Make_bib user's manual

Table of Contents

Introduction

It is often the case that you have to produce bibliographic records, either for yourself, or for a list of people (your team or your lab). In general, you do not have locally a reliable database of information and your are then forced to chase people for their references, or try to compile the information yourself. Since a number of people are involved, references arrive in a wide variety of formats, a devastating effect for the global bibliographic list. This is a pain, and furthermore, you end up doing it over and over, each year.

make_bib is a package that tries to make this task as automatic as possible by taking advantage of existing databases and character string manipulation softwares. Given a list of people's names and a range of years, it will produce an HTML bibliographic list for these people, using data available at the NASA ADS abstract service. Why HTML? Because it is easy to create HTML code, because once you've posted it, all the people in you lab can check their references (see the author breakdown option), because Microsoft Word can take in HTML code and secretaries use Microsoft Word and not LaTeX, and because HTML to LaTeX converters should exist (if you are interested, you can find a variety of such converters at the W3Consortium repository).

make_bib would not have seen the light of day if Gary Mamon (IAP) had not decided to build code to interact with ADS to get citation information, or to download papers automatically. If you are interested by these functionalities, go check his software page.

Requirements

make_bib is a rather heterogenous package, mostly because it was "simple" to retrieve data from ADS with a shell script, and Gary is a shell script guru, and because it was easier to do the extensive character string manipulation in a language I know, IDL. Therefore, to run make_bib you need:

If you meet these requirements, you can download make_bib

Installation

Once you unpack the distribution tar file you should have a number of text files and scripts, some, the .pro ones, are IDL scripts, the others are shell scripts. Let's discuss them in turn.
get_bibcodes
This is a shell script that connects to ADS to obtain the bibcode of papers (a bibcode is a unique code name that fully describe a paper, such as 1990A&A...230..412C). It is advisable to place this code in your bin directory, to be clean, or anywhere in your path, otherwise the IDL scripts will not see it. Given a name, and a certain number of options, get_bibcodes will return the list of bibcodes belonging to that name. get_bibcodes can be called from the shell so in case of trouble you can check whether this part of the package functions.
get_refinfo
This is a shell scripts that gets the reference information (title, author list, journal name, page numbers, etc...) from ADS for a given bibcode. Place it in your bin directory or somewhere in your path. get_refinfo can also be called from the shell so you can also check that it works well. Indeed a common source of errors lies in the fact that ADS frequently changes its output.
make_bclist.pro
This is the first IDL task of the package. It is a function that gets all the bibcodes for the person, or personnel list of your choice. It returns a structure that serves as input for the second step. This breakdown in two steps is here because of the length of the process. It offers the opportunity to save this intermediate structure and continue with the compilation later. It can also save times when faced with ADS format changes.
filter_arxiv.pro
Depending on your policy, you may or may not want to include the arXiv preprints in your publication list. There is a way to avoid ADS returning those papers, but it is not inplemented in get_bibcodes yet. Instead you can use this IDL routine to filter the output of make_bclist before moving on to the next step.
make_reflist.pro
This is the second IDL task of the package. It is also a function that compiles the bibliographic information for all the bibcodes returned by make_bclist. Its output is an array of IDL structures, each corresponding to a bibliographic entry of the reference list. Note that at this stage a number of references do not in fact belong to the person(s) you have selected due to the presence of homonyms in the ADS database.
bib2html.pro
This is the IDL procedure that converts the result of make_reflist into HTML page(s). If you are not satisfied with the look of the reference pages, this is the file to edit, not make_reflist or make_bclist.
bib_update.pro
This is the IDL function that will allow you to update the reference information with the form inputs you receive from authors' inspection of their own reference list.
jnl_code.pro
This is an IDL code that lists journals available in the ADS, their ADS name, their bibliographic abbreviation and their referee status (y/n). It is used by make_reflist when it fills the bibliographic structure. When make_reflist encounters a journal which is not listed in jnl_code it will print it out. You can update this routine accordingly, if you want to abbreviate these journals as well.
make_html_code.pro
This is an IDL function to transform the information in the bibliographic structure into an HTML sentence. Edit it to change the appearance of references (though be careful when doing that). It is called by bib2html.
tex2html.pro
This is a function to convert TeX character sequences that appear in ADS outputs into HTML codes. It is called by bib2html
ads2html.pro
This is a function to convert strange HTML-like character sequences that appear in ADS outputs into true HTML codes. It is called by bib2html
strallpos.pro
This is a function that find all occurences of a substring in a string (it is a generalization of the IDL function strpos). It is called by bib_update.
fict_personnel.txt
An example of a personnel list, so that you can understand the syntax.
email.txt
An example of the email file that contains parsable results from HTML form validations.
A note on the installation of Lynx: When I developped the code, Lynx was not installed in my system and so I made a private installation in my bin directory. Thus the configuration file of Lynx is in the directory ~/bin. If Lynx is installed differently for you, you will have to edit the shell scripts get_bibcodes and get_refinfo to reflect that (search for the string lynx -cfg ~/bin).

Using make_bib

Now that the installation is complete, let's explore how the package work.

General description

This is a three-stage process: first we create a structure (a sort of database variable) that will contain all the references found for a group of people that you have provided (we will see later how) and for a range of date that you have provided. This in itself is now a two-stage process performed by the function make_bclist and make_reflist. These references are extracted from the ADS database using its three categories: AST for astrophysics, PHY for physics and geophysics, and INST for instrumentation.

Once this is done, bib2html will let you convert this structure into HTML pages with reference sorted per authors or per year, and per type (refereed journal, refereed conferences, conferences, non-refereed journals and books). You can then use any HTML4.0 compliant browser to view the file.

If you have chosen to write the HTML files as forms, then upon validation of them by their owners, you will receive coded updating information by email, and you can process this information to update your bibliographic structure with bib_update.

These three operations use IDL scripts, you therefore have to be in IDL to use them, and this is what I will assume in the following examples.

For all routines (including the shell scripts), calling them with no arguments prints the on-line help.

Creating the bibliographic structure

You need not be familiar with IDL structures to be able to manipulate this package, it is only required that you have used IDL before and know the difference between a function and a procedure. To be short, a structure can be considered as an array whose elements may not be of the same type (some could be strings, some coud be integers). A structure is an ideal variable to store information on a paper for instance, since this information comes in strings (the title, the journal, the author names) but also in numbers (the volume, the page numbers, the number of authors)... In a way, this structure is the equivalent of a card where all information pertaining to a given paper is stored.

Since we are making a reference list, we will have many of these cards, which is no problem as IDL allows to create arrays of structures (i.e. collections of structures). This can be considered as our filing cabinet for our reference list.

Getting the bibcodes

To populate this cabinet, we will first use the function make_bclist. This first step will only give us bibcodes, vital information but not really what you are looking for, but we'll see later how to get the rest of the reference. If you just call it without argument, this is what you get:

IDL> result = make_bclist()
correct call is:
 
   bib_str = make_bclist([name=name,personnel=personnel,
              [,year=year,/since,upto=upto,/verb,/debug
              loserver=loserver,timeout=timeout]
 
where
 
   bib_str   will be a structure where the bibcodes and
             related data are be stored
   name      to make the bibliography of 1 person
             this should be a string of the form:
               surname initial 
                   (e.g. sauvage m) if year is set
                    or
               surname initial year
                   where year is the starting date for that
                   person's bibliography
   personnel can be used to provide the file with the personnel
             names.
             note that either name or personnel has to be set
   year      to build only one year's bibliography
             year is an integer
   upto      can be used to build the bibliography between
             year and upto
   /since    indicates that the bibliography should be made
             from the date given in year up to now
   /verb     prints information for debugging
   loserver  is a string array that contains the choice of 
             servers to use in order to avoid being banned from
             ADS.
             choose between cl, eso, fr, in, jp, kr and us
             default is all
   timeout   in seconds is a period when we wait before making
             the call to ADS, again not to be banned
             default is 30s
Thus, if you want to get the bibcodes of your complete bibliography (as known by ADS), and assuming that your name is einstein but you initial is b, the way to do it is:
idl> my_bibcodes = make_bclist(name='einstein b')

Adding your initial to your name is not required, it simply avoids retrieving paper from people with the same name as yours, which you would then have to delete.

Sometimes you just want to have your bibliography for a given year. This is easy, for instance, if this year is 1990, type:

idl> my_bibcodes90 = make_bclist(name='einstein b',year=1990)

if what you want is your bibliography since 1990, then just add the keyword /since to the previous command line:

IDL> my_bibcodes90_on = make_bclist(name='einstein b',year=1990,/since)
and if what you want is your bibliography between 1990 and 2000, just use the keyword upto:
IDL> my_bibcodes90_on = make_bclist(name='einstein b',year=1990,upto=2000)

Now that is well for one person, but if you want to make it for a list of people, you will have to use another method (you could call make_bclist a multiple number of times but that is not very elegant). The simplest way is to create a personnel list, i.e. a list containing the names of the people in your group. Since you are editing a file, it is the occasion to introduce some other information such as the fact that not all your collaborators entered your group the same year, and not all remained in your group till now. Thus for each person you can define an entrance date and an exit date which we can use to select the papers they published while in your group. You should then create a text file where the lines looks like this:

newton   b  i1989 o2000

The initial is not mandatory either here, the formatting is free (i.e. the number of spaces is not a problem). The only mandatory element is that the entrance date should be present and flagged with the letter i. The exit date is optional, but when it is there it should be flagged with an o. You can insert comments in this file (i.e. to identify the group of people who are post-docs, PhD students...) by placing the character ; at the beginning of the line. You can also decide to restrict the databases searched for some authors. Imagine that one of your team member has a very common name, e.g. Martin, and you know he has not published in instrumentation journals. It would thus make sense to exclude instrumentation journals from the search as this is bound to generate a high number of false references. You can do that by adding -INST after his entrance (or exit date) such as in the following example:

martin   b  i1989 o2000   -INST

Similarly, the astrophysics database is excluded with -AST and the physics/geophysics one with -PHY.

To illustrate the various possibilities you have to create a personnel list, I have included a fictitious one in the distribution, and you can also view it here.

So, to get the bibcodes of a group of people listed in the file my_group.txt, just type:

IDL> group_bib = make_bclist(personnel='my_group.txt')

Note that the year and /since keywords can still be used with the personnel keyword. When they are, make_bib selects only papers that satisfy both the entrance and exit date criteria (when present) and the year and /since criteria.

Finally you will notice two extra keywords that you can use in the call to make_bclist: loserver and timeout. These keyword are there to protect you from being banned by the ADS. Indeed this package make a heavy use of the ADS: for each person, getting the complete list of bibcodes requires 2 queries to the ADS (to be able to separate the refereed from the non-refereed papers), and then for each bibcode, getting the complete reference information requires 2 to 3 queries to the ADS (because no all the information is present or easily processable in the bibtex or tagged output of the ADS). Therefore compiling the bibliography of a reasonnable list of people can easily generate a thousand request to the ADS which, should they be too quick, will result in your IP being categorized as an attacker and being banned from further access. Therefore the package distributes the queries to 7 different mirrors (if you find that one is down or very slow you can use loserver to restrict the list of servers to use. Furthermore it waits for timeout seconds before making the call. I have experimented that using all 7 servers with a timeout of 1 second is viable.

Dealing with preprints

At this stage now we have in the output variable of make_bclist the list of bibcodes corresponding to the bibliography we are trying to build (there are other variables in that output but these are meant for the inner workings of the package). Each bibcode in this variable is unique but there are quite a number of them that will now make it to the final bibliography (those belonging to homonyms for instance). Among them, preprints hold a special place. The ADS now lists the arXiv preprints. get_bibcodes could be modified to avoid returning them but (1) it is not straightforward and In some cases you may want to keep some of them (those corresponding to papers in press). Therefore all the arXiv preprints found are in the output variable of make_bclist. If you want to get rid of all of them, simply do:

IDL> filtered = filter_arxiv(group_bib,/verb)
the /verboption will simply show you which of the codes are discarded and how many of them there was.

Getting the bibliographic information

We are now ready for the second part of the bibliographic information retrieval: getting the bibliographic information of all the bibcodes. This is the task of the function make_reflist. Calling it without arguments brings:

IDL> result = make_reflist()
correct call is:
 
   result = make_reflist(bibcodes[,/debug,
              ,/verb,affil=affil,loserver=loserver,
              /timeout=timeout])
 
where
 
   bibcodes  is the output structure of make_bclist
   affil     an array of strings to search in the affilitation
             field to declare that a paper really belongs to
             your group
   /verb     prints information for debugging
   /debug    prints more information
   loserver  is a string array containing the codes for the
             servers to use in order not to be banned by ADS.
             choose between cl, eso, fr, in, jp, kr, and us.
             default is all
   timeout   is a wait time before making the call to ADS.
             default is 30s

Most of the parameters are self explanatory. The only potentially intriguing one is affil. This allows you a first automated check that the papers retrieved indeed belong to your group by checking the affiliation present in ADS (although note that this field is not complete). You can do that by providing key strings to search with keyword affil. For instance, assuming your team members sign their paper as IAP or UMR 245, the previous call could be:

IDL> reflist = make_reflist(group_bib,affil=['IAP','UMR 245'])

If affil is not set, then make_reflist searches for Saclay affiliations. It is easy to change this default at the beginning of the source code. At the end of this process, you now have a potentially large, and very precious array of structures and it would be wise to save it...

Creating the HTML output

This is the job of bib2html, once again, calling the routine, which is a procedure, without any argument returns its on-line help:
IDL> bib2html
Correct call is:
 
     bib2html,bib_str,file[,/authors,/verb,/debug,
              personnel=personnel,/separated,title=title
              /url,/form,/single,address=address
 
where
 
     bib_str    is the bibliographical structure produced by
                make_reflist
     file       is the filename for the HTML source
     /authors   list references for each authors (for checking
                purposes mainly) and therefore leads to repetitions.
                This option requires a personnel list.
     /verb      prints information on the execution.
     /debug     adds information in the html output
     personnel  to provide the name of the personnel file (used in
                the /authors option, default is SAp_personnel.txt)
     /separated in the /author option, makes a separate page
                per author
     title      to provide a title to the HTML page (default is
                tailored to SAp)
     /url       add a link to the ADS abstract after each reference
     /form      makes the HTML page a form so that users can check
                their references
     /single    combined with /form, creates a single form for the
                complete bibliography instead of one per authors.
                This is useful when a single person will check
                the complete bibliography.
     address    can be used to specify who should receive the forms
                (default is msauvage_at_cea_dot_fr)
bib2html has two arguments: the bibliographical structure that has just been created by make_reflist, and a file name, which will be the name of the HTML document. There are two sorting orders for the references: per increasing year and then alphabetically, or per authors in your personnel list, and then per year. There is then a further subdivision in refereed journal papers, refereed conference papers (a rather theoretical specie as it is not easy to get that information from ADS), conference papers, non-refereed journal papers, and books.

Since the HTML document can be rather long, there are a number of anchors placed here and there and which are accessible from the top of the document. In the author sorting option you can actually make one HTML document per author.

An important point to mention for the author sorting option is that it leads to repetition of references with more than one author from your list. This is because this option is meant to allow authors to check their publication list. If you know what HTML forms are, then there is a very powerful checking option in bib2html that you can set with the /form keyword. You can transfrom each author's HTML page into a form where each reference is followed by checkboxes that allow to exclude references because they belong to an homonym, change the refereeing status of references, and change the affiliation status of references. Once these forms are filled, the send form button sends the information to a recipient specified in the address keyword in the call to bib2html. Use the function bib_update to update the bibliographic structure with the results from the form.

If the full bibliographic file is going to be checked by one person, you can combine /form with /single to have a single form containing all the references.

To ease the checking, you can also include link to the publications in the HTML form. This is done by adding /url to the bib2html command line. Note that the /url option does not require the /form option to be set.

To give you a feeling of what output is produced, I have compiled the publication list for the a certain number of people in the Service d'Astrophysique, for the years 1998 to 2000, which is now in a variable called bib98. Following each call, a link will lead you to the resulting HTML code.

A year simple sorted list
IDL> bib2html,bib98,'year_sort.html'
view the result
An author sorted list on a single page
IDL> bib2html,bib98,'author_sort.html',/authors
view the result
An author sorted list, with each author on a different page, with the form option and url included, and form sent to admin@ici.fr
IDL> bib2html,bib98,'authors_check.html',/form,address="admin@ici.fr",/url
view the result

In these HTML pages, the references in red are those for which the affiliation check produced a negative result (you can only see these color changes if your browser handles correctly Cascading Style Sheets). You can force all reference to appear in black (or red) by changing manually the affiliation flag in the bibliographic structure. This is done easily with:

IDL> bib98(*).affil = 'y'

or

IDL> bib98(*).affil = 'n'

Note that when /authors is used, bib2html requires a file listing the authors. This is the same file you provided to make_bclist. Here again it is defaulted to an SAp-tailored file. For your use, you should provide that file name with the keyword personnel.

Also note that setting /form without setting /single sets the authors and separated keywords as well, you do not need to repeat them.

Updating the reference structure

The interesting point of the form output of bib2html is that each member of your personnel list can check his/her own bibliographic list. When this is done, i.e. when they have cliked on the send form button, the custodian of the bibliographic list (i.e. the person corresponding to the email specified in the address keyword of bib2html) will receive many emails with rather cryptic information corresponding to the validation of the HTML forms by their owner. This is where bib_update comes in: its task is to read these emails, understand their content, and modify the bibliographic structure accordingly.

An important point here is that even if a user marks a reference for deletion (either because it belongs to an homonym or because it is not affiliated to the lab), bib_update will not delete this reference from the database. It will flag it in such a way that subsequent calls to bib2html will ignore it. This is done with the tag exclude in the reference structure. This way, the reference is not lost.

Before showing you how this works, there is a little caveat to mention. Although HTML is standardized, the format in which you will receive the form information varies with the browser. Let me dwell on that as it has some importance. The way the forms are set, we can define a unit of information as:

tag_name=bibcode

where, in our case, tag_name can have the following values (they are set by bib2html):

Tag NameSignification and action
excludeExclude this reference from the list
is_refThis reference was refereed, sets referee to y
not_refThis reference was not refereed, sets referee to n
is_affilThis reference belongs to the lab, sets affil to y
not_affilThis reference does not belong to the lab, excludes it from the list

bibcode represents a standardized bibliographic code. In a form sent by the HTML4.0-compliant browser ICab (a very nice browser indeed), there is only one information unit per line, and the bibcodes are unaltered. This makes for very simple parsing. In Netscape however, the information units appear on the same line, separated by the character &. When this line gets too long, it is broken arbitrarily and generally inside an information unit. I have not checked yet what happens with Internet Explorer but I suspect it will be different again.

The linebreak inside information units is a problem bib_update is not able to handle yet. Therefore it is your job to edit the mails you receive to make sure line breaks do not occur inside information units. Note also that it appears that Netscape inserts the character ! at the line break. You must remove this character.

Firefox appears to return all the units on a single line separated by the &. This is handled "gracefully" by bib_update so no editing is in principle required. You should still check the form output as the separation character could also appear in the bibcodes of the Astronomy and Astrophysics journal.

Firefox and Netscape replace the & in bibcodes by %26, which bib_update also handles correctly, but if the & is still in the journal code, then the bibcode will be split. This should produce an error from bib_update.

Now let's assume you have done this little editing, and that the results from the form emails are stored in a file called email.txt (an example of such a file is found here, note that I have combined results from ICab and Netscape and so & characters in the bibcodes sent by Netscape are replaced by %26).

A simple call to bib_update, a function, with no parameters will produce the on-line help:

IDL> result = bib_update()
Correct call is:
  
   new_bib = bib_update(bib_str,email[,/single,/verb,
                        parsed=parsed])
  
where
  
   bib_str     the bibliographic structure created by
               make_bib. Of particular importance are the
               existence of the tags referee and affil which
               are updated.
   email       name of the file where the results of form
               checkings are stored. This results are sent by
               email once the authors clik on the submit button.
               this file has to be less than 5000 lines long.
   /single     to be used when all the form codes are on a single
               line separated by &
   /verb       makes the software talkative

   parsed      an output structure containing the information in
               the email file parsed along the categories exclude,
               refereed, not refereed, affiliated, not affiliated
   new_bib     the updated bibliographic structure, where some
               references have been removed and some updated.

The principle of bib_update is that is parses the information in the email file, i.e. sorts the bibcodes according to the tags they were attached to, and then removes or update informations on the references. Thus, assuming the bibliographic structure is in a variable called bib98, you would simply send the command:

new_bib = bib_update(bib98,'email.txt')

Since you will get inputs from different sources, there is a chance that this will contain contradictory informations, for instance two authors of the same paper may not attribute it the same refereeing status, or more likely the same affiliation status. bib_update checks for these contradictory information and if it finds any, it will not update the structure, and instead returns the information it has obtained from the email file, parsed in a structure containing 7 arrays:

Structure Tag NameContent
excludebibcodes marked for deletion
is_refbibcodes marked as refereed
not_refbibcodes marked as not refereed
is_affilbibcodes marked as affiliated to the lab
not_affilbibcodes marked as not affiliated to the lab
ref_conflictbibcodes with conflicting referee information
affil_conflictbibcodes with conflicting affiliation information

With this structure, and the bibliographic structure, you should be able to understand the reasons for the contradictions. You need to solve them before bib_update can run. This is to avoid information losses. Note that in principle, if users check a common version of the bibliography, there can be no conflict: e.g. you can flag that a refereed reference is in fact not refereed, but you cannot confirm that it is refereed (this is implicit). Thus this feature of bib_update serves more to make sure that everyone is using the same version of the library.

For checking purposes, and even if no conflicts are found by bib_update, you can retrieve the parsed update information with the parsed keyword:

new_bib = bib_update(bib90,'email.txt',parsed=parsed)

If no conflict is detected, then new_bib is your updated bibliographic structure (better save it!) and you can use bib2html to produce your final HTML bibliographic list.

For your information, note that a reference will be printed by bib2html as long as the exclude is 'n' (even if the affiliation tag, affil is 'n'). you can update this tag manually to exclude some references. You can also select all the references based on this tag to extract a smaller array of structures that contains only the references belonging to your bibliography. I have kept this step manual in order to always keep all the references returned by ADS. This way you can always go backward in the process without having to query again the ADS, which is a rather long process.

Modification history

08-Mar-2000
First installation of the package, no upgrading facility yet, still highly Saclay specific.
15-Mar-2000
Generalization of the package, all Saclay-specific fields can be customized.
Tags in the bibliographic structures have been added or renamed (i.e. saclay is now affil, and there is a db_key tag). A result of this is that older bibliographic structures are no longer compatible with the current version of the package (sorry for this).
New options are added to bib2html to allow form-checking of the reference.
The routine bib_update.pro has been added to allow the processing of the form input and updating of the reference structure.
This version only supports form validation through ICab and Netscape.
04-Sep-2007
Following a ban from ADS (too many requests), I have now split the task in two, one for making the list of bibcodes, and one for compiling the reference information.
In order to reduce the load on the ADS, I rotate between the different mirrors.
I no longer use iCab or Netscape, but Firefox handles the validation fine.

Contact and frequently asked questions

If you have questions, just email them to me.

Q:While running make_bclist I have the following message:

lynx: can't access startfile http://cdsads.u-strasbg.fr/cgi-bin/nph-abs_connect?author=lada%2C%20&db_key=PHY&nr_to_return=1000&star_year=1995&sort=ODATE&jou_pick=NO

What does this mean?

A:This indicates that the lynx query failed, most likely because the server was unavailable at the time of the request. Although make_bclist is able to handle failed requests, this means you have lost information. It is advisable to restart make_bclist. Note that in that case, we have jou_pick=NO which indicates that we are querying ADS for refereed journals only. This means that the error occurs when make_bclist produces a second list of refereed papers. This list is used to check whether the referee status that we attribute on the basis of journal names is consistant with ADS. Therefore in that particular case, only this information is lost, but all the papers will be collected.

This type of error can also affect make_reflist, in which case you are also warned but you loose information. To know what has been lost, if the bibcode is not printed by the script, search the ouput array of structures for a structure where all the fields are 'void'.