Below is an excessively long summary of our "Introduction to Systematics" class' experience using BioStor as a starting point for exploring the primary literature in biodiversity and the issues associated with internet resources for systematics. Rod Page very kindly helped out by providing feedback on the assignment and adding links for the students on the front page of the wiki associated with biostor - thanks!
The exercise is posted as a pdf and a tex source in case someone else would like to adapt it. It was used in our Biology 550 "Introduction to Systematics" course here at KU; this course is an upper-level undergraduate course. The exercise was assigned after we had covered the basics of nomenclature and collections-based research. I chose the articles to assign to students so that they would all be fairly short, and assigned 3-4 students to each article. The students did not work together, so that I could compare the information that different students gleaned from the same exercise.
Overall I think that the exercise was a success; it certainly helped students realize how messy the literature is and the problems associated with synonymy.
What is sobering (but not surprising) is number of times different aggregating web sites gave conflicting answers about the current name and classification for an organism. I was also somewhat disappointed by the relatively poor agreement between the taxonomic names that are harvested by software and the manually-generated list of taxonomic names. We've got a long way to go with respect to getting our literature online and linked-up.
It is also clear that it won't be easy to do this in an automated fashion. For many of the errors, I tried to track down the source of the errors (I really focused on the taxonomic name extraction aspect, but I intend to go back and look at some of the other errors when I get a chance).
There is not a glaring "weak-link" in the chain that would be easy to correct.
As far as I could tell, none of the taxonomic name extraction errors were caused by the Biostor software itself. The errors appeared propagated from the taxonomic extraction in the biodiversity heritage library.
Suggestion for Biostor's Wiki : One counterintuitive aspect of the Wiki is the disagreement between the numbering of pages on the reference's Wiki page and the page number in the journal. I suspect that this is a difficult-to-avoid side effect of the number of pages scanned in the journal (including frontmatter) disagreeing with the printed page number. For example, http://biostor.org/wiki/Reference:4504 covers the journal pages 1-4, but the links to individual pages are to the wiki pages 19-22 (eg. http://biostor.org/wiki/Page:Breviora0166harv.djvu/21 ). The correct content is shown, but the students were confused that the link names were not what they were expecting. Adding another set of links that use the article number might help wiki users.
Suggestion to me : I should have done a better job of explaining this (the wiki page numbering) in the assignment.
I only followed through and added metadata such as "<section begin=types/>" to one page. It was a bit tedious, but you can definitely see this as a promising error for really "adding value" to the OCR text and making this as great resource.
Suggestion on the Wiki : It seems like the per-page markup is going to be a problem here just because any section can span multiple pages. I don't know how hard it would be to refactor that.
Question on the Wiki : I wasn't sure how to flag things like page numbers, headers, and footnotes in the wiki (I must admit that I didn't dig around much for guidance). Is there a set of tags that I should use for those things?
Suggestion to me : In the excercise, I should have had the students add a topic page for the organism and the type specimen (and then require them to fill in the appropriate info).
Minor one-off types of things
:
1.
http://biostor.org/reference/14337
is on the Wiki at
http://biostor.org/wiki/Reference:14437
instead of at
http://biostor.org/wiki/Reference:14337
2.
http://biostor.org/reference/1205
pdf link just has two pages, while the Wiki has the full paper.
there were a lot of these. Correction of this type of error is one of the motivations behind Rod Page's advocacy of Wikis.
It certainly doesn't look like it going to be easy to do a better job with the OCR using an automated system. I tried some open-source tools such as tesseract, but got much worse accuracy (but I did not train tesseract for the journals' fonts -- that could have a big effect. In fact I did not train terreract at all ). Adobe Acrobat Professional's OCR was better in some spots and worse in others -- not obviously better.
Bianca Lipscomb of BHL helpfully referred me to the pages of Internet Archive (which is the source of most of BHL's digitized content). See: http://www.archive.org/about/faqs.php#291 and the software they use http://sourceforge.net/projects/scribesw
Unfortunately, while much of scribesw is in Java, several pieces are compiled windows executables (I haven't run it yet). The SF svn repo has not been updated in several years.
Just looking at the scans, it is unsurprising that an OCR system would have a hard time. Some of them are tough for a human to read.
Suggestion to someone : It does seem like it better results might be obtained if the OCR was trained on a certain journal. Just learning the fonts might help a lot. It is appealing to imagine a system in which OCR was done jointly with textual analysis. Some of that clearly is a part of OCR systems (eg. tesseract wants to know the language used in the document, not just the character set). But the formatting of particular journals is so standardized, that you'd think that OCR would do better if it were smart enough to expect a certain font for titles.
If the OCR engine smart enough to use a set of typesetting conventions, fonts, and common words that had been inferred from reading scans from the same journal then you'd think that it might do considerably better.
For example, titles in "Proceedings of the United States National Museum" were in a funky font and all caps -- they were particularly hard for the OCR. The title for http://biostor.org/wiki/Page:Proceedingsofuni31880unit.djvu/317
"DESCRIPTION OF A NEW FLOUNDER (PLATYSOMATICHTHYS STOMIAS), FROM THE COAST OF CALIFORNIA."
was OCR'd as:
"I>ESC'RI5»TIO:V OF A IVEW FI^Ol'.'VDER (PI^ATV .SOMATICHTHVS STO.TIIAS), FKOm THE COAST OF €AEIFOIlI\IA."
In the not-at-all-surprising category
: tabular formatting is really tough on OCR.
http://biostor.org/wiki/Page:Greatbasinnatura34brig.djvu/255
is a good example of a table that was not dealt with well by the OCR
(cases when the name is in the OCR'd text, but the name is not extracted by uBio's software as a taxonomic name). There seem to be plenty of errors here too, but I need to do more thorough diagnosis. A couple of common problems:
- Page-level extraction of names is not article-level extraction. Several articles were flagged with taxa that appeared in the preceding or following article.
- Place names showed up as taxonomic names. Seems hard to avoid. Presumably checking with some webservice for the common names of places might allow a web-service to flag an extracted name as potentially an geographical name.
- Specific epithets were often missed if they occurred by themselves or after the abbreviation of the genus name.
- hyphenation can inhibit taxonomic name recognition (in http://biostor.org/reference/14131 this kept the uBio tools from recognizing Salvelinus fontinalis . This seems like it could be dealt with pretty easily.
Suggestion to someone : It seems like it would really help if the OCR software also spit out font-face info. Obviously, one can't rely solely on italics, but it knowing that the word was italicize would be a good clue that it may be a scientific name. Surely some software does this. I'm afraid that I don't keep up with this field...
Below are some notes that I made as I read through the assignments. I'll try to return to this and make them more useful for others, but I thought that I'd post them now in case they were of use to anyone in the raw form.
The convention is the url of the biostor article then the taxonomic names * means that it was missing from the names associated with the article in BHL. + means that the name was included, but should not have been.
3. taxonomy ok but ncbi's differs on phrysomatinae vs phrynosomatidae
4. Author name confuses EOL:
http://www.eol.org/pages/1250765
(with author)
http://www.eol.org/pages/11013922
(without author, NCBI, but few other links)
Eumeces gaigei
Eumeces multivirgatus
Eumeces humilis
* "epipleurotus" could refer to "Eumeces epipleurotus" or "Eumeces multivirgatus epipleurotus"
Plestiodon multivirgatus HALLOWELL 1857
http://eol.org/pages/8830550
http://eol.org/pages/794679
Difficult to tell the current status:
http://www.iucnredlist.org/apps/redlist/details/64246/0
list Leiolopisma caudaequinae as a synonym of Scincella silvicola
not sure how to deal with footnotes in the Wiki (
http://biostor.org/wiki/Page:Universityofkans3401univ.djvu/207
)
* Hyla euthysanota Kellogg
* Centrolenella viridissima
Centrolenella
Hyla
Hyla erythromma
Hyla pinorum
Ptychohyla
Ptychohyla adipoventris