Classification of Museum Specimens
Overview
The Cal museums have extensive collections of specimens that have been digitized but not catalogued, like the item above. There are way too many items for staff to enter directly, so the museums are using two approaches: Crowdsourcing and automatic labeling. In both cases, machine learning is needed to produce a final label. The sample above is from an insect collection, but similar problems arise for many collections and several museums, including plants, fossils and anthropological artifacts.
Questions
There are two problems to solve
1. Reconciling four transcriptions of label data from citizen scientists into one consensus record for each specimen.
2. Extraction of text and parsing of label data directly from images into a database.
More details on the transformations needed to produce these annotations is given in these papers.
DataSet
There are currently around 300,000 images that need classification or annotation resolution. A subset of these will be provided at the start of the project.
Tools
You should be able to use any machine learning toolset for resolving transcriptions from citizen scientists. You will probably want to parse them to determine parts-of-speech and attachments. You should be able to reach a soft consensus by finding the most representative sentences from the citizen annotations, rather than generating new sentences. For text/handwriting recognition, there are several products available, both commercial and open source. We are also exploring a collaboration with a company with a new handwriting/text recognizer with improved performance, and it may be available in time for this project.