This Challenge was posted 2 years ago

Challenge view

Back to Project

14. Label Recognition for Herbarium

Search batches of herbarium images for text entries related to the collector, collection or field trip

Goal

Set up a machine-learning (ML) pipeline that search for a given image pattern among a set of digitised herbarium vouchers.

Dataset

United Herbaria of the University and ETH Zurich

Team

  • Marina
  • Rae
  • Ivan
  • Lionel
  • Ralph

Problem

Imaging herbaria is pretty time- and cost-efficient, for herbarium vouchers are easy to handle, do not vary much in size, and can be handled as 2D-objects. Recording metadata from the labels is, however, much more time-consuming, because writing is often difficult to decipher, relevant information is neither systematically ordered, nor complete, leading to putative errors and misinterpretations. To improve current digitisation workflows, improvements therefore need to focus on optimising metadata recording.

Benefit

Developing the proposed pipeline would (i) save herbarium staff much time when recording label information and simultaneously prevent them from doing putative errors (e.g. attributing wrong collector or place names), (ii) allow to merge duplicates or re-assemble special collections virtually, and (iii) would facilitate transfer of metadata records between institutions, which share the same web portal or aggregator. Obviously, this pipeline could be useful to many natural history collections.

Case study

The vascular plant collection of the United Herbaria Z+ZT of the University (Z) and ETH Zurich (ZT) encompasses about 2.5 million objects, of which ca. 15% for a total of approximately 350,000 specimens have already been imaged and are publicly available on GBIF (see this short movie for a brief presentation of the institution). Metadata recording is, however, still very much incomplete.

Method

  1. Define search input, either by cropping portion of a label (e.g. header from a specific field trip, collector name or stamp), or by typing the given words in a dialog box
  2. Detect labels on voucher images and extract them
  3. Extract text from the labels using OCR and create searchable text indices
  4. Search for this pattern among all available images
  5. Retrieve batches of specimens that entail the given text as a list of barcodes

Examples

I have attached some images of vouchers, where I highlighted what could typically be searched for (see red squares).

a) Search for "Expedition in Angola"  Title

b) Search for "Herbier de la Soie"  Title

c) Search for "Herbier des Chanoines"  Title

d) Search for "Walo Koch"  Title

e) Search for "Herbarium Chanoine Mce Besse"  Title

Data Wrangling documentation

How did we clean and select the data ?

Selected tool

https://escriptorium.fr/

Setting up escriptorium on DigitalOcean

Tools we tried but where we were not successful

Transkribus

https://readcoop.eu/transkribus/. Not open source but very powerful. Not easy to use via an api

Open Image Search

The pipeline imageSearch developed by the ETH Library. The code is freely available on GitHub and comes with a MIT License.

We were not able to run it

This content is a preview from an external site.
 
Contributed 2 years ago by AlessiaGuggisberg for GLAMhack 2022
All attendees, sponsors, partners, volunteers and staff at our hackathon are required to agree with the Hack Code of Conduct. Organisers will enforce this code throughout the event. We expect cooperation from all participants to ensure a safe environment for everybody.

Creative Commons LicenceThe contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License.