Challenge Project

Label Recognition for Herbarium

(14) Search batches of herbarium images for text entries related to the collector, collection or field trip

Goal

Set up a machine-learning (ML) pipeline that search for a given image pattern among a set of digitised herbarium vouchers.

Dataset

United Herbaria of the University and ETH Zurich

Team

  • Marina
  • Rae
  • Ivan
  • Lionel
  • Ralph

Problem

Imaging herbaria is pretty time- and cost-efficient, for herbarium vouchers are easy to handle, do not vary much in size, and can be handled as 2D-objects. Recording metadata from the labels is, however, much more time-consuming, because writing is often difficult to decipher, relevant information is neither systematically ordered, nor complete, leading to putative errors and misinterpretations. To improve current digitisation workflows, improvements therefore need to focus on optimising metadata recording.

Benefit

Developing the proposed pipeline would (i) save herbarium staff much time when recording label information and simultaneously prevent them from doing putative errors (e.g. attributing wrong collector or place names), (ii) allow to merge duplicates or re-assemble special collections virtually, and (iii) would facilitate transfer of metadata records between institutions, which share the same web portal or aggregator. Obviously, this pipeline could be useful to many natural history collections.

Case study

The vascular plant collection of the United Herbaria Z+ZT of the University (Z) and ETH Zurich (ZT) encompasses about 2.5 million objects, of which ca. 15% for a total of approximately 350,000 specimens have already been imaged and are publicly available on GBIF (see this short movie for a brief presentation of the institution). Metadata recording is, however, still very much incomplete.

Method

  1. Define search input, either by cropping portion of a label (e.g. header from a specific field trip, collector name or stamp), or by typing the given words in a dialog box
  2. Detect labels on voucher images and extract them
  3. Extract text from the labels using OCR and create searchable text indices
  4. Search for this pattern among all available images
  5. Retrieve batches of specimens that entail the given text as a list of barcodes

Examples

Here are some images of vouchers, where what could typically be searched for (see red squares) is highlighted.

a) Search for "Expedition in Angola"  Title

b) Search for "Herbier de la Soie"  Title

c) Search for "Herbier des Chanoines"  Title

d) Search for "Walo Koch"  Title

e) Search for "Herbarium Chanoine Mce Besse"  Title

Problem selected for the hackathon

The most useful thing for the curators who enter metadata for the scanned herbarium images would be to know which vouchers were created by the same botanist.

Vouchers can have multiple different labels on them. The oldest and most important one (for us) was added by the botanist who sampled the plant. This label is usually the biggest one, and at the bottom of the page. We saw that this label is often typed or stamped (but sometimes handwritten). Sometimes it includes the word 'leg.', which is short for legit, Latin for 'sampled by'.

After that, more labels can be added whenever another botanist reviews the voucher. This might include identifying the species, and that could happen multiple times. These labels are usually higher up on the page, and often handwritten. They might include the word 'det.', short for 'determined by'.

We decided to focus on samples that were sampled by the botanist Walo Koch. He also determined a lot of samples, so it's important to distinguish between vouchers with 'leg. Walo Koch' and 'det. Walo Koch'.

Data Wrangling documentation

How did we clean and select the data ?

Selected tool

We decided to use eScriptorium to do OCR on the images of the vouchers. It's open source and we could host our own instance of it, which meant we could make as many requests as we wanted. We found out that there is a Python library to connect to an eScriptorium server, so hopefully we could automate processing images.

Setting up escriptorium on DigitalOcean

eScriptorium lets you manually transcribe a selection of images, and then builds a machine-learning model that it can use to automatically transcribe more of the same kind of image. You can also import a pre-existing model. We found a list of existing datasets for OCR and handwriting transcription at HTR-United, and Open Access models at Zenodo.

From the Zenodo list, we picked this model as probably a good fit with our data: HTR-United - Manu McFrench V1 (Manuscripts of Modern and Contemporaneous French). We could refine the model further in eScriptorium by training it on our images, but this gave us a good start.

Video Demo

Video demo

Presentation

Final presentation

Google Drive folder

Data, documentation, ...

Tools we tried but where we were not successful

Transkribus

https://readcoop.eu/transkribus/. Not open source but very powerful. Not easy to use via an api.

Open Image Search

The pipeline imageSearch developed by the ETH Library. The code is freely available on GitHub and comes with a MIT License.

We were not able to run it, because the repository is missing some files that are necessary to build the Docker containers.

Github

See also our work on Github, kindly shared by Ivan.

This content is a preview from an external site.
 

Event finished

Edited (version 113)

05.11.2022 15:25 ~ liowalter

Research

Edited (version 111)

05.11.2022 15:20 ~ liowalter

Edited (version 108)

05.11.2022 13:50 ~ liowalter

Preparing the finale presentation...

05.11.2022 13:48 ~ liowalter

Edited (version 103)

05.11.2022 12:39 ~ rae_knowler

Project

Edited (version 101)

05.11.2022 12:16 ~ Ralph_Machunze

Joined the team

05.11.2022 12:12 ~ Ralph_Machunze

Edited (version 98)

05.11.2022 11:51 ~ rae_knowler

Edited (version 96)

05.11.2022 11:44 ~ rae_knowler

Joined the team

05.11.2022 11:30 ~ IOlexyn

We have set up an instance of eScriptorium that's available online, and we've found a collection of images that might be relevant to our question. Now we're training an OCR model on those images in eScriptorium.

05.11.2022 11:30 ~ rae_knowler

Edited (version 89)

05.11.2022 10:54 ~ liowalter

Repository updated

05.11.2022 10:52 ~ liowalter

Edited (version 85)

05.11.2022 10:52 ~ liowalter

Edited (version 83)

05.11.2022 10:49 ~ liowalter

Joined the team

04.11.2022 12:35 ~ marina_berazategui

Edited (version 79)

04.11.2022 11:10 ~ liowalter

Edited (version 77)

04.11.2022 10:48 ~ liowalter

Joined the team

04.11.2022 10:44 ~ liowalter

Event started

Edited (version 73)

04.11.2022 08:49 ~ jonaslendenmann

Joined the team

04.11.2022 07:11 ~ lionel_walter

Edited (version 68)

26.10.2022 06:35 ~ AlessiaGuggisberg

Edited (version 65)

21.10.2022 11:21 ~ AlessiaGuggisberg

Edited (version 62)

21.10.2022 11:20 ~ AlessiaGuggisberg

Edited (version 60)

21.10.2022 11:19 ~ AlessiaGuggisberg

Edited (version 58)

21.10.2022 10:04 ~ AlessiaGuggisberg

Edited (version 56)

21.10.2022 09:47 ~ AlessiaGuggisberg

Edited (version 53)

21.10.2022 09:32 ~ AlessiaGuggisberg

Edited (version 50)

21.10.2022 09:31 ~ AlessiaGuggisberg

Edited (version 47)

21.10.2022 09:31 ~ AlessiaGuggisberg

Edited (version 44)

21.10.2022 09:29 ~ AlessiaGuggisberg

Edited (version 42)

21.10.2022 09:29 ~ AlessiaGuggisberg

Edited (version 39)

21.10.2022 09:21 ~ AlessiaGuggisberg

Edited (version 36)

21.10.2022 09:20 ~ AlessiaGuggisberg

Edited (version 33)

21.10.2022 09:19 ~ AlessiaGuggisberg

Edited (version 30)

21.10.2022 09:16 ~ AlessiaGuggisberg

Edited (version 26)

21.10.2022 09:12 ~ AlessiaGuggisberg

Edited (version 24)

21.10.2022 09:09 ~ AlessiaGuggisberg

Edited (version 22)

21.10.2022 09:09 ~ AlessiaGuggisberg

Edited (version 20)

21.10.2022 09:08 ~ AlessiaGuggisberg

Edited (version 18)

21.10.2022 09:07 ~ AlessiaGuggisberg

Edited (version 16)

21.10.2022 09:06 ~ AlessiaGuggisberg

Edited (version 14)

21.10.2022 09:06 ~ AlessiaGuggisberg

Edited (version 11)

21.10.2022 09:06 ~ AlessiaGuggisberg

Edited (version 8)

21.10.2022 09:05 ~ AlessiaGuggisberg

Edited (version 6)

21.10.2022 09:04 ~ AlessiaGuggisberg

Repository updated

04.10.2022 16:31 ~ AlessiaGuggisberg

Joined the team

04.10.2022 16:31 ~ AlessiaGuggisberg

First post View challenge

04.10.2022 15:07 ~ AlessiaGuggisberg

Challenge

 
Contributed 1 year ago by AlessiaGuggisberg for GLAMhack 2022
All attendees, sponsors, partners, volunteers and staff at our hackathon are required to agree with the Hack Code of Conduct. Organisers will enforce this code throughout the event. We expect cooperation from all participants to ensure a safe environment for everybody.

Creative Commons LicenceThe contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License.