Make provenance data machine-readable with AI & expand knowledge with Wikidata


In many cases, provenance data is not machine-readable. The provenance texts that are digitally recorded in museums’ databases are sometimes only available in string format. The data is often incomplete, inconsistently formatted, and contains additional notes. This is usually not a problem for humans — while computers cannot rely on precise structures and separators to translate the text into clean, machine-readable data.

The challenge consists in parsing provenance texts into machine-readable formats. This could be done with the support of already available machine learning algorithms, to be trained with provenance data from various sources.

Level 1: Parse The provenance texts should be split into the separate transaction events. Each entry related to a transaction should contain metadata such as the time range and location, transaction type, as well as the involved actors (previous owner, new owner, as well as transaction facilitators like auction houses). This step will need comparison with dictionaries and other authority files to correctly parse the provenance information. And can it handle texts in different languages (English, German, French, …)?

Level 2: Link Get additional information from Wikidata to fill in gaps. This comes in handy when e.g. only the name of an art trader or institution is available, but related geodata is missing. This could save quite some future research time if scaled up properly! Optional: expand data gathering to other authority files like e.g. GND.

Level 3: Gamify (guess maybe this part is for the next hackathon…) Make it interactive! Turn it into a web-based tool that can be used by any provenance researchers without programming knowledge, which will in return train the tool by telling it if it correctly parsed the provenance information.

Datasets: — (still needed) provenance data samples from participating museums in DE + FR (for example from MEG, MKB, MRZ, VMZ, …) — Zuckerman, Laurel (2022): "Replication Data for: artwork provenances NEPIP",, Harvard Dataverse, CC0 1.0

prov data challenge.png

ChatGPT already does the job quite well (but still mismatches data to the wrong fields), so this could be a start to work on and expand the available information with the support of Wikidata. A free alternative AI tool should be used though.

Screenshot 20230920 at 10.25.26.png

Event finished

30.09.2023 15:30

Joined the team

29.09.2023 09:54 ~ jonaslendenmann

Event started

29.09.2023 09:00

Edited content version 12

28.09.2023 15:37 ~ gaston


20.09.2023 08:19

Challenge posted

20.09.2023 08:19 ~ jonaslendenmann
Contributed 2 months ago by jonaslendenmann for GLAMhack 2023
All attendees, sponsors, partners, volunteers and staff at our hackathon are required to agree with the Hack Code of Conduct. Organisers will enforce this code throughout the event. We expect cooperation from all participants to ensure a safe environment for everybody.

Creative Commons LicenceThe contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License.

GLAMhack 2023