Joined the team
Make provenance data machine-readable with AI & expand knowledge with Wikidata
In many cases, provenance data is not machine-readable. The provenance texts that are digitally recorded in museums’ databases are sometimes only available in string format. The data is often incomplete, inconsistently formatted, and contains additional notes. This is usually not a problem for humans — while computers cannot rely on precise structures and separators to translate the text into clean, machine-readable data.
The challenge consists in parsing provenance texts into machine-readable formats. This could be done with the support of already available machine learning algorithms, to be trained with provenance data from various sources.
Level 1: Parse
The provenance texts should be split into the separate transaction events. Each entry related to a transaction should contain metadata such as the time range and location, transaction type, as well as the involved actors (previous owner, new owner, as well as transaction facilitators like auction houses). This step will need comparison with dictionaries and other authority files to correctly parse the provenance information. And can it handle texts in different languages (English, German, French, …)?
Level 2: Link
Get additional information from Wikidata to fill in gaps. This comes in handy when e.g. only the name of an art trader or institution is available, but related geodata is missing. This could save quite some future research time if scaled up properly! Optional: expand data gathering to other authority files like e.g. GND.
Level 3: Gamify
(guess maybe this part is for the next hackathon…) Make it interactive! Turn it into a web-based tool that can be used by any provenance researchers without programming knowledge, which will in return train the tool by telling it if it correctly parsed the provenance information.
Datasets:
— (still needed) provenance data samples from participating museums in DE + FR (for example from MEG, MKB, MRZ, VMZ, …)
— Zuckerman, Laurel (2022): "Replication Data for: artwork provenances NEPIP", https://doi.org/10.7910/DVN/LIRCGI, Harvard Dataverse, CC0 1.0
ChatGPT already does the job quite well (but still mismatches data to the wrong fields), so this could be a start to work on and expand the available information with the support of Wikidata. A free alternative AI tool should be used though.
Event finish
Start
Challenge shared
Tap here to review.