Big Data Analytics

(bibliographical data)

Demo

We try to analyse bibliographical data using big data technology (flink, elasticsearch, metafacture).

Here a first sketch of what we're aiming at:

Datasets

We use bibliographical metadata:

Swissbib bibliographical data https://www.swissbib.ch/

  • Catalog of all the Swiss University Libraries, the Swiss National Library, etc.

  • 960 Libraries / 23 repositories (Bibliotheksverbunde)

  • ca. 30 Mio records

  • MARC21 XML Format

  • → raw data stored in Mongo DB

  • → transformed and clustered data stored in CBS (central library system)

edoc http://edoc.unibas.ch/

  • Institutional Repository der Universität Basel (Dokumentenserver, Open Access Publications)

  • ca. 50'000 records

  • JSON File

crossref https://www.crossref.org/

  • Digital Object Identifier (DOI) Registration Agency

  • ca. 90 Mio records (we only use 30 Mio)

  • JSON scraped from API

Use Cases

Swissbib

Librarian:

  • For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found.

  • We would like to have a list of all the DVDs in swissbib.

  • What is special about the holdings of some library/institution? Profile?

Data analyst:

  • I want to get to know better my data. And be faster.

→ e.g. I want to know which records don't have any entry for ‚year of publication'. I want to analyze, if these records should be sent through the merging process of CBS. Therefore I also want to know, if these records contain other ‚relevant' fields, defined by CBS (e.g. ISBN, etc.). To analyze the results, a visualization tool might be useful.

edoc

Goal: Enrichment. I want to add missing identifiers (e.g. DOIs, ORCID, funder IDs) to the edoc dataset.

→ Match the two datasets by author and title

→ Quality of the matches? (score)

Tools

elasticsearch https://www.elastic.co/de/

JAVA based search engine, results exported in JSON

Flink https://flink.apache.org/

open-source stream processing framework

Metafacture https://culturegraph.github.io/, https://github.com/dataramblers/hackathon17/wiki#metafacture

Tool suite for metadata-processing and transformation

Zeppelin https://zeppelin.apache.org/

Visualisation of the results

How to get there

Usecase 1: Swissbib

Usecase 2: edoc

Links

Data Ramblers Project Wiki https://github.com/dataramblers/hackathon17/wiki

Team

  • Data Ramblers https://github.com/dataramblers
  • Dominique Blaser
  • Jean-Baptiste Genicot
  • Günter Hipler
  • Jacqueline Martinelli
  • Rémy Mej
  • Andrea Notroff
  • Sebastian Schüpbach
  • T
  • Silvia Witzig

hackathon17

Files and notes about the Swiss Open Cultural Data Hackathon 2017. For information about the data, use cases, tools, etc, see the Wiki: https://github.com/dataramblers/hackathon17/wiki

Requirements

Elasticsearch cluster

Docker

  • Docker CE, most recent version
  • recommended: 17.06
  • see also hints.md

Docker Compose

  • most recent version
  • see also hints.md

Installation notes

The technical environment runs in Docker containers. This enables everyone to run the infrastructure locally on their computers provided that Docker and Docker Compose are installed.

How to install Docker and Docker Compose

The package sources of many (Linux-) distributions do not contain the most recent version of Docker, and do not contain Docker Compose at all. You must most likely install them manually.

  1. Install Docker CE, preferably 17.06.
  2. https://docs.docker.com/engine/installation/
  3. Verify correct installation: sudo docker run hello-world
  4. Install Docker-Compose.
  5. https://docs.docker.com/compose/install/
  6. Check installed version: docker-compose --version
  7. Clone this repository.
  8. git clone git@github.com:dataramblers/hackathon17.git
  9. Increase your virtual memory: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html

How to run the technical environment in Docker (for Linux users)

Use sudo for Docker commands.

  1. Make sure you have sufficient virtual memory: sysctl vm.max_map_count must output at least 262144.
  2. To increase the limits for one session:sudo sysctl -w vm.max_map_count=262144
  3. To increase the limits permanently:

    1. Create a new .conf file in /etc/sysctl.d/ (e.g. 10-vm-max-map-count.conf, the prefix number just indicating the order in which the files are parsed)
    2. Add the line vm.max_map_count=262144 and save the file
    3. Reread values for sysctl: sudo sysctl -p --system
  4. cd to your hackathon17 directory.

  5. docker-compose up -> Docker loads the images and initializes the containers

  6. Access to the running applications:

  7. Elasticsearch HTTP: localhost:9200

  8. Elasticsearch TCP: localhost:9300

  9. Zeppelin instance: localhost:8080

  10. Flink RPC: localhost:6123

  11. Dashboard of the Flink cluster: localhost:8081

  12. If you want to stop and exit the docker containers: docker-compose down (in separate terminal) or Ctrl-c (in same terminal)

This content is a preview from an external site.
 

Event finish

Remove affiliation field in author and editor objects (@sschuepbach)

Define default running environment (Python) (@sschuepbach)

Don't clear cache after each processed line (@sschuepbach)

Read file directory instead of single files (@sschuepbach)

Merge remote-tracking branch 'origin/master' (@sschuepbach)

Load Crossref data into Elasticsearch (@sschuepbach)

Add a hint for use the setup with jwilder/nginx-proxy (@tobinski)

Start