Annohub (short for annotation hub) is the name of a workflow dedicated to the collection of annotation metadata of language resources. It covers freely available resources such as corpora, lexicons, or ontologies originating from the field of computational linguistics. Currently, the data collection comprises metadata of more than 600 different resources.
The workflow involves automated generation of metadata and subsequent curation of the results by domain experts. The generated metadata includes information about
Annohub is a service of the Fachinformationsdienst Linguistik (Specialised Information Service Linguistics), a cooperation project of the University Library Johann Christian Senckenberg and the Applied Computational Linguistics (ACoLi) lab at the Goethe University in Frankfurt, Germany. The Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) supports the project.
The Annohub software is used to download, parse and analyse language resources available in either RDF, CoNLL, or XML format. The software is available under an open licence at https://github.com/ubffm/Annohub.
Additionally, several tools for the conversion of XML documents to CoNLL and RDF formats are employed as part of the workflow. The tool packages are available under https://github.com/acoli-repo/xml2conll and https://github.com/acoli-repo/conll-rdf.
The information resulting from the analysis together with the basic formal descriptions of the resources (title, author, etc.) are stored in a metadata repository established for this purpose (Annobub-Repository). The repository data has been integrated in the Lin|gu|is|tik portal (www.linguistik.de). You can search for Annohub resources using the LOD search. Additionally, the portal provides a tabular overview of the dataset.
For more information, please consult
Abromeit et al. (2020). Annohub – Annotation Metadata for Linked Data Applications. In Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020), Marseille, France, May 2020, pages 36-44
Here you can find an RDF edition of the Annohub dataset. The data is published under a Creative Commons Attribution licence CC BY . The links below lead to the latest version of the dataset. All previous versions can be found in our archive (https://annohub.linguistik.de/archive/).
Persistent URI: http://annohub.linguistik.de/annohub-dataset
Via content negotiation, we offer different file formats according to the specification in the HTTP header. Following mime types are supported:
application/rdf+xml. Additionally, a static data dump is available.