ChEMU logo

ChEMU: Cheminformatics Elsevier Melbourne University

Information Extraction from Chemical Patents

NEWS update: Training data and submission website now available!

Please head over to our new website at: to access the data and formally participate.

We will be running a new evaluation lab named ChEMU, part of the 11th Conference and Labs of the Evaluation Forum (CLEF-2020). ChEMU proposes two key information extraction tasks over chemical reactions from patents.

The tasks are briefly explained in our upcoming ECIR 2020 paper:

Annotation Guidelines

To know how the datasets are annotated and gain further insight into the task, please see the annotation guidelines:

Sample dataset is available

The data for this task is released in BRAT format. This is a standoff format, with the text in one plain text file (*.txt), and the annotations in a different file (*.ann).

The configuration files required for BRAT are included in each of the two subdirectories, "ner" for Task 1 and "ee" for Task 2.

A visualization of the latest sample dataset is provided here: Visualization of Sample Dataset.

Latest version: On 7 April, we have removed the labeled trigger words from the annotation files in "ner", since those words are not the target output in task 1. This version is available at:

Second version: On 18 March, we create the 2nd version of the sample dataset. Due to some inconsistencies in how character entities were handled, we have corrected the sample. This version is available at:

Note that the file numbers in this version of the sample differ from in the first version.

First version: Please find the first version of sample dataset here:

Relevant background:

  1. Zhai Z, Nguyen DQ, Akhondi S, Thorne C, Druckenbrodt C, Cohn T, Gregory M and Verspoor K. (2019) Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2019.
  2. Yoshikawa H, Verspoor K, Baldwin T, Nguyen DQ, Zhai Z, Zkhondi S, Thorne C, Druckenbrodt C. (2019) Detecting Chemical Reaction Schemes in Patents. Australian Language Technology Association Workshop (ALTA 2019). Sydney, Australia, December 2019.

If you are interested in participating in the CLEF2020 ChEMU task on information extraction from chemical patents, please register here:

To access the data and submission site you will also need to register here, and accept the data usage agreement: to access the data and formally participate.

This project is a collaboration between the University of Melbourne natural language processing group in the School of Computing and Information Systems, the Elsevier Content Transformations, Life Science team, and RMIT University. The principal investigator of the project is Karin Verspoor. The research is supported by an Australian Research Council Linkage Project, LP160101469, and Elsevier.

Key Dates

For questions about the task, please email: