Corpus of Abstracts

We developed guidelines for annotating gene names which we used to annotate a corpus of 82 abstracts curated by FlyBase. The guidelines can be found here and the corpus in IOB format here. The inter-annotator agreement, measured with the Kappa coefficient, was over 90%. If you download and use the corpus please let us know and cite the following paper:

Vlachos A. and Gasperin C. 2006. Bootstrapping and evaluating Named Entity Recognition in the biomedical domain. In Proceedings of BioNLP at HLT-NAACL 2006, New York City, USA, pages 138-145.

Corpus of Full Papers

We produced the first corpus of full papers annotated with anaphoric relations between noun phrases referring to genes and biologically related entities. The corpus consists of 5 papers. For the annotation of gene names we followed the same guidelines as for the annotation of the abstracts. These data were first used in Vlachos 2007 and are available here. The anaphoric annotation has followed these guidelines and is available here. The annotation scheme and process used for this annotation is reported in:

Gasperin C., Karamanis N. and Seal R. 2007. Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme. In Proceedings of DAARC 2007, Algarve, Portugal, pages 19-24.

Corpus of Speculative Sentences

This corpus consists of more than 1,500 sentences annotated as speculative (hedges) or not speculative. The annotation guidelines and the data are available here. They were used for the experiments reported in the following paper:

Medlock B. and Briscoe T. 2007. Weakly supervised learning for hedge classification in scientific literature. In Proceedings of ACL 2007, Prague, Czech Republic, pages 992-999.

The speculative sentences were further annotated for speculative events at the clausal level. These data are available here.


SciXML-CB is an interface to a variety of XML markup schemes used by the publishers of Scientific journals. It is designed and tested with scientific articles in the fields of Chemistry and Biology in mind. SciXML-CB is currently used in FlySlip as an interface to articles whose (full-text) XML appears in National Library Medicine (pubmed central) archives, for example PLoSBiology. SciXML-CB software contains a definition of SciXML-CB plus scripts for converting (some) journal XML into SciXML-CB and for validating purported SciXML-CB against the definition. Click here to download SciXML-CB. More information on SciXML can be found in:

Teufel S. 1999. Argumentative Zoning: Information Extraction from Scientific Text, PhD, School of Cognitive Science, University Edinburgh.


PaperBrowser is a customised web browser which displays the paper in SciXML-CB format enriched by additional NLP analysis. It is equipped with navigational mechanisms which make use of the NLP mark-up to help the curators interact with the text quickly and efficiently. PaperBrowser is written in Java and can be downloaded by clicking here.

This poster provides an overview of PaperBrowser and its evaluation results. PaperBrowser is discussed in more detail in:

Karamanis N., Lewin I., Seal R., Drysdale R. and Briscoe E. J. 2007. Integrating Natural Language Processing with FlyBase curation. In Proceedings of the Pacific Symposium in Biocomputing 2007, Maui, USA, pages 245-256.

Click here for snapshots of PaperBrowser and a photo of a FlyBase curator using it.

