SourceData is making data discoverable

A push for a paradigm change in scientific publication

Bringing to the surface information buried in the figures of scientific papers: this is the purpose of SourceData, an open-access platform developed by EMBO in collaboration with the SIB Swiss Institute of Bioinformatics. SourceData allows researchers and publishers to share figures and their underlying data in a machine-readable, searchable format. This award-winning publishing innovation is highlighted in a Nature Methods paper this week, as well as in the journal’s Editorial.

A wealth of unaccessible information

There are over 1 million biomedical papers published each year. Without the appropriate methods allowing to extract information from this wealth of knowledge, data representing years of research could be practically lost.
“Finding ways to promote access to published scientific data is a major preoccupation today”, says Dr Robin Liechti, Senior Researcher at SIB’s Vital-IT Group and co-author on the paper. “So far, a key issue in accessing data is that the core findings of a scientific paper are often depicted in figures.” Such information can reflect the effect of one substance on another for example, or indicate levels of gene expression under the influence of a specific drug.
Liechti continues: “Even if a paper is published as ‘open-access’, even if it has a set of carefully defined searchable keywords and that text-mining can be applied to surface specific terms, the information contained in figures remains unsearchable and, therefore, not easy to access and reuse.”

SourceData is making data discoverable

The SourceData platform is an EMBO led project, jointly developed by SIB’s Vital-IT Group, that provides an intuitive interface to researchers and publishers alike, enabling them to share figures and their underlying data in a machine-readable, searchable format. In addition, SourceData provides a public interface where scientists can efficiently find and re-use published results.

How does it work?

A description of each figure is generated and stored in a structured database. The biological entities represented in the figure, such as genes, proteins or molecules, are linked to a controlled vocabulary to avoid naming ambiguity. This means that each occurrence of a certain biological entity in a figure (e.g. ‘Insulin’ or ‘Glucose’) or result set can be quickly found within the SourceData database.
SourceData also stores the role of each entity: whether they were manipulated or observed, allowing very specific searches based on the experimental design.

A change of paradigm for scientific publishing

This proof of concept should encourage researchers and publishers to integrate in their manuscript submission and respectively publication workflow a curation step to ensure the data underlying their figures become readily accessible to the broader community.
“Here, we show that a single model can be used to encode the scientific hypothesis behind a wide spectrum of biological experiments”, concludes Dr Thomas Lemberger of EMBO who leads the project. “With a very simple principle, we managed to fit 80% of the experiments present in our dataset”.
The platform is thus in active development, with SIB and EMBO engaging with academic publishers to establish an open and effective standard for the discovery and reuse of figure data.


Liechti R, George N, Götz L, El-Gebali S, Chasapi A, Crespo I, Xenarios I & Lemberger T (2017). SourceData - a semantic platform for curating and searching figures. Nature Methods. DOI: 10.1038/nmeth.4471

Nature Methods Editorial on SourceData

About SourceData

About EMBO