A toolbox to improve genome annotation

With thousands of prokaryotic genomes newly sequenced each year, solutions for accurately annotating these genomes and thereby providing the basis for uncovering the function(s) of genes and their encoded proteins are urgently needed. Scientists from the SIB Swiss Institute of Bioinformatics have developed an open proteogenomics toolbox to allow researchers to obtain more accurate and complete genome annotations. The method is published in Genome Research.

toolbox to improve genome annotation
By Monnet C, Loux V, Gibrat J-F, Spinnler E, Barbe V, et al. [CC BY 2.5], Wikimedia Commons

What is genome annotation?

Genome annotation describes the process of adding information to a raw DNA sequence, such as the location of genes and other features. It basically gives a meaning to a suite of nucleotides, and provides the basis for later identifying the function of the encoded proteins, to study their regulation or the way they interact.

Proteomics data, generated experimentally, help to improve genome annotations by providing evidence that genes are expressed, and by identifying new protein-coding genes that have so far been missed.

What is proteogenomics?

This emerging field of biological research, at the intersection of proteomics and genomics, commonly refers to studies that use proteomic information -derived from mass spectrometry, for example - to provide experimental evidence that genes are expressed, and to improve gene annotations.

The toolbox developed here relies on proteogenomics as it searches mass spectrometry data against large protein search databases that capture the entire protein-coding potential of prokaryotes.

A pressing need for high-quality genome annotation

Recent advances in Next Generation Sequencing (NGS) enable researchers to sequence and assemble most prokaryotic genomes within a few hours - a process that used to take several months or even years, 20 years ago.

Thousands of prokaryotic genomes can thus be sequenced yearly. But accurately annotating these genomes is a different story altogether.

"It is a long-standing problem in genomics: even for an identical genome sequence, reference annotations from different genome annotation centres can vary significantly, with regard to the overall number of genes, or their precise start sites," explains SIB Group Leader Christian Ahrens from the Agroscope (Bern), who led the project.

"Moreover, a genome annotation - even from the same resource - can change substantially from one release to another."

"Of course, the better a genome annotation, the more value it provides to the research community," comments Ahrens. "However, current software solutions for predicting the location of genes in genomes still miss important genes. This is especially true for short protein-coding genes," he continues. "Researchers have been finding out more and more about the important functions of short proteins, and we can expect many additional surprises from this rapidly expanding field."

A solution that helps to improve genome annotations by integrating and highlighting the differences among reference genome annotations and additional predictions is therefore urgently needed.

A method and toolbox to improve genome annotations

To fill this gap, and building on its proteogenomics expertise, the team has developed and released a web server and a suite of pre-computed integrated Proteogenomics search databases (iPtgxDBs) for reference model organisms that allow researchers to obtain more accurate and complete genome annotations in prokaryotes.

"The solution we have developed lets researchers visualize all annotation differences between releases, for example. Experimental data can also be added to identify the correct annotations and provide evidence for new genes," indicates Ahrens.

The software relies on the concepts of another software called PeptideClassifier (Qeli and Ahrens, Nature Biotechnology 2010), which ensures that the large protein search database that is generated is highly informative.

From improved annotations to agricultural and public health applications

As a proof-of-concept, the solution was validated on several prokaryotic organisms: Bartonella henselae, the causative agent of cat-scratch disease (bartonellosis), Bradyrhizobium diazoefficiens, a root symbiont of soybean and other leguminous plants, and Escherichia coli, the most widely studied prokaryotic model organism.

"The solution already allowed us to unambiguously identify novel ORFs, including short proteins, metabolic enzymes, differentially expressed proteins and membrane-localized lipoproteins," says Ahrens.

Widely applicable both to reference model organisms and to newly sequenced genomes from human or environmental microbiomes, the proteogenomics toolbox promises the discovery of functions of important microbial strains, which could be exploited for human health as well as for plant protection.

"During the validation process, we demonstrated that a correct genome assembly is of critical importance, as we could identify single amino acid variations at the protein level. This has important implications for our ability to track clinically relevant pathogens over time, or to follow the development and spread of antibiotic resistance," he concludes.


Omasits U et al. An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics. Genome Research 2017. 27:2083-2095. DOI: 10.1101/gr.218255.116

Read more

Link to the SIB Bioinformatics and Proteogenomics Group