Gauging the completeness of genomics data with BUSCO

With ever-lowering sequencing costs, genomic sequencing projects have been initiated for a wide range of organisms, but the vast majority of genomes currently exist in the form of draft assemblies. Considering that an important driving force behind sequencing projects is the acquisition of a complete catalogue of genes, an important task lies in assessing the integrity and completeness of the assembled genome or transcriptome. Researchers from the SIB Swiss Institute of Bioinformatics have developed an assessment tool that quantifies genomic data completeness in terms of expected gene content. The method is published in Molecular Biology and Evolution.

BUSCO Quality Assessments

The rapid accumulation of sequenced genomes means that comparative genomics approaches are becoming ever-more powerful as tools to improve both genome-wide gene structural annotations and large-scale gene functional inferences. Such approaches are well-established as immensely valuable for gene discovery and characterization, helping to build resources to support biological research. The success of such interpretative analyses relies on the comprehensiveness and accuracy of the input data, making quality assessment an important part of the process of genome sequencing, assembly, and annotation.

The Benchmarking Universal Single-Copy Ortholog (BUSCO) assessment tool provides intuitive quantitative measures of genomic data completeness in terms of expected gene content to gauge the quality and completeness of genome assemblies and their annotations.

Since its initial publication in the journal, Bioinformatics, BUSCO has been rapidly adopted by the genomics community. This is exemplified by the large number of genome projects that presented BUSCO results at the recent Plant and Animal Genome conference in California (#PAGXXVI), attended by Felipe Simão, co-author and SIB member from the Zdobnov group at the University of Geneva.

Presentations of projects using BUSCO included genomics of oat plants, rubber trees, calanoid copepod zooplankton, house dust mites, albacore longfin tuna, finger millet, almaco jack fish, red abalone molluscs, reindeers, cowpea legumes, spoon and acorn worms, Japanese honey bees, Holstein dairy cows, ice crawler insects, mango fruits, sugar cane aphids, coconuts, jaltomata plants, tomatoes, weeping lovegrass, and fungal pathogens of strawberries and alfalfa.

The BUSCO assessment tool is widely used in genomics research, exemplified by the many genome projects presented at the recent Plant and Animal Genome conference in California.

BUSCO updates and new genomics applications

The new updated BUSCO tool offers a greatly extended array of assessment datasets: the six original datasets have been updated with improved species sampling, datasets have now also been made available for nematodes, protists, and plants, and 34 new subsets have been built for fungi, arthropods, vertebrates, and prokaryotes that greatly enhance their resolution. "The latest BUSCO software release is more flexible and extendable to improve options for high-throughput assessments by implementing a complete refactoring of the code," explains Mathieu Seppey, co-author and SIB member from the Zdobnov group.

The latest publication in Molecular Biology and Evolution presents analyses that highlight BUSCO's wide-ranging utility that now encompass not only quality assessments but also applications in the training of gene predictor software, as well as in metagenomics, phylogenomics, and comparative genomics analyses.

Lead author and new SIB Group Leader at the University of Lausanne, Robert Waterhouse, explains, "BUSCO offers important complementary metrics to assess the quality, integrity, and completeness of draft genomes, transcriptomes, or annotated gene sets. This facilitates informative comparisons, for example, of newly-sequenced draft genome assemblies to those of gold-standard models, or to quantify iterative improvements to assemblies or annotations. BUSCO assessments therefore offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly-accumulating genomic data."

Waterhouse RM et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Molecular Biology and Evolution 2018. DOI:10.1093/molbev/msx319

Read more

Link to the SIB Computational Evolutionary Genomics Group