GABenchToB: A Genome Assembly Benchmark Tuned on Bacteria and Benchtop Sequencers

September 09, 2014

Researchers from the University of Münster and the Center for Biotechnology at the Bielefeld University, Germany evaluated microbial genome de novo assemblers from a practical user perspective.

Following the authors of the assembly benchmark tuned on bacteria and benchtop sequencers, or short GABenchToB, the main goal of this study was to “give the research community a practice-oriented assembly evaluation” addressing questions that were “covered insufficiently in the past”.

After evaluating nine different assembly algorithms applied to ten different bacterial next-generation sequencing data set, the researchers reported in a paper published this week in PLoS One that the assemblies show a high degree of variation and “consequently no assembler can be rated best for all preconditions“.

That alone is not surprising, as this was also the main conclusion drawn by one of the latest assembly software comparisons, Assemblathon 2, the second Assemblathon challenge, hosted by researchers at the University of California, Santa Cruz and UC Davis.

Still, GABenchToBo did many things differently. In contrast to the Assemblathon competitions, as well as to GAGE, a ‘bake-off’ of assemblers run by researchers at the University of Maryland and to its bacteria based successor GAGE-B, in GABenchToB the evaluation scenario was set up such that it covered most of what is regarded as the basis of applied genome assembly.

The set of assemblers compared in this study, for instance, covered not only what has proven to be the cream of the crop in previous evaluations but also harbors so far disregarded commercial assemblers. Among them Roche's GS De Novo Assembler, better known as Newbler, which is without any doubt one of the most used and popular genome assemblers but which was never before a candidate for any assembler challenge. Indeed the study showed, that Newbler is very “robust with respect to all kind of sequencing data”, even though it was intentionally designed to handle 454 reads only.

For this evaluation, the researchers build upon data from benchtop sequencing platforms, which provide sufficient genomic coverage […] to be efficiently used for sequencing bacterial genomes. Likewise to the choice of assembler candidates, not only the top dog under the sequencing companies, Illumina with it's benchtop sequencing platform MiSeq is represented but also Ion Torrent's Personal Genome Machine, for which “assembly evaluations are missing” so far. The test objects for this evaluation were three different bacterial species for which a finished high standard reference sequence was available.

The evaluators are very conscious about the fact that these preconditions hardly allow to answer general questions as “which assembler performs best and which [sequencing] platform allows for the best assemblies”. However, the authors argue that researchers are still “confronted with concrete application scenarios” and therefore “require decision-making support”.

One aspect directly aiming at this entitlement is the evaluation of the computational cost that an assembly process demands. “From a practical point of view, the run time of an assembly, for instance, is one of the most crucial parameter which influences the decision to use a particular assembler or not”, says Sebastian Jünemann, first author of this study. “Even the worlds best assembler is of little use if its run time requirement renders that specific assembler as practically unusable”. By measuring the computational cost and the memory usage, the paper highlights some remarkable differences between the assemblers. While some candidates finished a genome assembly within minutes, others took several hours for the same genome.

The paper took also a deeper look onto the questions to what extent the depth of coverage affects an assembly result. “For 454 sequencing data, it is no secret that a too high coverage can have a negative influence on the assembly. Surprisingly, we couldn't find any article analyzing and describing this effect in detail, neither for 454 nor for PGM data, which due to its comparable read pattern would be most probable prone to similar effects”, says Jünemann. As hypothesized, the evaluation revealed that for PGM data “researchers should consider to sub-sample their data […] in order to prevent negative oversampling effects”.

Another aspect that this benchmark is putting into the limelight is one that many of those who ever used an assembler form the family of the so called de Bruijn Graph assemblers are familiar with: the choice of the k-mer parameter. Following this, the researchers examined the effect of the k-mer parameter on the assembly result and reported a high inconsistency of the best performing k-mer parameter between different data sets and different assemblers. This means, that “parameters proven successful in the past may not be adequate for new assembly problems”, they wrote. Further they added, “currently the best solution is to pursue a trial-and-error approach. The downsides of [this] … , in turn, are drastically increased running times countering the speed advantage of [de Bruijn Graph] … assemblers”.

The assembly benchmark concluded with some recommendation of combinations between assemblers and benchtop sequencers. In summary, the SPAdes assembler was a very good choice for data originating from the MiSeq and PGM platform. In addition, the assembler from CLC bio was also a promising and very fast choice for assembling MiSeq data. Both, the SPAdes and CLC bio assemblers “offer good performing default k-mer parameters, are generally easy to execute and show one of the highest NGA50 and lowest mis-assembly rates”. For PGM data, Roche's Newbler and the MIRA assembler were proven to be good choices. “MIRA is able to achieve very high NGA50 values particular at higher coverages … [and] NEWBLER … convinces with … a low rate of mis-assemblies and a fairly quick execution time”.

To enable the community to reproduce the study results, sequencing data, processed data sets, the assemblies and the scripts were made freely available at the GABenchToB repository. The sequencing data can also be accessed at the European Nucleotide Archive (ENA; project ERP006674).

This work was supported in major parts by the European Commission’s Seventh Framework Programme (EU PathoNGenTrace project agreement no. 278864).

Citation:
Jünemann S, Prior K, Albersmeier A, Albaum S, Kalinowski J, Goesmann A, Stoye J and Harmsen D (2014). GABenchToB: A genome assembly benchmark tuned on bacteria and benchtop sequencers. PLoS ONE 9(9): e107014. doi:10.1371/journal.pone.0107014

URL: www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0107014

Univ.-Prof. Dr. med. Dag Harmsen
Head of Research
Zentrum für Zahn-, Mund- und Kieferheilkunde
Poliklinik für Parodontologie
Universitätsklinikum Münster
Albert-Schweitzer-Campus 1, Gebäude W 30
Waldeyerstrasse 30
48149 Münster, Germany