Tuesday, May 16, 2017

Lecture 19: Jennifer Lu (Steven Salzberg Lab)

KrakEN and Bracken.

Metagenomics is a rapidly growing field of study, driven in part by our ability to generate enormous amounts of DNA sequence rapidly and inexpensively. Since the human genome was first published in 2001 (The International Human Genome Sequencing Consortium, 2001; Venter et al., 2001), sequencing technology has become approximately one million times faster and cheaper, making it possible for individual labs to generate as much sequence data as the entire Human Genome Project in just a few days. In the context of metagenomics experiments, this makes it possible to sample a complex mixture of microbes by “shotgun” sequencing, which involves simply isolating DNA, preparing the DNA for sequencing, and sequencing the mixture as deeply as possible. Shotgun sequencing is relatively unbiased compared to targeted sequencing methods (Venter et al., 2004), including widely-used 16S ribosomal RNA sequencing, and it has the additional advantage that it captures any species with a DNA-based genome, including eukaryotes that lack a 16S rRNA gene. Because it is unbiased, shotgun sequencing can also be used to estimate the abundance of each taxon (species, genus, phylum, etc.) in the original sample, by counting the number of reads belonging to each taxon. 
Along with the technological advances, the number of finished and draft genomes has also grown exponentially over the past decade. At present, there are thousands of complete bacterial genomes, 20,000 draft bacterial genomes, and 80,000 full or partial virus genomes in the public GenBank archive (Benson et al., 2015). This rich resource of sequenced genomes now makes it possible to sequence uncultured, unprocessed microbial DNA from almost any environment, ranging from soil to the deep ocean to the human body, and use computational sequence comparisons to identify many of the formerly hidden species in these environments (Riesenfeld, Schloss & Handelsman, 2004). Several accurate methods have appeared that can align a sequence “read” to a database of microbial genomes rapidly and accurately (see below), but this step alone is not sufficient to estimate how much of a species is present. Complications arise when closely related species are present in the same sample–a situation that arises quite frequently–because many reads align equally well to more than one species. This requires a separate abundance estimation algorithm to resolve. In their recent article, Jennifer Lu and Steven Salzberg from Johns Hopkins University and their colleagues describe a new method, Bracken, that goes beyond simply classifying individual reads and computes the abundance of species, genera, or other taxonomic categories from the DNA sequences collected in a metagenomics experiment.
Number of reads within the Mycobacterium genus as assigned by Kraken (blue), estimated by Bracken (purple) and compared to the true read counts (green)[1].
Bracken (Bayesian Reestimation of Abundance after Classification with KrakEN) uses the taxonomic assignments made by Kraken, a very fast read-level classifier, along with information about the genomes themselves to estimate abundance at the species level, the genus level, or above. The authors of the study demonstrate that Bracken can produce accurate species- and genus-level abundance estimates even when a sample contains multiple near-identical species.

[1] Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: Estimating species abundance in metagenomics data. PeerJ Computer Science. 2017 Jan 2;3:e104.
[2] Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology. 2014 Mar 3;15(3):R46.
Jennifer Lu is a Biomedical Engineering Ph.D. Candidate in Professor Steven Salzberg's lab at the Center for Computational Biology at Johns Hopkins University. With a background in Chemical and Biomolecular Engineering and Computer Science, Jennifer began her Ph.D. with the intent of applying her knowledge in Computer Science to Biomedical research. Currently, her research is focused on computational genomics and the usage of next-generation sequencing for diagnosing bacterial, fungal, or viral infections relating to human health and diseases. As part of this research, she develops and uses various computational methods for quantifying DNA sequence similarities and analyzing the genomes of human pathogens.

No comments:

Post a Comment