Posts by Scott Peterson

Evaluating Strain-level Variation of Key Acidogenic Species in Dental Plaque Biofilms

The characterization of the dental plaque microbiome, using traditional 16S rDNA profiling strategies, illustrates both the strengths and the limitations of this method. The central limitation of the 16S rDNA methodology is the inability to decipher strain-level variation within a microbiome. Why is this important? It is becoming a common theme in microbiome research that microbiomes associated with the human host are distinct from those that inhabit the environment. The species present in distinct human microbiomes represent only a small number of taxa. Within these taxa are relatively few genera that have massive representation of member species. This structure has been referred to as the deep fan structure.  When comparing microbiomes representing healthy and diseased subjects, it may be commonplace that important strain-level variations exist, that are in many instances potentially causally related to the health of the human host. The dental plaque microbiome illustrates this point strongly. Oral microbiologists have isolated strains from species including: S. mitis, S. sanguinis, S. mutans, S. gordonii and others that differ dramatically in their acid production and acid tolerance characteristics. The genes encoding these activities are not part of the core genome, but reflect functions encoded in the strain-variable portion of the genome (~10-30% of the genomes coding capacity). Important aspects of human disease etiology may be missed if we fail to address this possibility.

Summary of Progress: Dental plaque samples from human subjects with and without dental caries were used to isolate S. mutans and S. sobrinus colonies using enrichment culturing procedures. Most colonies were subjected to 2-3 rounds of replating to obtain pure colonies. The individual clones were then grown in liquid media to isolate genomic DNAs to carry out fingerprinting of strains based on RFLP analysis. This allowed us to collapse positive strains that appeared identical or highly similar into a set of strains that appeared to be of maximal diversity, encoding the largest number of unique gene sequences. We further characterized the individual strains using primer pairs that are specific for either S. mutans or S. sobrinus. Several of the isolates were negative by PCR and these corresponded to isolates with unusual RFLP patterns and so were excluded from further analysis. Some isolates tested positive for one of the two primer pairs used for screening and were marked as such but retained for further analysis using genome sequencing. The isolates obtained were multiplexed into two lanes of the Solexa GSA IIx at a theoretical depth of coverage of 50X. Previous evidence based on comparative analyses indicates that strain-specific regions of the S. mutans genome are not randomly distributed but rather are present at discrete locations. The breadth of these regions is not fully characterized but will be greatly enhanced by our analyses. To date no reference genome sequence is available for S. sobrinus, a potentially important contributor to dental caries.

Each genome to be sequenced was uniquely barcoded using the EpiBio Nextera DNA sample prep kit, and sequencing was performed using an Illumina Genome Analyzer IIx. The sequenced reads were then used to search against the Genbank non-redundant nucleotide database for quality assessment and to determine the top hit of each genome.  As shown in Table 1, 76 isolates generated best hits to S. mutans and 47 to S. sobrinus genomes. Among the 17 isolates that do not appear to be either S. mutans or S. sobrinus it is somewhat puzzling how they were cultivated on the medias used. We believe these colonies were impure and predominantly that of the genome sequenced.

Top Blast Hits Genomes # of isolates
S. sobrinus 47
S. parasanguinis 1
E. faecalis 1
Lactobacillus spp. 1
S. mutans 76
Chryseobacterium gleum 1
S. aureus 8
S.  epidermidis 1
S. caprae 4

Table 1. Summary of the tops hits of the reads from each isolate sequenced.

We used Newbler to assemble each of the genomic sequence reads. For S. mutans we used mapping assembly against the S. mutans UA159 sequence and we performed de novo assembly for S. sobrinus sequence reads due to the lack of available reference genome sequence. Overall the sequencing of isolates was successful with one exception. The remaining 75 isolates assembled with an average coverage of 91% with respect to the reference genome. Given what is known about strain-specific gene content in S. mutans one expects 90% coverage to be equivalent to complete coverage since ~10% of UA159’s genome sequence is not likely to be shared with these isolates. The average number of contigs/isolate is 215 with average length of 10,842 bp. Based on this outcome it is highly likely that we will identify sequence reads from essentially all strain-specific genes for each isolate, the extent that full-length gene sequence has been generated and further to what extent those sequences display genomic context are a part of our current efforts.

Ongoing Efforts. We are currently identifying strain-specific sequences from each isolate to determine the extent that these sequences might be shared among newly characterized isolates and their association with either caries-free or caries-active subjects. We will also identify the set of core gene sequences that appear to be present in all S. mutans and S. sobrinus genomes respectively. Ultimately we have demonstrated the use of high throughput sequencing technology as a means for characterizing oral pathogens of interest. Suggested applications for this type of research effort include the generation of strain-specific oligonucleotides to be added to existing DNA microarray content to enhance analysis using standard CGH methods. Another powerful use of this data can be obtained via the application of a variety of selection schemes that reveal the fitness of individual strains among the groups sequenced. The identification of strain-specific sequence signatures allows us to design primer pairs that can be used to measure the abundance and growth characteristics of that strain by qPCR. Potentially more interesting is the measurement of strains’ growth characteristics in competition with other sequenced strains. We have created mixtures of all of the sequenced S. mutans and S. sobrinus strains as independent pools and also generated a super pool including all sequenced strains. We have subjected these pools to a number of selective growth conditions including oxidative stress, low pH and growth on a variety of sugar substrates. In each case we envision that the generation of gene expression data and/or qPCR data detailing the abundance of each strain before and after selection will reveal individual strains that display high and low resistance to low pH, oxidative stress etc. This experimental procedure is analogous to phenotypic screens involving pools of single gene KO strains that have been uniquely barcoded to allow highly parallel analysis using DNA microarrays as popularized by the S. cerevisiae community. The variation performed here is to make use of the strain-specific gene sequences as a surrogate for the molecular barcode. Each strain will have at least one and probably hundreds of unique sequence identifiers that may be exploited for this purpose.

It is our hope that this demonstration will provide the dental research community a blueprint for how genome sequence data can be exploited and become more than a simple GenBank record for reference purposes. The experimental process described above provides a novel way to relate genotypic and phenotypic information on collections of strains derived from healthy and diseased human subjects. The sequence data for all assemblies has been placed in the public domain and we are currently awaiting accession number assignments. If you have some ideas for negative selection, let me know, I am happy to share the strains/pools and funding permitting, primer pair aliquots targeting specific strains in the pools.

The projects described above were supported by NIAID via a contract to JCVI under the Pathogen Functional genomics Resource Center (N01-AI15447) and funds from NIDCR to PFGRC in an attempt to enable the HMP research community to exploit genomic and metagenomic methods. The work pertaining to the oral cavity was done in collaboration with Dr. Walter Bretz at NYU and the efforts pertaining to the gut microbiome were done in collaboration with Dr. Cynthia Sears at JHU.

Cataloguing the Gene Expression Patterns of Dental Plaque Biofilms: A Reference Dental Plaque Transcriptome

The RNA-Seq method has been widely adopted as an alternative to the use of DNA microarrays. In most contexts, the RNA-Seq method is implemented when a single reference organism is being studied. Our project endeavored to establish working methods to enable the generation of cDNA libraries that were depleted of contaminating human mRNA and host/microbiome rRNA sequences that would otherwise represent over 95% of the total sequence reads obtained. We have also made significant efforts to define bioinformatics procedures that allow RNA-Seq data to be assigned to appropriate species such that global gene expression analyses can be routinely conducted by the dental research community and those involved in HMP research objectives.

We have established a catalogue of expressed genes in dental plaque by turning to the Solexa sequencing platform and applying RNA-Seq to a collection of 19 twin pairs that are either concordant for dental health (caries-free concordant twin pairs), concordant for dental caries (caries-active concordant twin pairs) or discordant for dental caries (one twin caries-free and the other member of the twin pair caries-active). Based on our analysis of the data we have established that the most abundant ten species in each sample varies significantly from subject to subject. This fact greatly complicates the mapping of reads to reference genomes. Another significant conceptual challenge we faced was how to conduct highly specific mapping of transcripts to genomes of interest. We know that genes in genomes evolve at substantially different rates; some genes may differ by 2-5% across species boundaries whereas others may differ by 25-30%. The consequence of this is that no single cut-off for mapping a transcript to a reference genome may be reliably employed. We therefore reasoned that by creating an oral cavity reference genome database we could map each transcript according to reasonable specificity criteria but impose a best-hit criteria on the data to ensure minimal mis-mapping.

Based upon the data generated (38 samples X ~32.8 million reads/sample) ~1 billion reads or over 100 Gb of sequence data, we have fulfilled the goal of establishing a robust procedure for RNA-Seq and the specific transcripts expressed in dental plaque biofilms. These sequences and the associated SOPs developed for effective microbial RNA enrichment have been made available through the DACC (http://www.hmpdacc.org/RSEQ/). In addition, we have devised a strategy for mapping reads to particular functional or biochemical pathways such as those related to acid/base production as an independent means of exploiting RNA-Seq data. In this scheme the details of which species are expressing functions is not considered of importance but rather the sum total of expressed sequences related to acid/base production is. The approach used here is similar to that described above in that a database is created pertaining to all sequence data derived from particular biochemical pathways as a means of recruiting reads of appropriate sequence identity mapping to annotated genes. Over- or under-representation of expressed genes constituting discrete pathways may then be evaluated.

The projects described above were supported by NIAID via a contract to JCVI under the Pathogen Functional genomics Resource Center (N01-AI15447)and funds from NIDCR to PFGRC in an attempt to enable the HMP research community to exploit genomic and metagenomic methods. The work pertaining to the oral cavity was done in collaboration with Dr. Walter Bretz at NYU and the efforts pertaining to the gut microbiome were done in collaboration with Dr. Cynthia Sears at JHU.

Surrogate Methods for Profiling Species of the Oral and Gut Microbiome

We engaged in an effort focused on alleviating a substantial barrier facing the human microbiome research community. While powerful, the 16S rDNA gene is insufficiently divergent to allow discrimination of many species and essentially no strains present within communities. The increasing costs of Sanger sequencing has forced most investigators to adopt the use of the Roche, 454 sequencing platform to address the question, “who’s there?”  The benefits of the 454 sequence data are clear as investigators enjoy deep data sets with excellent statistical power. A major drawback relates to the fact that the read length of the 454 platform does not  allow the acquisition of a sufficient number of “informative bases” to allow species level identification and therefore generally depicts the genera present in the microbiome. While there is much to be gained by large-scale analysis of genus-based comparisons, it is highly desirable to have species and even strain-level resolution. Much of the difference in healthy and diseased human microbiomes may lie at the species and strain-level making it important to develop strategies to allow species abundance measurements to be made on large human cohorts, in a cost-effective manner. We used capture array technology in an iterative fashion to establish a comprehensive sequence database of seven conserved gene sequences. We performed a proof of concept using two model systems: the oral (dental plaque) microbiome and the fecal microbiome. We designed capture oligonucleotides that tiled each of seven universally conserved gene sequences present in Genbank belonging to genera known to be present in the gut and oral cavity, respectively. We refer to these oligonucleotides as “seed sequences” for use in capturing orthologous sequences present in both stool and dental plaque biofilms and saliva.

We next prepared complex mixtures of dental plaque and saliva from several individuals and separately also prepared a similar stool mixture representing a diversity of subjects. The DNAs generated from these microbiome samples were used in conjunction with the capture array. We refer to the captured DNAs as “cloud sequences” that represent related sequences (phylogenetic clades) surrounding the original seed sequences. We repeated the capture array process three times such that novel identified sequences relative to the original seeds were added to subsequent capture array designs. Our goal is to establish a taxonomic representation of these microbiomes based on detailed DNA sequence data of seven housekeeping genes, reminiscent of long-standing MLST approaches. We are leveraging existing and future reference genome sequences to annotate the sequence data obtained from capture array data. Additional species may be subsequently added to this framework by the HMP research community simply by sequencing the relevant loci from defined species available via ATCC, BEI or from the strain collections held by hundreds of investigators world-wide.  The power of this approach lies in the provision of DNA sequences that can be used to design qPCR primer pairs capable of highly discriminatory amplification and abundance measurements of species and strains of potential interest.

Despite the fluctuation in the efficiency of capturing orthologs among the seven target genes, we were able to generate a substantial depth of coverage for three genes in the oral cavity, pyrG, pgi and recA and four genes in the gut pyrG, dnaG, pgi and recA. We have been analyzing the total gene sequence data obtained from capture arrays including four 454 runs each for oral and fecal microbiomes. Given the nature of the sequence data as a representation of highly related sequences derived from tens or hundreds of strains belonging to the same species we were pessimistic that assembly of sequence reads would be fruitful. Our attempt at de novo assembly, using newbler, verified our concerns and was not successful. We have defined an in silico approach to organize the sequence data that involves generating a microbiome reference genome database populated with relevant genomes derived from the oral cavity and gut. In addition to the original genes collected from Genbank, we added the 7 targeted gene sequences from 134 oral-related genomes and 162 gut-related genomes. By creating this database we will be able to map each gene sequence to the reference genome to enhance the specificity of each assignment. We are mapping the reads from our sequencing data to genomes using a high stringency cut-offs. Those reads mapping to reference genomes will be used to generate a multiple sequence alignments to derive a consensus sequence and identify exploitable polymorphisms for qPCR primer design. For this we will not only rely on the multi-sequence alignments but we will also compare alignments for any individual species to others within a major clade (common genera). This will allow us to determine the sequences with the highest probability of being unique to the species of interest. Preliminary assessment of the DNA sequence data has shown promising outcomes as we are able to recapitulate phylogenetic clades such as the viridans group of Streptococci using gene sequences derived from recA. This supports the idea that gene representation from species known to be present in the oral cavity were effectively captured. The clade or sub-clade primer design will be based on all the sequences reliably mapped to genomes.

It is our goal to design useful primer pairs representing species-level resolution. This will be achieveable in many cases but not all. We are seeking funds to create a repository of primer pairs to share with the HMP community. It should be noted that initially, none of the primer designs will be experimentally validated and as such users will need to carefully evaluate their usage in the context of their experimental goals. It is our plan to continue efforts associated with this project to conduct validations to the extent that funding permits. These results will be added to the primer designs as they are validated or deemed unsuitable for experimental use.

The projects described above were supported by NIAID via a contract to JCVI under the Pathogen Functional genomics Resource Center (N01-AI15447)and funds from NIDCR to PFGRC in an attempt to enable the HMP research community to exploit genomic and metagenomic methods. The work pertaining to the oral cavity was done in collaboration with Dr. Walter Bretz at NYU and the efforts pertaining to the gut microbiome were done in collaboration with Dr. Cynthia Sears at JHU.

DNA microarrays vs RNAseq — The winner and new heavyweight champion is?… It’s a draw.

In the past year or so there have been several articles stating that the death of microarray technology is growing near. These proclamations are due to the more recently introduced methodology referred to as RNAseq. At first glance I wrote these claims off as being silly and premature. Over time though I am starting to appreciate that while the claim is still clearly wrong, the issue isn’t about technology displacement at all. My group works on a wide variety of gene expression problems ranging from the simple in vitro microbial gene expression studies to problems involving metagenomic samples of enormous complexity (http://pfgrc.jcvi.org). In my experience, the decision of whether to use DNA microarrays or RNAseq seems straight-forward and unambiguous. In reality the two technologies couldn’t be more complementary. Given the simple in vitro gene expression study as an example, the low cost, short turn-around time, exceptional quantitative accuracy and ease of data generation all make the glass slide microarray the clear choice.

About three years ago our laboratory began thinking about how to examine gene expression of pathogenic bacteria in the context of host infection. The challenge here is related to assay sensitivity since any RNA preparation derived from such an infection will yield host RNAs in an abundance 100 to 1000 times greater than that obtained from the infectious agent. Labeled RNAs from such an experiment would yield little useful information about the bacterial gene expression using standard DNA microarray procedures. This represents a clear case for RNAseq. The bewildering number of sequence reads we have come to enjoy from NextGen sequencing platforms is only going to get better. The extra bonus of applying RNAseq is that both the host and infectious agent can be profiled at the same time. There are still many technical problems to work out for routine use of RNAseq, such as effective rRNA removal and the development of appropriate data analysis tools, but the effort required seems quite justifiable.

I can think of only one application that is beginning to take on momentum where an investigator may truly ponder which strategy makes the most sense to apply. The approach is one that mimics EST sequencing as a means of defining genes and gene limits. Our ability to properly identify coding DNA sequences (CDS) in genomes ranges from, very good to relatively poor, depending on the genome in question. Members of the parasite research community, to name one, have struggled with this problem often. Generally speaking, substantial over-calling of genes occurs making it difficult for scientists to begin down the path of functional characterization of their favorite genome. We have worked with such groups recently to provide an independent means of substantiating gene calls via evidence of RNA expression. The design of such studies involves generating RNAs from a wide variety of experimental conditions to enhance the frequency for evidence based gene calls. DNA microarrays designed as a low or high density tiling array can be acquired at a reasonable cost and with good experimental outcomes. The case for applying RNAseq rests on the increased ability to detect transcripts that are expressed at low levels that defy routine detection using DNA microarrays.

In summary, I find very few instances where one might reasonably stop to wonder which technology are best suited for the biological/technical problem at hand. When sensitivity isn’t limiting, use DNA microarrays. When sensitivity is everything, look toward the short read sequencing technologies. In the end it turns out that it wasn’t really a contest at all. We should all feel fortunate that each strategy has its appropriate time and place for use. Those researchers, like myself, that have invested much time and effort working with DNA microarrays have nothing to fear, we just have more options now. This is a good thing to say the least. Most of our gene expression work is supported through a contract from NIAID to the PFGRC under contract N01-A115447.

Scott Peterson http://www.jcvi.org/cms/about/bios/speterson/

Professor, JCVI

Scientific Director, PFGRC at JCVI