We engaged in an effort focused on alleviating a substantial barrier facing the human microbiome research community. While powerful, the 16S rDNA gene is insufficiently divergent to allow discrimination of many species and essentially no strains present within communities. The increasing costs of Sanger sequencing has forced most investigators to adopt the use of the Roche, 454 sequencing platform to address the question, “who’s there?” The benefits of the 454 sequence data are clear as investigators enjoy deep data sets with excellent statistical power. A major drawback relates to the fact that the read length of the 454 platform does not allow the acquisition of a sufficient number of “informative bases” to allow species level identification and therefore generally depicts the genera present in the microbiome. While there is much to be gained by large-scale analysis of genus-based comparisons, it is highly desirable to have species and even strain-level resolution. Much of the difference in healthy and diseased human microbiomes may lie at the species and strain-level making it important to develop strategies to allow species abundance measurements to be made on large human cohorts, in a cost-effective manner. We used capture array technology in an iterative fashion to establish a comprehensive sequence database of seven conserved gene sequences. We performed a proof of concept using two model systems: the oral (dental plaque) microbiome and the fecal microbiome. We designed capture oligonucleotides that tiled each of seven universally conserved gene sequences present in Genbank belonging to genera known to be present in the gut and oral cavity, respectively. We refer to these oligonucleotides as “seed sequences” for use in capturing orthologous sequences present in both stool and dental plaque biofilms and saliva.
We next prepared complex mixtures of dental plaque and saliva from several individuals and separately also prepared a similar stool mixture representing a diversity of subjects. The DNAs generated from these microbiome samples were used in conjunction with the capture array. We refer to the captured DNAs as “cloud sequences” that represent related sequences (phylogenetic clades) surrounding the original seed sequences. We repeated the capture array process three times such that novel identified sequences relative to the original seeds were added to subsequent capture array designs. Our goal is to establish a taxonomic representation of these microbiomes based on detailed DNA sequence data of seven housekeeping genes, reminiscent of long-standing MLST approaches. We are leveraging existing and future reference genome sequences to annotate the sequence data obtained from capture array data. Additional species may be subsequently added to this framework by the HMP research community simply by sequencing the relevant loci from defined species available via ATCC, BEI or from the strain collections held by hundreds of investigators world-wide. The power of this approach lies in the provision of DNA sequences that can be used to design qPCR primer pairs capable of highly discriminatory amplification and abundance measurements of species and strains of potential interest.
Despite the fluctuation in the efficiency of capturing orthologs among the seven target genes, we were able to generate a substantial depth of coverage for three genes in the oral cavity, pyrG, pgi and recA and four genes in the gut pyrG, dnaG, pgi and recA. We have been analyzing the total gene sequence data obtained from capture arrays including four 454 runs each for oral and fecal microbiomes. Given the nature of the sequence data as a representation of highly related sequences derived from tens or hundreds of strains belonging to the same species we were pessimistic that assembly of sequence reads would be fruitful. Our attempt at de novo assembly, using newbler, verified our concerns and was not successful. We have defined an in silico approach to organize the sequence data that involves generating a microbiome reference genome database populated with relevant genomes derived from the oral cavity and gut. In addition to the original genes collected from Genbank, we added the 7 targeted gene sequences from 134 oral-related genomes and 162 gut-related genomes. By creating this database we will be able to map each gene sequence to the reference genome to enhance the specificity of each assignment. We are mapping the reads from our sequencing data to genomes using a high stringency cut-offs. Those reads mapping to reference genomes will be used to generate a multiple sequence alignments to derive a consensus sequence and identify exploitable polymorphisms for qPCR primer design. For this we will not only rely on the multi-sequence alignments but we will also compare alignments for any individual species to others within a major clade (common genera). This will allow us to determine the sequences with the highest probability of being unique to the species of interest. Preliminary assessment of the DNA sequence data has shown promising outcomes as we are able to recapitulate phylogenetic clades such as the viridans group of Streptococci using gene sequences derived from recA. This supports the idea that gene representation from species known to be present in the oral cavity were effectively captured. The clade or sub-clade primer design will be based on all the sequences reliably mapped to genomes.
It is our goal to design useful primer pairs representing species-level resolution. This will be achieveable in many cases but not all. We are seeking funds to create a repository of primer pairs to share with the HMP community. It should be noted that initially, none of the primer designs will be experimentally validated and as such users will need to carefully evaluate their usage in the context of their experimental goals. It is our plan to continue efforts associated with this project to conduct validations to the extent that funding permits. These results will be added to the primer designs as they are validated or deemed unsuitable for experimental use.
The projects described above were supported by NIAID via a contract to JCVI under the Pathogen Functional genomics Resource Center (N01-AI15447)and funds from NIDCR to PFGRC in an attempt to enable the HMP research community to exploit genomic and metagenomic methods. The work pertaining to the oral cavity was done in collaboration with Dr. Walter Bretz at NYU and the efforts pertaining to the gut microbiome were done in collaboration with Dr. Cynthia Sears at JHU.