Posts in category Genomic Medicine

Page 2 of 3

Thule, Greenland Year Two

Sequence data from the previous year allowed us to determine the overall microbial population in each site and this year we decided to focus on the Rich Lake site which seem to have representation of nearly all microbes found in the other sites. So lucky for us we only had to work on one site this year rather than six. This in itself had me excited to go back to Thule. After a five-hour flight on a military plane from BWI I finally arrived to Thule Greenland where we were greeted by the Colonel as well as other high ranking military officials at the hanger. Once I cleared the customs processing area, I arrived to the dorm where the other scientists were living. It was a little different from last year’s accommodations but nevertheless the luxuries of WI-FI, Internet and cable TV were all available. As I am anxious to get to the field and see the changes in the Rich Lake site, we were given some interesting news. That day was not a good day to travel to the site because a mother polar bear and her two cubs were spotted nearby not too long ago by military police. However, we managed to get other work done by preparing the schedule for the sampling, cultivation and other labwork.


The next few days consisted of preparing culture media, cultivation traps and diffusion chambers, and going out into the field (polar bear spray in hand; yes it’s a real thing!). We were extra careful in the field since there was quite a bit of fog in the area that did not seem to go anywhere and fog happens to be the same color as polar bears. The fog did however make it a bit easier to sleep since most of the sunlight was covered and when there’s 24 hours of daylight from mid-April until September, a little fog can still serve a purpose.

Rich Lake Site

Rich Lake Site



Scientist Spotlight: Meet Sarah Highlander

Sarah Highlander Ph.D. is an esteemed scientist and professor who joined JCVI in La Jolla this year. She comes from a long line of academically successful Professors, including a great uncle who was a University Dean. As a young child, Sarah was influenced by her parents: her mother was a musician and her father was a Ph.D. chemical engineer. Sarah too was a musician and she still enjoys jazz and the opera. But it was her father’s scientific career that influenced her own decision to pursue scientific research as her career.

Dr. Sarah Highlander

Dr. Sarah Highlander

As a chemical engineer and early IT specialist, he shared his interests with her at the kitchen table by doing mathematical puzzles and simple experiments. They explored the impact light had on grass growth by placing plants in the closet. Then in high school, she had the opportunity to work on a microbiology project with the help of her father. Using agar slants from his colleague’s lab, she looked for antimicrobial features of bacteria in the soil. Even with these opportunities, her focus in the sciences wasn’t fully set until she began working as a technician in a fermentation research lab where she had the opportunity to work with plasmids after completing her bachelor’s degree. At this point, plasmids and restriction enzymes were not readily available and researchers had to isolate them in their labs. She was extremely successful as a technician and even published several papers and secured several patents.

This experience launched Highlander into Medical Microbiology. She went to the Sackler Institute of Biomedical Sciences at the New York University School of Medicine, where she earned her Ph.D. in 1985. With her curious nature and the bourgeoning field of biotechnology, she began to research the replication of DNA plasmids in Staphylococcus. She asked basic but as yet unanswered questions such as, “How are these molecules controlled in the cell?” and “How can they best be manipulated in the laboratory?” Her thesis involved characterizing small RNA molecules that control plasmid copy number.

During her Post-doctoral fellowship, she shifted her focus to infectious diseases and worked on vaccine development for a cattle disease called “shipping fever” at the University of Texas Health Science Center. Shipping fever is the most common and costly problem affecting calves. It accounts for major economic losses to the cattle producer by reducing average daily weight gain, impairs feed efficiency, and diminishes overall performance and health of beef calves. Vaccination is key to reduce the disease and Highlander’s research culminated in the development of a subunit vaccine that is still in use.

After her fellowship, she began her professorship at Baylor’s College of Medicine (BCM), where she continued her research into shipping fever. The primary bacterial agent in this disease is Mannheimia haemolytica, which is the same family as the human respiratory pathogen, Haemophilus influenzae. JCVI scientists were the first to sequence and publish the H flu genome in 1995. Dr. Highlander’s group performed extensive characterization of the M. haemolytica leukotoxin and developed numerous genetic tools for manipulation and tagging of the organism. She holds patents for subunit and live-attenuated vaccines to prevent shipping fever.

In 2002, Highlander founded Prokaryon Technologies, a for-profit company focused on animal health to prevent and control diseases associated with food animals. One of Prokaryon’s lead products was a genomics-derived vaccine to prevent shipping fever in cattle.

While leading and growing her company, Highlander stayed committed to her academic research interests and joined the Human Genome Sequencing Center at Baylor. At BCM, she participated in genome sequencing of several pathogens (including M. haemolytica) and she moved to focus more on human pathogens. From 2006 to 2013, Highlander was a principal investigator for the Human Microbiome Project (HMP), a National Institutes of Health-funded program in which JCVI researchers were also key leaders.

In addition to her research, Highlander was involved in graduate and medical education at BCM. She was the co-director of the departmental graduate program for 15 years and directed and taught courses focused on bacterial physiology and molecular laboratory methods. Preparation for lectures and interactions with students helped her stay on top of new techniques and research, which in turn helped her further her own research. Sarah had the opportunity to mentor many graduate students both formally and informally.

At JCVI, Highlander is continuing her work on the microbes that live in and on the human body. Specifically she and her team are looking at the complex microbial communities that live in the human gut. While many microbes are associated with disease, most in the human body are associated with health. Highlander and her team are working to develop specific healthy bacterial mixtures that could be used treat conditions such recurrent Clostridium difficile diarrhea, inflammatory bowel disease and others. She is also using bioinformatics tools to look for new causes of diarrhea. “I am delighted to be a part of the collaborative environment here at JCVI and to be surrounded by colleagues who share common interests in bacterial genetics, genomics, microbial physiology and pathogenesis. The microbiome group at JCVI is strong and I hope to be able to make significant contributions to ongoing and future projects here”.

Even in her personal life, Sarah researches, through her hobby of tracing her genealogy. She has been able to find family roots dating back to the 1500s. This detective work is challenging but it keeps her mind sharp and detailed oriented. She points out that learning family naming patterns can be critical to genealogy research just as algorithm development is to genomic research.

Never having lost that early scientific curiosity and excitement of discovery that her father instilled in her as a young girl, Sarah loves working in the laboratory at JCVI and asking questions. Her analytical and inquisitive nature is one of her greatest professional strengths. She is fascinated by the complexity of the microbial ecosystem in our bodies and the impact these microbes have on our health. As she says, “Microbes are going to continue to win through evolution. We need to figure out the next step to keep ahead!” Let’s hope Highlander and her team can win this battle.

Professional Development Opportunities this Summer

This summer we are offering two professional development workshops: GenomeSolver and Bioinformatics: Unlocking Life through Computation.  Both explore bioinformatics, microbial diversity and the implementation in the undergradauate or high school classrooms. 

The GenomeSolver workshop trains faculty on genome analysis. Workshop attendees will learn about general methodologies, standards, and processes used to annotate and analyze microbial genomes. The workshop contents will be available to aid the faculty in developing teaching modules. In addition, extensive documentation on methodologies and tools will be available via the online environment created for this project. On online web portal Genome Solver ( will be a virtual space for development and sustaining of community. Genome Solver will assist faculty with technical issues and curricular design, as well as an online environment for the ongoing sharing of information including publication of student work.

Bioinformatics: Unlocking Life through Computation is a new opportunity for high school teachers. Genomics and biotechnology are valuable tools in our quest to understand life and nature. However, introducing the science classroom to the computational and mathematical underpinnings of biology can be challenging. The goal of this workshop is to introduce a curriculum for mathematics and science education in the area of genomics (with a focus on the fascinating world of microbes). Educators will be introduced to the various analysis and computational challenges that arise in this discipline. Workflow examples illustrating comparative genomic analysis will be made available through the JCVI Metagenomics Report (METAREP) software infrastructure. The eventual aim is for the educational material to be integrated with local high school curricula requirements to expose students to both hypothesis-driven and discovery-based science.

JCVI Hosts South African Scientists to Share Microbiome Research Techniques

Two scientists from the University of Cape Town, South Africa have joined Dr. Bill Nierman’s lab for the next month as part of NIH’s Human Heredity and Health in Africa (H3Africa) Initiative, a training program designed to build out technical biological skills in the African research community. This training relates specifically to developing techniques around the area of microbiome analysis, a relatively new field in the biological sciences.

Microbiome analysis for the collaborative study is looking at entire community of microorganisms in the respiratory tract of South African infants to better understand how the microbiome is associated with infant pneumonia and wheezing episodes. The expectation is that the organisms that reside in the infant respiratory tract will provide protection from or a predisposition to the pneumonia or wheezing episodes.


The Nierman Group

The Nierman group left to right Sarah Lucas, Bill Nierman, Shantelle Claassen, Mamadou Kaba and Stephanie Mounaud (unpictured Jyoti Shanker and Lilliana Losada) welcomes visiting scientists Ms. Classeen and Dr. Kaba from University of Cape Town for a month long training in microbiome sequencing and analysis.

Mamado Kaba, MD, PhD and colleague Shantelle Claassen from the University of Cape Town will be working closely under the guidance of JCVI’s Stephanie Mounaud who is functioning as the project manager and coordinating the laboratory components of a similar project at JCVI studying the microbiomes of inafnts in the Philippines and also in South Africa. These studies are sponsored by the Bill and Melinda Gates Foundation. The training will focus initially on preparing samples for DNA sequencing on a modern DNA sequencing platform, the Illumina MiSeq instrument. Once the sequence reads are off the sequencer, the instructional focus will shift to analysis of the reads by means of an informatics pipeline that develop phylogenies, or family trees, of the microbes that are obtained from the infant respiratory tract so that the abundance and relatedness of the microbes can be established. The bioinformatics training will be provided by Jyoti Shankar, the statistical analyst working on the Gates Foundation Project.

Mamadou Kaba is a Wellcome Trust Fellow working in the Division of Medical Microbiology, Faculty of Health Sciences, University of Cape Town. Mamadou’s research interests include the molecular epidemiology of infectious diseases and the study of human microbiome in healthy and disease conditions. He has contributed in establishing a new research group conducting studies on how the composition of the upper respiratory tract, gastrointestinal, and the house dust microbial communities influences the development of respiratory diseases.

Prior to joining the University of Cape Town, Mamadou worked as Research Associate at the Laboratory of Medical Microbiology, Timone University Hospital, Marseille, France, where he studied the epidemiological characteristics of infection with hepatitis E virus in South-eastern France.

Shantelle Claassen is pursuing a Masters degree in the Division of Medical Microbiology at the University of Cape Town. She has completed a BSc (Med) Honours degree in Infectious Diseases and Immunology at the University of Cape Town, during which she examined the relative efficacy of extracting bacterial genomic DNA from human faecal samples using five commercial DNA extraction kits. The DNA extraction kits were evaluated based on their ability to efficiently lyse bacterial cells, cause minimal DNA shearing, produce reproducible results and ensure broad-range representation of bacterial diversity.

Mamadou and Shantelle are currently involved in an additional prospective, longitudinal study of which the primary objective is to investigate the association between fecal bacterial communities and recurrent wheezing during the first two years of life.

The 2014 Summer Internship Application is Open and Announcing the Genomics Scholar Program

The 2014 Summer Internship Application is now open.   Last summer, we hosted 49 interns from a pool of 424 applicants. They presented their research in the First Annual Summer Internship Poster Sessions held in San Diego and Rockville. The posters were judged by a team of volunteer JCVI scientists and the poster sessions were open to all employees, interns and their guests to share what great work they all participated in this summer.



2013 Intern Poster Session

2013 Intern Poster Session

We are also excited to announce the new Genomics Scholar Program beginning this summer and also accepting applications.  The Genomic Scholar Program (GSP) is a targeted research experience program to community college students in Rockville. Our program incorporates multiple avenues of support for students through the research experience with the Principal Investigators as mentors, and supplemental professional development provided by the JCVI.  Additionally, selected students will have the opportunity to participate in undergraduate research conferences.

The GSP is supported by the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health under award number R25DK098111.

Thule, Greenland – Day One

Arrived at Thule, Greenland after a 5 hr flight from Copenhagen.  It was pretty interesting seeing a long line of people all getting on a flight that was headed to a part of the world that usually has less than 600 people there at any given time.  Arrival was pretty straightforward, no jetway, no customs, no LCD screens telling you where to pick up your bag.  Just a few military personnel checking your documents to ensure that you have the approval from the Danish government and USAF to be on base.  First impression getting off the plane…it’s cold.  Not as cold as I expected it to be but it was just 90 degrees F when I left home a few days ago.  Today’s high was 39 degrees F.  Standing in the sun it’s not so bad but when the wind starts blowing it turns into a recipe for chapped lips and windburn.  Oh and did I mention the massive mosquitos here?  Not much wildlife in this part of the world but the mosquitos outnumbers the vertebrates probably a million to one.  They are also VERY aggressive; they even swarmed the trucks while we were driving around the base.  We were shown our living quarters, which were very nice, kind of reminded me of living in the dorms during undergrad.  There are individual rooms and a shared bathroom on each floor.  We toured the various sites that our collaborator Slava Epstein already pointed out as good sampling sites that vary in vegetation and proximity to water.  The land here is quite desolate, not much green, mostly moss and small shrubs growing.  Traditional trees are nonexistent but “ground trees” are actually common.  They are trees that grow outward on the grass and not upward.  The rest resembles pictures taken by the mars rover.  As the day goes by I noticed the sun was circling and I came to the realization that the typical artic summer was happening right in front of me.  The sun literally circles and will not go down until around September.  It was quite odd, getting in bed at midnight and seeing the sun still in the sky.  Tomorrow will be more interesting since we will be going further away from base to sample additional areas. 



Thule, Greenland – Day Three

Day three started with me missing breakfast. It seems that folks around here only eat breakfast between 5am and 8am. Today was a very rough day for sampling.  About an hour drive to the area near the site, about a three-mile hike to one spot another half-mile hike to another spot followed by the three and a half mile hike back to the truck. We sampled “rich” soil and “rich” soil from a lake. These two sites were sampled and categorized as “rich” due to the abundance of vegetation around and near the sites. The area surrounding Thule is very desolate so I can imagine the plants have a hard enough time growing.  It would be very interesting to see what microbes are present in these two sites to allow such vegetation to grow; even more interesting to see how water affects the microbial population. Samples were frozen once we got back to the on site lab. A small portion was saturated with AllProtect to ensure preservation of RNA for transcriptomics analysis.




The day ended with a lecture from another NSF grant recipient to install a telescope on the Greenlandic ice cap. It was an interesting idea to coordinate radio imaging from other telescopes around the world to look at quantum singularities that were very far away. After speaking to some of the other scientists here I found out that our group, which includes myself and our collaborators Slava Epstein and Dawoon Jung, were the ONLY Microbiologists on the base. Everyone else was either a Geologist, Environmental Scientist, Astronomer, or Meteorologist. It was great to hear about everyone else’s projects.

Evaluating Strain-level Variation of Key Acidogenic Species in Dental Plaque Biofilms

The characterization of the dental plaque microbiome, using traditional 16S rDNA profiling strategies, illustrates both the strengths and the limitations of this method. The central limitation of the 16S rDNA methodology is the inability to decipher strain-level variation within a microbiome. Why is this important? It is becoming a common theme in microbiome research that microbiomes associated with the human host are distinct from those that inhabit the environment. The species present in distinct human microbiomes represent only a small number of taxa. Within these taxa are relatively few genera that have massive representation of member species. This structure has been referred to as the deep fan structure.  When comparing microbiomes representing healthy and diseased subjects, it may be commonplace that important strain-level variations exist, that are in many instances potentially causally related to the health of the human host. The dental plaque microbiome illustrates this point strongly. Oral microbiologists have isolated strains from species including: S. mitis, S. sanguinis, S. mutans, S. gordonii and others that differ dramatically in their acid production and acid tolerance characteristics. The genes encoding these activities are not part of the core genome, but reflect functions encoded in the strain-variable portion of the genome (~10-30% of the genomes coding capacity). Important aspects of human disease etiology may be missed if we fail to address this possibility.

Summary of Progress: Dental plaque samples from human subjects with and without dental caries were used to isolate S. mutans and S. sobrinus colonies using enrichment culturing procedures. Most colonies were subjected to 2-3 rounds of replating to obtain pure colonies. The individual clones were then grown in liquid media to isolate genomic DNAs to carry out fingerprinting of strains based on RFLP analysis. This allowed us to collapse positive strains that appeared identical or highly similar into a set of strains that appeared to be of maximal diversity, encoding the largest number of unique gene sequences. We further characterized the individual strains using primer pairs that are specific for either S. mutans or S. sobrinus. Several of the isolates were negative by PCR and these corresponded to isolates with unusual RFLP patterns and so were excluded from further analysis. Some isolates tested positive for one of the two primer pairs used for screening and were marked as such but retained for further analysis using genome sequencing. The isolates obtained were multiplexed into two lanes of the Solexa GSA IIx at a theoretical depth of coverage of 50X. Previous evidence based on comparative analyses indicates that strain-specific regions of the S. mutans genome are not randomly distributed but rather are present at discrete locations. The breadth of these regions is not fully characterized but will be greatly enhanced by our analyses. To date no reference genome sequence is available for S. sobrinus, a potentially important contributor to dental caries.

Each genome to be sequenced was uniquely barcoded using the EpiBio Nextera DNA sample prep kit, and sequencing was performed using an Illumina Genome Analyzer IIx. The sequenced reads were then used to search against the Genbank non-redundant nucleotide database for quality assessment and to determine the top hit of each genome.  As shown in Table 1, 76 isolates generated best hits to S. mutans and 47 to S. sobrinus genomes. Among the 17 isolates that do not appear to be either S. mutans or S. sobrinus it is somewhat puzzling how they were cultivated on the medias used. We believe these colonies were impure and predominantly that of the genome sequenced.

Top Blast Hits Genomes # of isolates
S. sobrinus 47
S. parasanguinis 1
E. faecalis 1
Lactobacillus spp. 1
S. mutans 76
Chryseobacterium gleum 1
S. aureus 8
S.  epidermidis 1
S. caprae 4

Table 1. Summary of the tops hits of the reads from each isolate sequenced.

We used Newbler to assemble each of the genomic sequence reads. For S. mutans we used mapping assembly against the S. mutans UA159 sequence and we performed de novo assembly for S. sobrinus sequence reads due to the lack of available reference genome sequence. Overall the sequencing of isolates was successful with one exception. The remaining 75 isolates assembled with an average coverage of 91% with respect to the reference genome. Given what is known about strain-specific gene content in S. mutans one expects 90% coverage to be equivalent to complete coverage since ~10% of UA159’s genome sequence is not likely to be shared with these isolates. The average number of contigs/isolate is 215 with average length of 10,842 bp. Based on this outcome it is highly likely that we will identify sequence reads from essentially all strain-specific genes for each isolate, the extent that full-length gene sequence has been generated and further to what extent those sequences display genomic context are a part of our current efforts.

Ongoing Efforts. We are currently identifying strain-specific sequences from each isolate to determine the extent that these sequences might be shared among newly characterized isolates and their association with either caries-free or caries-active subjects. We will also identify the set of core gene sequences that appear to be present in all S. mutans and S. sobrinus genomes respectively. Ultimately we have demonstrated the use of high throughput sequencing technology as a means for characterizing oral pathogens of interest. Suggested applications for this type of research effort include the generation of strain-specific oligonucleotides to be added to existing DNA microarray content to enhance analysis using standard CGH methods. Another powerful use of this data can be obtained via the application of a variety of selection schemes that reveal the fitness of individual strains among the groups sequenced. The identification of strain-specific sequence signatures allows us to design primer pairs that can be used to measure the abundance and growth characteristics of that strain by qPCR. Potentially more interesting is the measurement of strains’ growth characteristics in competition with other sequenced strains. We have created mixtures of all of the sequenced S. mutans and S. sobrinus strains as independent pools and also generated a super pool including all sequenced strains. We have subjected these pools to a number of selective growth conditions including oxidative stress, low pH and growth on a variety of sugar substrates. In each case we envision that the generation of gene expression data and/or qPCR data detailing the abundance of each strain before and after selection will reveal individual strains that display high and low resistance to low pH, oxidative stress etc. This experimental procedure is analogous to phenotypic screens involving pools of single gene KO strains that have been uniquely barcoded to allow highly parallel analysis using DNA microarrays as popularized by the S. cerevisiae community. The variation performed here is to make use of the strain-specific gene sequences as a surrogate for the molecular barcode. Each strain will have at least one and probably hundreds of unique sequence identifiers that may be exploited for this purpose.

It is our hope that this demonstration will provide the dental research community a blueprint for how genome sequence data can be exploited and become more than a simple GenBank record for reference purposes. The experimental process described above provides a novel way to relate genotypic and phenotypic information on collections of strains derived from healthy and diseased human subjects. The sequence data for all assemblies has been placed in the public domain and we are currently awaiting accession number assignments. If you have some ideas for negative selection, let me know, I am happy to share the strains/pools and funding permitting, primer pair aliquots targeting specific strains in the pools.

The projects described above were supported by NIAID via a contract to JCVI under the Pathogen Functional genomics Resource Center (N01-AI15447) and funds from NIDCR to PFGRC in an attempt to enable the HMP research community to exploit genomic and metagenomic methods. The work pertaining to the oral cavity was done in collaboration with Dr. Walter Bretz at NYU and the efforts pertaining to the gut microbiome were done in collaboration with Dr. Cynthia Sears at JHU.

Cataloguing the Gene Expression Patterns of Dental Plaque Biofilms: A Reference Dental Plaque Transcriptome

The RNA-Seq method has been widely adopted as an alternative to the use of DNA microarrays. In most contexts, the RNA-Seq method is implemented when a single reference organism is being studied. Our project endeavored to establish working methods to enable the generation of cDNA libraries that were depleted of contaminating human mRNA and host/microbiome rRNA sequences that would otherwise represent over 95% of the total sequence reads obtained. We have also made significant efforts to define bioinformatics procedures that allow RNA-Seq data to be assigned to appropriate species such that global gene expression analyses can be routinely conducted by the dental research community and those involved in HMP research objectives.

We have established a catalogue of expressed genes in dental plaque by turning to the Solexa sequencing platform and applying RNA-Seq to a collection of 19 twin pairs that are either concordant for dental health (caries-free concordant twin pairs), concordant for dental caries (caries-active concordant twin pairs) or discordant for dental caries (one twin caries-free and the other member of the twin pair caries-active). Based on our analysis of the data we have established that the most abundant ten species in each sample varies significantly from subject to subject. This fact greatly complicates the mapping of reads to reference genomes. Another significant conceptual challenge we faced was how to conduct highly specific mapping of transcripts to genomes of interest. We know that genes in genomes evolve at substantially different rates; some genes may differ by 2-5% across species boundaries whereas others may differ by 25-30%. The consequence of this is that no single cut-off for mapping a transcript to a reference genome may be reliably employed. We therefore reasoned that by creating an oral cavity reference genome database we could map each transcript according to reasonable specificity criteria but impose a best-hit criteria on the data to ensure minimal mis-mapping.

Based upon the data generated (38 samples X ~32.8 million reads/sample) ~1 billion reads or over 100 Gb of sequence data, we have fulfilled the goal of establishing a robust procedure for RNA-Seq and the specific transcripts expressed in dental plaque biofilms. These sequences and the associated SOPs developed for effective microbial RNA enrichment have been made available through the DACC ( In addition, we have devised a strategy for mapping reads to particular functional or biochemical pathways such as those related to acid/base production as an independent means of exploiting RNA-Seq data. In this scheme the details of which species are expressing functions is not considered of importance but rather the sum total of expressed sequences related to acid/base production is. The approach used here is similar to that described above in that a database is created pertaining to all sequence data derived from particular biochemical pathways as a means of recruiting reads of appropriate sequence identity mapping to annotated genes. Over- or under-representation of expressed genes constituting discrete pathways may then be evaluated.

The projects described above were supported by NIAID via a contract to JCVI under the Pathogen Functional genomics Resource Center (N01-AI15447)and funds from NIDCR to PFGRC in an attempt to enable the HMP research community to exploit genomic and metagenomic methods. The work pertaining to the oral cavity was done in collaboration with Dr. Walter Bretz at NYU and the efforts pertaining to the gut microbiome were done in collaboration with Dr. Cynthia Sears at JHU.

Surrogate Methods for Profiling Species of the Oral and Gut Microbiome

We engaged in an effort focused on alleviating a substantial barrier facing the human microbiome research community. While powerful, the 16S rDNA gene is insufficiently divergent to allow discrimination of many species and essentially no strains present within communities. The increasing costs of Sanger sequencing has forced most investigators to adopt the use of the Roche, 454 sequencing platform to address the question, “who’s there?”  The benefits of the 454 sequence data are clear as investigators enjoy deep data sets with excellent statistical power. A major drawback relates to the fact that the read length of the 454 platform does not  allow the acquisition of a sufficient number of “informative bases” to allow species level identification and therefore generally depicts the genera present in the microbiome. While there is much to be gained by large-scale analysis of genus-based comparisons, it is highly desirable to have species and even strain-level resolution. Much of the difference in healthy and diseased human microbiomes may lie at the species and strain-level making it important to develop strategies to allow species abundance measurements to be made on large human cohorts, in a cost-effective manner. We used capture array technology in an iterative fashion to establish a comprehensive sequence database of seven conserved gene sequences. We performed a proof of concept using two model systems: the oral (dental plaque) microbiome and the fecal microbiome. We designed capture oligonucleotides that tiled each of seven universally conserved gene sequences present in Genbank belonging to genera known to be present in the gut and oral cavity, respectively. We refer to these oligonucleotides as “seed sequences” for use in capturing orthologous sequences present in both stool and dental plaque biofilms and saliva.

We next prepared complex mixtures of dental plaque and saliva from several individuals and separately also prepared a similar stool mixture representing a diversity of subjects. The DNAs generated from these microbiome samples were used in conjunction with the capture array. We refer to the captured DNAs as “cloud sequences” that represent related sequences (phylogenetic clades) surrounding the original seed sequences. We repeated the capture array process three times such that novel identified sequences relative to the original seeds were added to subsequent capture array designs. Our goal is to establish a taxonomic representation of these microbiomes based on detailed DNA sequence data of seven housekeeping genes, reminiscent of long-standing MLST approaches. We are leveraging existing and future reference genome sequences to annotate the sequence data obtained from capture array data. Additional species may be subsequently added to this framework by the HMP research community simply by sequencing the relevant loci from defined species available via ATCC, BEI or from the strain collections held by hundreds of investigators world-wide.  The power of this approach lies in the provision of DNA sequences that can be used to design qPCR primer pairs capable of highly discriminatory amplification and abundance measurements of species and strains of potential interest.

Despite the fluctuation in the efficiency of capturing orthologs among the seven target genes, we were able to generate a substantial depth of coverage for three genes in the oral cavity, pyrG, pgi and recA and four genes in the gut pyrG, dnaG, pgi and recA. We have been analyzing the total gene sequence data obtained from capture arrays including four 454 runs each for oral and fecal microbiomes. Given the nature of the sequence data as a representation of highly related sequences derived from tens or hundreds of strains belonging to the same species we were pessimistic that assembly of sequence reads would be fruitful. Our attempt at de novo assembly, using newbler, verified our concerns and was not successful. We have defined an in silico approach to organize the sequence data that involves generating a microbiome reference genome database populated with relevant genomes derived from the oral cavity and gut. In addition to the original genes collected from Genbank, we added the 7 targeted gene sequences from 134 oral-related genomes and 162 gut-related genomes. By creating this database we will be able to map each gene sequence to the reference genome to enhance the specificity of each assignment. We are mapping the reads from our sequencing data to genomes using a high stringency cut-offs. Those reads mapping to reference genomes will be used to generate a multiple sequence alignments to derive a consensus sequence and identify exploitable polymorphisms for qPCR primer design. For this we will not only rely on the multi-sequence alignments but we will also compare alignments for any individual species to others within a major clade (common genera). This will allow us to determine the sequences with the highest probability of being unique to the species of interest. Preliminary assessment of the DNA sequence data has shown promising outcomes as we are able to recapitulate phylogenetic clades such as the viridans group of Streptococci using gene sequences derived from recA. This supports the idea that gene representation from species known to be present in the oral cavity were effectively captured. The clade or sub-clade primer design will be based on all the sequences reliably mapped to genomes.

It is our goal to design useful primer pairs representing species-level resolution. This will be achieveable in many cases but not all. We are seeking funds to create a repository of primer pairs to share with the HMP community. It should be noted that initially, none of the primer designs will be experimentally validated and as such users will need to carefully evaluate their usage in the context of their experimental goals. It is our plan to continue efforts associated with this project to conduct validations to the extent that funding permits. These results will be added to the primer designs as they are validated or deemed unsuitable for experimental use.

The projects described above were supported by NIAID via a contract to JCVI under the Pathogen Functional genomics Resource Center (N01-AI15447)and funds from NIDCR to PFGRC in an attempt to enable the HMP research community to exploit genomic and metagenomic methods. The work pertaining to the oral cavity was done in collaboration with Dr. Walter Bretz at NYU and the efforts pertaining to the gut microbiome were done in collaboration with Dr. Cynthia Sears at JHU.