Posts tagged metagenomics

Carl Woese 1928-2012

Editor’s Note: This post originally appeared on T. Taxus, December 31, 2012, by Jonathan Badger. Dr. Badger  is an Assistant Professor in the Microbial and Environmental Genomics Group at the J. Craig Venter Institute in La Jolla, CA. Reprinted by permission.

As you may have heard, Carl Woese died of pancreatic cancer yesterday at the age of 84. I had the honor of working with Carl in grad school at the University of Illinois where my advisor, Gary Olsen, ran a joint lab with Carl.

Carl Woese

Carl Woese. Photo courtesy IGB.

As the originator of the use of ribosomal RNA to distinguish and classify organisms (including obviously the Archaea), Carl both revolutionized evolutionary biology and created a method that is still very much in use today. Even in the latest metagenomic study of the oceans or of the human gut, a 16S rRNA diversity study is required as a control in addition to whatever additional markers or random sequencing is used.

One of the things that fascinated me about Carl is how he constantly reinvented himself and explored new fields of biology — his early work in the 1960s dealt with classical molecular biology and the genetic code (the origins of which continued to fascinate him for the rest of his life). He then transfered to the study of the ribosome and its structure, which in turn led to his study of 16S and its evolutionary implications. In the 1990s, when I worked with him, he was a pioneeering microbial genomicist and collaborated with TIGR to sequence the first two Archaeal genomes. And in his final years he focused on early evolution and the last common ancestor of life in the light of what genomics has taught us.

Carl also had his humorous and counter-cultural side. I remember him telling me how his lab in the 1960s heard about the rumor that compounds in banana peels were a legal narcotic and how they launched an unofficial research project to isolate these. His verdict was that there was nothing there and neither the peels nor anything in them could get you high — but he wanted to empirically test that. Also, when reading about a supposed “Qi master” who claimed to be able to influence mutation rates with his mind, he invited him to the lab to give a demonstation — which naturally failed to show any effect under controlled conditions — but he wanted to see if the guy could really do it.

Genomics, metagenomics, and evolutionary biology has lost one of its greats — but his legacy lives on.

JCVI Supports Human Mircrobiome Body Site Experts with Shotgun Data Analysis

Members of the Human Microbiome Project (HMP) Consortium (see and for more information on the project and partners) including human microbiome body site experts gathered for a virtual Jamboree January 19th. The fully online-based Jamboree has been set-up to communicate initial data products and tools best suited for analysis, primarily to make the data amendable/consumable in a user-friendly way for body site exerts. 61 participants followed the Jamboree agenda with presenters given access to a common desktop that was shared via the internet using an online collaboration tool. Results from  the Data Analysis Working Group (DAWG) were presented in the areas of 16S rRNA gene sequence (16S DAWG) and metagenomic whole-genome shotgun analysis (WGS DAWG). The efforts of the 16S DAWG focus on marker-gene based approaches to estimate biological diversity and how marker variability is associated with patient meta-data. The WGS DAWG  complements results from the 16S marker based analysis with comprehensive sequencing of random pieces of genomic DNA from the collection of microorganisms which inhabit a particular site on, or in, the human body (microbiome). These analyses allow researchers to investigate among other questions what microorganisms are present, and the nature and extent of their collective metabolism, at a particular body site. Ultimately researchers want to relate this information to healthy versus diseases states in humans.

METAREP tutorial presented as part of the HMP Virtual Jamboree

The current survey comprises more than 700 samples from hundreds of individuals taken from up to 16 distinct body sites. Illumina sequencing has yielded more than 20 billion Illumina reads and annotation data produced from the sequences exceeds 10 terabytes. In anticipation of such data volumes, we have developed JCVI Metagenomics Reports (METAREP), an open source tool for high-performance comparative analysis, in 2010. The tool enables users to slice and dice data using a combination of taxonomic and functional/pathway signatures. To demonstrate how the tool can be used by body site experts, we picked and loaded sample data from 17 oral samples and presented a quick tutorial on how users can view, search, browse individual samples and compare multiple samples (see video). The functionality was very well received and body site experts asked JCVI to make all the 700+ samples available. As a result of the Jamboree, JCVI in agreement/collaboration with the HMP Data Analysis and Coordination Center and the rest of the HMP consortium, will soon set-up a dedicated HMP METAREP instance that will allow body-site experts and eventually other users to analyze the DAWG data in a user-friendly way via the web.

The Microbiome of Esophageal Cancer

In anticipation of the International Human Microbiome Congress, our group has diligently worked to generate data to present for our HMP demo project studying the microbiome of patients who have developed esophageal cancer, gastrointestinal reflux disease, and barrett’s esophagus.  We received a large number of samples in December of 2010 which surveyed four body sites (esophagus, fecal, oral and stomach) of twelve patients.  Upon isolation of DNA, we amplified a variable region of the 16S gene for each sample using barcoded PCR primers.  Incorporation of the 454 A and B adaptors to our primers also provided minimal loss of sequence data when compared to previous methods that would ligate the adaptors to amplicons after PCR.  This method also allowed us to generate sequence reads which are all in the same 5’-3’ orientation.  A large dataset with high quality sequence reads was generated and is currently going thru phylogenetic analysis.  Metagenomic data is also currently being generated from DNA extracted from esophageal brushings taken from a healthy individual as well as a patient who has developed esophageal cancer.  This comparative analysis will be scientifically beneficial in identifying key structural and functional elements that are known to increase pathogenesis of a complex disease such as cancer.  We are anxiously awaiting results from the analysis of these sequences and expect to present a thorough investigation on the esophagus microbiome.

Lucene Revolution Conference 2010

I arrived late in Boston after my plane from Washington DC was delayed. On the agenda – the next four days the Lucene Revolution conference and a Solr application development workshop organized by Lucid Imagination. The conference promised a unique venue (the first of its kind in the US) to meet developers that all share the same challenge: to enable users to find relevant information in growing bodies of data quickly and intuitively. I was looking forward to hearing many interesting talks given by experts of the field, to learning how to build intuitive search interfaces, and to get an idea where things are heading in the next years. As the developer of JCVI’s Metagenomics Reports (METAREP), I was especially looking forward to the Solr workshop to learn some of the tricks from the experts to tweak the search engine behind this open-source metagenomics analysis tool.

The Early Revolution

But before the revolution could happen and I could enjoy some splendid time at the Washington Dulles airport, Doug Cutting had to start developing a Java based full-text search engine called Lucene in 1997. Lucene became an open-source project in 2000 and an Apache Software Foundation project one year later. In 2004, Solr emerged as an internal CNET project created by Yonik Seeley to serve Lucene powered search results to the company’s website. It was donated by CNET to the Apache Software Foundation in 2006.

Google Trend for Solr

Google Trend for Solr

Early this year, both projects merged and development since then has been carried out jointly under the umbrella of the Apache Software Foundation. Meanwhile many companies use Solr/Lucene, among them IBM, LinkedIn, Twitter, and Netflex. How did this happen?

The Lucid Imagination Solr Application Development Workshop

In search of an answer, I made my way from my hotel to the conference venue, the Hyatt Hotel located along the beautiful Boston harbor bay. The 2-day workshop was a brute-de-force tour of Solr features, configuration, and optimization. It also touched on the mathematical theory behind Lucene’s search result scoring and on evaluating result relevance.  The 2-day workshop covered enough material to warrant a third day. Given this optimistic agenda, there was not much time for the labs (exercises) and the trainer had to focus more on breadth than on depth. As a one-year Solr user, many of the general concepts were familiar so I was more interested in details. A comprehensive hand-book and an excellent exercise compilation came to the rescue and provided me with the needed detail to follow up on subjects that were touched on. There were two parallel Solr classes. In my class, 25 participants followed the training. The mix included developers working for media, defense, and other co-operations. Academia was represented by several libraries and universities.

Solr Application Development Workshop

A powerful feature I had not heard before is the DISMAX RequestHandler. The handler allows to abstract complex queries. Users can enter a simple query without complex syntax or specifying a search field and behind the scenes the handler will do its magic.  It searches across a set of specified fields which (among other things) can be weighted by importance. Additional information about this handler and other snippets I collected during the class can be found in my Solr workshop notes .

The Lucene Revolution Conference

After a mediocre coffee brewed in my hotel room, I headed to the conference venue on the second floor of the Hyatt Hotel. The first day of the conference started with a podium discussion about Cutting Edge of Search that included Michael Busch (Twitter), John Wang (LinkedIn), Joshua Tuberville (eHarmony), and Bill Press ( The discussion went back and forth showcasing each search platform and the experience in developing it. When asked what he would do differently in retrospect John Wang from LinkedIn ironically mentioned that he would “ban recruiters” – if I correctly remember he mentioned that they “spam-up” the system.

Lightning Talk “Using Solr/Lucene for High-Performance Comparative Metagenomics”

Joshua Tuberville from eHarmony provided valuable advice to developers: “Avoid pet queries for benchmarking a system – use a random set of queries instead.” He also suggested tracking queries that web site users enter for optimization, adding “it surprises me every day that the world is not made up from engineers, but it is a fact.” Avoid unnecessary complexity and duplicating efforts. Use open-source if available. For example, instead of implementing their own Lucene wrapper, eHarmony made use of the open-source project Solr. Bill Press added “Do not be afraid to tear things down, rebuild it many times if needed.”

“Companies do not have time to debug code.” Eric Gries (CEO Lucid Imagination)

Eric Gries, CEO of Lucid Imagination, presented ‘The Search Revolution: How Lucene & Solr Are Changing the World’. In the introduction, he pointed out that Solr/Lucene is the 10th largest community project and the 5th largest Apache Software Foundation project. “Open-source projects need a commercial entity behind them to help them grow”. “Companies need no errors, they do not have time to debug.” His main part focused on his company’s LucidWorks Enterprise software which is based on the open-source project Solr/Lucene. Features that separate it from the open-source version include smart defaults, additional data sources, a REST API that allows programmatic access via Perl/Python/PHP code, standardized error messages, and click based relevance boosting. Later, Brian Pinkerton, also from Lucid Imagination presented additional details. He revealed that their software is based on elements of the upcoming Solr 4.0 version and is fully cloud enabled (added SolrCloud patch). It uses ZooKeeper to manage node configuration and failover. All website communication is done in JSON .  The enterprise version supports field collapsing for distributed search.

“A picture communicates a thousand words but a video communicates a thousand pictures.” Satish Gannu (Cisco)

Satish Gannu from Cisco stressed the increasing prevalence of video data and how such data is changing the world. More and more video enabled devices are pushed on the market. Collaboration is increasingly done across the world. Meetings are recorded and shared globally. Videos are replacing manuals. Cooperate communication/PR via video is increasing. He related the popularity of video to the fact that “A picture communicates a thousand words, but a video communicates a thousand pictures” and that “60% of human communication is non-verbal.” Satish went on to highlight Cicso’s video solutions that make use of automatic voice and face recognition software to store metadata about speakers to enrich the user experience. For example, users can filter out certain speakers when watching recorded meetings. More can be found here.

View of Boston

“Mobile application development will be the driver of open-source innovation.” Bill McQuaide (Black Duck Software)

One of the highlights that morning was Bill McQuaide’s talk on open source trends. Based on diverse sources, including his company Black Duck Software, he showed that software IT spending is down, that 22% of software is open source, and that 40% of software projects use open source. There is an enormous amount of new open source projects targeting the cloud with a lot of competition. Among top open-source licenses are the GNU General Public Licenses, GPL 3.0, and BSD licenses. The three predominant programming languages used by open-source developers are C, C++ and Java. Mobile development will be the driver of innovation in the open-source community especially developments around Google’s Android operating system.  To manage licenses for projects that integrate dozens of open-source projects such as Android and to ship the bundled software to customers can become very complex. For this and other reasons, McQuaide recommends companies and institutions to have policies for implementing open source, integrating third party tools, and identifying and cataloging all open source software used.

Distributed Solr/Lucene using Hadoop

An excellent talk was presented by Rod Cope, from Open Logic.  He presented Real-Time Searching of Big Data with Solr and Hadoop.  The search infrastructure centered around Hadoop’s distributed file system on top of which they cleverly arranged several other technologies.  For example, Hadoop’s HBase database provides fast database lookups but does not provide the power of Lucene text searches. Solr/Lucene however is not as optimized to return stored document information. Their solution is to use Solr/Lucene to search indexed text fields, storing and returning only the document ID.  The returned document ID is then used to fetch additional information from the HBase database. Open Logic uses the open-source software katta to integrate Lucene indices with Hadoop and increased fault tolerance by replicating Solr cores across different machines. Also, corresponding master and slave servers were set up to run on different machines for indexing and searching respectively.  The set up he described runs completely on commodity hardware and new machines can be added on the fly to scale out horizontally.

“It surprises me every day that the world is not made up from engineers but it is a fact.” Joshua Tuberville (eHarmony)

Next on the agenda were seven minute lightning talks. I opened-up the lightning talk session describing our Solr/Lucene based open-source web project METAREP for high-performance comparative genomics (watch). Next was Stefan Olafsson from TwigKit presenting ‘The 7-minute Search UI‘, a presentation which I thought was another gem of this conference. In contrast to other talks, it focused on user experience and intuitive user interfaces. TwigKit has developed a framework that provides well designed search widgets that can be  integrated with several search engines.

“If nobody is against you in open source then you are not right.” Marten Mickos (CEO Eucalyptus)

The key note presentation on the second day was presented by Marten Mickos the CEO of Eucalyptus and former CEO of MySQL. He opened by advocating his philosophy of making money out of open source projects. “Innovation is a change that creates a new dimension in performance” he said and mentioned the open-source Apache web server that allows anyone to run a powerful web server. He added “Market disruption is a change that creates a new level of efficiency” and referred to MySQL originally designed to scale horizontally. While in 1995 such a design was a draw-back compared to other marked solutions, scale-out has become the dominant design today. Now, within the cloud, horizontal scaling is the key. A fact that has made MySQL the most used database in the cloud.

He observed that “while most successful open-source projects are related to building infrastructure software, servers and algorithms, there are only a few open-source projects centered around human behavior, user experience and user interfaces. The latter projects are mainly developed in closed source environments.” Then he went on praising open-source as a driver for innovation “Open source is so effective because you are not protected. Code can be scrutinized by everybody. In a close sourced company, your only competition is within the company, while in open source you compete with everybody.” Open source is a way to innovate and it is more productive. It usually takes a stubborn individual to drive things. Innovation mostly stems from single individuals that are supported by the community.

When asked how to maintain property rights as a company when running an open-source model, he responded “keep things that keep the business going proprietary but open-up others. The key is to be very transparent with your model.”

What’s next ?

In a podium session the core Solr/Lucene committer team discussed future features. The team works on rapid front-end prototyping using the Apache Velocity template engine and Ajax. The prototyping code can be found in the current trunk of the Solr/Lucene code repository under the /browse directory. A Solr/Lucene cloud enabled version is being developed. Twitter’s real time search functionality will be integrated. Other open source projects that are being integrated are Nutch, a web-search software, and Mahout for machine learning (  New features will include pivot tables (table matrices), a hierarchical data type, spatial searching, and flexible indexing.

The above represents a subset of talks that took place. There were many other interesting talks – some took place in parallel sessions. Individual presentations can be downloaded from the Lucid Imaginations conference page. A selection of videos is available here. The next Lucene Revolution conference will take place in San Francisco May 2011.

After four days of Solr/Lucene, many coffees, talks, discussions, I left inspired by the conference. It dawned on me that the real revolution is not the search technology but the strong community spirit itself that has emerged and drives developers to jointly work towards a common goal.

Entamoeba histolytica research presented at the Molecular Parasitology Meeting

Entamoeba histolytica causes invasive intestinal and extraintestinal infections, known as amoebiasis, in about 50 million people and still remains a significant cause of human death in developing countries. However, for unknown reasons, fewer than 10% of E. histolytica infections are symptomatic (causing symptoms such as diarrhea, dysentery or liver abscess). The J. Craig Venter Institute is among the institutions awarded the NIAID Genome Sequencing Centers for Infectious Diseases (GSCID) contracts to provide high-quality genome sequencing and high-throughput genotyping of NIAID Category A-C priority pathogens.

Photo of Entamoeba histolytica

Entamoeba histolytica in the trophozoite stage.

A GSCID project led at JCVI by Dr. Elisabet Caler includes performing whole-genome sequencing of Entamoeba phenotypic variants from symptomatic, asymptomatic and liver abscess-causing strains chosen to include a range of clinical manifestations and taken from human cases, as well as strains grown under different conditions. Our objective is to develop a genome-wide landscape of Entamoeba diversity to understand how sequence variations in the parasite relate to pathogenicity (ability to cause disease) and clinical outcome.

The Molecular Parasitology Meeting held at the Woods Hole Oceanographic Institution, Woods Hole, MA last week provided a window into the exciting science of Parasitology.  The keynote speaker, Fotis Kafatos, spoke on “Major Challenges to Global Health in the Tropics and Beyond–Insect Vectors of Malaria and Other Parasitic or Viral Diseases.”  Dr. Kafatos stressed that a multi-pronged approach to the control of malaria is necessary to prevent the devastating loss of life that malaria causes.

Woods Hole Oceanographic Institution

A view of Woods Hole Oceanographic Institution.

The many excellent papers and posters provided an overview of the field, including   Plasmodium falciparum, Toxoplasma gondii, the trypanosomes, Giardia lamblia, Trichomonas vaginalis, Entamoeba histolytica, Schistosoma species, Babesia bovis, and associated vectors.  Topics spanned basic biology, drug design, sequencing and host-pathogen interactions.

I presented an overview of the Entamoeba sequencing project at the meeting.   Discussions as a result of the presentation included questions about the details of sequencing and handling the next-generation sequencing data.   We had animated discussions about methods for assembly of the DNA sequences, including reference-guided vs de novo assembly.   Many attendees were impressed with JCVI’s open-source METAREP metagenomic tool (J. Goll, et al., Bioinformatics 2010).  Determination of the best methods for the analysis of differences in the clinical isolates generated much discussion.  Entamoeba researchers see the sequences as a great resource and are looking forward to being able to mine the data.  One, from India, was very excited that he was going to have about 15 times the resources he has had in the past, since he has had only had one genome to mine up until now.

The Molecular Parasitology Meeting was an excellent venue for scientific exchange.  The Entamoeba histolytica GSCID project will help us understand the pathogenicity of Entamoeba histolytica, and has the potential to save lives in developing countries.

Virtual Comparative Metagenomics

We have created an open virtualization format (OVF)  package of JCVI’s Metagenomics Reports (METAREP)– a high performance comparative metagenomics analysis tool. The software runs on a web server, retrieves data from two different database systems and uses R for statistical analysis. The new OVF package bundles all these 3rd party tools and is configured to run out of the box in a virtual machine.

Screenshot of the virtual box appliance import wizard. The wizard allows you to specify the CPU and memory usage of the virtual machine on which METAREP will run on.

To run a virtual version of METAREP on your machine, follow these steps

  1. download the METAREP OVF package from our ftp site [download] .
  2. unzipp the OVF package
  3. download and install Oracle’s Virtual Box, a OVF compatible virtualization software [download]
  4. Start Virtual Box
  5. Click File/Import Appliance and select the OVF file.
  6. Adjust RAM/CPU usage using the Appliance Import Wizard (see image)
  7. Start VM
  8. Double-Click on the METAREP firefox link on the VM desktop
  9. Log into METAREP with username=admin and password=admin

This virtual machine appliance is the first step in developing a fully cloud-enabled analysis platform where users can easily launch the application wherever is most convenient: on their personal desktop or in the cloud where they can scale-out the appliance to suite their needs.

Future virtual machine images will be certified to run on other virtualization software platforms. Stay tuned.

If you like to learn more about METAREP and talk to the developers,  join us  at  Lucene Revolution Conference in Boston (October 7-8  2010). We will present a lightning talk about METAREP  the first day of the conference 5pm  (see agenda).





METAREP Source Code

HMP Consortium – St. Louis Missouri

Human Microbiome Project Consortium – September 2010 – St Louis, Missouri

We received warm welcome messages from Dr George Weinstock and Dr Jane Petersen as well as a humorous welcome from Dr Larry Shapiro, Dean of Washington University Medical School. 

It was wonderful to see so many scientists come together to share the progress on their individual HMP related demonstration projects.  Our own demonstration project with Dr Zhiheng Pei, involving the esophagus microbiome and how that relates to esophageal adenocarcinoma (EA), was quite unique compared to the other projects as we were the only group to focus on the correlation between bacterial population and a form of cancer. 

With over 400 participants and 59 speakers, the conference was quite successful and very interesting.  JCVI Director Dr Karen Nelson did a wonderful job moderating one of the segments.  Dr Roger Lasken also gave a thorough presentation on his lab’s single cell approaches to genomic sequencing of uncultureable bacteria.  Johannes Goll gave a great presentation on his recent work with an open source tool called METAREP (recently published in Bioinformatics 8/26/2010), which is designed to help scientists with analyzing annotated metagenomic data.  And Dan Haft presented his interesting work with algorithmically tuning protein families from reference genomes for systems discovery. 

Overall the conference was quite interesting and informative.  I continue to wish all of the participating sequencing centers, PIs, and others involved with the HMP much success with their projects. 

Hope to see everyone in Vancouver!!!

Advance Access JCVI Metagenomics Reports Application Note

A significant JCVI informatics development is JCVI Metagenomics Reports, an open source Web 2.0 application designed to help scientists analyze and compare annotated metagenomics data sets. Users can download the application to upload and analyze their own metagenomics datasets.

METAREP has just been published in Bioinformatics (08/26/2010) as an open access article. The publication is currently accessible under the Bioinformatics Advance Access model. The PDF version can be downloaded at

Supplementary information includes the METAREP data model and an overview about its search performance accessible at

One of METAREP’s  key features that distinguishes it from other metagenomics tools is that it utilizes a high-performance scalable search engine that allows users to analyze and compare extremely large metagenomics datasets, e.g. datasets produced by the Human Microbiome Project.

If you like to learn more about METAREP and talk to the developers,  join us  at  Human Microbiome Research Conference in St. Louis in Missouri (August 31 – September 2, 2010). We will present METAREP  the first day of the conference at 10:35am (see agenda).

Contact Us:

We would like to hear from you. If you have questions or feedback or if you wish to contribute to the METAREP open source project please send an email to





METAREP Source Code

High-performance comparative metagenomics

Are your carrying out large scale metagenomics analyses to identify differences among multiple sample sites? Are you looking for suitable analysis  tools?

If you have not yet found the right analysis tool, you may be interested in  the latest beta version of JCVI Metagenomics Reports (METAREP)  [Test It].

METAREP is a new open source tool developed for high-performance comparative metagenomics .

It provides a suite of web based tools to help scientists view, query, browse, and compare metagenomics annotation data derived from ORFs called on metagenomics reads or assemblies.

Users can either specify fields, or logical combinations of fields, to filter
and refine datasets
. Users can compare multiple datasets at various functional and taxonomic levels, applying statistical tests as well as hierarchical clustering, multidimensional scaling, and heatmaps (see image gallery).

For each of these features, tab delimited files can be exported for downstream analysis. The web site is optimized to be user friendly and fast.

Feature Summary [download Flyer]:

  • Handle extremely large datasets. Uses scalable high-performance Solr/Lucene search engine (we have indexed 300 million annotation entries, but much larger volumes can be handled as shown by Hathi Trust).
  • Compare 20+ datasets at the same time. Use various compare
    options including statistical tests and plot options to visualize
    dataset difference at various taxonomic and functional levels.
  • Apply statistical tests such as METASTATS (White et al.), a modified
    non-parametric t-test to compare two sample populations (e.g.
    metagenomics samples from healthy and diseased individuals).
  • Export publication-ready graphics. Export heatmaps, hierarchical clustering, and multi-dimensional scaling plots in PDF format.
  • Analyze KEGG metabolic pathways. Summaries include enzyme
    highlights on KEGG maps, pathway enzyme distributions, and
    statistics about pathway coverage at various pathway levels.
  • Search using a SQL-like query syntax. Build your query using 14
    different fields that can be combined logically.
  • Drill down into data using METAREP’s NCBI Taxonomy, Gene
    Ontology, Enzyme Classification or KEGG Pathway browser.
    Install your own METAREP version.
  • Flexible central configuration, METAREP and 3rd party code base is completely open source.
  • Cross-link function with phylogeny. Slice your data at various
    taxonomic and/or functional levels. For example, search for all
    bacteria or exclude eukaryotes or search for a certain (GO/EC
    ID)/taxonomic combination.
  • Generic data format. Data types that can be populated include a
    free text functional description, best BLAST hit information, as well
    as GO ID, EC ID, and HMMs.

How to analyze your own data: You can install your own METAREP version to analyze your metagenomics annotation data [download source]. We have written a comprehensive manual that describes the installation process step by step [download manual]. Since METAREP only operates on annotated data, raw sequences need to be annotated first. Supported data types that can be loaded for each sequence include functional descriptions, best BLAST hits fields (E-Value, Percent Identity, NCBI Taxon, Percent Sequence Coverage), GO, EC, and HMM assignments. The installation also contains a set of example annotations that can be imported.

Contact Us:

We would like to hear from you. If you have questions or feedback or if you wish to contribute to the METAREP open source project please send an email to





METAREP Source Code

Genomics of the Indoor Air Environment

Microbial air concentrators in the air mixing room of a large office building. These concentrators work by pulling air through the water-filled chamber (the glowing boxes) and entraining bacteria and particles in liquid. The concentrator on the right is sampling from the building's internal air vent, while the machine on the left is sampling the fresh air entering the mixing room at a 90° angle.

Most of our life is spent in indoors, well-buffered from the constant changes in temperature, humidity, wind and light which shape life outside our homes and offices.  It seems intuitive that the types of microorganisms which inhabit our indoor environment must be different from those on the outside; after all, by removing environmental stresses such as UV, dessication and wind, we eliminate selective pressures on populations.  We spend 23 hours a day indoors, but we know very little about the types of  aerosolized microorganisms  we encounter – and inhale – with every breath.  It has been postulated that ‘everything is everywhere’: microbes are widely distributed around the world, with particular environments selecting for population subsets.  If that is the case, then are the organisms inhabiting a house different from those in a hospital, school or a downtown high rise?  Are cosmopolitan microorganisms simply taking advantage of our climate controlled environment, or are they interested in us in particular, bearing genes for virulence and pathogenicity?  And how are indoor microorganisms adapting to our widespread use of antibiotics, antimicrobials and other bacterial control agents?

Kitchen appliances and workspaces are potential sources of aerosolized microorganisms. We deployed several air samplers in this home, including "Dr. Hibbert" in the kitchen.

Through a grant from the Alfred P. Sloan Foundation, the J. Craig Venter Institute is developing new tools and techniques to examine the composition of microbial populations in the indoor air environment.  As many of these organisms are resistant to cultivation, our approach is modeled on the techniques developed during the Global Ocean Sampling (GOS) Expedition, namely the shotgun sequencing of bulk DNA to generate a genomic snapshot, or metagenome, of the indoor air environment.   A fundamental difference with the GOS expedition is that of scale: a milliliter of open ocean seawater may contain 10,000 microorganisms or more, so filtering a few hundred liters of seawater will often capture enough DNA to construct a high quality random shotgun library for sequencing.  Microbial density in the air is quite different: aerosolized bacteria counts for outdoor air are closer to 10,000 per cubic meter, meaning that a given volume of air contains a million times less organisms than an equivalent volume of seawater.

Field decontamination of an air sampler - this particular picture was taken in the air handling room of a medical center. Each machine was completely disassembled and all of the internal plumbing was replaced to prevent contamination. Those belts on either side of me let out a shrieking 110 dB of sound, hence the earplugs!

An additional issue is in collection efficiency.  The collection efficiency of most cyclone-style air samplers is usually less than 100% – this is especially true as the particle size drops below 1 micron in size.  As aerosolized bacteria are often small and dessicated, this efficiency becomes a problem, and air samplers have to be run for days at a time to collect sufficient DNA for sequencing.  These long collection times lead to  problems with growth and contamination.  Our air samplers are wet-cyclone Spin-Con concentrators, and to prevent bacteria from growing inside the collection chamber, we add a number of bacteriostatic compounds to keep the organisms from multiplying in the collection liquid and skewing our population data.  We have also programmed our collectors to dispense the collected sample into a refrigerated vessel containing additional growth inhibitors, and this is done every 2 hours.  Lastly, we completely clean or replace almost of the tubing and parts on a daily basis – we do this to reduce the chance of biofilms forming inside any of the tubing or collection chambers.   Air sampling is a labor intensive process, but the results have been relatively clean and diverse samples reflecting the actual microbial composition of the air environment.

An array of microbial air samplers at the end of the Scripps Research Pier. This 300 m long pier intercepts air before it reaches land, so is an ideal place to determine the marine microbial component to regional air quality.

Our original dataset was from a high-rise in mid-town Manhattan, where we collected air from an air mixing room over 20 floors up.  Both indoor and outdoor air samples were collected, and these samples form a baseline of data for much of the sampling we are currently conducting in California.  The air in New York City generally arrives from the west, so in addition to its urban signature, it also contains soil, dust and pollen from an entire continent.  In San Diego, the predominant winds are from the Pacific, and we suspect that there will be a strong marine component to populations of microorganisms in both indoor and outdoor environments.  To determine this baseline, we set up an array of samplers at the end of the 1,000 foot long research pier at Scripps Institution of Oceanography.  These samplers ran for five days, and aside from an osprey menacing the water bags, we were able to collect relatively clean marine air prior to being influenced by the terrestrial environment.

Doug Fadrosh checks on our de-aggregation and filtration set up. After the microorganisms are dissociated from the particulate matter and each other, they are filtered through 3.0 and 0.1 micron filters, and the balance - the viral fraction - is collected in the flask.

As can be seen in the pictures accompanying this blog, we have samples in home and a medical center, and we plan to sample in a school and an office building.  Each of these indoor environments is unique, and some of the sites are ten miles inland, and we are interested to see how the marine microbial component in the air attenuates with distance.  We run multiple machines in parallel for several days, and produce two liters of collected liquid, which we then process and concentrate before we attempt to isolate the DNA.  We have noticed that many of the organisms are associated with particles, so we use surfactants and mild physical techniques to de-aggregate the microbes prior to filtration, and we have found that this increases our yield of DNA substantially.

A 0.1 micron filter following de-aggregation and processing. The filtrate had already passed through the 3.0 micron filter, and is composed of bacteria and small particles. This filter contains particles from 5.4 million liters of air!

After the sample has been deaggregated, we pass the liquid through sequential 3.0 and 0.1 micron filters in order to fractionate the sample.  Larger material, particulalry fungal spores, pollen and eukaryotic cells, tend to get trapped on the 3.0 filter, leaving more bacteria on the 0.1 micron filter.  Any material which passes through the 0.1 micron filter is generally viruses and very small particulates – these too will be sequenced.  An example of the quantity of material on a 0.1 micron filter can be seen in the picture on the left – it is surprising just how ‘clean’ 5 million liters of air is!.  In posts to come we will describe more on how we get from filters to DNA and on to libraries, as well as share some of our preliminary results  —  so stay tuned!