Posts in category Bioinformatics

Page 2 of 4

H3Africa Update

The National Institutes of Health (NIH) and the UK-based Wellcome Trust, in partnership with the African Society of Human Genetics, developed a program to foster genomic and epidemiological research in African scientific institutions. The laboratory and computational infrastructure available to most scientists on the African continent is currently insufficient to keep up with the rapid developments in DNA sequencing technologies and the need to use advanced computationally intensive methods to analyze this data.

Through the H3Africa Consortium, a partnership between NIH and Wellcome Trust, funding has become available to support knowledge development and implementation of genomics-centered research in several African academic institutions. The first scientific paper to come from this effort, Enabeling the Genomic Revolution in Africa, was published in the journal Science in June 2014.

H3Africa Efforts at J. Craig Venter Institute (JCVI)

One of the main initiatives of H3Africa is to foster scientific exchange between US-based partners and their African-based consortium members. JCVI is involved in a number of such partnerships through training and research collaborations.

Tuberculosis Research with Addis Ababa University

Addis Ababa University is the only Ethiopian institution to receive a primary award from NIH under H3Africa. It is based on a collaboration with JCVI. Professor Gobena Ameni of Addis Ababa University and Dr. Rembert Pieper of JCVI developed a proposal on Systems Biology for Molecular Analysis of Tuberculosis in Ethiopia which was initiated earlier this year. The research focuses on genomic variability in M. tuberculosis strains in Ethiopian pastoralist societies and also has an oral microbiome and proteomic biomarker discovery component.

Bioinformatics Training for African Scientists

As part of H3Africa, JCVI is leveraging its recent GCID award, where appropriate, for training of African Scientists. As part of this effort Dr. Andrey Tovchigrechko  taught microbiome analysis to graduate students in Ibadan, Nigeria. The workshop was organized by the local H3Africa Bioinformatics Network node. The workshop took place in July, 2014 and comprised of students from Nigeria and other West and Central African countries.

Symposium presenters.

Symposium presenters.

Workshop student participants.

Workshop participants.

The workshop was held at IITA.

The workshop was held at IITA.

During the three day workshop, Dr. Tovchigrechko taught the students launching and controlling computing instances on Amazon cloud, the basics of Python and R programming, MG-RAST Web interface, MG-RAST R package matR and JCVI-developed R code MGSAT. MG-RAST tutorials were provided by one of its developers Andreas Wilke (ANL).

Dr. Tovchigrechko also gave a talk, along with a dozen other speakers, at a one-day symposium at the University of Ibadan that preceded the workshop and included approximately 200 participants. Special thanks go to Nash Oyekanmi, the organizer and manager of the whole event, for his relentless efforts.

Collaborations with University of Cape Town

Also as part of the H3Africa Consortium, Dr. William Nierman from JCVI and Dr. Mark Nicol from the University of Cape Town, South Africa are in collaboration to study the nasopharyngeal microbiome and respiratory disease in African children. Dr. Nierman’s group has conducted a month long in house microbiome training workshop with students from Dr. Nicol’s group.

The focus of the training was to teach students JCVI’s complete microbiome pipeline (including sample preparation, sequencing generation, and final association analysis). The aim of the training collaboration is to ensure that this complete pipeline can be performed at the University of Cape Town, to help build independent and sustainable capacity in this field within South Africa.


J. Craig Venter at Recent Google Zeitgeist Conference [VIDEO]

Dr. J. Craig Venter recently spoke at a Google Zeitgeist conference in Arizona where he spoke on advances in genomics, synthetic biology, and DNA as the software of life.

Understanding Complex Data through Better Visualization

Recently, researchers at JCVI reported on the Rhizoctonia solani mitochondrial genome which was the largest fungal mitochondrion to be sequenced to date. We showed that its unusually large size was probably due to the expansion of multiple genetic elements that populated the genome in somewhat of a ‘parasitic’ relationship. The visualization was meant to impress the number and variety of these repetitive genetic elements, and was selected in a commentary in  FEMS Microbiology Letters as an example of how to summarize molecular data in order to obtain an overall view of the results.

The outermost circle represents the chromosome and repetitive elements. Other important features such as genes, endonucleases, exons, RNAseq coverage are represented in the concentric circles respectively. Grey links represent short repeats (< 35bp) found up to 100 times in the genome; colored links show the location of repeats and follow the coloration in Track 1.

The outermost circle represents the chromosome and repetitive elements. Other important features such as genes, endonucleases, exons, RNAseq coverage are represented in the concentric circles respectively. Grey links represent short repeats (< 35bp) found up to 100 times in the genome; colored links show the location of repeats and follow the coloration in Track 1.

Professional Development Opportunities this Summer

This summer we are offering two professional development workshops: GenomeSolver and Bioinformatics: Unlocking Life through Computation.  Both explore bioinformatics, microbial diversity and the implementation in the undergradauate or high school classrooms. 

The GenomeSolver workshop trains faculty on genome analysis. Workshop attendees will learn about general methodologies, standards, and processes used to annotate and analyze microbial genomes. The workshop contents will be available to aid the faculty in developing teaching modules. In addition, extensive documentation on methodologies and tools will be available via the online environment created for this project. On online web portal Genome Solver ( will be a virtual space for development and sustaining of community. Genome Solver will assist faculty with technical issues and curricular design, as well as an online environment for the ongoing sharing of information including publication of student work.

Bioinformatics: Unlocking Life through Computation is a new opportunity for high school teachers. Genomics and biotechnology are valuable tools in our quest to understand life and nature. However, introducing the science classroom to the computational and mathematical underpinnings of biology can be challenging. The goal of this workshop is to introduce a curriculum for mathematics and science education in the area of genomics (with a focus on the fascinating world of microbes). Educators will be introduced to the various analysis and computational challenges that arise in this discipline. Workflow examples illustrating comparative genomic analysis will be made available through the JCVI Metagenomics Report (METAREP) software infrastructure. The eventual aim is for the educational material to be integrated with local high school curricula requirements to expose students to both hypothesis-driven and discovery-based science.

JCVI Hosts South African Scientists to Share Microbiome Research Techniques

Two scientists from the University of Cape Town, South Africa have joined Dr. Bill Nierman’s lab for the next month as part of NIH’s Human Heredity and Health in Africa (H3Africa) Initiative, a training program designed to build out technical biological skills in the African research community. This training relates specifically to developing techniques around the area of microbiome analysis, a relatively new field in the biological sciences.

Microbiome analysis for the collaborative study is looking at entire community of microorganisms in the respiratory tract of South African infants to better understand how the microbiome is associated with infant pneumonia and wheezing episodes. The expectation is that the organisms that reside in the infant respiratory tract will provide protection from or a predisposition to the pneumonia or wheezing episodes.


The Nierman Group

The Nierman group left to right Sarah Lucas, Bill Nierman, Shantelle Claassen, Mamadou Kaba and Stephanie Mounaud (unpictured Jyoti Shanker and Lilliana Losada) welcomes visiting scientists Ms. Classeen and Dr. Kaba from University of Cape Town for a month long training in microbiome sequencing and analysis.

Mamado Kaba, MD, PhD and colleague Shantelle Claassen from the University of Cape Town will be working closely under the guidance of JCVI’s Stephanie Mounaud who is functioning as the project manager and coordinating the laboratory components of a similar project at JCVI studying the microbiomes of inafnts in the Philippines and also in South Africa. These studies are sponsored by the Bill and Melinda Gates Foundation. The training will focus initially on preparing samples for DNA sequencing on a modern DNA sequencing platform, the Illumina MiSeq instrument. Once the sequence reads are off the sequencer, the instructional focus will shift to analysis of the reads by means of an informatics pipeline that develop phylogenies, or family trees, of the microbes that are obtained from the infant respiratory tract so that the abundance and relatedness of the microbes can be established. The bioinformatics training will be provided by Jyoti Shankar, the statistical analyst working on the Gates Foundation Project.

Mamadou Kaba is a Wellcome Trust Fellow working in the Division of Medical Microbiology, Faculty of Health Sciences, University of Cape Town. Mamadou’s research interests include the molecular epidemiology of infectious diseases and the study of human microbiome in healthy and disease conditions. He has contributed in establishing a new research group conducting studies on how the composition of the upper respiratory tract, gastrointestinal, and the house dust microbial communities influences the development of respiratory diseases.

Prior to joining the University of Cape Town, Mamadou worked as Research Associate at the Laboratory of Medical Microbiology, Timone University Hospital, Marseille, France, where he studied the epidemiological characteristics of infection with hepatitis E virus in South-eastern France.

Shantelle Claassen is pursuing a Masters degree in the Division of Medical Microbiology at the University of Cape Town. She has completed a BSc (Med) Honours degree in Infectious Diseases and Immunology at the University of Cape Town, during which she examined the relative efficacy of extracting bacterial genomic DNA from human faecal samples using five commercial DNA extraction kits. The DNA extraction kits were evaluated based on their ability to efficiently lyse bacterial cells, cause minimal DNA shearing, produce reproducible results and ensure broad-range representation of bacterial diversity.

Mamadou and Shantelle are currently involved in an additional prospective, longitudinal study of which the primary objective is to investigate the association between fecal bacterial communities and recurrent wheezing during the first two years of life.

Plant Bioinformatics Workshop

JCVI recently held its 3rd Annual Plant Bioinformatics Workshop from July 15-19th. During the week-long workshop, 20 scientists from the Plant Research community visited JCVI and learned many aspects of Bioinformatics from the members of Chris Town’s Plant Genome group. Attendees included undergraduate and graduate students, post-doctoral fellows, research scientists and faculty at various Universities throughout the United States as well as a biotech company. In addition to the on-site participants, we had 5 additional participants attend the workshop via WebEx. The virtual participants had the opportunity to sit in on the lectures and complete the hands on exercises by logging into an Amazon Cloud instance, which was set up specifically for this purpose. The topics covered during the workshop included UNIX tools for Bioinformatics, Genome Assembly, Structural and Functional Annotation, RNA-seq assembly and analysis and SNPs. In addition to JCVI’s instructors, we had additional sections covered by external instructors. Eric Lyons (University of Arizona and iPlant) presented on Comparative Genomics and the iPlant Infrastructure and Ann Loraine (UNC Charlotte) presented on Integrated Genome Browser. All sessions contained a hands-on component so the students would have the opportunity to use the tools that we discussed during the lecture portion.  Watch our website for future offerings!


JCVI Viral Finishing Pipeline: a Winning Combination of Advanced Sequencing Technologies, Software Development and Automated Data Processing

JCVI viral projects are supported by the NIAID Genomic Sequencing Center for Infectious Disease (GSCID). The viral sequencing and finishing pipeline at JCVI combines next generation sequencing technologies with automated data processing. This allowed us to complete over 1,800 viral genomes in the last 12 months, and almost 8,800 genomes since 2005.

Viral Projects at JCVI

JIRA Viral Sample Tracking Workflow

Our NextGen pipeline, which utilizes SISPA-generated libraries with Roche/454 and Illumina sequencing, enables us to complete a wide variety of viral genomes including challenging samples. Automated assembly pipeline employs CLCbio command-line tools and JCVI cas2consed, a cas to ace assembly format conversion tool. Our complimentary Sanger pipeline software is currently being integrated with the NextGen pipeline. This will improve our data processing and will allow us to use validation software (autoTasker) more efficiently.

Assembly of Repetitive Viral Genomes

Genome Organization of Varicella-Zoster

Assembly of Novel Viral Genomes

CLC Assembly Viewer Representation

Promoter of Bat Genome

Promoter of Bat Genome

During the past year we have found that novel viruses, repetitive genomes, and mixed infection samples could not be easily integrated with our high-throughput assembly pipeline. We have developed an assembly and finishing process that utilizes components of the high-throughput pipeline and combines them with manual reference selection and editing. Using this approach we completed novel adenovirus genomes and mixed-infection avian influenza genomes, and improved assemblies of previously unknown arbovirus genomes. We are currently working on optimizing and automating this new pipeline.

Assembly of Mixed Viral Genomes

Consed Representation of Mixed Viral Sample

Consed Representation of Mixed Viral Sample

Repetitive genomes have long been known to present great challenges during assembly and finishing. We are presenting a new approach to assembly and finishing of repetitive varicella genome that is based on separating it into overlapping PCR amplicons followed by merging sequenced amplicons during assembly.

To streamline our viral pipelines, we have fully integrated them with JCVI’s LIMS and JIRA Workflow Management to create a semi-automated tracking interface that follows the progress of viral samples from acquisition through to NCBI submission. This allows us to process a large volume of samples with limited manual interaction and, at the same time, gives us flexibility to work on challenging and novel genomes.


The JCVI Viral Genomics Group is supported by federal funds from the National Institute of Allergy and Infectious Disease, the National Institutes of Health, and the Department of Health and Human Services under contracts no. HHSN272200900007C.

Bat coronavirus project is collaboration with Kathryn Holmes and Sam Dominguez, University of Colorado Medical Center.

The authors would like to thank members of the Viral Genomics and Informatics group at JCVI.


Viral genome sequencing by random priming methods. Djikeng A, Halpin R, Kuzmickas R, Depasse J, Feldblyum J, Sengamalay N, Afonso C, Zhang X, Anderson NG, Ghedin E, Spiro DJ. BMC Genomics. 2008 Jan 7;9:5A virus discovery method incorporating DNase treatment and its application to the identification of two bovine parvovirus species.  Allander T, Emerson SU, Engle RE, Purcell RH, Bukh J.


This post is based on a poster by Nadia Fedorova, Danny Katzel, Tim Stockwell, Peter Edworthy, Rebecca Halpin, and David E. Wentworth.

Summit on Systems Biology, June 15-17, 2011

I attended the Summit on Systems Biology hosted by Virginia Commonwealth University in Richmond, VA June 15-17.  So, judging from the talks given, what is systems biology?

  • Systems biology is non-linear and/or multi-step.  Heavy math does not make something systems biology if it’s directly solvable.  Taking a big gene expression matrix, using principle component analysis on it, and coming up with a linear equation for the contributions of a list of biomarker genes, is not systems biology.  The same microarray expression experiment, coupled with pathway analysis in order to reduce candidate genes and so do a less stringent multiple-hypothesis-testing-correction and so have fewer false negatives, is.  So is a non-linear model of how just a few genes interact over time.
  • Standard bioinformatic analysis seeks correlations.  Systems biology goes beyond that to seek cause and effect.  Thus, most systems biology work involves time series, and sometimes simulation.

What data and techniques do systems biologists use?

  • Large datasets of all types.  Microarray time-series, genomes, SNPs, protein-protein interactions, automated protein annotation – anything that comes in gigabytes instead of kilobytes.
  • There was marked interest in protein-protein interaction networks, and in micro RNAs (which inhibit translation of multiple target mRNAs).
  • There were several papers using reverse-phase protein microarrays.  RPMAs can distinguish phosphorylated (which usually means active) from unphosphorylated proteins, which helps understand protein interaction dynamics.
  • There were several papers using weighted gene co-expression network analysis.  WGCNA analyzes modules of co-expressed genes, rather than individual genes.  This gives more statistical power from sparse data.  Brian Sayre of VSU identified disease-resistance genes in livestock and crop species using single-nucleotide polymorphisms (SNPs) from related species.  We might know about some goats that are resistant to a disease that also affects sheep; but sheep don’t have the same SNPs as goats.  His group categorized the SNPs into genes, and the genes into pathways common across species, then looked for pathways associated with disease resistance in other species, and hypothesized that the same pathways would be involved in disease resistance in the target species.

What do people do with systems biology?

  • Medical applications predominated.  The main areas of interest were cancer, aging, cell simulation, eukaryotic model organisms, genome-wide association studies, pathway analysis, and immunology.
  • There were no talks about industrial applications or synthetic biology.
  • There were no talks on prokaryotes, except one on host-pathogen interactions.  This struck me as odd, since eukaryotes are more difficult to analyze or simulate than prokaryotes, and we haven’t done these things with prokaryotes yet.
  • There were no talks on metagenomics.  This also struck me as odd; bacterial communities seem like a natural systems biology problem.

What does the future hold for systems biology?

  • Omniomics:  We don’t want just a protein’s sequence – we want to know where and when it is expressed, what regulates it, what it interacts with, and what parameters describe those interactions. Soon, annotating a genome will not mean producing a list of genes and their functions – it will mean producing a simulation.
  • We need to learn to think at a higher level of abstraction.  If you have tens or hundreds of thousands of genes, transcripts, proteins, small molecules, and structures interacting, you need to figure out what it is you’re really interested in (e.g., “How did this cancer bypass the G1 cell-cycle restriction checkpoint?”), how to specify that precisely enough to ask the computer for an answer, and not to insist on understanding all the details if the answer checks out.
  • There is a growing gap between research and practice.  We can make more and more detailed analyses of diseases, especially in cancer, where each patient has a unique disease at the genetic level.  Meanwhile, the FDA approval process is so long and expensive that even in diseases (for example, Alzheimer’s and FTLD) for which there are millions of patients and a handful of known causes, pharmaceutical companies don’t try to develop three to four separate therapies for those three to four causes.  And the gap is growing wider:  Even as we are coming up with ways to combine weak information from across an entire genome, the FDA is considering proposals to regulate genomic sequencing that would forbid doctors from acquiring a full sequence.

NASA and JCVI host symposium on the evolution of Earth and Life

On May 12th and 13th, the J. Craig Venter Institute in San Diego will be hosting a NASA Astrobiology Institute-funded symposium titled “Paleobiology in the genomics era.” Paleobiology is the study of the origins and evolution of life and, by nature, is interdisciplinary. The goal is to bring together scientists united by this common interest but differentiated by expertise.  A major intellectual challenge to paleobiology is the close interaction between environment and life.  As life evolved, it changed the environment and suffered the consequences.  One of the most extreme examples is the invention of oxygenic photosynthesis by blue-green algae cyanobacteria; the sunlight-fueled production of dramatically changed the availability of crucial elements of life, like nitrogen, sulfur, iron, zinc, copper, and other trace metals.  Genome-based analyses showed that these environmental changes modulated the emergence of metal-requiring proteins.  For example, proteins that bind Fe evolved when the earth was Fe rich. Essentially, one biological event changed the environment, which in turn induced a subsequent biological change; a feedback cycle between biota and planet.

In order to study these interactions in a robust fashion, numerous lines of evidence must be integrated, despite originating from disparate fields like organic and inorganic geochemistry (oils and metals in rocks), micropaleontology (tiny fossils), and evolutionary biology.  Recent years have observed the emergence and maturation of synthetic biology and computational biology, two fields with tremendous potential for the formulation and testing of hypotheses about the evolution of life. To facilitate a dialog between these fields, myself, along with Ariel Anbar from Arizona State University, and John Peters and Eric Boyd from Montana State University, have invited experts to present their work as it pertains to paleobiology.  The topic list almost appears schizophrenic, with numerous hard-core geochemical talks being followed presentations on molecular genetics, synthetic biology, metagenomics, and comparative genomics.   This was intentional. I hope to feel intellectually challenged in the fashion of a 1st year graduate student and further hope that I’m not the only one.  A major wild card at the moment is the identity of over 2/3rd of the attendees.  With travel grants available for graduate students, post-doctoral researchers, and faculty, we hope to incorporate novel perspectives not covered by the confirmed speakers.

While the content of the meeting is exciting, the format is pretty sweet too. As part of NASA’s Workshop Without Walls series, the meeting will be webcast live with an accompanying live stream chat.  Thus, people will be able to see the presentations and pose questions and comments during the attendant discussions.  Previous workshops have often had hundreds of live viewers throughout the meeting, despite only dozens of in situ attendees.   The actual energy savings for a single meeting are modest in isolation; imagine 250 people not flying 500 miles and you basically have a single 737 flight that remains grounded.  However, the future of environmentally-friendly science requires important preliminary steps to change dominant trends.  Similarly, the talks will be streamed live without charge and deposited in the open access scientific podcast site,; economic barriers to information exchange are removed.

Needless to say, I’m looking forward to this meeting. Organizing something like this is an absolute undertaking. The number of details that need attention is astounding.  And if you think I actually could do that, you don’t know me.  Numerous people at JCVI have provided invaluable assistance, including Matt LaPointe and Jasmine Pollard, Robert Friedman, Dave Negrotto, and Jody Wilson.  It would also have no chance of happening if it not for Pat Goley, who has handed the numerous (re: uncountable) details I’ve lapsed on.

Check out the NASA page for the meeting and webcast registration.

JCVI Supports Human Mircrobiome Body Site Experts with Shotgun Data Analysis

Members of the Human Microbiome Project (HMP) Consortium (see and for more information on the project and partners) including human microbiome body site experts gathered for a virtual Jamboree January 19th. The fully online-based Jamboree has been set-up to communicate initial data products and tools best suited for analysis, primarily to make the data amendable/consumable in a user-friendly way for body site exerts. 61 participants followed the Jamboree agenda with presenters given access to a common desktop that was shared via the internet using an online collaboration tool. Results from  the Data Analysis Working Group (DAWG) were presented in the areas of 16S rRNA gene sequence (16S DAWG) and metagenomic whole-genome shotgun analysis (WGS DAWG). The efforts of the 16S DAWG focus on marker-gene based approaches to estimate biological diversity and how marker variability is associated with patient meta-data. The WGS DAWG  complements results from the 16S marker based analysis with comprehensive sequencing of random pieces of genomic DNA from the collection of microorganisms which inhabit a particular site on, or in, the human body (microbiome). These analyses allow researchers to investigate among other questions what microorganisms are present, and the nature and extent of their collective metabolism, at a particular body site. Ultimately researchers want to relate this information to healthy versus diseases states in humans.

METAREP tutorial presented as part of the HMP Virtual Jamboree

The current survey comprises more than 700 samples from hundreds of individuals taken from up to 16 distinct body sites. Illumina sequencing has yielded more than 20 billion Illumina reads and annotation data produced from the sequences exceeds 10 terabytes. In anticipation of such data volumes, we have developed JCVI Metagenomics Reports (METAREP), an open source tool for high-performance comparative analysis, in 2010. The tool enables users to slice and dice data using a combination of taxonomic and functional/pathway signatures. To demonstrate how the tool can be used by body site experts, we picked and loaded sample data from 17 oral samples and presented a quick tutorial on how users can view, search, browse individual samples and compare multiple samples (see video). The functionality was very well received and body site experts asked JCVI to make all the 700+ samples available. As a result of the Jamboree, JCVI in agreement/collaboration with the HMP Data Analysis and Coordination Center and the rest of the HMP consortium, will soon set-up a dedicated HMP METAREP instance that will allow body-site experts and eventually other users to analyze the DAWG data in a user-friendly way via the web.