Posts tagged Bioinformatics

JCVI Viral Finishing Pipeline: a Winning Combination of Advanced Sequencing Technologies, Software Development and Automated Data Processing

JCVI viral projects are supported by the NIAID Genomic Sequencing Center for Infectious Disease (GSCID). The viral sequencing and finishing pipeline at JCVI combines next generation sequencing technologies with automated data processing. This allowed us to complete over 1,800 viral genomes in the last 12 months, and almost 8,800 genomes since 2005.

Viral Projects at JCVI

JIRA Viral Sample Tracking Workflow

Our NextGen pipeline, which utilizes SISPA-generated libraries with Roche/454 and Illumina sequencing, enables us to complete a wide variety of viral genomes including challenging samples. Automated assembly pipeline employs CLCbio command-line tools and JCVI cas2consed, a cas to ace assembly format conversion tool. Our complimentary Sanger pipeline software is currently being integrated with the NextGen pipeline. This will improve our data processing and will allow us to use validation software (autoTasker) more efficiently.

Assembly of Repetitive Viral Genomes

Genome Organization of Varicella-Zoster

Assembly of Novel Viral Genomes

CLC Assembly Viewer Representation

Promoter of Bat Genome

Promoter of Bat Genome

During the past year we have found that novel viruses, repetitive genomes, and mixed infection samples could not be easily integrated with our high-throughput assembly pipeline. We have developed an assembly and finishing process that utilizes components of the high-throughput pipeline and combines them with manual reference selection and editing. Using this approach we completed novel adenovirus genomes and mixed-infection avian influenza genomes, and improved assemblies of previously unknown arbovirus genomes. We are currently working on optimizing and automating this new pipeline.

Assembly of Mixed Viral Genomes

Consed Representation of Mixed Viral Sample

Consed Representation of Mixed Viral Sample

Repetitive genomes have long been known to present great challenges during assembly and finishing. We are presenting a new approach to assembly and finishing of repetitive varicella genome that is based on separating it into overlapping PCR amplicons followed by merging sequenced amplicons during assembly.

To streamline our viral pipelines, we have fully integrated them with JCVI’s LIMS and JIRA Workflow Management to create a semi-automated tracking interface that follows the progress of viral samples from acquisition through to NCBI submission. This allows us to process a large volume of samples with limited manual interaction and, at the same time, gives us flexibility to work on challenging and novel genomes.

Acknowledgements

The JCVI Viral Genomics Group is supported by federal funds from the National Institute of Allergy and Infectious Disease, the National Institutes of Health, and the Department of Health and Human Services under contracts no. HHSN272200900007C.

Bat coronavirus project is collaboration with Kathryn Holmes and Sam Dominguez, University of Colorado Medical Center.

The authors would like to thank members of the Viral Genomics and Informatics group at JCVI.

References

Viral genome sequencing by random priming methods. Djikeng A, Halpin R, Kuzmickas R, Depasse J, Feldblyum J, Sengamalay N, Afonso C, Zhang X, Anderson NG, Ghedin E, Spiro DJ. BMC Genomics. 2008 Jan 7;9:5A virus discovery method incorporating DNase treatment and its application to the identification of two bovine parvovirus species.  Allander T, Emerson SU, Engle RE, Purcell RH, Bukh J.

Note

This post is based on a poster by Nadia Fedorova, Danny Katzel, Tim Stockwell, Peter Edworthy, Rebecca Halpin, and David E. Wentworth.

Summit on Systems Biology, June 15-17, 2011

I attended the Summit on Systems Biology hosted by Virginia Commonwealth University in Richmond, VA June 15-17.  So, judging from the talks given, what is systems biology?

  • Systems biology is non-linear and/or multi-step.  Heavy math does not make something systems biology if it’s directly solvable.  Taking a big gene expression matrix, using principle component analysis on it, and coming up with a linear equation for the contributions of a list of biomarker genes, is not systems biology.  The same microarray expression experiment, coupled with pathway analysis in order to reduce candidate genes and so do a less stringent multiple-hypothesis-testing-correction and so have fewer false negatives, is.  So is a non-linear model of how just a few genes interact over time.
  • Standard bioinformatic analysis seeks correlations.  Systems biology goes beyond that to seek cause and effect.  Thus, most systems biology work involves time series, and sometimes simulation.

What data and techniques do systems biologists use?

  • Large datasets of all types.  Microarray time-series, genomes, SNPs, protein-protein interactions, automated protein annotation – anything that comes in gigabytes instead of kilobytes.
  • There was marked interest in protein-protein interaction networks, and in micro RNAs (which inhibit translation of multiple target mRNAs).
  • There were several papers using reverse-phase protein microarrays.  RPMAs can distinguish phosphorylated (which usually means active) from unphosphorylated proteins, which helps understand protein interaction dynamics.
  • There were several papers using weighted gene co-expression network analysis.  WGCNA analyzes modules of co-expressed genes, rather than individual genes.  This gives more statistical power from sparse data.  Brian Sayre of VSU identified disease-resistance genes in livestock and crop species using single-nucleotide polymorphisms (SNPs) from related species.  We might know about some goats that are resistant to a disease that also affects sheep; but sheep don’t have the same SNPs as goats.  His group categorized the SNPs into genes, and the genes into pathways common across species, then looked for pathways associated with disease resistance in other species, and hypothesized that the same pathways would be involved in disease resistance in the target species.

What do people do with systems biology?

  • Medical applications predominated.  The main areas of interest were cancer, aging, cell simulation, eukaryotic model organisms, genome-wide association studies, pathway analysis, and immunology.
  • There were no talks about industrial applications or synthetic biology.
  • There were no talks on prokaryotes, except one on host-pathogen interactions.  This struck me as odd, since eukaryotes are more difficult to analyze or simulate than prokaryotes, and we haven’t done these things with prokaryotes yet.
  • There were no talks on metagenomics.  This also struck me as odd; bacterial communities seem like a natural systems biology problem.

What does the future hold for systems biology?

  • Omniomics:  We don’t want just a protein’s sequence – we want to know where and when it is expressed, what regulates it, what it interacts with, and what parameters describe those interactions. Soon, annotating a genome will not mean producing a list of genes and their functions – it will mean producing a simulation.
  • We need to learn to think at a higher level of abstraction.  If you have tens or hundreds of thousands of genes, transcripts, proteins, small molecules, and structures interacting, you need to figure out what it is you’re really interested in (e.g., “How did this cancer bypass the G1 cell-cycle restriction checkpoint?”), how to specify that precisely enough to ask the computer for an answer, and not to insist on understanding all the details if the answer checks out.
  • There is a growing gap between research and practice.  We can make more and more detailed analyses of diseases, especially in cancer, where each patient has a unique disease at the genetic level.  Meanwhile, the FDA approval process is so long and expensive that even in diseases (for example, Alzheimer’s and FTLD) for which there are millions of patients and a handful of known causes, pharmaceutical companies don’t try to develop three to four separate therapies for those three to four causes.  And the gap is growing wider:  Even as we are coming up with ways to combine weak information from across an entire genome, the FDA is considering proposals to regulate genomic sequencing that would forbid doctors from acquiring a full sequence.