Posts by Samuel Payne

Recomb – Computational Proteomics

I recently attended the Recomb satellite conference on Computational Proteomics (downloads for talk and poster) in San Diego, CA.  It was a kind of homecoming for me.  I was a computational proteomics researcher at UCSD as a grad student with Vineet Bafna.  Many of my classmates were still there, as were lots of familiar faces and friends.  I joked with Marshall Bern, that this was almost all of the people at ASMS that I wanted to hear gathered in the same room, and it was only two days. Oh, and there was a beach.

Revisiting computational proteomics was a lot like hearing an old high school favorite on the radio.  You sing along with the chorus and then mumble through the verse.  But by the end, you remember it all again.  Having diversified my research interests while at JCVI, I was excited to hear new developments in proteomics, as well as get a refresher on some oldies-but-goodies.

I presented the progress on my proteogenomics research, introducing some comparative proteogenomics. I submitted a poster abstract, and was asked to give a “flash-presentation” during the Saturday session as well.  As the attendees were like-minded geeks, I presented my troubles spots as well as the highlights, hoping for advice and suggestions.  For highlights, we have the pipeline running smoothly on the JCVI infrastructure, and can analyze any LTQ dataset rapidly.  This produces a list of genomic loci needing re-annotation.  I have somewhere around 10 datasets done now, and they all have been informative. Each shows a significant amount of gene novelty. As we amass a more diverse set of results, we are starting to look at the annotations from a comparative or evolutionary standpoint.

For trouble spots, I noticed that larger datasets need more attention to downstream processing.  You can’t simply claim that all ‘novel’ peptides are new genes. First, we require high quality PSMs (p< 0.005 ) and then do a fairly detailed analysis of results.  I found that even with very strict PSM filters, there were a significant number of false-positive ORF identifications.  For this I have started to develop ORF level filters that look at the set of peptides in an ORF and not the individual ORF identification.  You could, as Pavel suggested to me, go with perfect ‘zero FDR’ at the PSM level, but that tends to kill your spectrum identifications to the point where you see very few novel results (slide 6 from the talk).  I feel that a more sensitive approach is to introduce the orthogonal filter (ORF level).  I hope that later this summer I’ll be able to get out a Nature Methods paper about all the lessons learned.  Till then …

AGBT, Marco Island 2010

I just got back from AGBT in Marco Island, Florida and I am still in awe. As noted in the name, this conference highlights advances in both genome biology and technology.  The biology seemed to be very human genome centric.  Many of the talks presented full genome sequences of cancer genomes or familial cohorts.  Some of the numbers that people threw around were shocking. It was only a short time ago that Craig Venter came out with the first personal genome, and now sequencing centers like Washington University in St. Louis and the Broad are talking about sequencing hundreds of human genomes in a year.  I was really impressed by a talk from Wash-U where they sequenced 4 exomes for Miller’s syndrome – two normal parents and two affected children – and discovered the causative gene.  They were also very honest about their efforts not being as readily successful in a variety of other Mendelian autosomal recessive traits.

One of the more interesting fallouts from all this sequencing is that everyone is submitting to dbSNP; I think that soon nearly every base will have a SNP.  Unfortunately people are currently using dbSNP to filter out candidate SNPs for their study.  Their logic runs something like, “If it’s in dbSNP then it must be benign.”  You can easily see that with the sequencing of several hundred cancer genomes, this proposition will no longer hold.  dbSNP will hold a vast array of deleterious SNPs.  What it needs to transform into is not just a database of ‘Chr17:2345656 A->T’, but rather something that keeps track of frequency and phenotype. Also there was a discussion about false-discovery rates, and how to keep databases clean.  Some of these studies were going to submit 18 million new SNPs, with a 10% FDR. That’s 2 million false positives 🙁

The technology portion of the conference was amazing.  There were new instruments being rolled out by numerous companies, and they were all promising. I loved sitting and listening to the clever new set-ups. There were several single molecule sequencers which I am very excited for – because we can hopefully get past metagenomic pools and on to metagenomic genomes.

I was lucky to be chosen as a presenter in the Genome Informatics session on Thursday evening.  There were lots of talks about RNA-seq and getting more out of your sequencing data.  So I felt like a bit of an outsider not talking about sequencing.  My presentation was on proteogenomics, which uses proteomic data to improve genome annotation.  Then major thrust for this audience is that genome annotation is far from perfect, and proteomics evidence can reveal many novel proteins in every genome that I’ve come across. I think that as biology expands past E. coli and B. subtilis into the vastness of genome diversity, we are seeing genes that look nothing like we’ve ever imagined.  Rather, as I point out in the talk, we are often NOT seeing these new genes because gene predictors fail to recognize them.  I’ve attached the slides to the post for your viewing pleasure (Payne.AGBT.2010).  Let me know what you think.

Sam