Monthly Archive for November, 2010

Starting the Atlantic Crossing

Wednesday November 17th 2010

On November 10th Sorcerer II set sail from Valencia Spain to start the sail back to America.  The first leg was a 3 day sail down the Spanish coast to Gibraltar

Coastline to Gibraltar

Valencia Coastline

John showing the delivery crew around Sorcerer II

We spent one night in Gibraltar to get fuel and supplies.   The next day we took a very important sample on the Mediterranean Sea side of the Straits of Gibraltar.  We collected a surface sample, which should be the lower salinity Atlantic water coming into the Mediterranean Sea.  At the  same location we collected a deeper sample, this is the saltier Mediterranean water flowing on the bottom into the Atlantic Ocean. 

CTD cast from Med. Sea side of the Straits of Gibraltar. Salinity increased from 36 to 38 PSU

After we collected our last Mediterranean Sea sample, we sailed through the Straits of Gibraltar into the Atlantic Ocean and started our way to the Canary Islands.

Gibraltar

Gibraltar

Sailing through the Straits of Gibraltar at sunset

Moroccan Coast

Boat traffic on Med. Sea Side of the Straits

Scientist Spotlight: Meet Vanessa Hayes

Geneticist Vanessa Hayes does not think small nor move slowly—from completing her post doc in six months (the US National average is 3 to 7 years) to completing the first South African Genome Project in 2010 with her goal set on defining the extent of human diversity in all populations, she is on a mission.  Just 11 years outside her post doc she has the credentials of someone who has been in science much longer. Her work and talent has taken her to remote regions of Southern Africa, all over Australia, Europe, the U.S. and now to the J. Craig Venter Institute with her appointment as Professor of Genomic Medicine at the San Diego facility.

Of Cartoons and Men…

Born and raised in South Africa, Vanessa first headed a laboratory near Cape Town to investigate genetic susceptibility to HIV/AIDS after earning a Ph.D. in 1999 in Medical Genetics at the University of Groningen, Netherlands. After three years at the University of Stellenbosch she moved to Sydney, Australia to become group leader of Cancer Genetics first at the Garvan Institute of Medical Research, and later at the Children’s Cancer Institute of Australia. During those years she began two major cancer research projects that continue today. One is a study to assess how ethnicity impacts prostate cancer risk and outcomes by genetic profiling men with and without prostate cancer from different ethnic and geographical locations (initially South Africa and Australia). “I believe in going to the extremes of phenotypic diversity to understand genotype – for example the clinical disparities of prostate cancer in Africa compared to non-African populations has not been adequately explored,” she said explaining that the genetics of ethnic diversity is one of her main research interests. “We don’t always have clear clinical definitions to describe phenotype, but genomics can help to define disease,” she added.

This cancer research then led her to what might seem like an unlikely suspect–the Tasmanian devil. The inspiration for a much beloved Looney Toons character and the largest carnivorous marsupial indigenous to Australia, Vanessa became acquainted with the devil when learning that it was a good model for human cancers. She partnered with Stephan Schuster of Pennsylvania State University to sequence the animal using next-generation (gen) sequencing, in turn establishing the then first next-gen sequencing research laboratory in Australia.  By establishing a Tasmanian devil genome, she and her team were able to define the extent of dwindling genetic diversity within the devil population as a result of an unusual infectious facial cancer. The hope is that this information and tools developed will be used for the insurance breeding program, which has been established by Australian authorities to save this iconic species from inevitable extinction within the next decade.

Putting Africa on the Genetic Map

In early 2010 Vanessa embarked on another collaborative effort with Schuster’s lab, this one to help get African populations represented in genetic databases and reap the benefits of human genomics research. The initiation of the South African Genome Project was a key step in helping to define the extent of human variation, the relevance to assessing disease risk, and the response to various medicines. The effort was conceived out of Vanessa’s frustration in earlier studies with African populations when she found a complete lack of African reference genomes and susceptibility gene array profiles in existing databases. Africa, believed to be the birthplace of mankind with the oldest populations, offers a much greater diversity than found in individuals of European decent. Another issue with the existing databases was that the little African genetic data represented in early 2010 was based on one population – the Yoruba people from Nigeria. Demonstrating that the Yoruba people are clearly not representative of the majority of the over 500 different linguistic groups in central to southern Africa, Vanessa was determined to change the face of European-driven genomic research.

Vanessa and a Bushman lady from the Southern African Kalahari desert in deep discussions about what we can read in the blood (aka genomics). This lady is one of only a few click-speaking hunter-gatherer peoples left who represent an ancient line for all modern humans.

Vanessa and a Bushman lady from the Southern African Kalahari desert in deep discussions about what we can read in the blood (aka genomics). This lady is one of only a few click-speaking hunter-gatherer peoples left who represent an ancient line for all modern humans. (photo credit: Chris Bennett - www.chrisbennettphoto.com)

Ingenuity and perseverance led Vanessa to knock on the door of Nobel Peace Prize recipient Archbishop Desmond Tutu. He was, she knew, a critical step needed to gain access to a potential treasure trove of South African genetic data. She made her case directly to the Archbishop in front of a room of advisors who told him not to participate in a genetic study. However, much to her surprise, Tutu agreed to be the first South African to have his genome sequenced. Vanessa believes he did so, against the advice of his advisors, because he knew the importance of this type of research to the people of his country. The Archbishop’s participation was both critical and significant as he represents not only the Bantu linguistic group to which 80% of the South African population belongs, but he is also a survivor of TB, polio, and prostate cancer. The researchers were able to correlate his genetic markers (genotype) potentially associated with disease susceptibility with his family and medical history (phenotype), providing valuable information about the Bantu people. Vanessa and her team also sequenced the complete genome and three exomes (protein-coding genes only) from four individuals representing diverse ethnic groups of what are known as the Kalahari Bushmen. Bushmen (or San) is the term for the click-speaking hunter-gatherers who inhabit the Kalahari Desert, which spans parts of Botswana, Namibia, and Angola. Her studies, published in Nature in 2010, showed that two different linguistic groups of Kalahari Bushman were as genetically divergent as Europeans and Asians. Some found this finding surprising, however, the extent of the diversity should not be surprising considering these Bushman represent the oldest living lineage of modern humans.

By this time in 2010 Vanessa decided she had reached the technological limits of her research in human genomics in her current position in Australia. She was searching for a place to expand her capabilities, particularly in next generation (gen) sequencing and bioinformatics. She was interviewing last spring in Melbourne at the Walter and Eliza Hall Institute for Medical Research where Dr. Craig Venter happened to be giving a keynote lecture. The JCVI was not on her radar at the time as she had several job offers within and outside Australia, but Craig was able to convince her to come to work with him and the team at JCVI.

Sleep is overrated

The sequencing of Archbishop Tutu was only a start to Vanessa’s plans in human genomics research. She is continuing to expand her work with indigenous groups in Africa. Much like the aspirations (and accomplishments) of her new boss, she claims a ‘modest’ goal: “To define the extent of human diversity that exists globally so we can have a true picture of variation that human genomes have and to help make sense of that variation by linking genotype to phenotype.” Phenotype cannot only mean disease conditions (associated with genes) but also evolved behaviors. For example, how the Bushmen are able to go for a week without water in the desert climate is a phenotype that may be encoded in their genes. Understanding the genetic basis for disease and behavior in different populations will certainly be a challenge, but clearly Vanessa is a person who thrives when presented with challenges.

Vanessa’s limited spare time revolves around her family, including two children — each born on different continents — who keep her busy with the latest goal to teach mom how to surf! A keen soccer player in Australia, she has turned to a new adventure since her move to San Diego, kickboxing.  She says she doesn’t get much sleep, particularly little in the past three years, but at least now she’s working mostly on U.S. time rather than two opposite time zones.

If she had time for another career, “it is hard to think of another career as I am doing exactly what I love, combining my passion for the rich-diversity of people from Southern Africa (and globally) from whom we have so much to learn, with the speed and dynamics of everyday life of 21st century science. What better place to combine these two worlds than here at JCVI.” Vanessa hopes via her new position to understand and educate others about the breadth of human genetic diversity existing in populations worldwide.

Lucene Revolution Conference 2010

I arrived late in Boston after my plane from Washington DC was delayed. On the agenda – the next four days the Lucene Revolution conference and a Solr application development workshop organized by Lucid Imagination. The conference promised a unique venue (the first of its kind in the US) to meet developers that all share the same challenge: to enable users to find relevant information in growing bodies of data quickly and intuitively. I was looking forward to hearing many interesting talks given by experts of the field, to learning how to build intuitive search interfaces, and to get an idea where things are heading in the next years. As the developer of JCVI’s Metagenomics Reports (METAREP), I was especially looking forward to the Solr workshop to learn some of the tricks from the experts to tweak the search engine behind this open-source metagenomics analysis tool.

The Early Revolution

But before the revolution could happen and I could enjoy some splendid time at the Washington Dulles airport, Doug Cutting had to start developing a Java based full-text search engine called Lucene in 1997. Lucene became an open-source project in 2000 and an Apache Software Foundation project one year later. In 2004, Solr emerged as an internal CNET project created by Yonik Seeley to serve Lucene powered search results to the company’s website. It was donated by CNET to the Apache Software Foundation in 2006.

Google Trend for Solr

Google Trend for Solr

Early this year, both projects merged and development since then has been carried out jointly under the umbrella of the Apache Software Foundation. Meanwhile many companies use Solr/Lucene, among them IBM, LinkedIn, Twitter, and Netflex. How did this happen?

The Lucid Imagination Solr Application Development Workshop

In search of an answer, I made my way from my hotel to the conference venue, the Hyatt Hotel located along the beautiful Boston harbor bay. The 2-day workshop was a brute-de-force tour of Solr features, configuration, and optimization. It also touched on the mathematical theory behind Lucene’s search result scoring and on evaluating result relevance.  The 2-day workshop covered enough material to warrant a third day. Given this optimistic agenda, there was not much time for the labs (exercises) and the trainer had to focus more on breadth than on depth. As a one-year Solr user, many of the general concepts were familiar so I was more interested in details. A comprehensive hand-book and an excellent exercise compilation came to the rescue and provided me with the needed detail to follow up on subjects that were touched on. There were two parallel Solr classes. In my class, 25 participants followed the training. The mix included developers working for media, defense, and other co-operations. Academia was represented by several libraries and universities.

Solr Application Development Workshop

A powerful feature I had not heard before is the DISMAX RequestHandler. The handler allows to abstract complex queries. Users can enter a simple query without complex syntax or specifying a search field and behind the scenes the handler will do its magic.  It searches across a set of specified fields which (among other things) can be weighted by importance. Additional information about this handler and other snippets I collected during the class can be found in my Solr workshop notes .

The Lucene Revolution Conference

After a mediocre coffee brewed in my hotel room, I headed to the conference venue on the second floor of the Hyatt Hotel. The first day of the conference started with a podium discussion about Cutting Edge of Search that included Michael Busch (Twitter), John Wang (LinkedIn), Joshua Tuberville (eHarmony), and Bill Press (Salesforce.com). The discussion went back and forth showcasing each search platform and the experience in developing it. When asked what he would do differently in retrospect John Wang from LinkedIn ironically mentioned that he would “ban recruiters” – if I correctly remember he mentioned that they “spam-up” the system.

Lightning Talk “Using Solr/Lucene for High-Performance Comparative Metagenomics”

Joshua Tuberville from eHarmony provided valuable advice to developers: “Avoid pet queries for benchmarking a system – use a random set of queries instead.” He also suggested tracking queries that web site users enter for optimization, adding “it surprises me every day that the world is not made up from engineers, but it is a fact.” Avoid unnecessary complexity and duplicating efforts. Use open-source if available. For example, instead of implementing their own Lucene wrapper, eHarmony made use of the open-source project Solr. Bill Press added “Do not be afraid to tear things down, rebuild it many times if needed.”

“Companies do not have time to debug code.” Eric Gries (CEO Lucid Imagination)

Eric Gries, CEO of Lucid Imagination, presented ‘The Search Revolution: How Lucene & Solr Are Changing the World’. In the introduction, he pointed out that Solr/Lucene is the 10th largest community project and the 5th largest Apache Software Foundation project. “Open-source projects need a commercial entity behind them to help them grow”. “Companies need no errors, they do not have time to debug.” His main part focused on his company’s LucidWorks Enterprise software which is based on the open-source project Solr/Lucene. Features that separate it from the open-source version include smart defaults, additional data sources, a REST API that allows programmatic access via Perl/Python/PHP code, standardized error messages, and click based relevance boosting. Later, Brian Pinkerton, also from Lucid Imagination presented additional details. He revealed that their software is based on elements of the upcoming Solr 4.0 version and is fully cloud enabled (added SolrCloud patch). It uses ZooKeeper to manage node configuration and failover. All website communication is done in JSON .  The enterprise version supports field collapsing for distributed search.

“A picture communicates a thousand words but a video communicates a thousand pictures.” Satish Gannu (Cisco)

Satish Gannu from Cisco stressed the increasing prevalence of video data and how such data is changing the world. More and more video enabled devices are pushed on the market. Collaboration is increasingly done across the world. Meetings are recorded and shared globally. Videos are replacing manuals. Cooperate communication/PR via video is increasing. He related the popularity of video to the fact that “A picture communicates a thousand words, but a video communicates a thousand pictures” and that “60% of human communication is non-verbal.” Satish went on to highlight Cicso’s video solutions that make use of automatic voice and face recognition software to store metadata about speakers to enrich the user experience. For example, users can filter out certain speakers when watching recorded meetings. More can be found here.

View of Boston

“Mobile application development will be the driver of open-source innovation.” Bill McQuaide (Black Duck Software)

One of the highlights that morning was Bill McQuaide’s talk on open source trends. Based on diverse sources, including his company Black Duck Software, he showed that software IT spending is down, that 22% of software is open source, and that 40% of software projects use open source. There is an enormous amount of new open source projects targeting the cloud with a lot of competition. Among top open-source licenses are the GNU General Public Licenses, GPL 3.0, and BSD licenses. The three predominant programming languages used by open-source developers are C, C++ and Java. Mobile development will be the driver of innovation in the open-source community especially developments around Google’s Android operating system.  To manage licenses for projects that integrate dozens of open-source projects such as Android and to ship the bundled software to customers can become very complex. For this and other reasons, McQuaide recommends companies and institutions to have policies for implementing open source, integrating third party tools, and identifying and cataloging all open source software used.

Distributed Solr/Lucene using Hadoop

An excellent talk was presented by Rod Cope, from Open Logic.  He presented Real-Time Searching of Big Data with Solr and Hadoop.  The search infrastructure centered around Hadoop’s distributed file system on top of which they cleverly arranged several other technologies.  For example, Hadoop’s HBase database provides fast database lookups but does not provide the power of Lucene text searches. Solr/Lucene however is not as optimized to return stored document information. Their solution is to use Solr/Lucene to search indexed text fields, storing and returning only the document ID.  The returned document ID is then used to fetch additional information from the HBase database. Open Logic uses the open-source software katta to integrate Lucene indices with Hadoop and increased fault tolerance by replicating Solr cores across different machines. Also, corresponding master and slave servers were set up to run on different machines for indexing and searching respectively.  The set up he described runs completely on commodity hardware and new machines can be added on the fly to scale out horizontally.

“It surprises me every day that the world is not made up from engineers but it is a fact.” Joshua Tuberville (eHarmony)

Next on the agenda were seven minute lightning talks. I opened-up the lightning talk session describing our Solr/Lucene based open-source web project METAREP for high-performance comparative genomics (watch). Next was Stefan Olafsson from TwigKit presenting ‘The 7-minute Search UI‘, a presentation which I thought was another gem of this conference. In contrast to other talks, it focused on user experience and intuitive user interfaces. TwigKit has developed a framework that provides well designed search widgets that can be  integrated with several search engines.

“If nobody is against you in open source then you are not right.” Marten Mickos (CEO Eucalyptus)

The key note presentation on the second day was presented by Marten Mickos the CEO of Eucalyptus and former CEO of MySQL. He opened by advocating his philosophy of making money out of open source projects. “Innovation is a change that creates a new dimension in performance” he said and mentioned the open-source Apache web server that allows anyone to run a powerful web server. He added “Market disruption is a change that creates a new level of efficiency” and referred to MySQL originally designed to scale horizontally. While in 1995 such a design was a draw-back compared to other marked solutions, scale-out has become the dominant design today. Now, within the cloud, horizontal scaling is the key. A fact that has made MySQL the most used database in the cloud.

He observed that “while most successful open-source projects are related to building infrastructure software, servers and algorithms, there are only a few open-source projects centered around human behavior, user experience and user interfaces. The latter projects are mainly developed in closed source environments.” Then he went on praising open-source as a driver for innovation “Open source is so effective because you are not protected. Code can be scrutinized by everybody. In a close sourced company, your only competition is within the company, while in open source you compete with everybody.” Open source is a way to innovate and it is more productive. It usually takes a stubborn individual to drive things. Innovation mostly stems from single individuals that are supported by the community.

When asked how to maintain property rights as a company when running an open-source model, he responded “keep things that keep the business going proprietary but open-up others. The key is to be very transparent with your model.”

What’s next ?

In a podium session the core Solr/Lucene committer team discussed future features. The team works on rapid front-end prototyping using the Apache Velocity template engine and Ajax. The prototyping code can be found in the current trunk of the Solr/Lucene code repository under the /browse directory. A Solr/Lucene cloud enabled version is being developed. Twitter’s real time search functionality will be integrated. Other open source projects that are being integrated are Nutch, a web-search software, and Mahout for machine learning (http://mahout.apache.org).  New features will include pivot tables (table matrices), a hierarchical data type, spatial searching, and flexible indexing.

The above represents a subset of talks that took place. There were many other interesting talks – some took place in parallel sessions. Individual presentations can be downloaded from the Lucid Imaginations conference page. A selection of videos is available here. The next Lucene Revolution conference will take place in San Francisco May 2011.

After four days of Solr/Lucene, many coffees, talks, discussions, I left inspired by the conference. It dawned on me that the real revolution is not the search technology but the strong community spirit itself that has emerged and drives developers to jointly work towards a common goal.

French Road Sampling Trip Saves Sorcerer II From More Rough Weather!

September 28th 2010

With one last sample to collect and the weather still rough in the Mediterranean, we made the decision to make the Banyuls sample a road sampling trip.  So Jeremy and I loaded up a rental car with carboys and headed out at 5 am to drive the 125 miles (200km) to Banyuls France from Barcelona Spain.

Driving to Banyuls

After being on the boat for a few months straight, the 2 hour drive was a welcome adventure, even with the 5 am departure!  We were greeted at the Observatory of Banyuls by Dr. Ian Salter.  Ian showed us around the laboratory and down to the dock to their research vessel. 

View of Harbor from Lab

Old facilities

Ian and I on the research boat

They have a station less than ¼ of a mile offshore, which they have been monitoring for many years.  We motored out to the site, they did a water column profile with a CTD that was very similar to the CTD we have on Sorcerer II, and then we collected our water from a few meters deep with a niskin bottle .  Once we got back to the dock we loaded the carboys into the car and drove back to Sorcerer II to process the sample.  It is always good to collect samples with collaborators that have long term monitoring sites and are interested in working with the Venter Institute to analyze the data!

Filling our carboys

Ian and me as we motor to sample site

Jeremy, Me and Ian