Apache Hadoop is a software that provides a framework for large-scale data storage and analytics. It is built on principles that are used by Google Inc for handling web-scale data; in order to built an index of the Internet, the search giant needs to store and process Petabytes of data, and in order to keep ahead of the game it needs to facilitate fast turnover of the ideas of their scientists. SARA has started a half-year project to evaluate Hadoop together with Dutch Scientists; and if you are a Dutch Scientist, you very welcome to join us and do your science on our cluster! In this post I describe some properties of the project, and give you some examples of the work that is currently being done on Apache Hadoop at SARA.
Apache Hadoop exists of a distributed file system, the Hadoop DFS (HDFS), and a framework for parallel processing of large datasets, MapReduce. The former enables redundant and fail-safe storage of large datasets, the latter provides a very simple programming model for parallel processing of data. See this presentation (which I used to introduce scientific programmers within a large BioInformatics project, BioAssist, to Hadoop) for a more extensive overview, and this document (which I wrote for our first Hadoop hackathon last December 7th) for an introduction to the MapReduce programming API. A number of additions are developed in the community for even easier storage and data analytics. Apache Pig gives a user the power of Pig Latin, a simple but very powerful language that looks a lot like SQL. Apache Hbase provides an implementation of Google’s BigTable, providing database-like storage with low-latency access. Yahoo!’s Oozie provides a workflow system to chain MapReduce jobs together. And then there’s Chukwa, Flume, Hama, Hive, Hue, Mahout, and many others. In short, a hyping technology that is in production at lots of data oriented market-leaders and used by a growing number of scientists around the world.
Our evaluation project aims to see how Hadoop would fit in SARA’s mission as a provider of compute and storage infrastructure for scientists in national and international projects. Our approach is simple but powerful: we equip a (relatively small) cluster of machines with Hadoop and invite scientists to use the system to answer their data-related questions.
Some properties of our prototype cluster:
- 20 cores for MapReduce
- 55 TB of net storage for HDFS
- An SFTP interface to HDFS
- Cloudera’s Distribution of Hadoop (CDH3b3)
- Apache Pig 0.8
At the moment we have a number scientists from different universities and disciplines piloting our system. Their backgrounds a diverse and include people from the worlds of Information Retrieval, BioInformatics and Data Mining. We have people analyzing Twitter tweets and DBPedia, sensor data collected on the “Hollandse Brug” (a bridge in The Netherlands), and genome datasets. And we have already seen some very interesting results!
Below I have included some short descriptions of work that is currently being done on our cluster. If you are a scientist, and you are interested in participating in our pilot project by using the system for your work, drop me a mail at evert [dot] lammerts [at] sara [dot] nl.
Edgar is using Hadoop for fast turnover of experiments on large textual corpora. At the moment he is looking at a dump of 417 million Twitter tweets, totalling 600 GB of raw text.
Coming experiments include:
(i) A comparison of “new” media like Twitter, and more traditional media, like papers
(ii) Several analyses of the ClueWeb09 dataset – a webcrawl existing of around 500 million webpages, totalling around 13.4 TB of raw text.
Online reputation management consists of monitoring media, detecting relevant contents, and analyzing what people say about an entity (organization, product, brand, etc.). A bottleneck for reputation management experts is the ambiguity of entity names; this is amplified for organizations with names that are common nouns, for example “Apple”, “Amazon”, etc. To keep the task of monitoring user generated content manageable, Krisztian developes a method for filtering out spurious name matches from a stream of tweets. He is analyzing a large collection of historical data from Twitter (~400 million tweets), to observe patterns and changes in language usage and to identify other possibly discriminative features. These features will then be used to train a classifier that can decide for each new incoming message whether it refers to a given company.
Joaquin was one of the two winners of a copy of “Hadoop: the definitive guide” during the SARA hackathon. He works on the InfraWatch project, in which data of a total of 145 sensors embedded in a bridge in The Netherlands, called “Hollandse Brug”, is being analyzed. This data is coming from strain sensors, vibration sensors, temperature sensors, and even a video camera; leaving out the size of the video data, these sensors collect around 5GB per day, or 1.8 TB per year. He is using Hadoop for parts of his work that can be done in an embarrassingly parallel way.
Spyros and Jacopo work in the Large Knowledge Collider project, and experiment with using Hadoop for high-performance Semantic Web applications. They perform experiments on batch query answering over very large datasets. To this end, they and their colleagues have created a mapping from the SPARQL query language for the Semantic Web to the Pig framework for Hadoop.
Barbera was introduced to Hadoop during the SARA hackathon. The tests she performed there were done on a small part of the public DNA sequence data provided by the 1000 genomes project (~12GB). The aim was to detect sequences in the raw experiment data that occur more than one time. Within one DNA sequence experiment these duplicates represent technical errors of the laboratory process and/or real biologically repetitive sequences. The web interface (Hue) of the Hadoop cluster was used to setup the bioinformatics/hadoop experiment and keep track of the progress and the results. During the workshop two shell scripts were uploaded to the Hadoop cluster. The first script (the mapper) extracted the DNA sequences from the input files. The sorting and parallel processing was done by Hadoop. The second script (the reducer) searched for duplicated sequences and reported those. As a next step it would be interesting to expand the experiment to the complete data set of the 1000 genomes project and create a list of sequences with known technical artifacts. Data from new DNA sequence experiments can then be filtered against this list.
 HDFS is based on a paper written by Google employees Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung about the Google File System
 MapReduce is based on a paper written by Google employees Jeffrey Dean and Sanjay Ghemawat about the Google implementation of MapReduce