Facebook, Hadoop and Hive

Cloud Computing, Programming June 16th, 2009

facebook logo for website Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run distributed applications that process vast amounts of data), Yahoo being the first. It is also the creator of Hive, a data warehouse infrastructure built on top of Hadoop.

The following two posts shed some more light on why Facebook chose the Hadoop\Hive path, how they’re doing it and the challenges they’re facing:

Facebook, Hadoop, and Hive on DBMS2 by Curt Monash discusses Facebook’s architecture and motivation.

Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS…

The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold.

Hive – A Petabyte Scale Data Warehouse using Hadoop by Ashish Thusoo from the Data Infrastructure team at Facebook discusses Facebook’s Hive implementation in details.

… using Hadoop was not easy for end users, specially for the ones who were not familiar with map/reduce. End users had to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. It was very clear to us that in order to really empower the company to analyze this data more productively, we had to improve the query capabilities of Hadoop. Bringing this data closer to users is what inspired us to build Hive. Our vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility that Hadoop enjoyed.

Tags: , , , ,

Yahoo Releases Its Own Hadoop Distribution

Cloud Computing, Technology June 11th, 2009

hadoopYahoo! is releasing its own distribution of Hadoop:

Hadoop is a distributed file system and parallel execution environment that enables its users to process massive amounts of data.
In response to frequent requests from the Hadoop community, Yahoo! is opening up its investment in Hadoop quality engineering to benefit the larger ecosystem and to increase the pace of innovation around open and collaborative research and development.
The Yahoo! Distribution of Hadoop has been tested and deployed at Yahoo! on the largest Hadoop clusters in the world.

Hadoop is free Java software framework born out of an open-source implementation of Google’s published computing infrastructure and fostered within the Apache Software Foundation.
Yahoo! has been the primary developer and contributor to Apache’s Hadoop.
In 2006, Hadoop founder Doug Cutting joined Yahoo, which provided a dedicated team and resources, to lead the project of developing the open-source software and turn Hadoop into a system that ran at web scale. Today, Yahoo! is running the largest Hadoop cluster in the world, which includes more than 25,000 servers and provides the framework for many Yahoo properties including Yahoo Search, Yahoo Mail, and several content and ad services.

Yahoo says its opening up the source code to Hadoop to “benefit the larger ecosystem increase the pace of innovation around open and collaborative research and development.”.
As Nigel Daley, Quality and Release Engineering Manager at Yahoo! Grid Technologies, summarizes:

Hadoop is helping us solve key science and research problems in hours or days instead of months. It provides us a platform to solve extreme problems requiring massive amounts of data processing. It underpins major revenue-generating systems. Opening our distribution enables a faster pace of innovation for the entire Hadoop ecosystem and broadens the use — and ultimately the quality — of this key platform across the industry.

Tags: , , , , , , ,