Eran Kampf
Eran Kampf
2 min read

Facebook, Hadoop and Hive

facebook logo for website Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run distributed applications that process vast amounts of data), Yahoo being the first. It is also the creator of Hive, a data warehouse infrastructure built on top of Hadoop.

The following two posts shed some more light on why Facebook chose the Hadoop\Hive path, how they’re doing it and the challenges they’re facing:

Facebook, Hadoop, and Hive on DBMS2 by Curt Monash discusses Facebook’s architecture and motivation.

Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop €” augmented by Hive €” rather than to an MPP data warehouse DBMS…

The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold.

Hive – A Petabyte Scale Data Warehouse using Hadoop by Ashish Thusoo from the Data Infrastructure team at Facebook discusses Facebook’s Hive implementation in details.

… using Hadoop was not easy for end users, specially for the ones who were not familiar with map/reduce. End users had to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. It was very clear to us that in order to really empower the company to analyze this data more productively, we had to improve the query capabilities of Hadoop. Bringing this data closer to users is what inspired us to build Hive. Our vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility that Hadoop enjoyed.