This post is part of our ReadWriteCloud channel, which is dedicated to covering virtualization and cloud computing. The channel is sponsored by Intel and VMware. How one Florida state agenct saved a bundle with virtualization. Learn more in this ReadWriteWeb special report, made possible by Intel and VMware: How Virtualization Helps the Water Flow.
When it comes to having to scale your service to handle ever-increasing data demands, there may be no better example to look at than Facebook. The social networking giant has grown from a service for Harvard students to some 500 million users.
But that number doesn’t adequately capture the company’s scaling and storage demands. So here are a few more statistics: All told, users spend 8 billion minutes on the site every day. There are 3.5 billion pieces of content shared weekly. 2.5 billion photos are uploaded every month, and 1.2 million photos are served up every second. And as 70% of Facebook users are outside the United States, the amount of data served and stored is further complicated by the locations of users and data centers.
It’s no surprise, then, that some of the traditional ways of scaling cannot work for Facebook. You can’t simply shard the database based on a user’s location or based on what information she or he will be accessing, as users are interconnected in way that is both unpredictable and global.
As Facebook has grown, the company has worked to develop a number of tools to handle this data, both in terms of the storage and the delivery of content, and it has open sourced many of these. Facebook has been built from the beginning on open source technologies, according to David Recordon, Facebook’s Open Source Programs Manager. But Facebook’s use of open source goes far beyond the LAMP stack (or even, beyond the LAMP stack plus Memcached). The company has also created and released several open source projects and participates heavily in others, most notably perhaps, Hadoop.
Here are a few of Facebook’s open source tools that have helped it handle the massive amount of data:
Now an Apache Software Foundation project, Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure. Originally developed to help power Facebook’s Inbox search, Cassandra is one of a growing number of NoSQL database solutions.
Also an Apache project, Hive is data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets. Hive provides a simple query language called Hive QL, which as it’s based on SQL enables those familiar with SQL to query this data.
Here is an excerpt from a presentation at FOSDEM earlier this year:
In order to save resources for its servers, Facebook developed HipHop, which transforms PHP source code into a highly optimized C++. HipHop was open sourced earlier this year.
Facebook logs approximately 25 terrabytes of data a day. No surprise, perhaps, other tools weren’t able to scale to that capacity, so Facebook developed Scribe to log data streamed in real time from a large number of servers.
Currently an Apache incubator project, Thrift provides a framework for scalable cross-language services development in C++, Java, Python, PHP, and Ruby.
Recordon notes that many of Facebook’s new technologies started as “a hack project,” and the company is working not only to develop new tools internally, but to encourage their development and usage externally as well. So while few companies may have the massive data storage and scaling needs that Facebook does, the company’s open source efforts work towards building a “collaborative sustainable model” of development.