Ashish Thusoo knows a lot about Big Data. Thusoo joined Facebook in 2007 when the company had 50 million users. He left when it had some 800 million. During that time he managed Facebook’s internal data analytics team.
Facebook’s analytics team managed the data and analytics for ad targeting, user growth, and user engagement. Now Thusoo has a new company, Qubole, which is building a Big Data platform in the cloud.
Thusoo’s insights have a single overarching theme: the democratization of data. By this he means opening up data analytics to all users in an organization, from data scientists to product engineers and business analysts.
Here’s what Thusoo learned while scaling the data analytics engine at Facebook:
1. New technologies have shifted the conversation from “what data to store” to “what can we do with more data.” The lower comparative cost of open source technologies like Hadoop and Hive makes it possible to gather more key measurements. In the case of Facebook and other Internet properties, that means gathering a lot more data on user activity and behavior.
This reduction in cost also enables more historical data to be online. “The result,” says Thusoo, “is better data driven applications. At least in the data world, simple algorithms on more data seems to yield better results than complex algorithms on a smaller data sample, notwithstanding some exceptions.”
2. Simplify data analytics for end users. Put another way, what Thusoo learned at Facebook was that there “was a lot of power in democratizing data for data users” such as scientists, analysts, and engineers.
His goal was to make all capabilities related to data easy, from instrumenting applications and collecting data, to understanding and analyzing it, to creating data driven applications.
“Building familiar interfaces,” and tools to deal with data was key to increasing the adoption of underlying technologies like Hadoop and Hive within Facebook.
3. More users means data analytics systems have to be more robust. The vision of “democratizing data” among Facebook’s “data scientists, analysts and data engineers made things harder.”
To realize that vision, Thusoo’s team had to design in the ability to handle poorly written queries so they wouldn’t crash the system. They had to build mechanisms for sharing resources fairly, including usage monitoring and limits.
“We had many different kinds of users ranging from business analysts to product engineers with varying levels of understanding of the infrastructure or the best practices of using it.”
4. Social networking works for Big Data. ”We invested in making our tools more and more collaborative so that users could share analysis with each other and discover data by getting connected to expert users of a data set.”
With Facebook’s hyper-growth and data that was changing all the time, a collaboration approach “worked better than creating knowledge bases around metadata.”
5. No single infrastructure can solve all Big Data problems. When it came to real-time reports, Thusoo’s team made “a lot of investment as we discovered use cases… better solved through systems other than Hadoop. In the case of real time reports our team invested in building out Puma. There were many other examples around graph analysis as well as low latency data inspection on large data sets,” where they had to build or invest in new technologies.
6. Building software is hard, but running a service is even harder. Thusoo’s team had to do a lot of work to make the service usable. They invested a lot of time and energy in building “systems that would measure usage, point out bottlenecks and really quantify for our users how much they were using” the system. They had to build capabilities to monitor and deliver on agreed upon service levels as well.