Apache Hadoop at 10
2016 marks the 10th Anniversary of Hadoop. This birthday provides us an opportunity to celebrate, and also to reflect on how we got here and where we are going.
Hadoop has come to symbolize big data, itself central to this century’s industrial revolution: the digital transformation of business. Ten years ago, digital business was limited to a few sectors, like e-commerce and media. Since then, we have seen digital technology become essential to nearly every industry. Every industry is becoming data driven, built around its information systems. Big data tools like Hadoop enable industries to best benefit from all the data they generate.
Hadoop did not cause digital transformation, but it is a critical component of this larger story. Thus by exploring Hadoop’s history, we can better understand the century we are now in.
Pre-Hadoop, there were two software traditions, which I will call “enterprise” and “hacker”. In the enterprise tradition, vendors developed and sold software to businesses who ran it—the two rarely collaborated. Enterprise software relied on a Relational Database Management System (RDBMS) to address almost every problem. Users trusted only their RDBMS to store and process business data. If it was not in the RDBMS, it was not business data.
In the hacker tradition, software was largely used by the same party that developed it, at universities, research centers, and Silicon Valley web companies. Developers wrote software to address specific problems, like routing network traffic, generating and serving web pages, and so on. I came out of this latter tradition, specifically working on search engines for over a decade. We had little use for an RDBMS, since it did not scale well to searching the entire web, becoming too slow, inflexible and expensive.
In 2000, I launched the Apache Lucene project, working in open source for the first time. The methodology was a revelation. I could collaborate with more than just the developers at my employer, plus I could keep working on the same software when I changed employers. But most important, I learned just how great open source is at making software popular. When software is not encumbered by licensing restrictions, users feel much more comfortable trying it and building their businesses around it, without the risks of hard dependencies on opaque, commercial software. When users find problems, they can get involved and help to fix them, increasing the size of the development team. In short, open source is an accelerant for software adoption and development.
A few years later, around 2004, while working on the Apache Nutch project, we arrived at a second insight. We were trying to build a distributed system that could process billions of web pages. It had been rough going: the software was difficult to develop and operate. We heard rumors that Google engineers had a system where they could, with just a few lines of code, write a computation that would run in parallel on thousands of machines, reliably processing many terabytes in minutes. Then Google published two papers describing how this all worked, a distributed filesystem (GFS) with an execution engine (MapReduce) running on top of it. This approach would make Nutch a much more viable system. Moreover, these tools could probably be used in a lot of other applications. MapReduce had unprecedented potential for large-scale data analysis, but was at that time only available to engineers at Google.
Combining these two insights—the efficacy of open source for spreading technology and the broad applicability of Google’s approach—we realized that an open-source implementation of Google’s ideas would not only help us in Nutch, but had the potential to become a very successful open-source project. With that realization, Mike Cafarella and I began implementing such a distributed filesystem and MapReduce engine in Nutch.
By 2005 we had this new Google-inspired version of Nutch limping on 20-to-40 computer clusters, Mike at the University of Washington and me at the Internet Archive. However, I realized that, with just a couple of us working part time, it would take many years to get this software to be stable and reliable enough so that anyone could easily make use of it. Moreover, to truly fulfill its promise, the software needed to be tested and debugged on thousand computer clusters, which we did not have. The technology needed more engineers and more hardware.
Late that year I gave a talk about Nutch to folks at Yahoo! and learned that they had a great need for this kind of software. They also had a team of skilled engineers to work on it, and plenty of hardware. It was a perfect match.
So in January 2006, I joined Yahoo!. Shortly thereafter we separated the distributed filesystem and MapReduce software from Nutch into a new project, “Hadoop”, named after my son’s stuffed elephant. With the addition of a dozen or so Yahoo! engineers and access to thousands of their computers, we made rapid progress. By 2007 we had a relatively stable, reliable system that could process petabytes using affordable commodity hardware.
Hadoop was then a game changer. Developers could much more quickly and easily build better methods of advertising, spell-checking, page layout, and so on. Increasingly, users outside of Yahoo! started to deploy Hadoop, at companies like Facebook, Twitter, and LinkedIn. Other projects were soon built on top of Hadoop, like Apache Pig, Apache Hive, Apache HBase, and so on. Academic researchers began to use Hadoop. We had reached the target I had initially imagined: a popular open source project that enabled easy, affordable storage and analysis of bulk data.
Little did I know that things were only getting started. A few venture capitalists approached me suggesting that this software might be useful outside of the web and academia. I thought they were crazy. Banks, insurance companies and railways would never run the open-source “hacker” software that I worked on. But the VCs persisted, and, in 2008, they funded Cloudera, the first company whose express mission was to bring Hadoop and related technologies to traditional enterprises.
A year later, in 2009, I began to recognize this possibility. If we could make Hadoop approachable to Fortune 500 companies, it had the potential to change their businesses. As companies were adopting more technology, from websites and call centers to cash registers and bar code scanners, more and more data about their businesses passed through their fingers. If institutions could capture and use more of this data, they could better understand and improve their businesses. Traditional RDBMS-based technologies were a poor match in several dimensions: they were too rigid to support variable, messy data and rapid experimentation; they did not scale easily to petabytes; plus they were very expensive. Even a small Hadoop cluster could permit companies to ask and answer bigger questions than before, learning and improving. So I joined Cloudera. This was clearly where Hadoop would make the biggest difference going forward.
Now, seven years later, we can see that Cloudera’s founders were right: We have found that Hadoop and the movement it started has a valuable role in mainstream enterprises.
We are in a revolution on several fronts. Traditional enterprise RDBMS software now has competition: open source, big data software. Much to my surprise and pleasure, the hacker and enterprise software traditions are no longer distinct, but have merged. No longer is there a strict division between those who develop software and those who use it. We regularly see Cloudera customers collaborating with our engineers. Users are often directly involved in software advances.
No single software component dominates. Hadoop is perhaps the oldest and most successful component, but new, improved technologies arrive every year. New execution engines like Apache Spark and new storage systems like Apache Kudu (incubating) demonstrate that this software ecosystem evolves rapidly, with no central point of control. Users get better software sooner.
The new software is not only more affordable and scalable, it offers a better style. Institutions can explore messy, diverse data sources, perform experiments, and rapidly develop and evolve applications. Data from sensors, social media, and production can be combined to develop insights, inform decisions, and fuel new products. Companies like Cloudera have helped this software meet the requirements of industry, making it more stable, reliable, manageable, secure, and easily integrated with existing systems.
Government and industry themselves are transforming. Not only are new companies like Uber and Tesla using data to reinvent their sectors, but established companies like Caterpillar and Chevron are dramatically improving themselves through data technology. Much of the progress we will make in this century will come from increased understanding of the data we generate.
Looking back, 10 years ago I never would have guessed that Hadoop would form a critical part of such huge trends. I am incredibly surprised and proud of how far we have come. I look forward to following Hadoop’s continued impact as the data century unfolds.