Skip to main content

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking.

Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions of Hadoop without fixing many of its core usability issues. But, who knows, maybe people were lining up to run their general purpose application workloads inside of a separate and brand new resource management tool that was far less advanced than current offerings.

SPOILER ALERT: They weren't.

The blog says that putting a SQL layer onto Hadoop is a bad idea, but if it wasn’t for that SQL-layer Hadoop would have already died. Many of Hadoop's early adopters in the Enterprise had big plans about how they would leverage Hadoop, but they almost immediately hit a roadblock with requirements for Java and MapReduce knowledge. SQL, the lingua franca for data access, came to the rescue and opened up the data stored in Hadoop to actual use by actual business users with their traditional BI tools. There is no question SQL was the correct interface at the time. Technologies such as Apache HAWQ, Apache Impala, and IBM BigSQL played important roles in bridging the gap between complexity and usefulness during the progression of Hadoop in the market. Apache Hive
started as an slow and weak SQL alternative supporting just a subset of the SQL standards. It has, however, garnered a lot of attention by Hortonworks and the community and has progressed rapidly over the past few years because of this attention. That team understood how customers needed to interact with Hadoop data and morphed Apache Hive into the de facto standard SQL on Hadoop engine. The other entries all provided significant functionality above and beyond Apache Hive, but Hive advanced to a point in which users had to require those advanced features before Hive was ruled out.

As technologies progressed, new processing capabilities emerged that could leverage the Hadoop infrastructure. Spark proved to be the shining star in this new trend. A new simpler programming model, the ability to leverage in-memory processing, and support for multiple languages were all very attractive, but it didn’t take long for the Spark community to realize how most people wanted to access the data and a robust SparkSQL emerged. SparkSQL gained popularity because people were familiar with the interface, it made sense, and was significantly easier to use. Once again demonstrating that SQL was the right interface at the right time.

Hadoop has steadily added features, but the market is moving faster than Hadoop. Stream processing, real-time analytics, event-driven architectures have become a much bigger focus for modern data architectures. With the many companies moving into the cloud, HDFS becomes yet another option for how they can store data (a more expensive and self-managed option). There are tools, like Datameer and others, that can make Hadoop easier for the end users, but the care and feeding of the infrastructure is not an easy task and is much more complex than many alternative solutions in the market.


Taking a look at an Event-Driven Reference Architecture makes it very clear that there are many pieces involved with none of them being central to its operation. What's more important than the functionality of any one piece, is that the management of that piece does not overwhelm all the others. With Hadoop, that's not always true.


Hadoop isn’t dead, but it’s no longer synonymous with Big Data. Additionally, 3+ years after launch of Hadoop v2 the complexity/value curve is still not attractive in many cases. Take a look at the home pages of Cloudera, Hortonworks, and MapR: Count the number of times Hadoop is mentioned on their landing pages.

SPOILER ALERT: It's less than 3 times total across all 3 sites.

Hortonworks even changed the name of their user conference to reflect the changing priorities in the enterprise. Many users have begun to realize that they don't need Hadoop at all, or that the use-cases that were targeted for Hadoop can be addressed by a mix of offerings. That mix might include Hadoop, but as a piece of the solution, not THE solution.

In many cases, users are turning back to scale-out data warehousing to meet their needs. Advancements in MPP data warehouse offerings allow for massive clusters, use of cloud-storage or an HDFS-based data-lake for additional storage or a data-source, and simpler deployment/management. Often, these make them a much more attractive option, especially if planning on leveraging a SQL interface regardless of the final solution. In other cases, users are also adopting cloud-storage as a "data-lake" and the myriad of processing options that exposes.

So, Hadoop isn’t dead, but the excitement has worn off and it’s been relegated to yet another tool on the tool belt, instead of THE tool belt, as it was billed. The abundance of available tools is an interesting side effect of the market move toward open-source. But, that is a topic for another post.

Comments

  1. Hi Dan, unsure I would say that Hadoop is dead or even that the excitement is worn off. Instead I would say that people have gone from the hype to the trough and even the vendors are finding out now what the tooling was meant to fulfill. I expect more startups will begin to use Hadoop (which I define as inclusive of Spark and all the SQLs) to base their data platforms within. I also expect the vendors to realize that its not mapreduce or even SQL that's exciting about Hadoop, its the concept of a clustered operating system by which any application can be added atop. I'm actually more hyped now about what it can become (assuming the vendors can execute now).

    ReplyDelete

Post a Comment

Popular posts from this blog

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Intelligence

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"    The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs directly to the share, then Hadoop can run MapRe