Skip to main content

Wait...You did what? With Who?






I am a little behind on this announcement for a variety of reasons, but now as I fly home from yet another trip out to CA I have some cycles to jot down my thoughts. I guess the timing is still pretty good, since Hadoop world is coming up next week. So, better later than never,  I would like to introduce the Hadoop Starter Kit 2.0 which now supports Pivotal, Cloudera, Hortonworks, and Apache.   Yes, we provided instructions for every major distribution..... not just Pivotal.  That's one of the unique things about EMC and our partner companies like VMware and Pivotal, they are free to parter as they wish, as is EMC.   Of course, we will work closely together to engineer tightly integrated solutions, but that is more a function of leg work than corporate mandate.  I recently moved into the Corporate CTO office from Pivotal(Greenplum), and enjoy working with the Pivotal folks every chance I get, so look for more uniquely integrated products in the future.  As Big Data technologies are incorporated by more and more customers we are starting to see new requirements that just are not met by the technologies available today.  In some cases its design issues, but in many cases it's scope issues....when you step back and say how can I do X with Yx100 data.... decisions might be made much differently.   EMC knows performance and scale, so how can we apply knowledge and technology and solve some Enterprise-Ready issues with what's available today.



Now, onto the Hadoop Starter Kit.   First, the Hadoop Starter Kit v2 (HSK2) is nothing revolutionary, but 
it is a blueprint for rapid deployment of any of the major Hadoop distributions leveraging VMware Big Data Extensions, and Isilon as the HDFS datastore.   Both of these technologies help to enable a rapid deployment of Hadoop in a cost effective manner.   Many customers are interested in dipping their toes into the Hadoop waters, but just don't have the dedicated infrastructure to do it in a rapid manner.  What they typically do have is a VMware infrastructure and some have accessibility to Isilon storage.  The HSK2 lets them quickly get a test environment up and running that leverages both, and is much more useful than the single VM training environments that Cloudera, Hortonworks, and Pivotal all provide. They can leverage this environment for test/dev and then replicate the deployment at a larger scale once they are ready for production.  Easy.  So get started today.  Jim Ruddy did a masterful job in documenting this work and if you stop for a moment and consider the fact that he had not touched Hadoop prior to starting this project you will be even more impressed with the work.  You have a non-Hadoop guy, who has taken these enabling technologies and stitched together a blueprint for rolling out rapid Hadoop environments primarily aimed at that same type of person within the customer base.   As Jim likes to say, and my earlier blogpost mentioned....It's so easy even Hulk could do it.   I even recycled the picture which was taken at last years Hadoop World.


In my spare time, I have been working on a little piece of software that's almost ready to go.   When leveraging Isilon you get full HDFS capability, but what you don't get today is WebHDFS capability.   The reality is with Isilon you don't really need WebHDFS because Isilon has Native HTTP, NFS, and HDFS access to the same data, so the usecase for WebHDFS really disappears.  But, some of the training materials for Hadoop from Hortonworks leverages HUE  (Hadoop User Experience), which in turn relies on WebHDFS.  So, I set out to write a translator service.   In a nutshell, you run this service on your HUE server and it accepts WebHDFS calls on the standard http port.  It then translates that call to an equivalent Isilon REST call.   It then takes the Isilon response and reformats it into the format that the WebHDFS calls are expecting.   I didn't design it as a long term, production ready solution, but more of a means to leverage HUE for the HW sandbox training activities.    I have not decided how I am going to release it...or if there is any demand for it.   Short-term you can just email me at dan.baskette@emc.com and I can put it in your hands once I get it cleaned up a bit.   I developed it leveraging the Hortonworks Sandbox VM, an Isilon VM, and my Eclipse VM.   Thank you VMware Fusion.  (and thanks to Datameer for the Hadoop Flasher graphic)

Comments

Popular posts from this blog

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Intelligence

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"    The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs directly to the share, then Hadoop can run MapRe

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the  compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions of H