Skip to main content

Hadoop so EASY, Hulk could do it.






Wow.   I would like to say it's hard to believe how long it's been since I have blogged, but I really can't.  I never really got into the whole blogging thing.  I have gotten work and play with some great technologies over the past couple years, and it's time to start talking about some of that.    My recent work history has been a whirlwind of change, but any of my prior customers or fellow EMC'ers know that is not a huge surprise.  I have been known as a guy that has leapt from technology to technology to stay fresh and engaged.   Sometimes, that's at the expense of career development, but I think it all works out in the long run.     It's funny, but because because of my history, I have become a jump-mentor for a couple of well-known guys at VMware (ex-vSpecialists) that just needed that little nudge.   This October will mark the beginning of my 13th year at EMC.   I spent the last 3 years at Greenplum focusing almost entirely on various aspects of Hadoop, but given my background much of that focus was on deployment and operations.   Among other things, I worked on Hadoop virtualization, Python-based deployment, and my last project was Active Directory integration with Hadoop and Kerberos.

I recently moved to EMC ESG to work for Bala Ganeshan.   I have known Bala my entire career at EMC and we had been looking for an opportunity to work together again, so when he needed some Hadoop and Hadoop benchmarking experience for  his team I jumped at the chance.   Fast-forward about a month and the Annual EMC Re-Org struck.    Bala's team was split up with resources being redistributed to various groups, buy a few of us ended up with Bala in the Corporate Office of the CTO.   I no longer work for Bala, but I do work alongside him on many projects in my new role in the OCTO.   I now work for another awesome EMC Distinguished Engineer, John Cardente, helping to define the building blocks for Big Data within EMC.   This is a unique and challenging role that has me very excited for the future.   I will expand out into other technologies, while continuing to provide Hadoop expertise to a variety of projects within EMC and Pivotal.   This is the first time I have gotten to work VERY closely with product development outside of the Greenplum organization, so I am really enjoying that aspect.

I am also tightly aligned with the OIL group (Bala Ganeshan and vEddie http://veddiew.typepad.com/blog/2013/08/oilintro.html) and will work with them on deploying current and future technologies into the Big Data marketplace.    That's where this blog actually is going....took long enough....right?   One of the first projects the OIL group took on was an effort to increase Hadoop adoption within the EMC Isilon customer base.    Isilon has been offering HDFS support, a freely licensable feature within the OneFS platform, for sometime, but had just not taken off within the current customer base for a variety of reasons.   One of those is a fundamental misunderstanding of how the Isilon HDFS support works and the value it can bring.  To me, that's just shows the lack of understanding of Hadoop within the current EMC field organization.   That doesn't surprise me, and in fact, I wouldn't really expect them to understand Hadoop.  They have relied on Greenplum to provide that support, but now that Greenplum has become an independent company (Pivotal), there is a push within EMC Corp and field to understand where the current products play well and how we can build Next-Gen products specifically for the market.   To me, Isilon sits squarely in the middle of those to requirements....it's a current product that has addressed a portion of the Big Data marketplace and the plan on further developing the product to play even better.   My first real exposure to this technology combination was over a year ago at a customer site (http://chrisgreer.blogspot.com/2012/04/isilon-and-hadoop.html) and it went so smoothly we decided to further test the combination and got some great initial results.  Since then Isilon has worked out any kinks, added support for the Hadoop 2.0 stack and built much better documentation of the combo.     Where else can you find a platform that provides simultaneous access to the EXACT same dataset via NFS, CIFS, HDFSv1 and HDFSv2?  Answer:  No where.   Why is that interesting?   Data loading challenges go away.   Hadoop migration challenges are simplified, because you can spin up a 2.0 cluster and test against the data in place.  If your migration was not successful, keep using the 1.0 cluster nodes.    Isilon is unique in that it replaces the Namenode and datanode functionality.   This allows you to scale out both compute and storage independently in their own scale out cluster.  I have personally performance tested Isilon in a variety of configurations and for all those out there that say shared storage in ANY configuration is a bad idea, I invite you to take a closer look, because this is not your fathers shared storage....this is a high performance scale-out cluster.  Add a node....add more bandwidth.

The OIL effort to increase adoption immediately focused on a couple of key enabling technologies.   Apache Hadoop 1.0 stack,  VMware Project Serengeti (with the Big Data Extensions), and Isilon.   Why these technologies?   Apache Hadoop because BDE ships pre-configured for it, so the configuration and rollout becomes much easier.  Serengeti and BDE, because almost everyone has a VMware implementation and it's much easier to deploy test environments there than it is to build out a net-new physical test cluster to play with Hadoop.

Project Serengeti and BDE allows you to easily deploy Hadoop clusters into a variety of configurations: combined storage/compute, separate storage/compute,  or the interesting one multiple compute farms against a single storage farm.   This can deployed within the VMware environment, or by leveraging HDFS support within Isilon.     The OIL effort leveraged this model  and documents a simple deployment of a compute-only cluster that is pointed at Isilon via an HDFS URI.   This documentation in combination with a HDFS-licensed Isilon cluster should have a Hadoop test environment up and running in a matter of minutes.  James Ruddy (@darth_ruddy) and Ed Walsh (@veddiew) did a masterful job in a short amount of time documenting this process in a way that is very easy for customers to follow. See Here:  Hadoop Starter Kit  They are currently on-site at VMworld to answer any questions you might have....  I opted to attend HadoopWorld instead, so if you are heading to New York in October give me a shout and we can chat about this.

If you are interested in these technologies, but are well past the introductory stage, take a look at what the Solutions Group from EMC put together for EMC World: Hadoop as a Service.  This is a much more complex implementation of the same technologies to provide customers with a roadmap on how to deploy a true multi-tenant Hadoop implementation.





Comments

Popular posts from this blog

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Intelligence

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"    The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs directly to the share, then Hadoop can run MapRe

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the  compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions of H