Wow. I would like to say it's hard to believe how long it's been since I have blogged, but I really can't. I never really got into the whole blogging thing. I have gotten work and play with some great technologies over the past couple years, and it's time to start talking about some of that. My recent work history has been a whirlwind of change, but any of my prior customers or fellow EMC'ers know that is not a huge surprise. I have been known as a guy that has leapt from technology to technology to stay fresh and engaged. Sometimes, that's at the expense of career development, but I think it all works out in the long run. It's funny, but because because of my history, I have become a jump-mentor for a couple of well-known guys at VMware (ex-vSpecialists) that just needed that little nudge. This October will mark the beginning of my 13th year at EMC. I spent the last 3 years at Greenplum focusing almost entirely on various aspects of Hadoop, but given my background much of that focus was on deployment and operations. Among other things, I worked on Hadoop virtualization, Python-based deployment, and my last project was Active Directory integration with Hadoop and Kerberos.
I am also tightly aligned with the OIL group (Bala Ganeshan and vEddie http://veddiew.typepad.com/blog/2013/08/oilintro.html) and will work with them on deploying current and future technologies into the Big Data marketplace. That's where this blog actually is going....took long enough....right? One of the first projects the OIL group took on was an effort to increase Hadoop adoption within the EMC Isilon customer base. Isilon has been offering HDFS support, a freely licensable feature within the OneFS platform, for sometime, but had just not taken off within the current customer base for a variety of reasons. One of those is a fundamental misunderstanding of how the Isilon HDFS support works and the value it can bring. To me, that's just shows the lack of understanding of Hadoop within the current EMC field organization. That doesn't surprise me, and in fact, I wouldn't really expect them to understand Hadoop. They have relied on Greenplum to provide that support, but now that Greenplum has become an independent company (Pivotal), there is a push within EMC Corp and field to understand where the current products play well and how we can build Next-Gen products specifically for the market. To me, Isilon sits squarely in the middle of those to requirements....it's a current product that has addressed a portion of the Big Data marketplace and the plan on further developing the product to play even better. My first real exposure to this technology combination was over a year ago at a customer site (http://chrisgreer.blogspot.com/2012/04/isilon-and-hadoop.html) and it went so smoothly we decided to further test the combination and got some great initial results. Since then Isilon has worked out any kinks, added support for the Hadoop 2.0 stack and built much better documentation of the combo. Where else can you find a platform that provides simultaneous access to the EXACT same dataset via NFS, CIFS, HDFSv1 and HDFSv2? Answer: No where. Why is that interesting? Data loading challenges go away. Hadoop migration challenges are simplified, because you can spin up a 2.0 cluster and test against the data in place. If your migration was not successful, keep using the 1.0 cluster nodes. Isilon is unique in that it replaces the Namenode and datanode functionality. This allows you to scale out both compute and storage independently in their own scale out cluster. I have personally performance tested Isilon in a variety of configurations and for all those out there that say shared storage in ANY configuration is a bad idea, I invite you to take a closer look, because this is not your fathers shared storage....this is a high performance scale-out cluster. Add a node....add more bandwidth.
The OIL effort to increase adoption immediately focused on a couple of key enabling technologies. Apache Hadoop 1.0 stack, VMware Project Serengeti (with the Big Data Extensions), and Isilon. Why these technologies? Apache Hadoop because BDE ships pre-configured for it, so the configuration and rollout becomes much easier. Serengeti and BDE, because almost everyone has a VMware implementation and it's much easier to deploy test environments there than it is to build out a net-new physical test cluster to play with Hadoop.
Project Serengeti and BDE allows you to easily deploy Hadoop clusters into a variety of configurations: combined storage/compute, separate storage/compute, or the interesting one multiple compute farms against a single storage farm. This can deployed within the VMware environment, or by leveraging HDFS support within Isilon. The OIL effort leveraged this model and documents a simple deployment of a compute-only cluster that is pointed at Isilon via an HDFS URI. This documentation in combination with a HDFS-licensed Isilon cluster should have a Hadoop test environment up and running in a matter of minutes. James Ruddy (@darth_ruddy) and Ed Walsh (@veddiew) did a masterful job in a short amount of time documenting this process in a way that is very easy for customers to follow. See Here: Hadoop Starter Kit They are currently on-site at VMworld to answer any questions you might have.... I opted to attend HadoopWorld instead, so if you are heading to New York in October give me a shout and we can chat about this.
If you are interested in these technologies, but are well past the introductory stage, take a look at what the Solutions Group from EMC put together for EMC World: Hadoop as a Service. This is a much more complex implementation of the same technologies to provide customers with a roadmap on how to deploy a true multi-tenant Hadoop implementation.
Comments
Post a Comment