EMC World Day 1

I am officially back in the blogosphere. It will take me a bit to get into the swing of things, but I wanted to crank out a Day 1 blog so here goes.

This year at EMC World has been a bit of a different experience for me. My role changed a bit this year, I have joined the Greenplum Product Architecture team. As a member of this team, I am tasked with working with Product Engeering to help them design products that meet customer needs. So, with that, I am working with field resources a good bit so I can gather those requirement. I am also tasked with driving products out of Engineering and into the field prior to there General Availability...so this includes POCs and Betas. EMCWorld is proving to be a treasure-trove of feedback since I am much more of an attendee than a worker-bee, but I do have a couple of responsibilities other than learning this week:

Co-presenter on Optimizing Greenplum Database on a VMware Virtualized Infrastructure. So, what's a co-presenter? Well, Kevin O'Leary is the rock star here 🎸🎤...I am the slide flipper 😄 ! Actually, I collaborated with he and Shane Dempsey over the last couple of months on the "what and how to test" and participated with them as we worked with VMware Performance Engineering to make this a reality. Kevin and Shane did all the heavy lifting and Kevin has brought a wealth of knowledge for Cork, Ireland to help customers learn a bit what was involved in running our Proof of Concept test cases. Come watch at 4:15 today @ Lando 4201A. I will be there as well for QA.
Co-presenter on Hadoop Analytics + enterprise Storage. This one has been my baby for quite awhile. It's being primarily presented by the Isilon team, but I have been intimately involved for many months and am one of the few that have some direct hands on experience with the combination of these technologies in the field. Isilon's HDFS is unique in the industry and it's a little tough to wrap your head around in the beginning. At its simplest level, Isilon provides a Highly Available Namenode and Datanode implementation which is distributed over their cluster. This allows you to scale your compute and storage capabilities separately as your Hadoop workload grows. This is especially interesting to customers looking to provide Hadoop as a Service as they can provide storage capability for multiple users of the infrastructure while allowing them to share the compute infrastructure. If you went to the keynote, you saw a real world example of this combination. I was actually involved with the real world version of that implementation. Dr. Don Miner and I went onsite at that particular customer. He wore the hat of MapReduce guy and my hat was Hadoop Infrastructure guy. We showed up on site at 9 AM and had a LIVE Hadoop cluster processing hundreds of GB of XML and Log files within 43 minutes. Now, two things helped our time....Don had prewritten the MR job and only had to tweak it onside, and I leveraged version an EXTREMELY early version of an automated Hadoop installer I have been writing called Pachyderm (blog post coming Soon). The installer had a couple of issues that forced us to do a couple of steps manually, and we had to configure the Isilon cluster as well. So the 43 minutes is still an amazing feat as most Hadoop implementations would have taken longer than that just to get the data into HDFS. With Isilon, we just provide an NFS share to the end user and they put the data on that and it's immediately available to HDFS. In the case of weblogs, you can write those directly to the shares so you have no data movement at all. <sarcasm😜> The Isilon and Hadoop integration was VERY DIFFICULT to enable. <sarcasm> We had to make a single change in the core-site.xml file. I just talked to a customer yesterday that loved the work EMC has done and has already ordered GPHD and Isilon! Also, stay tuned for another Blog entry and possibly a VMworld session that leverages these technologies to provide Hadoop in the private cloud. This session is Thurs@1:00 in Marcello 4403 or the Birds of a Feather session tomorrow at 1:30.
World Wide Hadoop Booth. I have been involved with this EMC Black-ops project since it was first created and have been amazed at what our friends in our Solutions Group in Russia have been able to put together in a very short period of time. This is designed to be a Hadoop community project over time, but for now is an EMC proof of Concept. We are actively seeking customers to help us further define use cases and requirements so we can expand the project as its pushed into the public domain. Basically, the idea is that Hadoop Clusters are great at processing data that they have locally, but what about data that you can't consolidate for size, legal compliance, financial laws, etc.? What if you could write a single MapReduce job, choose the geographically disperse data sets to run it against and then remotely submit the job. The job is dynamically passed to the appropriate geographies and run against the local Hadoop cluster. The resulting aggregated data sets are then passed back to the management system or local cluster for review. Then if needed another final MR job can be run to further consolidate the results. So, what we have created is a Global or World wide Namenode view that allows you to seamlessly view all directory information regardless of location and we have created a remote job submission system that allows a job to be submitted to multiple job trackers in a single operation. As a Proof of Concept, this was all built in a matter of days leveraging SpringHadoop as the orchestration framework. SpringHadoop provides Hadoop hooks to the Spring Framework and Spring Batch, which allows you to build Spring apps that leverage Hadoop services or you can use it to implement a batch processing workflow. Stop by and see me during lunch sometime this week.

So, besides these duties, I am meeting customers and attending all the Hadoop and UAP sessions to listen to customers, so I can provide feedback and hopefully help EMC better meet customer requirements in the Big Data space. More tomorrow...

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions...

Baskettecase

Search This Blog

EMC World Day 1

Comments

Post a Comment

Popular posts from this blog

CF Summit 2018

Isilon HDFS User Access

Is Hadoop Dead or Just Much Less Important?