Skip to main content

EMC World Day 1

I am officially back in the blogosphere. It will take me a bit to get into the swing of things, but I wanted to crank out a Day 1 blog so here goes.

This year at EMC World has been a bit of a different experience for me. My role changed a bit this year, I have joined the Greenplum Product Architecture team. As a member of this team, I am tasked with working with Product Engeering to help them design products that meet customer needs. So, with that, I am working with field resources a good bit so I can gather those requirement. I am also tasked with driving products out of Engineering and into the field prior to there General Availability...so this includes POCs and Betas. EMCWorld is proving to be a treasure-trove of feedback since I am much more of an attendee than a worker-bee, but I do have a couple of responsibilities other than learning this week:

  1. Co-presenter on Optimizing Greenplum Database on a VMware Virtualized Infrastructure. So, what's a co-presenter? Well, Kevin O'Leary is the rock star here ๐ŸŽธ๐ŸŽค...I am the slide flipper ๐Ÿ˜„ ! Actually, I collaborated with he and Shane Dempsey over the last couple of months on the "what and how to test" and participated with them as we worked with VMware Performance Engineering to make this a reality. Kevin and Shane did all the heavy lifting and Kevin has brought a wealth of knowledge for Cork, Ireland to help customers learn a bit what was involved in running our Proof of Concept test cases. Come watch at 4:15 today @ Lando 4201A. I will be there as well for QA.
  2. Co-presenter on Hadoop Analytics + enterprise Storage. This one has been my baby for quite awhile. It's being primarily presented by the Isilon team, but I have been intimately involved for many months and am one of the few that have some direct hands on experience with the combination of these technologies in the field. Isilon's HDFS is unique in the industry and it's a little tough to wrap your head around in the beginning. At its simplest level, Isilon provides a Highly Available Namenode and Datanode implementation which is distributed over their cluster. This allows you to scale your compute and storage capabilities separately as your Hadoop workload grows. This is especially interesting to customers looking to provide Hadoop as a Service as they can provide storage capability for multiple users of the infrastructure while allowing them to share the compute infrastructure. If you went to the keynote, you saw a real world example of this combination. I was actually involved with the real world version of that implementation. Dr. Don Miner and I went onsite at that particular customer. He wore the hat of MapReduce guy and my hat was Hadoop Infrastructure guy. We showed up on site at 9 AM and had a LIVE Hadoop cluster processing hundreds of GB of XML and Log files within 43 minutes. Now, two things helped our time....Don had prewritten the MR job and only had to tweak it onside, and I leveraged version an EXTREMELY early version of an automated Hadoop installer I have been writing called Pachyderm (blog post coming Soon). The installer had a couple of issues that forced us to do a couple of steps manually, and we had to configure the Isilon cluster as well. So the 43 minutes is still an amazing feat as most Hadoop implementations would have taken longer than that just to get the data into HDFS. With Isilon, we just provide an NFS share to the end user and they put the data on that and it's immediately available to HDFS. In the case of weblogs, you can write those directly to the shares so you have no data movement at all. <sarcasm๐Ÿ˜œ> The Isilon and Hadoop integration was VERY DIFFICULT to enable. <sarcasm> We had to make a single change in the core-site.xml file. I just talked to a customer yesterday that loved the work EMC has done and has already ordered GPHD and Isilon! Also, stay tuned for another Blog entry and possibly a VMworld session that leverages these technologies to provide Hadoop in the private cloud. This session is Thurs@1:00 in Marcello 4403 or the Birds of a Feather session tomorrow at 1:30.
  3. World Wide Hadoop Booth. I have been involved with this EMC Black-ops project since it was first created and have been amazed at what our friends in our Solutions Group in Russia have been able to put together in a very short period of time. This is designed to be a Hadoop community project over time, but for now is an EMC proof of Concept. We are actively seeking customers to help us further define use cases and requirements so we can expand the project as its pushed into the public domain. Basically, the idea is that Hadoop Clusters are great at processing data that they have locally, but what about data that you can't consolidate for size, legal compliance, financial laws, etc.? What if you could write a single MapReduce job, choose the geographically disperse data sets to run it against and then remotely submit the job. The job is dynamically passed to the appropriate geographies and run against the local Hadoop cluster. The resulting aggregated data sets are then passed back to the management system or local cluster for review. Then if needed another final MR job can be run to further consolidate the results. So, what we have created is a Global or World wide Namenode view that allows you to seamlessly view all directory information regardless of location and we have created a remote job submission system that allows a job to be submitted to multiple job trackers in a single operation. As a Proof of Concept, this was all built in a matter of days leveraging SpringHadoop as the orchestration framework. SpringHadoop provides Hadoop hooks to the Spring Framework and Spring Batch, which allows you to build Spring apps that leverage Hadoop services or you can use it to implement a batch processing workflow. Stop by and see me during lunch sometime this week.
So, besides these duties, I am meeting customers and attending all the Hadoop and UAP sessions to listen to customers, so I can provide feedback and hopefully help EMC better meet customer requirements in the Big Data space. More tomorrow... 

Comments

Popular posts from this blog

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Intelligence

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the  compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions of H

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"    The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs directly to the share, then Hadoop can run MapRe