Skip to main content

Project Mystique


REST APIs are becoming ubiquitous these days, because users expect easy and programmatic access to about any piece of technology.  Hadoop is no exception.   Apache Hadoop provides WebHDFS to give access to HDFS via REST API Calls.  You can not only query information, but also upload and download data via the API via simple calls such as:
http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILESTATUS
 
One application that depends on WebHDFS quite heavily is HUE (Hadoop User Interface).   It provides a web-based interface to Hive, Pig, and a File Browser for HDFS and was developed and maintained by Cloudera.  (thanks @templedf of Cloudera for pointing out the oversight)  If you are new to Hadoop, the Hortonworks Sandbox tutorials are all driven via HUE and are a nice introduction to Hadoop functionality and to get a feel for HUE.  HUE is a python based  app designed to improve the overall Hadoop experience.

EMC Corporation has been hard at work not only developing native REST APIs for platforms, but also allowing users to overlay something like ViPR to provide a universal REST API regardless of the backend storage.   This work became more important when we started working on the Hadoop Starter Kit, because Isilon is a wholesale HDFS replacement, meaning it replaces the actual HDFS datastore while providing you with the same HDFS API to read/write your data.   In HDFS, the namenode server provides the http access for WebHDFS over the same port as the management, so if you aren't running a namenode......you don't get WebHDFS.   That's where the REST APIs come in to play.   I started looking at the WebHDFS API and the Isilon Namespace API and they were very comparable, but the actual formatting of requests and responses were very different, obviously.   So, to solve that little issue, I wrote a Java App that runs a Grizzly WebServer on port 50070 (the WebHDFS Port).  Enter Project Mystique.... The application accepts
all the standard WebHDFS API calls and translates them to Isilon REST API Calls.   It then issues the call to the Isilon cluster and gets the response.   The response can then be modified (if needed) and returned to the client as a WebHDFS response.  So GET/PUT/DELETE/etc all work and can be translated on the fly, so that anyone that uses WebHDFS (including HUE) can do so even with Isilon being used as the backend data store.  

Along the way, I ran into a couple of challenges.  The first was that WebHDFS supports an append operation that the Isilon Namespace API does not.  HUE allow you to upload files to HDFS and it leverages the append operation to build the file in 64 MB chunks.   So, how can you append to a file when the API doesn't support Append?  First thought was to append it via Java Code, but that meant rereading the base file after I added 64 MB.   Works great for a 200 MB file, but one you go bigger....the constant re-read overhead kills the performance.   My final solution was to leverage another REST API within Isilon that did support Append.  WebDAV.  So, for File PUTS, the translator actually calls a different API to perform the task, but the end user never notices a difference.  

All in all a fun project that leveraged a lot of new stuff to me:
Now for a quick walk-through of how it works:

1) We launch the app and the properties file is created.  This tells the app how to communicate with the backend Isilon.
2) The App then launches the webserver on port 50070.   As long as Hue is pointed at the server running the app....that's it from a user perspective.   There are a few permission issues that might have to be tweaked on an as needed basis, based on the way the Isilon is configured.
3) Mystique listens on port 50070 for calls to webHDFS. 
Based on the call it receives, Mystique via its webHDFS Processing Engine calls the Isilon Processing Engine which takes the info from the webHDFS call and morphs it into an Isilon REST API Call.  It issues that call and then stores the  gets the response object.
4) If the response object is file or directory information, an Isilon Object is instantiated by parsing the JSON response.    This instantiated object has a convert method which will convert this information back into JSON, but this time it will be in webHDFS format.
5) The Response object (including converted info from #4) is then returned to the webHDFS engine and then back to Grizzly instance and finally to the client.

There are lots of twists and turns to the actual code, but that is a pretty easy snapshot of how it works.  For instance, when writing a file to HDFS, WebHDFS first returns the address of a datanode to write to as an HTTP 307  Redirect.   In cases like this, Mystique returns the redirect and passes back the same URL that the user tried to use in the first request.  This is done because in Isilon there is no concept of a datanode, so the redirect just becomes a no-op.

I am an amateur coder at best....   let's call it a hobbyist.  I have no illusions my code is the best way to do it or implemented in the best manner, but I use these projects to exercise the problem solving portions of my brain...so I view them as a success when I solve the problem and they work.  So, with that said, this code is nowhere close to being a production implementation and is really designed as a stop-gap for testing until Isilon implements webHDFS natively sometime early next year.

I don't have approval to post it for public consumption, but if you are using Isilon HDFS and would like to give it a whirl....I would be glad to send it to you.   Just shoot me an email at dan.baskette@emc.com or tweet me at @dbbaskette




Comments

Popular posts from this blog

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Intelligence

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"    The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs directly to the share, then Hadoop can run MapRe

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the  compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions of H