Project Mystique

REST APIs are becoming ubiquitous these days, because users expect easy and programmatic access to about any piece of technology. Hadoop is no exception. Apache Hadoop provides WebHDFS to give access to HDFS via REST API Calls. You can not only query information, but also upload and download data via the API via simple calls such as:

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILESTATUS

One application that depends on WebHDFS quite heavily is HUE (Hadoop User Interface). It provides a web-based interface to Hive, Pig, and a File Browser for HDFS and was developed and maintained by Cloudera. (thanks @templedf of Cloudera for pointing out the oversight) If you are new to Hadoop, the Hortonworks Sandbox tutorials are all driven via HUE and are a nice introduction to Hadoop functionality and to get a feel for HUE. HUE is a python based app designed to improve the overall Hadoop experience.

EMC Corporation has been hard at work not only developing native REST APIs for platforms, but also allowing users to overlay something like ViPR to provide a universal REST API regardless of the backend storage. This work became more important when we started working on the Hadoop Starter Kit, because Isilon is a wholesale HDFS replacement, meaning it replaces the actual HDFS datastore while providing you with the same HDFS API to read/write your data. In HDFS, the namenode server provides the http access for WebHDFS over the same port as the management, so if you aren't running a namenode......you don't get WebHDFS. That's where the REST APIs come in to play. I started looking at the WebHDFS API and the Isilon Namespace API and they were very comparable, but the actual formatting of requests and responses were very different, obviously. So, to solve that little issue, I wrote a Java App that runs a Grizzly WebServer on port 50070 (the WebHDFS Port). Enter Project Mystique.... The application accepts

all the standard WebHDFS API calls and translates them to Isilon REST API Calls. It then issues the call to the Isilon cluster and gets the response. The response can then be modified (if needed) and returned to the client as a WebHDFS response. So GET/PUT/DELETE/etc all work and can be translated on the fly, so that anyone that uses WebHDFS (including HUE) can do so even with Isilon being used as the backend data store.

Along the way, I ran into a couple of challenges. The first was that WebHDFS supports an append operation that the Isilon Namespace API does not. HUE allow you to upload files to HDFS and it leverages the append operation to build the file in 64 MB chunks. So, how can you append to a file when the API doesn't support Append? First thought was to append it via Java Code, but that meant rereading the base file after I added 64 MB. Works great for a 200 MB file, but one you go bigger....the constant re-read overhead kills the performance. My final solution was to leverage another REST API within Isilon that did support Append. WebDAV. So, for File PUTS, the translator actually calls a different API to perform the task, but the end user never notices a difference.

All in all a fun project that leveraged a lot of new stuff to me:

Apache HTTPComponents - REST calls to Isilon Namespace, Isilon Platform and WebDAV API. Then manage response objects.
Oracle Jersey - Leverage JAX-RS to build API Compatible WebHDFS implementation
Grizzly - Lightweight embedded HTTP engine to host WebHDFS implementation
json-simple - Used to Encode and Decode Responses for translation

Now for a quick walk-through of how it works:

1) We launch the app and the properties file is created. This tells the app how to communicate with the backend Isilon.

2) The App then launches the webserver on port 50070. As long as Hue is pointed at the server running the app....that's it from a user perspective. There are a few permission issues that might have to be tweaked on an as needed basis, based on the way the Isilon is configured.

3) Mystique listens on port 50070 for calls to webHDFS.

Based on the call it receives, Mystique via its webHDFS Processing Engine calls the Isilon Processing Engine which takes the info from the webHDFS call and morphs it into an Isilon REST API Call. It issues that call and then stores the gets the response object.
4) If the response object is file or directory information, an Isilon Object is instantiated by parsing the JSON response. This instantiated object has a convert method which will convert this information back into JSON, but this time it will be in webHDFS format.
5) The Response object (including converted info from #4) is then returned to the webHDFS engine and then back to Grizzly instance and finally to the client.

There are lots of twists and turns to the actual code, but that is a pretty easy snapshot of how it works. For instance, when writing a file to HDFS, WebHDFS first returns the address of a datanode to write to as an HTTP 307 Redirect. In cases like this, Mystique returns the redirect and passes back the same URL that the user tried to use in the first request. This is done because in Isilon there is no concept of a datanode, so the redirect just becomes a no-op.

I am an amateur coder at best.... let's call it a hobbyist. I have no illusions my code is the best way to do it or implemented in the best manner, but I use these projects to exercise the problem solving portions of my brain...so I view them as a success when I solve the problem and they work. So, with that said, this code is nowhere close to being a production implementation and is really designed as a stop-gap for testing until Isilon implements webHDFS natively sometime early next year.

I don't have approval to post it for public consumption, but if you are using Isilon HDFS and would like to give it a whirl....I would be glad to send it to you. Just shoot me an email at dan.baskette@emc.com or tweet me at @dbbaskette

Baskettecase

Search This Blog

Project Mystique

Comments

Post a Comment

Popular posts from this blog

Isilon HDFS User Access

Is Hadoop Dead or Just Much Less Important?

CF Summit 2018