Skip to main content

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"   The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs directly to the share, then Hadoop can run MapReduce against those logs without having to move the data again.    So, for this blog, I am going to show how you can leverage the Isilon REST API in place of WebHDFS when you want remote access to the data stored in HDFS.   This is a nice option for those shops that have a separate group managing their storage, since using the API only requires permissions in managed filesystem.

Let's start with the basic case of just viewing what files are in a particular directory.  HUE front-ends WebHDFS very nicely to provide a graphical file browser to accomplish this task.  The interface is fairly standard and is very helpful to those that don't wish to use hadoop fs commands to access this information.  The Hadoop CLI does provide the same information, though, and for most users is really the goto place for this information.

Finally,  If you are leveraging Isilon you can just mount the HDFS directory as a NFS or SMB share directly to your client.   You can do this with native HDFS as well using the new NFSv3 support in HDFS, but it is not a scalable implementation like Isilon is.  This is a big portion of the value proposition of Isilon with Hadoop.  You get an Enterprise-class file sharing implementation in addition to the availability of the data via HDFS.


Now, let's take a look at how we gather this same information using the REST APIs we have available to us.   First, we will use the WebHDFS API to retrieve the root directory information.  The information that is returned is a JSON Data-structure containing all the metadata about the root directory in the HDFS Filesystem.

# curl -i "http://sandbox:50070/webhdfs/v1/?op=LISTSTATUS&user.name=hdfs"
HTTP/1.1 200 OK Content-Type: application/json Date: Mon, 11 Nov 2013 22:01:10 GMT Content-Length: 756 {"FileStatuses":{"FileStatus":[{"accessTime":0,"replication":1,"owner":"hdfs","length":220,"permission":"775","blockSize":0,"modificationTime":1384202214000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"user"},{"accessTime":0,"replication":1,"owner":"hdfs","length":0,"permission":"777","blockSize":0,"modificationTime":1384202297000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"tmp"},{"accessTime":0,"replication":1,"owner":"mapred","length":0,"permission":"775","blockSize":0,"modificationTime":1384202269000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"mapred"},{"accessTime":0,"replication":1,"owner":"hdfs","length":144,"permission":"775","blockSize":0,"modificationTime":1384201858000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"apps"}]}} 

Now, let's take a look at the Isilon API version of that command. 

# curl -i -k "https://192.168.135.100:8080/namespace/ifs/hadoop/?detail=default"
HTTP/1.1 401 Authorization Required
Date: Tue, 12 Nov 2013 02:05:16 GMT
Server: Apache/2.2.21 (FreeBSD) mod_ssl/2.2.21 OpenSSL/0.9.8x mod_fastcgi/2.4.6
WWW-Authenticate: Basic
Last-Modified: Tue, 26 Jun 2012 18:38:58 GMT
ETag: "6351-31-4c36469527480"
Accept-Ranges: bytes
Content-Length: 49
Content-Type: application/json
{"errors":[{"message":"authorization required"}]}

The first thing you should notice is that because Isilon is an Enterprise-class array it requires silly little things such as Authentication and Authorization.   Natively, HDFS just assumes you are who you say you are as shown in the first example (hdfs).  Let's try this again, but this time we will pass some credentials with the command.  There are multiple ways to do this, but for simplicity I will pass have the system prompt me for my password:

# curl -i -k -u hdfs "https://192.168.135.100:8080/namespace/ifs/hadoop/?detail=default"
Enter host password for user 'hdfs':
HTTP/1.1 200 Ok
Date: Tue, 12 Nov 2013 02:14:41 GMT
Server: Apache/2.2.21 (FreeBSD) mod_ssl/2.2.21 OpenSSL/0.9.8x mod_fastcgi/2.4.6
Allow: DELETE, GET, HEAD, POST, PUT
Last-Modified: Mon, 11 Nov 2013 20:38:29 GMT
x-isi-ifs-access-control: 0775
x-isi-ifs-spec-version: 1.0
x-isi-ifs-target-type: container
Transfer-Encoding: chunked
Content-Type: application/json

{"children":[{
   "group" : "hdfs",
   "last_modified" : "Mon, 11 Nov 2013 20:36:54 GMT",
   "mode" : "0775",
   "name" : "user",
   "owner" : "hdfs",
   "size" : 220,
   "type" : "container"
}
,{
   "group" : "hdfs",
   "last_modified" : "Mon, 11 Nov 2013 20:38:17 GMT",
   "mode" : "0777",
   "name" : "tmp",
   "owner" : "hdfs",
   "size" : 0,
   "type" : "container"
}
,{
   "group" : "hdfs",
   "last_modified" : "Mon, 11 Nov 2013 20:37:49 GMT",
   "mode" : "0775",
   "name" : "mapred",
   "owner" : "mapred",
   "size" : 0,
   "type" : "container"
}
,{
   "group" : "hdfs",
   "last_modified" : "Mon, 11 Nov 2013 20:30:58 GMT",
   "mode" : "0775",
   "name" : "apps",
   "owner" : "hdfs",
   "size" : 144,
   "type" : "container"
}
]}

This time we got what we were expecting: The root directory metadata.  Notice, the Isilon REST API returns the data in JSON as well, but just in a slightly different format.   Now we have seen the files in a directory, but what if you wanted a higher level report that gave you a snapshot of your overall utilization.  This next command leverages GETCONTENTSUMMARY to get a usage report in HDFS:

# curl -i "http://sandbox:50070/webhdfs/v1/apps?op=GETCONTENTSUMMARY&user.name=hdfs"
HTTP/1.1 200 OK
Content-Type: application/json
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: hadoop.auth="u=hdfs&p=hdfs&t=simple&e=1384245692918&s=BwXiXAJZu3gmp8gFh0wGGnNtisI=";Path=/
Transfer-Encoding: chunked
Server: Jetty(6.1.26)

{"ContentSummary":{"directoryCount":32,"fileCount":20,"length":120291359,"quota":-1,"spaceConsumed":120291359,"spaceQuota":-1}}

This one is a little tougher to directly translate into the Isilon API.   From my Mac CLI, I can do standard linux CLI commands to get some of the data:

$ find apps -type f -follow  | wc -l
      20
$ du apps | wc -l
      32
$ du -k apps
0    apps/webhcat/test
117272    apps/webhcat
0    apps/hbase/data/usertable/4c144d4cbcb051de4297fad89f11f9c8/recovered.edits
2    apps/hbase/data/usertable/4c144d4cbcb051de4297fad89f11f9c8/family
3    apps/hbase/data/usertable/4c144d4cbcb051de4297fad89f11f9c8
0    apps/hbase/data/usertable/.tmp
4    apps/hbase/data/usertable
0    apps/hbase/data/.corrupt
0    apps/hbase/data/.META./1028785192/recovered.edits
3    apps/hbase/data/.META./1028785192/info
1    apps/hbase/data/.META./1028785192/.oldlogs
0    apps/hbase/data/.META./1028785192/.tmp
5    apps/hbase/data/.META./1028785192
6    apps/hbase/data/.META.
0    apps/hbase/data/.oldlogs
0    apps/hbase/data/.logs/sandbox,60020,1384184784780
1    apps/hbase/data/.logs
0    apps/hbase/data/.tmp
0    apps/hbase/data/-ROOT-/70236052/recovered.edits
4    apps/hbase/data/-ROOT-/70236052/info
1    apps/hbase/data/-ROOT-/70236052/.oldlogs
0    apps/hbase/data/-ROOT-/70236052/.tmp
6    apps/hbase/data/-ROOT-/70236052
0    apps/hbase/data/-ROOT-/.tmp
7    apps/hbase/data/-ROOT-
19    apps/hbase/data
19    apps/hbase
46    apps/hive/warehouse/sample_07
46    apps/hive/warehouse/sample_08
92    apps/hive/warehouse
92    apps/hive
117384    apps


If you want to get an equivalent report via the Isilon REST API, You would need to write a script to gather the information and present in the proper format.  This is because the Isilon REST API is designed to process the details of 1 object or container at a time. So, I can get the information for apps, but not the information about the objects within containers (directories) of apps. For example this quick python script I wrote gathers the number of files and directories in a path and the amount of space consumed using the Isilon API.  I hard-coded a lot of it for speed of coding, but you can easily get the idea how to make it more general purpose.

import urllib2
import base64
import simplejson

def Requester(url):
    param =  "?detail=default"
    request = urllib2.Request(url+param)
    base64string = base64.encodestring('%s:%s' % ("hdfs", "password"))[:-1]
    request.add_header("Authorization", "Basic %s" % base64string)
    return simplejson.load(urllib2.urlopen(request))

if __name__ == '__main__':
    files=list()
    dirs=list()
    newChildren=list()
    baseUrl =  "https://192.168.135.100:8080/namespace/ifs/hadoop/apps/"
    children = Requester(baseUrl)['children']

    for child in children:
        if child['type'] == 'object':
            files.append(child)
        else:
            dirs.append(child)
    for hDir in dirs:
        url =  baseUrl+hDir['name']
        children = Requester(url)['children']
        for child in children:
            if child['type'] == 'object':
                files.append(child)
            else:
                child['name']=str(hDir['name'])+"/"+str(child['name'])
                dirs.append(child)
    print len(files)
    print len(dirs)+1
    totalSize = 0
    for hFile in files:
        totalSize=totalSize+hFile['size']
    print "SpaceConsumed: "+str(totalSize)


Open / Read / Download

Below is the example of using the WebHDFS API natively.   In this particular case, you have to add a -L to the curl statement, because WebHDFS will return a 307 HTTP Redirect to point you to a datanode to actually read the data.  If you want to save the data to a file, you can just redirect it into one and the clean up the HTTP header info in the first few lines.

# curl -i -L "http://192.168.135.30:50070/webhdfs/v1/apps/hive/warehouse/sample_07/sample_07.csv?op=OPEN&user.name=hue&offset=0&doas=hdfs" | more
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0HTTP/1.1 307 Temporary Redirect
Location: http://192.168.135.31:50070/webhdfs/v1/apps/hive/warehouse/sample_07/sample_07.csv?op=OPEN&user.name=hue&offset=0&doas=hdfs
Date: Wed, 13 Nov 2013 12:19:53 GMT
Content-Length: 0

HTTP/1.1 200 OK
Content-Type: text/plain
Date: Wed, 13 Nov 2013 12:19:53 GMT
Transfer-Encoding: chunked

00-0000     All Occupations     134354250     40690
11-0000     Management occupations     6003930     96150
11-1011     Chief executives     299160     151370
11-1021     General and operations managers     1655410     103780
11-1031     Legislators     61110     33880
11-2011     Advertising and promotions managers     36300     91100
11-2021     Marketing managers     165240     113400
11-2022     Sales managers     322170     106790
11-2031     Public relations managers     47210     97170
11-3011     Administrative services managers     239360     76370 

With Isilon we have a couple options.   You can just mount the directory as a fileshare and just use standard OS commands to add or remove a file from the directory, or drag and drop them.  This is obviously the easiest and once again demonstrates the power of the Isilon HDFS combination.  But, if you prefer to use a REST API to get the contents of a file you can use the following command which gets you the same results as using WebHDFS.

# curl -i -k -u hdfs:password https://192.168.135.101:8080/namespace/ifs/hadoop/apps/hive/warehouse/sample_07/sample_07.csv
HTTP/1.1 200 Ok
Date: Wed, 13 Nov 2013 08:30:17 GMT
Server: Apache/2.2.21 (FreeBSD) mod_ssl/2.2.21 OpenSSL/0.9.8x mod_webkit2/1.0 mod_fastcgi/2.4.6
Allow: DELETE, GET, HEAD, POST, PUT
Last-Modified: Wed, 13 Nov 2013 05:33:46 GMT
x-isi-ifs-access-control: 0777
x-isi-ifs-spec-version: 1.0
x-isi-ifs-target-type: object
Vary: Accept-Encoding
Transfer-Encoding: chunked
Content-Type: text/plain

00-0000     All Occupations     134354250     40690
11-0000     Management occupations     6003930     96150
11-1011     Chief executives     299160     151370
11-1021     General and operations managers     1655410     103780
11-1031     Legislators     61110     33880
11-2011     Advertising and promotions managers     36300     91100
11-2021     Marketing managers     165240     113400
11-2022     Sales managers     322170     106790
11-2031     Public relations managers     47210     97170
11-3011     Administrative services managers     239360     76370
11-3021     Computer and information systems managers     264990     113880
11-3031     Financial managers     484390     106200
11-3041     Compensation and benefits managers     41780     88400
11-3042     Training and development managers     28170     90300
11-3049     Human resources managers, all other     58100     99810
11-3051     Industrial production managers     152870     87550
11-3061     Purchasing managers     65600     90430
11-3071     Transportation, storage, and distribution managers     92790     81980 

Now, that you have a general idea how this works, rather than show output of each and every command, I have listed a couple of the other possible WebHDFS Commands and their Isilon REST API equivalent as well as links to the documentation 
  • Creating / Uploading a File:
WebHDFS
curl -i -d @/tmp/sample_07.csv  -X PUT   "http://192.168.135.30:50070/webhdfs/v1/user/hdfs/sample_07.csv?op=CREATE"

Isilon REST
curl -i -k -u hdfs:password -H "x-isi-ifs-target-type:object" -d @/tmp/sample_07.csv -X PUT "https://192.168.135.101:8080/namespace/ifs/hadoop/user/hdfs/sample_07.csv"
  • Make a Directory
WebHDFS
curl -i -X PUT "http://192.168.135.30:50070/webhdfs/v1/user/hdfs/test?op=MKDIRS”

Isilon REST
curl -i -k -u hdfs:password -H "x-isi-ifs-target-type:container" -X PUT "https://192.168.135.101:8080/namespace/ifs/hadoop/user/hdfs/test”
  • Rename a File/Directory
WebHDFS
curl -i -X PUT "http://192.168.135.30:50070/webhdfs/v1/user/hdfs/sample_07.csv?op=RENAME&destination=/user/hdfs/sample_08.csv” 

Isilon REST 
curl -i -k -u hdfs:password -H "x-isi-ifs-set-location:/namespace/ifs/hadoop/user/hdfs/sample_07.csv" -X POST "https://192.168.135.101:8080/namespace/ifs/hadoop/user/hdfs/sample_08.csv” 


If what you are looking for isn’t shown, you can use the Isilon API Documentation to help out…. or give me a shout at @dbbaskette.  For instance, in Mystique I create a login cookie and resubmit it each time to ease logins, but for this blog I took the simple approach of just logging in for each command issued.  There are lots of options....

Comments

  1. Very useful article, just what I was looking for. Thanks!

    ReplyDelete

Post a Comment

Popular posts from this blog

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Intelligence

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the  compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions of H