

Finally, If you are leveraging Isilon you can just mount the HDFS directory as a NFS or SMB share directly to your client. You can do this with native HDFS as well using the new NFSv3 support in HDFS, but it is not a scalable implementation like Isilon is. This is a big portion of the value proposition of Isilon with Hadoop. You get an Enterprise-class file sharing implementation in addition to the availability of the data via HDFS.

Now, let's take a look at how we gather this same information using the REST APIs we have available to us. First, we will use the WebHDFS API to retrieve the root directory information. The information that is returned is a JSON Data-structure containing all the metadata about the root directory in the HDFS Filesystem.
# curl -i "http://sandbox:50070/webhdfs/v1/?op=LISTSTATUS&user.name=hdfs"
HTTP/1.1 200 OK Content-Type: application/json Date: Mon, 11 Nov 2013 22:01:10 GMT Content-Length: 756 {"FileStatuses":{"FileStatus":[{"accessTime":0,"replication":1,"owner":"hdfs","length":220,"permission":"775","blockSize":0,"modificationTime":1384202214000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"user"},{"accessTime":0,"replication":1,"owner":"hdfs","length":0,"permission":"777","blockSize":0,"modificationTime":1384202297000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"tmp"},{"accessTime":0,"replication":1,"owner":"mapred","length":0,"permission":"775","blockSize":0,"modificationTime":1384202269000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"mapred"},{"accessTime":0,"replication":1,"owner":"hdfs","length":144,"permission":"775","blockSize":0,"modificationTime":1384201858000,"type":"DIRECTORY","group":"hdfs","pathSuffix":"apps"}]}}
Now, let's take a look at the Isilon API version of that command.
# curl -i -k "https://192.168.135.100:8080/namespace/ifs/hadoop/?detail=default"
HTTP/1.1 401 Authorization Required Date: Tue, 12 Nov 2013 02:05:16 GMT
Server: Apache/2.2.21 (FreeBSD) mod_ssl/2.2.21 OpenSSL/0.9.8x mod_fastcgi/2.4.6 WWW-Authenticate: Basic
Last-Modified: Tue, 26 Jun 2012 18:38:58 GMT
ETag: "6351-31-4c36469527480"
Accept-Ranges: bytes
Content-Length: 49
Content-Type: application/json
{"errors":[{"message":"authorization required"}]}
The first thing you should notice is that because Isilon is an Enterprise-class array it requires silly little things such as Authentication and Authorization. Natively, HDFS just assumes you are who you say you are as shown in the first example (hdfs). Let's try this again, but this time we will pass some credentials with the command. There are multiple ways to do this, but for simplicity I will pass have the system prompt me for my password:
# curl -i -k -u hdfs "https://192.168.135.100:8080/namespace/ifs/hadoop/?detail=default"
Enter host password for user 'hdfs':
HTTP/1.1 200 Ok
Date: Tue, 12 Nov 2013 02:14:41 GMT
Server: Apache/2.2.21 (FreeBSD) mod_ssl/2.2.21 OpenSSL/0.9.8x mod_fastcgi/2.4.6
Allow: DELETE, GET, HEAD, POST, PUT
Last-Modified: Mon, 11 Nov 2013 20:38:29 GMT
x-isi-ifs-access-control: 0775
x-isi-ifs-spec-version: 1.0
x-isi-ifs-target-type: container
Transfer-Encoding: chunked
Content-Type: application/json
{"children":[{
"group" : "hdfs",
"last_modified" : "Mon, 11 Nov 2013 20:36:54 GMT",
"mode" : "0775",
"name" : "user",
"owner" : "hdfs",
"size" : 220,
"type" : "container"
}
,{
"group" : "hdfs",
"last_modified" : "Mon, 11 Nov 2013 20:38:17 GMT",
"mode" : "0777",
"name" : "tmp",
"owner" : "hdfs",
"size" : 0,
"type" : "container"
}
,{
"group" : "hdfs",
"last_modified" : "Mon, 11 Nov 2013 20:37:49 GMT",
"mode" : "0775",
"name" : "mapred",
"owner" : "mapred",
"size" : 0,
"type" : "container"
}
,{
"group" : "hdfs",
"last_modified" : "Mon, 11 Nov 2013 20:30:58 GMT",
"mode" : "0775",
"name" : "apps",
"owner" : "hdfs",
"size" : 144,
"type" : "container"
}
]}
This time we got what we were expecting: The root directory metadata. Notice, the Isilon REST API returns the data in JSON as well, but just in a slightly different format. Now we have seen the files in a directory, but what if you wanted a higher level report that gave you a snapshot of your overall utilization. This next command leverages GETCONTENTSUMMARY to get a usage report in HDFS:
# curl -i "http://sandbox:50070/webhdfs/v1/apps?op=GETCONTENTSUMMARY&user.name=hdfs"
HTTP/1.1 200 OK
Content-Type: application/json
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: hadoop.auth="u=hdfs&p=hdfs&t=simple&e=1384245692918&s=BwXiXAJZu3gmp8gFh0wGGnNtisI=";Path=/
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
{"ContentSummary":{"directoryCount":32,"fileCount":20,"length":120291359,"quota":-1,"spaceConsumed":120291359,"spaceQuota":-1}}
This one is a little tougher to directly translate into the Isilon API. From my Mac CLI, I can do standard linux CLI commands to get some of the data:
$ find apps -type f -follow | wc -l
20
$ du apps | wc -l
32
$ du -k apps
0 apps/webhcat/test
117272 apps/webhcat
0 apps/hbase/data/usertable/4c144d4cbcb051de4297fad89f11f9c8/recovered.edits
2 apps/hbase/data/usertable/4c144d4cbcb051de4297fad89f11f9c8/family
3 apps/hbase/data/usertable/4c144d4cbcb051de4297fad89f11f9c8
0 apps/hbase/data/usertable/.tmp
4 apps/hbase/data/usertable
0 apps/hbase/data/.corrupt
0 apps/hbase/data/.META./1028785192/recovered.edits
3 apps/hbase/data/.META./1028785192/info
1 apps/hbase/data/.META./1028785192/.oldlogs
0 apps/hbase/data/.META./1028785192/.tmp
5 apps/hbase/data/.META./1028785192
6 apps/hbase/data/.META.
0 apps/hbase/data/.oldlogs
0 apps/hbase/data/.logs/sandbox,60020,1384184784780
1 apps/hbase/data/.logs
0 apps/hbase/data/.tmp
0 apps/hbase/data/-ROOT-/70236052/recovered.edits
4 apps/hbase/data/-ROOT-/70236052/info
1 apps/hbase/data/-ROOT-/70236052/.oldlogs
0 apps/hbase/data/-ROOT-/70236052/.tmp
6 apps/hbase/data/-ROOT-/70236052
0 apps/hbase/data/-ROOT-/.tmp
7 apps/hbase/data/-ROOT-
19 apps/hbase/data
19 apps/hbase
46 apps/hive/warehouse/sample_07
46 apps/hive/warehouse/sample_08
92 apps/hive/warehouse
92 apps/hive
117384 apps
If you want to get an equivalent report via the Isilon REST API, You would need to write a script to gather the information and present in the proper format. This is because the Isilon REST API is designed to process the details of 1 object or container at a time. So, I can get the information for apps, but not the information about the objects within containers (directories) of apps. For example this quick python script I wrote gathers the number of files and directories in a path and the amount of space consumed using the Isilon API. I hard-coded a lot of it for speed of coding, but you can easily get the idea how to make it more general purpose.
import urllib2
import base64
import simplejson
def Requester(url):
param = "?detail=default"
request = urllib2.Request(url+param)
base64string = base64.encodestring('%s:%s' % ("hdfs", "password"))[:-1]
request.add_header("Authorization", "Basic %s" % base64string)
return simplejson.load(urllib2.urlopen(request))
if __name__ == '__main__':
files=list()
dirs=list()
newChildren=list()
baseUrl = "https://192.168.135.100:8080/namespace/ifs/hadoop/apps/"
children = Requester(baseUrl)['children']
for child in children:
if child['type'] == 'object':
files.append(child)
else:
dirs.append(child)
for hDir in dirs:
url = baseUrl+hDir['name']
children = Requester(url)['children']
for child in children:
if child['type'] == 'object':
files.append(child)
else:
child['name']=str(hDir['name'])+"/"+str(child['name'])
dirs.append(child)
print len(files)
print len(dirs)+1
totalSize = 0
for hFile in files:
totalSize=totalSize+hFile['size']
print "SpaceConsumed: "+str(totalSize)
Open / Read / Download
Below is the example of using the WebHDFS API natively. In this particular case, you have to add a -L to the curl statement, because WebHDFS will return a 307 HTTP Redirect to point you to a datanode to actually read the data. If you want to save the data to a file, you can just redirect it into one and the clean up the HTTP header info in the first few lines.
# curl -i -L "http://192.168.135.30:50070/webhdfs/v1/apps/hive/warehouse/sample_07/sample_07.csv?op=OPEN&user.name=hue&offset=0&doas=hdfs" | more
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0HTTP/1.1 307 Temporary Redirect
Location: http://192.168.135.31:50070/webhdfs/v1/apps/hive/warehouse/sample_07/sample_07.csv?op=OPEN&user.name=hue&offset=0&doas=hdfs
Date: Wed, 13 Nov 2013 12:19:53 GMT
Content-Length: 0
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Wed, 13 Nov 2013 12:19:53 GMT
Transfer-Encoding: chunked
00-0000 All Occupations 134354250 40690
11-0000 Management occupations 6003930 96150
11-1011 Chief executives 299160 151370
11-1021 General and operations managers 1655410 103780
11-1031 Legislators 61110 33880
11-2011 Advertising and promotions managers 36300 91100
11-2021 Marketing managers 165240 113400
11-2022 Sales managers 322170 106790
11-2031 Public relations managers 47210 97170
11-3011 Administrative services managers 239360 76370
With Isilon we have a couple options. You can just mount the directory as a fileshare and just use standard OS commands to add or remove a file from the directory, or drag and drop them. This is obviously the easiest and once again demonstrates the power of the Isilon HDFS combination. But, if you prefer to use a REST API to get the contents of a file you can use the following command which gets you the same results as using WebHDFS.
# curl -i -k -u hdfs:password https://192.168.135.101:8080/namespace/ifs/hadoop/apps/hive/warehouse/sample_07/sample_07.csv
HTTP/1.1 200 Ok
Date: Wed, 13 Nov 2013 08:30:17 GMT
Server: Apache/2.2.21 (FreeBSD) mod_ssl/2.2.21 OpenSSL/0.9.8x mod_webkit2/1.0 mod_fastcgi/2.4.6
Allow: DELETE, GET, HEAD, POST, PUT
Last-Modified: Wed, 13 Nov 2013 05:33:46 GMT
x-isi-ifs-access-control: 0777
x-isi-ifs-spec-version: 1.0
x-isi-ifs-target-type: object
Vary: Accept-Encoding
Transfer-Encoding: chunked
Content-Type: text/plain
00-0000 All Occupations 134354250 40690
11-0000 Management occupations 6003930 96150
11-1011 Chief executives 299160 151370
11-1021 General and operations managers 1655410 103780
11-1031 Legislators 61110 33880
11-2011 Advertising and promotions managers 36300 91100
11-2021 Marketing managers 165240 113400
11-2022 Sales managers 322170 106790
11-2031 Public relations managers 47210 97170
11-3011 Administrative services managers 239360 76370
11-3021 Computer and information systems managers 264990 113880
11-3031 Financial managers 484390 106200
11-3041 Compensation and benefits managers 41780 88400
11-3042 Training and development managers 28170 90300
11-3049 Human resources managers, all other 58100 99810
11-3051 Industrial production managers 152870 87550
11-3061 Purchasing managers 65600 90430
11-3071 Transportation, storage, and distribution managers 92790 81980
Now, that you have a general idea how this works, rather than show output of each and every command, I have listed a couple of the other possible WebHDFS Commands and their Isilon REST API equivalent as well as links to the documentation
- Creating / Uploading a File:
curl -i -d @/tmp/sample_07.csv -X PUT "http://192.168.135.30:50070/webhdfs/v1/user/hdfs/sample_07.csv?op=CREATE"
Isilon REST
curl -i -k -u hdfs:password -H "x-isi-ifs-target-type:object" -d @/tmp/sample_07.csv -X PUT "https://192.168.135.101:8080/namespace/ifs/hadoop/user/hdfs/sample_07.csv"
- Make a Directory
curl -i -X PUT "http://192.168.135.30:50070/webhdfs/v1/user/hdfs/test?op=MKDIRS”
Isilon REST
curl -i -k -u hdfs:password -H "x-isi-ifs-target-type:container" -X PUT "https://192.168.135.101:8080/namespace/ifs/hadoop/user/hdfs/test”
- Rename a File/Directory
curl -i -X PUT "http://192.168.135.30:50070/webhdfs/v1/user/hdfs/sample_07.csv?op=RENAME&destination=/user/hdfs/sample_08.csv”
Isilon REST
curl -i -k -u hdfs:password -H "x-isi-ifs-set-location:/namespace/ifs/hadoop/user/hdfs/sample_07.csv" -X POST "https://192.168.135.101:8080/namespace/ifs/hadoop/user/hdfs/sample_08.csv”
If what you are looking for isn’t shown, you can use the Isilon API Documentation to help out…. or give me a shout at @dbbaskette. For instance, in Mystique I create a login cookie and resubmit it each time to ease logins, but for this blog I took the simple approach of just logging in for each command issued. There are lots of options....
Very useful article, just what I was looking for. Thanks!
ReplyDelete