Skip to main content

BeastHD–Benchmarking and Automated Stress Testing for Hadoop


beastHD

This particular project came about from the many many benchmarks I have had to run for internal testing, as well as customer testing. I was constantly doing the same things over and over just to get the tests running and then collecting the results. HiBench from Intel went a long ways toward scripting the process, but it was really just a bunch of custom scripts that kicked off particular tests in a particular way. What I wanted to do was build an application that would allow me to run a set of preconfigured tests in a certain way, but also allow me to, over time, add tests to the mix through the use of some simple configuration files. BeastHD was born from those ideas and grew into something much bigger. So, what is it exactly? BeastHD is an application that allows you to create batch jobs of a set of benchmarks and kick them off all via a simple REST interface. If you want even simpler repetitive configurations, you can simply change the defaults in the benchmark configuration file. The app leverages a couple different technologies to make this all happen. First, its Java7 based. This allowed me simple access to the

Hadoop APIs so that many things can be easily discovered and automated by
diving into the Hadoop configuration and gaining access to things like the Job Object. Second, I leveraged Spring Data Hadoop technology to build the benchmark definition files. For example, for TestDFSIO:

<context:property-placeholder
location="./resources/TestDFSIO/TestDFSIO.properties"
ignore-resource-not-found="true" ignore-unresolvable="true" />
<hdp:configuration />
<hdp:tool-runner id="TestDFSIOJob" tool-class="org.apache.hadoop.fs.TestDFSIO"
jar="file://${jar}">
<hdp:arg value="-${test}" />
<hdp:arg value="-nrFiles" />
<hdp:arg value="${nrFiles}" />
<hdp:arg value="-fileSize" />
<hdp:arg value="${fileSize}" />
<hdp:arg value="-resFile" />
<hdp:arg value="${logPath}/TestDFSIO-${test}-${timestamp}.out" />
</hdp:tool-runner>


This bit of XML has 3 distinct sections, all of which are required.


  1. Context: Property-Placeholder: This section is where the default properties are defined for this particular test. In this case, we are giving it a properties file to load that contains any needed variables and their default values.
    class=org.apache.hadoop.fs.TestDFSIOworkingDir=/beasthd/TestDFSIOnrFiles=4fileSize=100benchmarkClass=TestDFSIOjar=/usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.5-alpha-gphd-2.1.0.0-tests.jartest=write

    Any of these values can be overridden later in the process, but we define them here so that just calling TestDFSIO without any other information will launch a 4 file DFSIO write.
    Configuration: This section is intentionally left blank. I provides a placeholder in which the true Hadoop Configuration (*-site.xml) files can be loaded.
    Runner section: This section is where the actual benchmark is defined. In this particular example, we leverage Tool-Runner from Spring Hadoop which provides a way to run CLI based Hadoop tests from within our Java code. You can also leverage Jar-Runner here to run any Hadoop tests that do not leverage the Tool interface.


So, that’s all there is to adding a new benchmark to the tool: create an XML file and a properties file that define it. Currently, I have implemented: TestDFSIO, Teragen/Terasort, NNBench, MRBench, and SWIM. One other nice feature is the ability to within the XML to define a simple Groovy script to take care of any HDFS pre/post processing that normally needs to occur (clean-up). This one removes the HDFS directory used for the TestDFSIO output, so that you can run it over and over without changing the path every time.

<hdp:script id="hdfsClean" language="groovy">
outputPath =
"${outputdir}"
if (fsh.test(outputPath)) {
fsh.rmr(outputPath)
}
</hdp:script>

 

 

Comments

Popular posts from this blog

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the  compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions...

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Int...

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"    The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs dir...