Skip to main content

PivotalHD - Decommissioning Slave Nodes

Continuing on the theme of "So easy that Hulk could do it", I recently wrote this document as a Pivotal blog piece, but I seemed to outrun the startup of the new Pivotal Technical Blog.   So, it was decided to add the content to the product documentation since the blog is still in a "coming soon" state.    Nothing earth shattering here, just some best practices for taking slave nodes out of the cluster in the proper manner.



Decommissioning, Repairing, or Replacing Hadoop Slave Nodes

Decommissioning Hadoop Slave Nodes

The Hadoop distributed scale-out cluster-computing framework was inherently designed to run on commodity hardware with typical JBOD configuration (just a bunch of disks; a disk configuration where individual disks are accessed directly by the operating system without the need for RAID). The idea behind it relates not only to cost, but also fault-tolerance where nodes (machines) or disks are expected to fail occasionally without bringing the cluster down. Because of these reasons, Hadoop administrators are often tasked to decommission, repair, or even replace nodes in a Hadoop cluster.  

Decommissioning slave nodes is a process that is used to prevent data loss when you need to shutdown or remove these nodes from a Pivotal HD Cluster.   For instance, if multiple nodes need to be taken down, there is a possibility that all the replicas of one or more data blocks live on those nodes.   If the nodes are just taken down without preparation, those blocks will no longer be available to the active nodes in the cluster, and so the files that contain those blocks will be marked as corrupt and will appear as unavailable. 

Hadoop Administrators may also want to decommission nodes to shrink an existing cluster or proactively remove nodes.  The process of decommission is not an instant process since it will require the replication of all of the blocks on the decommissioned node(s) to active nodes that will remain in the cluster.   Decommissioning nodes should only be used in cases where more than one node needs to be taken down for maintenance, because it evacuates the blocks from the targeted hosts and can affect both data balance and data locality for Hadoop and higher level services, such as HAWQ (see HAWQ Considerations).

DataNode Decommission

The procedures below assume NameNode HA is configured (which is a Pivotal HD best practice), if it’s not, just skip the additional steps for the Standby NameNode.

It is recommended best practice to run a filesystem check on HDFS to verify the filesystem is healthy before you proceed with decommissioning any nodes.

gpadmin# sudo -u hdfs hdfs fsck /

On the Active NameNode:

·      Edit the /etc/gphd/hadoop/conf/dfs.exclude file and add the DataNode hostnames to be removed (separated by newline character). Make sure you use the fully qualified domain name (FQDN) for each hostname.
·      Instruct the Active NameNode to refresh it’s nodelist by re-reading the .exclude and .include files.

gpadmin# sudo -u hdfs hdfs dfsadmin –fs hdfs://<active_namenode_fqdn> –refreshNodes

On the Standby NameNode:

·      Edit the /etc/gphd/hadoop/conf/dfs.exclude file and add the DataNode hostnames to be removed (separated by newline character). Make sure you use the FQDN for each hostname.
·      Instruct the Standby NameNode to refresh it’s nodelist by re-reading the .exclude and .include files.

gpadmin# sudo -u hdfs hdfs dfsadmin -fs hdfs://<standby_namenode fqdn> –refreshNodes

·      Monitor decommission progress with NameNode WebUI (http://<active_namenode_host>:50070) and navigate to Decommissioning Nodes page.   You can also monitor the status using the command line by executing one of the following commands on any NameNode or DataNode in the cluster (verbose/concise):

gpadmin# sudo -u hdfs hdfs dfsadmin –report
gpadmin# sudo -u hdfs hdfs dfsadmin -report | grep -B 2 Decommission

YARN NodeManager Decommission

Use the following procedure if YARN NodeManager daemons are running on the nodes that are being decommissioned.  This process is almost immediate and only requires a notification to the ResourceManager that the excluded nodes are no longer available for use.

On the Yarn ResourceManager host machine

·      Edit /etc/gphd/hadoop/conf/yarn.exclude file and add the node manager hostnames to be removed (separated by newline character). Make sure you use the FQDN for each hostname.
·      On the Resource Manager host instruct the Resource Manager to refresh it’s node list by re-reading the .exclude and .include files:

gpadmin# sudo -u yarn yarn rmadmin -refreshNodes

·      The decommission state can be verified via the Resource Manager WebUI (https://<resource_manager_host>:8088) or by using the command line by executing the following command on the Resource Manager host:

gpadmin# sudo -u yarn yarn rmadmin node -list

SlaveNode Shutdown Procedures

Next, the slave processes running on the newly decommissioned nodes need to be shutdown by using the Pivotal HD ICM CLI.
Create a text file containing the hostnames that have been decommissioned (separated by newline character).   Make sure you use the FQDN for each hostname. (hostfile.txt)

·      If the hosts are HDFS DataNodes:

gpadmin# icm_client stop -r datanode -r datanode -o <hostfile.txt>

·      If the hosts are YARN NodeManagers:

gpadmin# icm_client stop -l <clustername> -r yarn-nodemanager -o <hostfile.txt>

·      If the hosts are HBase RegionServers:

It is preferable to use graceful_stop script that HBase provides.    The graceful_stop.sh script will check to see if the Region Load Balancer is operational will turn it off before starting it’s region server decommission process.  If you want to decommission more than one node at a time by stopping multiple RegionServers concurrently, the RegionServers can be put into a "draining" state to avoid offloading data to other servers being drained.  This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the <hbase_root>/draining znode. This znode has format name,port,startcode just like the regionserver entries under <hbase_root>/rs node.

Using zkCLI, list the current HBase Region Servers:

[zk:] ls /hbase/rs

Now, use the following command to put any servers you wish into draining status.  Copy the entry exactly as it exists in the /hbase/rs znode:

[zk:] create /hbase/draining/<FQDN Hostname>,<Port>,<startcode>

This process will ensure that these nodes don’t receive new blocks as other nodes are decommissioned.

·      If the hosts are GemfireXD Servers:

gpadmin# gfxd server stop -dir=<working directory containing status file>

·      If the hosts are HAWQ Segment Servers:

If HAWQ is deployed on the hosts, data locality concerns must be considered before leveraging the HDFS DataNode decommission process.   HAWQ leverages a hash distribution policy to distribute its data evenly across the cluster, but this distribution is negatively effected when the data blocks are evacuated to the other hosts throughout the cluster.   If the DataNode is later brought back online, two states are possible:

·      Data In-Place

In this case, when the DataNode is brought back online HDFS will report the blocks stored on the node as "over-replicated" blocks.  HDFS will, over-time, randomly remove a replica of each of the blocks.  This process may negatively impact the data locality awareness of the HAWQ segments, because data that hashes to this node could now be stored elsewhere in the cluster.   Operations can resume in this state with the only impact being potential HDFS network reads for some of the data blocks that had their primary replica moved off the host as the "over-replication" is resolved.  This will not, however, affect co-located database joins, because the segment servers will be unaware that the data is being retrieved via the network rather than a local disk read. 

·      Data Removed

In this case, when the DataNode is brought back online HDFS will now use this node for net-new storage activities, but the pre-existing blocks will not be moved back into their original location.  This process will negatively impact the data locality for the co-located HAWQ segments because any existing data will not be local to the segment host.   This will not result in a database gather motion since the data will still appear to be local to the segment servers, but it will require the data blocks to be fetched over the network during the HDFS reads.   HDFS Balancer should not be used to repopulate data onto the newly decommissioned server unless a HAWQ table redistribution is planned as well.  The HDFS Balancer will affect segment host data locality on every node in the cluster as it moves data around to bring HDFS utilization in balance across the cluster.

In either case, a HAWQ table redistribution can be performed on specific tables, or all tables in order to restore data locality.   If possible, it is recommended that maintenance on a cluster containing HAWQ should be done one host at a time to avoid the situations described above.  This would alleviate the need to decommission the host, because two valid replicas of the data would exist at all times. 

There is no specific decommission process for a HAWQ segment host, but if the host needs to be decommissioned the HAWQ segment servers should be shutdown. 

o   On the Decommissioned Node, stop the postgres processes and then verify they are down:

gpadmin# pkill -SIGTERM postgres
gpadmin# ps –ef | grep postgres

o   On the HAWQ Master (Verify segments are down):

gpadmin# source /usr/local/hawq/greenplum_path.sh
gpadmin# gpstate

Slave Node Replacement

There are many situations in which a slave node goes down and the entire server must be replaced.   In these cases, the administrator is not able to issue a decommission, so HDFS will mark the server offline and begin replicating the now missing blocks to bring up replica count back within policy guidelines.   To replace the node, a new server can be brought online with the same configuration (disk mounts, etc.) and the following procedure can be used on the PCC/ICM server to bring the replacement node into the cluster.

·      Get the current cluster configuration

gpadmin# icm_client fetch-configuration -o <config directory target> -l <clustername>

·      Remove the failed node from the cluster, by creating a text file containing the fully qualified hostname of the host to replace and then running the ICM command below.   This step is required even if the replacement node will have the same name, because adding a “net-new” node to the cluster will allow us to leverage the ICM automation to properly configure the replaced host.

gpadmin# icm_client remove-slaves -f <replaced host text file> -l <clustername>

·      Add the replaced host back into the cluster by using the original configuration from the first step.

gpadmin# icm_client add-slaves -f <replaced host text file> -l <clustername>


·      Manually start the slave processes on the newly replaced node:

o   If node is a DataNode:

gpadmin# icm_client start -r datanode -o <hostfile.txt>

o   If node is a NodeManager: 

gpadmin# icm_client start -r yarn-nodemanager -o <hostfile.txt>


o   If node is a HBase Region Server: 

gpadmin# icm_client start -r hbase-regionserver -o <hostfile.txt>


o   If node is a HAWQ Segment Server: 

gpadmin# sudo massh <replaced host text file> verbose “service hawq start"

With HAWQ, the database engine needs to be informed that it now has the new segment server online, so you need to login to the HAWQ Master and issue the appropriate recovery commands for HAWQ segments. 

·      On the HAWQ Master:

gpadmin# source /usr/local.hawq/greenplum_path.sh
gpadmin# gprecoverseg -F -d <master data directory>

These commands will bring the server back online, but please refer to the HAWQ Considerations section above for how to proceed in regards to the data within the database instance itself.

Slave Node Disk Replacement

Hadoop is extremely resilient in terms of hardware failure, but disk failure is one type of failure scenario that relies on the administrator to put some thought into as the system is configured.   In the default configuration, Hadoop will blacklist the slave node if a single disk fails.   In most cases, this response is an extreme reaction to a relatively inconsequential failure that is relatively common in large Hadoop clusters.  The parameter to control this response is dfs.datanode.failed.volumes.tolerated and can be found in the hdfs-site.xml file.   The value given to this parameter represents the number of HDFS DataNode directories can fail before the node is blacklisted.   A good rule of thumb for this setting would be to tolerate 1 disk failure for every 6 data disks you have in the system.   For example, a 12 disk server would have dfs.datanode.failed.volumes.tolerated = 2.  

In the majority of scenarios with the proper failure tolerance configured, the disk will fail but the DataNode will remain operational. 

To replace the disk drive:
  • Stop DataNode, NodeManager, GemfireXD, and/or HAWQ processes using methods found above.
  • Replace the failed disk drive(s)
  • Follow the Slave Node Replacement procedures above (add-slaves/remove-slaves)




Comments

Popular posts from this blog

Is Hadoop Dead or Just Much Less Important?

I recently read a blog discussing the fever to declare Hadoop as dead. While I agreed with the premise of the blog, I didn't agree with some of its conclusions. In summary, the conclusion was that if Hadoop is too complex you are using the wrong interface. I agree at face-value with that conclusion, but in my opinion, the user-interface only addresses a part of the complexity and the management of a Hadoop deployment is still a complex undertaking. Time to value is important for enterprise customers, so this is why the tooling above Hadoop was such an early pain-point. The core Hadoop vendors wanted to focus on how processes executed and programming paradigms and seemed to ignore the interface to Hadoop. Much of that stems from the desire for Hadoop to be the operating system for Big Data. There was even a push to make it the  compute cluster manager for all-things in the Enterprise. This effort, and others like it, tried to expand the footprint of commercial distributions...

CF Summit 2018

I just returned from CF Summit 2018 in Boston. It was a great event this year that was even more exciting for Pivotal employees because of our IPO during the event. I had every intention of writing a technology focused post, but after having some time to reflect on the week I decided to take a different route. After all the sessions were complete and I was reflecting on the large numbers of end-users that I had seen present, I decided to go through the schedule and pick out the names of companies that are leveraging Cloud Foundry in some way and were so passionate about it that they spoke about it at this event.   I might have missed a couple when compiling this list, so if you know of one not on here, it was not intentional. Allstate Humana T-Mobile ZipCar Comcast United States Air Force Scotiabank National Geospatial-Int...

Isilon HDFS User Access

I recently posted a blog about using my app Mystique to enable you to use HUE (webHDFS) while leveraging Isilon for your HDFS data storage.   I had a few questions about the entire system and decided to also approach this from a different angle.   This angle is more of "Why would you even use WebHDFS and the HUE File Browser when you have Isilon?"    The reality is you really don't need it, because the Isilon platform give you multiple options for working directly with the files that need to be accessed via Hadoop.   Isilon HDFS is implemented as just another API, so the data stored in OneFS can be accessed via NFS, SMB, HTTP, FTP, and HDFS.   This actually open up a lot of possibilities that make the requirements for some of the traditional tools like WebHDFS, and in some cases Flume go away because I can read and write via something like NFS.   For example, one customer is leveraging the NFS functionality to write weblogs dir...