Continuing on the theme of "So easy that Hulk could do it", I recently wrote this document as a Pivotal blog piece, but I seemed to outrun the startup of the new Pivotal Technical Blog. So, it was decided to add the content to the product documentation since the blog is still in a "coming soon" state. Nothing earth shattering here, just some best practices for taking slave nodes out of the cluster in the proper manner.
Decommissioning,
Repairing, or Replacing Hadoop Slave Nodes
Decommissioning Hadoop Slave Nodes
The
Hadoop distributed scale-out cluster-computing framework was inherently
designed to run on commodity hardware with typical JBOD configuration (just a bunch
of disks; a disk configuration where individual disks are accessed directly by
the operating system without the need for RAID). The idea behind it relates not
only to cost, but also fault-tolerance where nodes (machines) or disks are
expected to fail occasionally without bringing the cluster down. Because of
these reasons, Hadoop administrators are often tasked to decommission, repair,
or even replace nodes in a Hadoop cluster.
Decommissioning
slave nodes is a process that is used to prevent data loss when you need to
shutdown or remove these nodes from a Pivotal HD Cluster.
For instance, if multiple nodes need to be taken down, there is a possibility
that all the replicas of one or more data blocks live on those
nodes. If the nodes are just taken down without preparation, those
blocks will no longer be available to the active nodes in the cluster, and so
the files that contain those blocks will be marked as corrupt and will appear
as unavailable.
Hadoop
Administrators may also want to decommission nodes to shrink an existing
cluster or proactively remove nodes. The process of decommission is not
an instant process since it will require the replication of all of the blocks
on the decommissioned node(s) to active nodes
that will remain in the cluster. Decommissioning nodes should only
be used in cases where more than one node needs to be taken down for
maintenance, because it evacuates the blocks from the targeted hosts and can
affect both data balance and data locality for Hadoop and higher level
services, such as HAWQ (see HAWQ Considerations).
DataNode Decommission
The
procedures below assume NameNode HA is configured (which is a Pivotal HD
best practice), if it’s not, just skip the additional steps for the Standby
NameNode.
It
is recommended best practice to run a filesystem check on HDFS to verify the
filesystem is healthy before you proceed with decommissioning any nodes.
gpadmin# sudo -u hdfs hdfs fsck /
On
the Active NameNode:
·
Edit the /etc/gphd/hadoop/conf/dfs.exclude file and add the DataNode
hostnames to be removed (separated by newline character). Make sure you use the
fully qualified domain name (FQDN) for each hostname.
·
Instruct the Active NameNode to refresh it’s
nodelist by re-reading the .exclude and .include files.
gpadmin# sudo -u hdfs hdfs dfsadmin –fs hdfs://<active_namenode_fqdn>
–refreshNodes
On
the Standby NameNode:
·
Edit the /etc/gphd/hadoop/conf/dfs.exclude file and add the DataNode
hostnames to be removed (separated by newline character). Make sure you use the
FQDN for each hostname.
·
Instruct the Standby NameNode to refresh it’s nodelist by re-reading the
.exclude and .include files.
gpadmin# sudo -u hdfs hdfs dfsadmin -fs hdfs://<standby_namenode
fqdn> –refreshNodes
·
Monitor decommission progress with NameNode WebUI
(http://<active_namenode_host>:50070) and navigate to Decommissioning
Nodes page. You can also monitor the status using the command line
by executing one of the following commands on any NameNode or DataNode in the
cluster (verbose/concise):
gpadmin# sudo -u hdfs hdfs dfsadmin –report
gpadmin# sudo -u hdfs hdfs dfsadmin -report | grep -B 2 Decommission
YARN NodeManager
Decommission
Use
the following procedure if YARN NodeManager daemons are running on the nodes
that are being decommissioned. This process is almost immediate and only
requires a notification to the ResourceManager that the excluded nodes are no
longer available for use.
On
the Yarn ResourceManager host machine
·
Edit /etc/gphd/hadoop/conf/yarn.exclude file and add the node manager
hostnames to be removed (separated by newline character). Make sure you use the
FQDN for each hostname.
·
On the Resource Manager host instruct the Resource Manager to refresh
it’s node list by re-reading the .exclude and .include files:
gpadmin# sudo -u yarn yarn rmadmin -refreshNodes
·
The decommission state can be verified via the Resource Manager WebUI
(https://<resource_manager_host>:8088) or by using the command line by
executing the following command on the Resource Manager host:
gpadmin# sudo -u yarn yarn rmadmin node -list
SlaveNode Shutdown
Procedures
Next,
the slave processes running on the newly decommissioned nodes need to be
shutdown by using the Pivotal HD ICM CLI.
Create
a text file containing the hostnames that have been decommissioned (separated
by newline character). Make sure you use the FQDN for each
hostname. (hostfile.txt)
·
If the hosts are HDFS DataNodes:
gpadmin# icm_client stop -r datanode -r datanode -o <hostfile.txt>
·
If the hosts are YARN NodeManagers:
gpadmin# icm_client stop -l <clustername> -r yarn-nodemanager -o
<hostfile.txt>
·
If the hosts are HBase RegionServers:
It is preferable to use graceful_stop script that HBase
provides. The graceful_stop.sh script will check to see
if the Region Load Balancer is operational will turn it off before starting
it’s region server decommission process. If you want to decommission more
than one node at a time by stopping multiple RegionServers concurrently, the
RegionServers can be put into a "draining" state to avoid offloading
data to other servers being drained. This is done by marking a
RegionServer as a draining node by creating an entry in ZooKeeper under the <hbase_root>/draining
znode. This znode has format name,port,startcode just like the regionserver
entries under <hbase_root>/rs node.
Using zkCLI, list the current HBase Region Servers:
[zk:] ls /hbase/rs
Now, use the following command to put any servers you wish into draining
status. Copy the entry exactly as it exists in the /hbase/rs znode:
[zk:] create /hbase/draining/<FQDN Hostname>,<Port>,<startcode>
This process will ensure that these nodes don’t receive new blocks as
other nodes are decommissioned.
·
If the hosts are GemfireXD Servers:
gpadmin# gfxd server stop -dir=<working directory containing status
file>
·
If the hosts are HAWQ Segment Servers:
If HAWQ is deployed on the hosts, data locality concerns must be
considered before leveraging the HDFS DataNode decommission
process. HAWQ leverages a hash distribution policy to distribute
its data evenly across the cluster, but this distribution is negatively
effected when the data blocks are evacuated to the other hosts throughout the
cluster. If the DataNode is later brought back online, two states
are possible:
·
Data In-Place
In this case, when the DataNode is brought back online HDFS will report
the blocks stored on the node as "over-replicated" blocks. HDFS
will, over-time, randomly remove a replica of each of the blocks. This
process may negatively impact the data locality awareness of the HAWQ segments,
because data that hashes to this node could now be stored elsewhere in the
cluster. Operations can resume in this state with the only impact
being potential HDFS network reads for some of the data blocks that had their
primary replica moved off the host as the "over-replication" is
resolved. This will not, however, affect co-located database joins,
because the segment servers will be unaware that the data is being retrieved
via the network rather than a local disk read.
·
Data Removed
In this case, when the DataNode is brought back online HDFS will now use
this node for net-new storage activities, but the pre-existing blocks will not
be moved back into their original location. This process will negatively
impact the data locality for the co-located HAWQ segments because any existing
data will not be local to the segment host. This will not result in
a database gather motion since the data will still appear to be local to the
segment servers, but it will require the data blocks to be fetched over the
network during the HDFS reads. HDFS Balancer should not be used to
repopulate data onto the newly decommissioned server unless a HAWQ table
redistribution is planned as well. The HDFS Balancer will affect segment
host data locality on every node in the cluster as it moves data around to
bring HDFS utilization in balance across the cluster.
In either case, a HAWQ table redistribution can be performed on specific
tables, or all tables in order to restore data locality. If
possible, it is recommended that maintenance on a cluster containing HAWQ
should be done one host at a time to avoid the situations described above. This would alleviate the need to decommission
the host, because two valid replicas of the data would exist at all
times.
There is no specific decommission process for a HAWQ segment host, but
if the host needs to be decommissioned the HAWQ segment servers should be
shutdown.
o
On the Decommissioned Node, stop the postgres processes and then verify
they are down:
gpadmin# pkill -SIGTERM postgres
gpadmin# ps –ef | grep postgres
o
On the HAWQ Master (Verify segments are down):
gpadmin# source /usr/local/hawq/greenplum_path.sh
gpadmin# gpstate
Slave Node Replacement
There
are many situations in which a slave node goes down and the entire server must
be replaced. In these cases, the administrator is not able to issue
a decommission, so HDFS will mark the server offline and begin replicating the
now missing blocks to bring up replica count back within policy
guidelines. To replace the node, a new server can be brought online
with the same configuration (disk mounts, etc.)
and the following procedure can be used on the PCC/ICM server to bring the
replacement node into the cluster.
·
Get the current cluster configuration
gpadmin# icm_client fetch-configuration -o <config directory
target> -l <clustername>
·
Remove the failed node from the cluster, by creating a text file
containing the fully qualified hostname of the host to replace and then running
the ICM command below. This step is required even if the
replacement node will have the same name, because adding a “net-new” node to
the cluster will allow us to leverage the ICM automation to properly configure
the replaced host.
gpadmin# icm_client remove-slaves -f <replaced host text file> -l
<clustername>
·
Add the replaced host back into the cluster by using the original configuration
from the first step.
gpadmin# icm_client add-slaves -f <replaced host text file> -l
<clustername>
·
Manually start the slave processes on the newly replaced node:
o
If node is a DataNode:
gpadmin# icm_client start -r datanode -o
<hostfile.txt>
o
If node is a NodeManager:
gpadmin# icm_client start -r yarn-nodemanager -o <hostfile.txt>
o
If node is a HBase Region Server:
gpadmin# icm_client start -r hbase-regionserver -o <hostfile.txt>
o
If node is a HAWQ Segment Server:
gpadmin# sudo massh <replaced host text file> verbose “service
hawq start"
With HAWQ, the database engine needs to be informed that it now has the
new segment server online, so you need to login to the HAWQ Master and issue
the appropriate recovery commands for HAWQ segments.
·
On the HAWQ Master:
gpadmin# source /usr/local.hawq/greenplum_path.sh
gpadmin# gprecoverseg -F -d <master data directory>
These commands will bring the server back online, but please refer to
the HAWQ Considerations section above for how to proceed in regards to the data
within the database instance itself.
Slave Node Disk Replacement
Hadoop is extremely resilient in terms of hardware failure, but
disk failure is one type of failure scenario that relies on the administrator
to put some thought into as the system is configured. In the default configuration, Hadoop will
blacklist the slave node if a single disk fails. In most cases, this response is an extreme
reaction to a relatively inconsequential failure that is relatively common in
large Hadoop clusters. The parameter to
control this response is dfs.datanode.failed.volumes.tolerated and can be found in the hdfs-site.xml file. The value given to this parameter represents
the number of HDFS DataNode directories can fail before the node is
blacklisted. A good rule of thumb for
this setting would be to tolerate 1 disk failure for every 6 data disks you
have in the system. For example, a 12
disk server would have dfs.datanode.failed.volumes.tolerated
= 2.
In the majority of scenarios with the
proper failure tolerance configured, the disk will fail but the DataNode will
remain operational.
To replace the disk drive:
- Stop DataNode, NodeManager, GemfireXD, and/or HAWQ processes using methods found above.
- Replace the failed disk drive(s)
- Follow the Slave Node Replacement procedures above (add-slaves/remove-slaves)
Comments
Post a Comment