前言

在存储系统中，数据的安全性无疑是最top priority的事情，因此当数据发生丢失的时候，如何快速找到这些数据的位置并且快速地对他们进行恢复是最最要紧的事情。本文笔者想聊聊关于HDFS上数据丢失查找的问题。在HDFS中，数据是以多副本的形式存储于DataNode节点之中。倘若其中出现文件Block块发生多副本同时出现丢失的情况，就会造成我们常见的missing block现象。不过HDFS内部并没有记录最后数据所在的节点，这意味着我们无从得知这些数据丢失前到底分布在哪些节点之上，这对于我们快速恢复数据制造了一个些挑战。本文笔者想聊聊对这块的改进，基于的代码版本为Hadoop 2.6版本。

HDFS Block副本storage location的移除逻辑

在介绍笔者对missing block改进措施之前，我们有必要了解下现在HDFS Block存储location的移除逻辑。什么时候Block的存储location会被移除，从而造成所有的location都没有保存，以至于管理员十分难恢复丢失的数据。

这里的一个主要场景在DataNode变为dead节点的逻辑之中，DataNode变为dead节点之后，NameNode这边会执行removeDeadnode逻辑，并移除与dead node节点相关的block的storage location。

DatanodeManager的removeDeadDatanode方法，

  /** Remove a dead datanode. */  void removeDeadDatanode(final DatanodeID nodeID) {         synchronized(datanodeMap) {           DatanodeDescriptor d;        try {             d = getDatanode(nodeID);        } catch(IOException e) {             d = null;        }        if (d != null && isDatanodeDead(d)) {             NameNode.stateChangeLog.info(              "BLOCK* removeDeadDatanode: lost heartbeat from " + d);          removeDatanode(d);        } else {           	LOG.warn("datanode is timeout but not removed " + d);        }      }  }  /**   * Remove a datanode descriptor.   * @param nodeInfo datanode descriptor.   */  private void removeDatanode(DatanodeDescriptor nodeInfo) {       assert namesystem.hasWriteLock();    heartbeatManager.removeDatanode(nodeInfo);    // 此处会执行block移除dead storage location的操作    blockManager.removeBlocksAssociatedTo(nodeInfo);    networktopology.remove(nodeInfo);    decrementVersionCount(nodeInfo.getSoftwareVersion());    blockManager.getBlockReportLeaseManager().unregister(nodeInfo);    if (LOG.isDebugEnabled()) {         LOG.debug("remove datanode " + nodeInfo);    }    namesystem.checkSafeMode();  }

最后会执行到BlockInfoContiguous的removeStorage方法，BlockInfoContiguous是最终保存block副本location所在信息的类：

  /**   * Remove {@link DatanodeStorageInfo} location for a block   */  boolean removeStorage(DatanodeStorageInfo storage) {       int dnIndex = findStorageInfo(storage);    if(dnIndex < 0) // the node is not found      return false;    assert getPrevious(dnIndex) == null && getNext(dnIndex) == null :       "Block is still in the list and must be removed first.";    // find the last not null node    int lastNode = numNodes()-1;     // replace current node triplet by the lastNode one     setStorageInfo(dnIndex, getStorageInfo(lastNode));    setNext(dnIndex, getNext(lastNode));     setPrevious(dnIndex, getPrevious(lastNode));     // set the last triplet to null    setStorageInfo(lastNode, null);    setNext(lastNode, null);     setPrevious(lastNode, null);    return true;  }

BlockInfoContiguous的内部对象数组triplets同时保存了3个副本的storage location引用和每个副本前后block信息的引用。

BlockInfoContiguous类

  /**   * This array contains triplets of references. For each i-th storage, the   * block belongs to triplets[3*i] is the reference to the   * {@link DatanodeStorageInfo} and triplets[3*i+1] and triplets[3*i+2] are   * references to the previous and the next blocks, respectively, in the list   * of blocks belonging to this storage.   *    * Using previous and next in Object triplets is done instead of a   * {@link LinkedList} list to efficiently use memory. With LinkedList the cost   * per replica is 42 bytes (LinkedList#Entry object per replica) versus 16   * bytes using the triplets.   */  private Object[] triplets;

那么问题来了，为什么NameNode要在节点become dead的时候，从Block中将dead node的location移除掉呢？倘若出现3个副本节点全部瞬间变为dead后，岂不是会出现副本找不到位置的现象？

没错，多副本节点同时dead就是这种处理逻辑带来的一个弊端问题。不过本身Block本身的removeStorage逻辑并没有什么问题，它需要及时更新内部现有的存放副本的location，移除掉旧的storage信息，否则旧的location信息占据的空间按照千万级别block数来计算的话，也将达到GB级别的内存使用。因此NameNode在这边做了快速的storage清理处理。

HDFS Block的last stored location的优化

但是话说回来了，如果我们只是为了保证location的绝对精简处理，把missing block最后的location全部清理了后，我们将得不到任何有用的信息。在这里我们是否能够额外保留它最后存储的location位置呢？因为信息是额外保存的，也就是说将不破坏现有的Block处理逻辑，风险性也小。

在做这块的处理优化之前，笔者实现了2套解决方案：

第一套方案，简单粗暴为每个Block记录现有副本的location位置，为了节省空间，只记录副本位置的ip地址，并且用byte数组去存(String内部每个字符会用char来存，占2个byte，而byte数组会用到1位，更省空间)。此方案能存储详尽的副本位置信息，缺点是会占据比较多的heap使用。
第二套方案，只为missing block这类的block存储最后一个location的位置，同样用byte去存储location的ip地址。这里基于的原则是每当Block移除完没有任何location位置信息后，则进行最后一个location的记录。此方案对现有NameNode的heap使用几乎不会造成任何的影响。因为大部分的Block信息是有其现有有效的存储位置的，因此无需记录last location的。

OK，我们先来看第一套实现方案，这里笔者只展示diff的change。

diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.javaindex bc020aff6a..185827f1bf 100644--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java@@ -20,7 +20,9 @@ import java.util.LinkedList;  import org.apache.hadoop.classification.InterfaceAudience;+import org.apache.hadoop.hdfs.DFSUtil; import org.apache.hadoop.hdfs.protocol.Block;+import org.apache.hadoop.hdfs.protocol.DatanodeID; import org.apache.hadoop.hdfs.server.common.HdfsServerConstants.BlockUCState; import org.apache.hadoop.util.LightWeightGSet; @@ -56,18 +58,27 @@    */   private Object[] triplets; +  /**+   * Last locations that this block stored. The locations will miss when+   * all replicated DN node got removed when they become dead. Use byte array+   * to store ip address to reduce memory size cost.+   */+  private byte[][] lastStoredLocations;+   /**    * Construct an entry for blocksmap    * @param replication the block's replication factor    */   public BlockInfoContiguous(short replication) {        this.triplets = new Object[3*replication];+    this.lastStoredLocations = new byte[replication][];     this.bc = null;   }      public BlockInfoContiguous(Block blk, short replication) {        super(blk);     this.triplets = new Object[3*replication];+    this.lastStoredLocations = new byte[replication][];     this.bc = null;   } @@ -79,6 +90,7 @@ public BlockInfoContiguous(Block blk, short replication) {      protected BlockInfoContiguous(BlockInfoContiguous from) {        super(from);     this.triplets = new Object[from.triplets.length];+    this.lastStoredLocations = new byte[from.lastStoredLocations.length][];     this.bc = from.bc;   } @@ -101,6 +113,25 @@ DatanodeStorageInfo getStorageInfo(int index) {        return (DatanodeStorageInfo)triplets[index*3];   } +  /**+   * Get last stored node address that stores this block before, this method+   * currently used in hdfs fsck command.+   */+  public String[] getLastStoredNodes() {   +    if (lastStoredLocations != null) {   +      String[] nodeAddresses = new String[lastStoredLocations.length];+      for (int i = 0; i < lastStoredLocations.length; i++) {   +        if (lastStoredLocations[i] != null) {   +          nodeAddresses[i] = DFSUtil.bytes2String(lastStoredLocations[i]);+        }+      }++      return nodeAddresses;+    } else {   +      return null;+    }+  }+   private BlockInfoContiguous getPrevious(int index) {        assert this.triplets != null : "BlockInfo is not initialized";     assert index >= 0 && index*3+1 < triplets.length : "Index is out of bound";@@ -182,6 +213,42 @@ private int ensureCapacity(int num) {        return last;   } +  /**+   * Ensure that there is enough  space to include num more triplets.+   * @return first free triplet index.+   */+  private void updateLastStoredLocations() {   +    if (lastStoredLocations != null) {   +      if (lastStoredLocations.length * 3 == triplets.length) {   +        setLastStoredLocations();+      } else {   +        // This is for the case of increasing replica number from users.+        lastStoredLocations = new byte[triplets.length / 3][];+        setLastStoredLocations();+      }+    }+  }++  /**+   * Reset block last stored locations from triplets objects.+   */+  private void setLastStoredLocations() {   +    int storedIndex = 0;+    // DatanodeStorageInfo is stored in position of triplets[i*3]+    for (int i = 0; i < triplets.length; i += 3) {   +      if (triplets[i] != null) {   +        DatanodeStorageInfo currentDN = (DatanodeStorageInfo) triplets[i];+        if (currentDN.getDatanodeDescriptor() != null) {   +          String ipAddress = currentDN.getDatanodeDescriptor().getIpAddr();+          if (ipAddress != null) {   +            lastStoredLocations[storedIndex] = DFSUtil.string2Bytes(ipAddress);+            storedIndex++;+          }+        }+      }+    }+  }+   /**    * Count the number of data-nodes the block belongs to.    */@@ -204,6 +271,10 @@ boolean addStorage(DatanodeStorageInfo storage) {        setStorageInfo(lastNode, storage);     setNext(lastNode, null);     setPrevious(lastNode, null);+    // Only need update last stored locations in adding storage,+    // adding storage behavior mean there is a block replicated successfully+    // and we can ensure there is at least one valid block location.+    updateLastStoredLocations();     return true;   } diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.javaindex 1ee7d879fa..82c49d7d35 100644--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java@@ -676,7 +676,17 @@ private void collectBlocksSummary(String parent, HdfsFileStatus file, Result res           report.append(" " + sb.toString());         }       }-      report.append('\n');++      // Print last location the block stored.+      BlockInfoContiguous blockInfo = bm.getStoredBlock(block.getLocalBlock());+      String[] nodeAddresses = blockInfo.getLastStoredNodes();+      report.append("\nBlock last stored locations: [");+      if (nodeAddresses != null) {   +        for (String node : nodeAddresses) {   +          report.append(node).append(", ");+        }+      }+      report.append("].\n");       blockNumber++;     }

上面方案的核心点在于在每次新加入storage的时候，我们更新一下当前有效storage location的位置到内部变量lastStoredLocations中。如果这期间有storage的移除，那么在下次addStorage方法执行之后(做副本replication或balance操作，或重新节点块上报)，lastStoredLocations还是能够得到当前最新的location的位置的。

然后笔者将lastStoredLocations以fsck命令工具的形式供外部使用，这样能够达到一个比较好的使用效果。不过笔者在测试此方案的过程中，还是发觉此方案额外消耗的heap空间过大，比如一个Block按照3个ip的byte数组保存，就是3*16个byte(字符数字数，假设每个ip 16位数字)=48个byte。然后48个byte如果按照5000w的block计算=48bx5000w=2.24GB。倘若此改动运用在上亿个Block块的集群内，将会达到10GB+的heap使用。这么大的heap使用对于部分集群维护者来说可能不能接受。

于是笔者实现了另外一个方案的改进手段，即上面谈论的第二套方案，我们再来看看方案的实现：

diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.javaindex bc020aff6a..5e27b45d8b 100644--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java@@ -20,6 +20,7 @@ import java.util.LinkedList;  import org.apache.hadoop.classification.InterfaceAudience;+import org.apache.hadoop.hdfs.DFSUtil; import org.apache.hadoop.hdfs.protocol.Block; import org.apache.hadoop.hdfs.server.common.HdfsServerConstants.BlockUCState; import org.apache.hadoop.util.LightWeightGSet;@@ -56,18 +57,27 @@    */   private Object[] triplets; +  /**+   * Last location that this block stored. The locations will miss when+   * all replicated DN node got removed when they become dead. Use byte array+   * to store ip address to reduce memory size cost.+   */+  private byte[] lastStoredLocation;+   /**    * Construct an entry for blocksmap    * @param replication the block's replication factor    */   public BlockInfoContiguous(short replication) {        this.triplets = new Object[3*replication];+    this.lastStoredLocation = null;     this.bc = null;   }      public BlockInfoContiguous(Block blk, short replication) {        super(blk);     this.triplets = new Object[3*replication];+    this.lastStoredLocation = null;     this.bc = null;   } @@ -79,6 +89,7 @@ public BlockInfoContiguous(Block blk, short replication) {      protected BlockInfoContiguous(BlockInfoContiguous from) {        super(from);     this.triplets = new Object[from.triplets.length];+    this.lastStoredLocation = from.lastStoredLocation;     this.bc = from.bc;   } @@ -101,6 +112,18 @@ DatanodeStorageInfo getStorageInfo(int index) {        return (DatanodeStorageInfo)triplets[index*3];   } +  /**+   * Get last stored node address that stores this block before, this method+   * currently used in hdfs fsck command.+   */+  public String getLastStoredNodes() {   +    if (lastStoredLocation != null) {   +      return DFSUtil.bytes2String(lastStoredLocation);+    } else {   +      return null;+    }+  }+   private BlockInfoContiguous getPrevious(int index) {        assert this.triplets != null : "BlockInfo is not initialized";     assert index >= 0 && index*3+1 < triplets.length : "Index is out of bound";@@ -182,6 +205,31 @@ private int ensureCapacity(int num) {        return last;   } +  /**+   * Update block last storage location once given removed storage is the last+   * block replica location.+   */+  private void updateLastStoredLocation(DatanodeStorageInfo removedStorage) {   +    if (triplets != null) {   +      for (int i = 0; i < triplets.length; i += 3) {   +        if (triplets[i] != null) {   +          // There still exists other storage location and given removed+          // storage isn't last block storage location.+          return;+        }+      }++      // If given removed storage is the block last storage, we need to+      // store the ip address of this storage node.+      if (removedStorage.getDatanodeDescriptor() != null) {   +        String ipAddress = removedStorage.getDatanodeDescriptor().getIpAddr();+        if (ipAddress != null) {   +          lastStoredLocation = DFSUtil.string2Bytes(ipAddress);+        }+      }+    }+  }+   /**    * Count the number of data-nodes the block belongs to.    */@@ -204,6 +252,12 @@ boolean addStorage(DatanodeStorageInfo storage) {        setStorageInfo(lastNode, storage);     setNext(lastNode, null);     setPrevious(lastNode, null);++    if (lastStoredLocation != null) {   +      // Reset last stored location to null since we have new+      // valid storage location added.+      lastStoredLocation = null;+    }     return true;   } @@ -225,7 +279,9 @@ assert getPrevious(dnIndex) == null && getNext(dnIndex) == null :     // set the last triplet to null     setStorageInfo(lastNode, null);     setNext(lastNode, null); -    setPrevious(lastNode, null); +    setPrevious(lastNode, null);+    // update last block stored location+    updateLastStoredLocation(storage);     return true;   } diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.javaindex 1ee7d879fa..20d3a31c7d 100644--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java@@ -676,7 +676,11 @@ private void collectBlocksSummary(String parent, HdfsFileStatus file, Result res           report.append(" " + sb.toString());         }       }-      report.append('\n');++      // Print last location the block stored.+      BlockInfoContiguous blockInfo = bm.getStoredBlock(block.getLocalBlock());+      report.append("\nBlock last stored location: [" + blockInfo.getLastStoredNodes());+      report.append("].\n");       blockNumber++;     } diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.javaindex f57dece80e..47c7c299c3 100644--- a/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java+++ b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java@@ -172,4 +172,37 @@ public void testBlockListMoveToHead() throws Exception {              blockInfoList.get(j), dd.getBlockListHeadForTesting());     }   }++  @Test+  public void testLastStoredLocation() {   +    BlockInfoContiguous blockInfo = new BlockInfoContiguous((short) 3);++    String lastStoredLocation = "127.0.0.3";+    DatanodeStorageInfo storage1 = DFSTestUtil.createDatanodeStorageInfo(+        "storageID1", "127.0.0.1");+    DatanodeStorageInfo storage2 = DFSTestUtil.createDatanodeStorageInfo(+        "storageID2", "127.0.0.2");+    DatanodeStorageInfo storage3 = DFSTestUtil.createDatanodeStorageInfo(+        "storageID3", lastStoredLocation);++    blockInfo.addStorage(storage1);+    blockInfo.addStorage(storage2);+    blockInfo.addStorage(storage3);++    Assert.assertEquals(storage1, blockInfo.getStorageInfo(0));+    Assert.assertEquals(storage2, blockInfo.getStorageInfo(1));+    Assert.assertEquals(storage3, blockInfo.getStorageInfo(2));+    Assert.assertNull(blockInfo.getLastStoredNodes());++    blockInfo.removeStorage(storage1);+    blockInfo.removeStorage(storage2);+    blockInfo.removeStorage(storage3);++    Assert.assertNull(blockInfo.getDatanode(0));+    Assert.assertNull(blockInfo.getDatanode(1));+    Assert.assertNull(blockInfo.getDatanode(2));++    Assert.assertNotNull(blockInfo.getLastStoredNodes());+    Assert.assertEquals(lastStoredLocation, blockInfo.getLastStoredNodes());+  } }\ No newline at end of file

方案二的实现点在于Block在移除最后一个storage位置的时候，记录最后一个removed storage位置。如果这些dead node重新回来之后，将会触发addStorage方法，然后再次清空lastStoredLocation的保存信息。换句话说，方案二只会额外记录那些missing block的存储位置信息。不过因为逻辑是添加在removeStorage方法内的，文件的正常删除行为也会触发到最后lastStoredLocation的保存操作，不过对删除文件的行为本身影响还好。

HDFS Missing Block lastStoredLocationd的测试

笔者用fsck命令对上述改动做了测试，测下效果还不错，如下所示：

FSCK started by hdfs from /xx.xx.xx.xx for path /tmp/testfile3 at Tue Mar 03 00:32:55 GMT-07:00 2020/tmp/testfile3 12 bytes, 1 block(s):  OK0. BP-xx.xx.xx.xx:blk_1075951756_2217202 len=12 repl=3 [DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-1c6de9cf-4440-426e-b5a6-c6f3b212d39b,DISK], DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-7ea05055-025d-4a96-ad17-b87c677ff421,DISK], DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-8ce45298-7109-4d9d-82cf-dfaece62263f,DISK]]Block last stored location: [null].// 同时将datanode变为dead之后FSCK started by hdfs from /xx.xx.xx.xx for path /tmp/testfile3 at Tue Mar 03 00:52:14 GMT-07:00 2020/tmp/testfile3 12 bytes, 1 block(s):/tmp/testfile3: CORRUPT blockpool BP-xx.xx.xx.xx block blk_1075951756 MISSING 1 blocks of total size 12 B0. BP-xx.xx.xx.xx:blk_1075951756_2217202 len=12 MISSING!Block last stored location: [xx.xx.xx.xx].

笔者目前只在测试集群中对上述改动进行了测试，不过还并未在正式生产环境进行使用，不过相信此部分的改进会对集群的维护者来说是一个不错的帮助。

上一篇：# 标题开始学并翻译一本英文版的python导引手册The Python Tutorial译文开篇

下一篇：笔记54 命令行快速入门最后一个练习

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

文章目录

前言

HDFS Block副本storage location的移除逻辑

HDFS Block的last stored location的优化

HDFS Missing Block lastStoredLocationd的测试

发表评论

最新留言

关于作者

推荐文章