需求

统计HDFS上指定文件中的单词频率，并将结果存储至HDFS上新文件。

需求分析

1. 连接HDFS集群，读取指定文件内容； 2. 按行读取文件数据； 3. 将每行文本按空格或Tab键分割为单词数组； 4. 使用Map函数统计每个单词的出现频率，去重处理； 5. 将统计结果按(key-value)格式写入目标文件中； 6. 验证文件是否已成功写入，可通过HDFS命令查看。

代码实现

以下是基于Java语言实现的完整解决方案：

public class WordCount {      private static Map
   
     wordFrequencyMap = new HashMap<>();      public static void main(String[] args) throws IOException, InterruptedException, URISyntaxException {          Connection_Texture fs = FileSystem.get("hdfs://hadoop01:9000", new Configuration(), "root");          FSDataInputStream inputStream = fs.open(new Path("/data/ghm.txt"));          BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));          String line = null;          while ((line = reader.readLine()) != null) {              // 按空格或Tab分割单词              String[] words = line.split("\\s+");              for (String word : words) {                  Integer count = wordFrequencyMap.getOrDefault(word, 0);                  if (count >= 0) {                      count++;                      wordFrequencyMap.put(word, count);                  }              }          }          Path outputPath = new Path("/user/hdfs-wordcount/part-r-0001");          FSDataOutputStream outputStream = fs.create(outputPath);          TreeSet
    
     
      > entrySet = wordFrequencyMap.entrySet();          for (Map.Entry
      
        entry : entrySet) {              String entryKey = entry.getKey();              Integer entryValue = entry.getValue();              outputStream.write((entryKey + "=" + entryValue + "\r\n").getBytes());              System.out.println("[INFO] 处理结果：" + entryKey + "=" + entryValue);          }          reader.close();          inputStream.close();          fs.close();      }  }

以上代码实现了对指定HDFS文件的单词频率统计，按键值对存储至目标文件中。开发者可通过hdfs dfs -ls /user/hdfs-wordcount等命令验证输出结果。

上一篇：Log4j产生的日志文件上传到hdfs集群上

下一篇：HDFS的JavaAPI操作

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

需求

需求分析

代码实现

发表评论

最新留言

关于作者

推荐文章