
本文共 1410 字,大约阅读时间需要 4 分钟。
数据清洗案例实操
需求 去除日志中字段长度小于等于11的日志
输入数据 (图1:示例输入数据)
期望输出数据 每行字段长度都大于11
需求分析 需要在Map阶段对输入的数据根据规则进行过滤清洗
代码实现 Mapper阶段 进行数据解析 解析的同时记录计数
代码示例 public class LogMap extends Mapper< LongWritable, Text, NullWritable > { protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); boolean result = parseLog(line, context); if (!result) { return; } context.write(value, NullWritable.get()); }
private boolean parseLog(String line, Context context) { String[] fields = line.split(" "); if (fields.length > 11) { context.getCounter("map", "true").increment(1); return true; } else { context.getCounter("map", "false").increment(1); return false; }}
}
Driver阶段 正常编写配置项 注意设置ReduceTask数为0
代码示例 public class LogDriver { public static void main(String[] args) { Configuration configuration = new Configuration(); Job job = null; try { job = Job.getInstance(configuration); job.setJarByClass(LogDriver.class); job.setMapperClass(LogMap.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); job.setNumReduceTasks(0); FileInputFormat.setInputPaths(job, new Path("G:\Projects...")); FileOutputFormat.setOutputPath(job, new Path("G:\Projects...")); boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } catch (Exception e) { e.printStackTrace(); } } }
发表评论
最新留言
关于作者
