hadoop入门wc程序

前言

安装好hadoop后,需要测试启动,这里就跑一个例子。WC指的是Linux的一条指令wordcount,用来统计文件中的单词数。

命令:
wc [文件名]

那么如何做大数据的统计?利用hadoop来做统计,指令在hadoop-mapreduce-examples-2.8.4.jar包中。

上传/下载/删除文件

启动hadoop后,先传一个文件test到HDFS上:
hadoop fs -put /root/Desktop/test.tar.gz hdfs://zc01:9000/test

下载某个文件可以用界面或者命令行:
hadoop fs -get hdfs://zc01:9000/test /home # 下载到home文件夹

删除文件(夹):
hadoop fs -rm hdfs://zc01:9000/test

这是基础知识。

利用自带的工具包做wc统计

先进入jar包目录:

1
2
[root@zc01 mapreduce]# pwd
/hadoop/hadoop-2.8.4/share/hadoop/mapreduce

使用命令:
hadoop jar hadoop-mapreduce-examples-2.8.4.jar wordcount [输入] [输出]

其中输入和输出都是HDFS里面的地址,则先把待统计的words.txt文件上传到HDFS中:
hadoop fs -put /root/Desktop/words.txt hdfs://zc01:9000/words

然后计算:
hadoop jar hadoop-mapreduce-examples-2.8.4.jar wordcount hdfs://zc01:9000/words hdfs://zc01:9000/out

计算完成,hadoop fs -ls /列出目录,发现多了out目录,也可以直接界面查看计算结果。


hadoop1-3

下载计算结果时浏览器无法解析主机名,应该在hosts文件(C:\Windows\System32\drivers\etc\hosts)中添加映射(相当于本机的DNS):

1
192.168.8.88 zc01

下载part-r-00000文件,打开后即是wc统计的结果。

自己写一个简单的WordCount程序

新建maven程序,pom依赖

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoopVersion>2.8.4</hadoopVersion>
</properties>
<dependencies>
<!-- Hadoop start -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>RELEASE</version>
</dependency>
</dependencies>

自己写一个WordCount类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
public class WordCount {
//实现一个Mapper类
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
//实现一个Reducer类
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Error in args");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//输入目录
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
//输出目录
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

在弄清楚MapReduce的原理的基础上,自己实现一个Map类和Reducer类。

将项目打包成jar文件,上传到hdfs中,利用命令行运行

1
hadoop jar hadoop01.jar demo01.WordCount hdfs://hadoop01:9000/wc_test_01 hdfs://hadoop01:9000/wc_out2

执行结果与example包中计算的结果一样(程序无误的前提下),这样,就完成了简单的WordCount程序