gpt4 book ai didi

hadoop - 默认情况下,mapreduce程序是否会消耗文件夹中的所有文件(输入数据集)?

转载 作者:行者123 更新时间:2023-12-02 21:18:35 24 4
gpt4 key购买 nike

您好在Stackoverflow的同事们,

我运行了一个mapreduce代码,该代码在文件中找到唯一的单词。输入数据集(文件)在HDFS的文件夹中。因此,在运行mapreduce程序时,我将文件夹的名称作为输入。

我没有意识到同一文件夹中还有另外2个文件。 Mapreduce程序继续进行,并读取了所有3个文件并给出了输出。输出很好。

这是mapreduce的默认行为吗?意味着如果您指向的是文件夹而不是文件(作为输入数据集),那么mapreduce是否会占用该文件夹中的所有文件?我感到惊讶的原因是,在映射器中,没有代码可以读取多个文件。我知道驱动程序中的第一个参数args [0]是我给的文件夹名称。

这是驱动程序代码:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DataSort {

public static void main(String[] args) throws Exception {

/*
* Validate that two arguments were passed from the command line.
*/
if (args.length != 2) {
System.out.printf("Usage: StubDriver <input dir> <output dir>\n");
System.exit(-1);
}

Job job=Job.getInstance();

/*
* Specify the jar file that contains your driver, mapper, and reducer.
* Hadoop will transfer this jar file to nodes in your cluster running
* mapper and reducer tasks.
*/
job.setJarByClass(DataSort.class);

/*
* Specify an easily-decipherable name for the job.
* This job name will appear in reports and logs.
*/
job.setJobName("Data Sort");

/*
* TODO implement
*/
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(ValueIdentityMapper.class);
job.setReducerClass(IdentityReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

/*
* Start the MapReduce job and wait for it to finish.
* If it finishes successfully, return 0. If not, return 1.
*/
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}

映射器代码:
import java.io.IOException;  
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class ValueIdentityMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line=value.toString();
for (String word:line.split("\\W+"))
{
if (word.length()>0)
{
context.write(new Text(word),new IntWritable(1));
}
}

}

}

reducer 代码:
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class IdentityReducer extends Reducer<Text, IntWritable, Text, Text> {

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {

String word="";
context.write(key, new Text(word));
}
}

最佳答案

Is this the default behaviour of mapreduce?


不是mapreduce的,只是您使用的InputFormat的。
FileInputFormat API引用

setInputPaths(JobConf conf, Path... inputPaths)

Set the array of Paths as the list of inputs for the map-reduce job.


Path API引用

Names a file or directory in a FileSystem.


所以,当你说

there is no code to read multiple files


是的,实际上有,使用只是不需要编写它。 Mapper<LongWritable, Text,可以正确处理指定 InputFormat中所有文件的所有文件偏移。

关于hadoop - 默认情况下,mapreduce程序是否会消耗文件夹中的所有文件(输入数据集)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38063116/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com