gpt4 book ai didi

java - 如何用Java从AWS读取Common Crawl的所有数据?

转载 作者:可可西里 更新时间:2023-11-01 16:52:55 29 4
gpt4 key购买 nike

我对 Hadoop 和 MapReduce 编程完全陌生,我正在尝试使用 Common Crawl 的数据编写我的第一个 MapReduce 程序。

我想从 AWS 读取 2015 年 4 月的所有数据。例如,如果我想在命令行中下载 2015 年 4 月的所有数据,我会这样做:

s3cmd get s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246633512.41/wat/*.warc.wat.gz

此命令行有效,但我不想下载 2015 年 4 月的所有数据,我只想读取所有“warc.wat.gz”文件(以便分析数据)。

我试着创建我的工作,看起来像这样:

public class FirstJob extends Configured implements Tool {
private static final Logger LOG = Logger.getLogger(FirstJob.class);

/**
* Main entry point that uses the {@link ToolRunner} class to run the Hadoop
* job.
*/
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new FirstJob(), args);
System.out.println("done !!");
System.exit(res);
}

/**
* Builds and runs the Hadoop job.
*
* @return 0 if the Hadoop job completes successfully and 1 otherwise.
*/
public int run(String[] arg0) throws Exception {
Configuration conf = getConf();
//
Job job = new Job(conf);
job.setJarByClass(FirstJob.class);
job.setNumReduceTasks(1);

//String inputPath = "data/*.warc.wat.gz";
String inputPath = "s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246633512.41/wat/*.warc.wat.gz";
LOG.info("Input path: " + inputPath);
FileInputFormat.addInputPath(job, new Path(inputPath));

String outputPath = "/tmp/cc-firstjob/";
FileSystem fs = FileSystem.newInstance(conf);
if (fs.exists(new Path(outputPath))) {
fs.delete(new Path(outputPath), true);
}
FileOutputFormat.setOutputPath(job, new Path(outputPath));

job.setInputFormatClass(WARCFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);

job.setMapperClass(FirstJobUrlTypeMap.ServerMapper.class);
job.setReducerClass(LongSumReducer.class);

if (job.waitForCompletion(true)) {
return 0;
} else {
return 1;
}
}

但是我遇到了这个错误:

Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

我该如何解决我的问题?提前致谢,

最佳答案

我解决了我的问题。在代码中,更改:

 Configuration conf = getConf();
//
Job job = new Job(conf);

Configuration conf = new Configuration();
conf.set("fs.s3n.awsAccessKeyId", "your_key");
conf.set("fs.s3n.awsSecretAccessKey", "your_key");
Job job = new Job(conf);

关于java - 如何用Java从AWS读取Common Crawl的所有数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31287956/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com