gpt4 book ai didi

java - 如何让 hadoop 忽略\n 输入文件中的字符?

转载 作者:可可西里 更新时间:2023-11-01 16:55:29 27 4
gpt4 key购买 nike

我正在使用 Hadoop 的 map reduce 函数编写倒排索引创建器。我的输入文件中的某些行已将字符\n 作为实际字符写入其中(不是 ASCII 10,而是两个实际字符“\”和“n”)。出于某种我不明白的原因,这似乎导致 map 函数将我的行分成两行。

这是我的一些文件中的一些示例行。

32155: Wyldwood Radio: On the Move WILL begin on Friday May 1st, as originally planned!\n\nWe had some complications with... http://t.co/g8STpuHn5Q

5: RT @immoumita: #SaveJalSatyagrahi\nJal Satyagraha 'holding on to the truth by water' https://t.co/x3XgRvCE5H via @4nks

15161: RT @immoumita: #SaveJalSatyagrahi\nJal Satyagraha 'holding on to the truth by water' https://t.co/x3XgRvCE5H via @4nks

这是输出:

co :78516: tweets0001:30679;2, ... , tweets0001:We had some complications with... http;1, ...

x3XgRvCE5H :2: tweets0000:Jal Satyagraha 'holding on to the truth by water' https;2

下面是我的 map reduce:

map

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
private final static Text word = new Text();
private final static Text location = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

String line = value.toString();

int colon_index = line.indexOf(":");
if(colon_index > 0)
{
String tweet_num = line.substring(0,colon_index);
line = line.substring(colon_index + 1);

StringTokenizer tokenizer = new StringTokenizer(line," !@$%^&*()-+=\"\\:;/?><.,{}[]|`~");
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String filename = fileSplit.getPath().getName();
location.set(filename + ":" + tweet_num);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, location);
}
}
}

减少

public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
boolean first = true;
int count = 0;
StringBuilder locations = new StringBuilder();
HashMap<String,Integer> frequencies = new HashMap<String, Integer>();


while (values.hasNext()) {
String location = values.next().toString();
if(frequencies.containsKey(location)){
int frequency = frequencies.get(location).intValue() + 1;
frequencies.put(location,new Integer(frequency));
}
else{
frequencies.put(location,new Integer(1));
}
count++;
}
for(String location : frequencies.keySet()){
int frequency = frequencies.get(location).intValue();
if(!first)
locations.append(", ");
locations.append(location);
locations.append(";"+frequency);
first = false;
}
StringBuilder finalString = new StringBuilder();
finalString.append(":"+String.valueOf(count)+": ");
finalString.append(locations.toString());
output.collect(key, new Text(finalString.toString()));
}
}

一般的数据流是将每一行映射到一个 {Word,filename:line_number} 对,然后通过计算它出现的频率来减少这些对。输出应该是:

Word-->:occurences:filename1:line_number:occurences_on_this_line, filename2....

map reduce 部分工作得很好,你甚至可以从我的示例中看到第 5 行和第 15161 行的推文都包含字符串 x3XgRvCE5H,而且,因为我的 Mapper 在之前查找冒号附加一个行号并且这两条推文包含相同的文本,它们都映射到相同的索引位置,给出“频率”值 2。

所以,我的问题是:如何让 Hadoop 的输入格式不将字符“\n”读取为换行符?毕竟,它们不是 ASCII 10,即实际的换行符、换行符,而是两个单独的字符。

最佳答案

您必须扩展 FileInputFormat 并编写一个新类来覆盖该行为。例如:

public class ClientTrafficInputFormat extends FileInputFormat {

@Override
public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {

return new ClientTrafficRecordReader();
}

}

RecordReader 也应该被覆盖

public class ClientTrafficRecordReader extends
RecordReader<ClientTrafficKeyWritable, ClientTrafficValueWritable> {

...

private LineRecordReader reader = new LineRecordReader(); // create your own RecordReader this is where you have to mention not to use '\n' but it should be read as "\"and "n"

@Override
public void initialize(InputSplit is, TaskAttemptContext tac) throws IOException,
InterruptedException {

reader.initialize(is, tac);

}
...
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
//customize your input
}

关于java - 如何让 hadoop 忽略\n 输入文件中的字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30498165/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com