gpt4 book ai didi

java - 使用 java 进行映射减少 - java.lang.StringIndexOutOfBoundsException : String index out of range: 0

转载 作者:行者123 更新时间:2023-12-02 00:35:15 24 4
gpt4 key购买 nike

我正在尝试编写一个 Spark 应用程序,它输出以每个字母开头的单词数。我收到字符串索引超出范围错误。有什么建议,或者我没有以正确的方式解决这个映射减少问题?

public class Main {
public static void main(String[] args) throws Exception{

//Tell spark to access a cluster
SparkConf conf = new SparkConf().setAppName("App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
System.out.printf("%d lines\n", sc.textFile("pg100.txt").count());


//MARK: Mapping
//Read target file into an Resilient Distributed Dataset(RDD)
JavaRDD<String> lines = sc.textFile("pg100.txt");

//Split lines into individual words by converting each line into an array of words
//Treat all words as lowercase
//Ignore non-alphabetic characters
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

//MARK: Sorting
//Count the total number of words that start with each letter
JavaPairRDD<Character, Integer> letters = words.mapToPair(w -> new Tuple2<>(w.charAt(0), 1));

//MARK: Reducing
//Get count of number of instances of each word
JavaPairRDD<Character, Integer> counts = letters.reduceByKey((n1,n2) -> n1 + n2);

counts.saveAsTextFile("result");
sc.stop();

}
}

最佳答案

我怀疑某些单词仅由以下行替换的字符组成:

JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

因此,一些单词变成空字符串,并且仍然保留在 words RDD 中,当您尝试访问它们的 index=0 时,您自然会收到您提到的异常。

您可能认为如果 map 生成空字符串,它就不会包含在 words 中,但事实并非如此。

UPD。您可以这样过滤掉空字符串:

words.filter(line -> !line.equals(""));

关于java - 使用 java 进行映射减少 - java.lang.StringIndexOutOfBoundsException : String index out of range: 0,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57983977/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com