gpt4 book ai didi

java - 从 CSV 文件读取速度慢

转载 作者:行者123 更新时间:2023-11-30 06:06:14 24 4
gpt4 key购买 nike

我正在尝试读取 csv 文件,但速度很慢。下面是代码的粗略解释:

private static Film[] readMoviesFromCSV() {
// Regex to split by comma without splitting in double quotes.
// https://regexr.com/3s3me <- example on this data
var pattern = Pattern.compile(",(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)");
Film[] films = null;
try (var br = new BufferedReader(new FileReader(FILENAME))) {
var start = System.currentTimeMillis();
var temparr = br.lines().skip(1).collect(Collectors.toList()); // skip first line and read into List
films = temparr.stream().parallel()
.map(pattern::split)
.filter(x -> x.length == 24 && x[7].equals("en")) // all fields(total 24) and english speaking movies
.filter(x -> (x[14].length() > 0)) // check if it has x[14] (date)
.map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7]))
// movieData[8] = String title, movieData[9] = String overview
// movieData[14] = String date (constructor parses it to LocalDate object)
// movieData[22] = String avgRating
.toArray(Film[]::new);
System.out.println(MessageFormat.format("Execution time: {0}", (System.currentTimeMillis() - start)));
System.out.println(films.length);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return films;
}

文件大约 30 MB 大,平均需要大约 3-4 秒。我正在使用流,但它仍然很慢。是因为每次的 split 吗?

编辑:我已经成功使用 uniVocity-parsers 库将读取和处理时间加快了 3 倍。平均需要 950 毫秒才能完成。这真是令人印象深刻。

private static Film[] readMoviesWithLib() {
Film[] films = null;
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
RowListProcessor rowProcessor = new RowListProcessor();
parserSettings.setProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(parserSettings);
var start = System.currentTimeMillis();
try {
parser.parse(new BufferedReader(new FileReader(FILENAME)));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
List<String[]> rows = rowProcessor.getRows();
films = rows.stream()
.filter(Objects::nonNull)
.filter(x -> x.length == 24 && x[14] != null && x[7] != null)
.filter(x -> x[7].equals("en"))
.map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7]))
.toArray(Film[]::new);
System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start)));
return films;
}

最佳答案

这里是 univocity-parsers 库的作者。您可以通过像这样重写来进一步加快您在编辑中发布的代码的速度:

    //initialize an arraylist with a good size to avoid reallocation
final ArrayList<Film> films = new ArrayList<Film>(20000);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(true);

//don't generate strings for columns you don't want
parserSettings.selectIndexes(7, 8, 9, 14, 22, 23);

//keep generating rows with the same number of columns found in the input
//indexes not selected will have nulls as they are not processed.
parserSettings.setColumnReorderingEnabled(false);

parserSettings.setProcessor(new AbstractRowProcessor(){
@Override
public void rowProcessed(String[] row, ParsingContext context) {
if(row.length == 24 && "en".equals(row[7]) && row[14] != null){
films.add(new Film(row[8], row[9], row[14], row[22], row[23], row[7]));
}
}
});

CsvParser parser = new CsvParser(parserSettings);
long start = System.currentTimeMillis();
try {
parser.parse(new File(FILENAME), "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
}

System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start)));
return films.toArray(new Film[0]);

为了方便起见,如果您必须将内容处理到不同的类中,您也可以 use annotations在您的 Film 类(class)中。

希望这有帮助。

关于java - 从 CSV 文件读取速度慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51263473/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com