gpt4 book ai didi

java - 优化Java大数据文件读取

转载 作者:行者123 更新时间:2023-12-02 00:28:21 24 4
gpt4 key购买 nike

我正在编写一个应用程序来帮助改进我的论文的机器翻译。为此,我需要大量的 ngram 数据。我从 Google 获得了数据,但其格式并不有用。

Google 数据的格式如下:

ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

这就是我想要的:

ngram total_match_count_for_all_years

因此,我编写了一个小型应用程序来运行文件并提取 ngram 并聚合多年来的数据以获得总计数。看起来,它运行得很好。但是,由于 Google 文件太大(每个 1.5GB!有 99 个 >.<),需要很长时间才能浏览完所有文件。

代码如下:

public class mergeData
{
private static List<String> storedNgrams = new ArrayList<String>(100001);
private static List<String> storedParts = new ArrayList<String>(100001);
private static List<String> toWritePairs = new ArrayList<String>(100001);
private static int rows = 0;
private static int totalFreq = 0;

public static void main(String[] args) throws Exception
{
File bigram = new File("data01");
BufferedReader in = new BufferedReader(new FileReader(bigram));
File myFile = new File("newData.txt");
Writer out = new BufferedWriter(new FileWriter(myFile));
while (true)
{
rows = 0;
merge(in, out);
}
}

public static void merge(BufferedReader in, Writer out) throws IOException
{

while (rows != 1000000)
{
storedNgrams.add(in.readLine());
rows++;
}

while (!(storedNgrams.isEmpty()))
{

storedParts.addAll(new ArrayList<String>(Arrays.asList(storedNgrams.get(0).split("\\s"))));

storedNgrams.remove(0);

}
while (storedParts.size() >= 8)
{
System.out.println(storedParts.get(0) + " " + storedParts.get(1) + " " + storedParts.get(6)
+ " " + storedParts.get(7));
if (toWritePairs.size() == 0 && storedParts.get(0).equals(storedParts.get(6))
&& storedParts.get(1).equals(storedParts.get(7)))
{

totalFreq = Integer.parseInt(storedParts.get(3)) + Integer.parseInt(storedParts.get(9));

toWritePairs.add(storedParts.get(0));
toWritePairs.add(storedParts.get(1));

toWritePairs.add(Integer.toString(totalFreq));
storedParts.subList(0, 11).clear();

}
else if (!(toWritePairs.isEmpty()) && storedParts.get(0).equals(toWritePairs.get(0))
&& storedParts.get(1).equals(toWritePairs.get(1)))
{

int totalFreq = Integer.parseInt(storedParts.get(3))
+ Integer.parseInt(toWritePairs.get(2));

toWritePairs.remove(2);
toWritePairs.add(Integer.toString(totalFreq));
storedParts.subList(0, 5).clear();
}
else if ((!toWritePairs.isEmpty())
&& !(storedParts.get(0).equals(storedParts.get(6)) && storedParts.get(1).equals(
storedParts.get(7))))
{
toWritePairs.add(storedParts.get(0));
toWritePairs.add(storedParts.get(1));
toWritePairs.add(storedParts.get(2));
storedParts.subList(0, 2).clear();
}

else if (!(toWritePairs.isEmpty()))
{
out.append(toWritePairs.get(0) + " " + toWritePairs.get(1) + " " + toWritePairs.get(2)
+ "\n");
toWritePairs.subList(0, 2).clear();

}

out.flush();
}
}

}

如果有人有任何想法如何提高这些文件的处理速度,这将对我有很大帮助。

最佳答案

在数据库中创建临时表。使用文件中的行填充它。如有必要,创建索引并让数据库进行分组。它将简化程序的逻辑并且很可能执行得更快。

关于java - 优化Java大数据文件读取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9619237/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com