gpt4 book ai didi

java - 逐行比较两个大文件中的数据

转载 作者:行者123 更新时间:2023-11-29 07:55:57 25 4
gpt4 key购买 nike

我需要分析两个应该具有相同结构的大型数据文件之间的差异。每个文件的大小为几千兆字节,可能包含 3000 万行或文本数据。数据文件太大,以至于我犹豫要不要将每个文件加载到它自己的数组中,因为按顺序遍历这些行可能更容易。每行的结构为:

topicIdx, recordIdx, other fields...  

topicIdx 和 recordIdx 是顺序的,从零开始,每次迭代递增 +1,因此很容易在文件中找到它们。 (无需四处搜索;只需按顺序向前递增)。

我需要做类似的事情:

for each line in fileA  
store line in String itemsA
get topicIdx and recordIdx
find line in fileB with same topicIdx and recordIdx
if exists
store this line in string itemsB
for each item in itemsA
compare value with same index in itemsB
if these two items are not virtually equal
//do something
else
//do something else

我用 FileReader 和 BufferedReader 编写了以下代码,但它们的 api 似乎没有提供我需要的功能。任何人都可以告诉我如何修复下面的代码以实现我想要的吗?

void checkData(){  
FileReader FileReaderA;
FileReader FileReaderB;
int topicIdx = 0;
int recordIdx = 0;
try {
int numLines = 0;
FileReaderA = new FileReader("B:\\mypath\\fileA.txt");
FileReaderB = new FileReader("B:\\mypath\\fileB.txt");
BufferedReader readerA = new BufferedReader(FileReaderA);
BufferedReader readerB = new BufferedReader(FileReaderB);
String lineA = null;
while ((lineA = readerA.readLine()) != null) {
if (lineA != null && !lineA.isEmpty()) {
List<String> itemsA = Arrays.asList(lineA.split("\\s*,\\s*"));
topicIdx = Integer.parseInt(itemsA.get(0));
recordIdx = Integer.parseInt(itemsA.get(1));
String lineB = null;
//lineB = readerB.readLine();//i know this syntax is wrong
setB = rows from FileReaderB where itemsB.get(0).equals(itemsA.get(0));
for each lineB in setB{
List<String> itemsB = Arrays.asList(lineB.split("\\s*,\\s*"));
for(int m = 0;m<itemsB.size();m++){}
for(int j=0;j<itemsA.size();j++){
double myDblA = Double.parseDouble(itemsA.get(j));
double myDblB = Double.parseDouble(itemsB.get(j));
if(Math.abs(myDblA-myDblB)>0.0001){
//do something
}
}
}
}
readerA.close();
} catch (IOException e) {e.printStackTrace();}
}

最佳答案

您需要按搜索键(recordIdx 和 topicIdx)对这两个文件进行排序,因此您可以像这样进行某种合并操作

open file 1
open file 2
read lineA from file1
read lineB from file2
while (there is lineA and lineB)
if (key lineB < key lineA)
read lineB from file 2
continue loop
if (key lineB > key lineA)
read lineA from file 1
continue
// at this point, you have lineA and lineB with matching keys
process your data
read lineB from file 2

请注意,内存中只会有两条记录。

关于java - 逐行比较两个大文件中的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17662747/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com