gpt4 book ai didi

java - 根据 2 列比较 2 个大型未排序 CSV 文件

转载 作者:太空宇宙 更新时间:2023-11-03 19:28:53 30 4
gpt4 key购买 nike

我的任务是根据第 1 列和第 3 列比较 2 个大型未排序 .csv 文件。每个文件包含大约 200k 条记录。对于输出,我需要知道第一个文件中存在哪些基于第 1 列和第 3 列的记录,但第二个文件中不存在。这些文件是用引号引起来的逗号分隔值文件。第 3 列在比较时需要忽略大小写。

示例文件1:

"id", "name", "email", "country"
"1233", "jake", "jake@mailinator.com", "USA"
"2345", "alison", "Alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"

文件2

"id", "name", "email", "country"
"2345", "alison", "alison@mailinator.com", "Canada"
"3456", "jacob", "jacob@mailinator.com", "USA"
"5690", "lina", "lina@mailinator.com", "Canada"

所需的输出文件

"5678", "natalia", "natalia@mailinator.com", "USA"

代码示例将非常感激。

最佳答案

尝试:

join -v 1 -i -t, -1 1 -2 1 -o 1.2 1.3 1.4 1.5  <(awk -F, '{print $1":"$3","$0}' f1.txt | sort) <(awk -F, '{print $1":"$3","$0}' f2.txt | sort)

它是如何工作的:

1) 我首先通过连接第 1 列和第 3 列来创建一个复合键列:

awk -F, '{print $1":"$3","$0}' f1.txt
awk -F, '{print $1":"$3","$0}' f2.txt

2)我对两个输出进行排序:

awk -F, '{print $1":"$3","$0}' f1.txt | sort 
awk -F, '{print $1":"$3","$0}' f2.txt | sort

3) 然后,我使用 join 命令加入第一列(我的复合键)并输出来自文件 1 的不可配对行。

输出:

"1233",  "jake", "jake@mailinator.com", "USA"
"5678", "natalia", "natalia@mailinator.com", "USA"

关于java - 根据 2 列比较 2 个大型未排序 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6999705/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com