gpt4 book ai didi

python - 比较 2 个巨大的 (5-6 GB) csv 文件并计算匹配和不匹配的数量。行数

转载 作者:太空宇宙 更新时间:2023-11-04 04:11:20 25 4
gpt4 key购买 nike

每个 csv 文件有 2 个巨大的 (5-6 GB)。现在的目标是比较这两个文件。有多少行匹配,有多少行不匹配?

假设 file1.csv 包含 5 条相似的行,我们需要将其计为 1 而不是 5。同样,对于file2.csv,如果有冗余数据,我们需要将其计为1。

我希望输出显示匹配的行数和编号。不同的行数。

最佳答案

我用 python 编写了一个文件比较器,它可以优化比较大文件并获得匹配的行数和不同的行数。用你的 2 个大文件替换 input_file1 和 input_file2 并运行它。让我知道结果。

input_file1 = r'input_file.txt'
input_file2 = r'input_file.1.txt'

__author__ = 'https://github.com/praveen-kumar-rr'

# Simple Memory Efficient high perfomance file comparer.
# Can be used to efficiently compare large files.

# Alogrithm:
# Hashes the lines and compared first.
# Non matching lines are picked as different count.
# All the matching lines are taken and the exact lines are read from file
# These strings undergo same comparison process based on string itself


def accumulate_index(values):
'''
Returns dict like key: [indexes]
'''
result = {}
for i, v in enumerate(values):
indexes = result.get(v, [])
result[v] = indexes + [i]
return result


def get_lines(fp, line_numbers):
'''
Reads lines from the file pointer based on the lines_numbers list of indexes
'''
return (v for i, v in enumerate(fp) if i in line_numbers)


def get_match_diff(left, right):
'''
Compares the left and right iterables and returns the matching and different items
'''
left_set = set(left)
right_set = set(right)
return left_set ^ right_set, left_set & right_set


if __name__ == '__main__':
# Gets hashes of all lines for both files
dict1 = accumulate_index(map(hash, open(input_file1)))
dict2 = accumulate_index(map(hash, open(input_file2)))

diff_hashes, matching_hashes = get_match_diff(
dict1.keys(), dict2.keys())

diff_lines_count = len(diff_hashes)

matching_lines_count = 0
for h in matching_hashes:
with open(input_file1) as fp1, open(input_file2) as fp2:
left_lines = get_lines(fp1, dict1[h])
right_lines = get_lines(fp2, dict2[h])
d, m = get_match_diff(left_lines, right_lines)
diff_lines_count += len(d)
matching_lines_count += len(m)

print('Total number of matching lines is : ', matching_lines_count)
print('Total number of different lines is : ', diff_lines_count)

关于python - 比较 2 个巨大的 (5-6 GB) csv 文件并计算匹配和不匹配的数量。行数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56216081/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com