gpt4 book ai didi

python - 将 Perl 脚本转换为 Python : dedupe 2 files based on hash keys

转载 作者:太空宇宙 更新时间:2023-11-04 06:49:33 24 4
gpt4 key购买 nike

我是 Python 新手,想知道是否有人愿意将一个相当简单的 Perl 脚本示例转换为 Python?

该脚本获取 2 个文件,并通过比较哈希键仅输出第二个文件中的唯一行。它还将重复行输出到文件。我发现使用 Perl 进行重复数据删除的这种方法非常快,并且想看看 Python 的比较情况。

#! /usr/bin/perl

## Compare file1 and file2 and output only the unique lines from file2.

## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
my $name = $_;
$file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;

while ( <$file2> ) {
$name = $_;
$file2hash{$name}=$_;
}

open my $dfh, '>', "duplicate.txt";

## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
if ( exists ( $file2hash{$_} ))
{
print $dfh $file2hash{$_};
delete $file2hash{$_};
}
}

open my $ofh, '>', "file2_clean.txt";
print $ofh values(%file2hash) ;

我已经在 2 个超过 100 万行的文件上测试了 perl 和 python 脚本,总时间不到 6 秒。就其服务的商业目的而言,表现非常出色!

我修改了 Kriss 提供的脚本,我对两个结果都非常满意:1) 脚本的性能和 2) 我修改脚本的容易程度使其更加灵活:

#!/usr/bin/env python

import os

filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")

file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])

for name, results in [
(os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
(os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
with file(name, 'w') as fh:
for line in results:
fh.write(line)

最佳答案

如果你不关心顺序,你可以在 Python 中使用集合:

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1 #uncommon lines (in file2 but not file1)
for items in intersection:
print items
for nitems in non_intersection:
print nitems

其他方法包括使用 difflib、filecmp 库。

另一种方式,只使用列表比较。

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
line=line.rstrip()
if line in data1:
print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
line=line.rstrip()
if not line in data1:
print line

关于python - 将 Perl 脚本转换为 Python : dedupe 2 files based on hash keys,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1782033/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com