gpt4 book ai didi

Python 比较文本文件中相似或相等的行

转载 作者:太空宇宙 更新时间:2023-11-03 14:14:10 29 4
gpt4 key购买 nike

我有 2 个文本文件,我的目标是找到文件 First.txt 中不在 Second.txt 中的行,并将所述行输出到第三个文本文件 Missing.txt,我已经完成了:

fn = "Missing.txt"
try:
fileOutPut = open(fn, 'w')
except IOError:
fileOutPut = open(fn, 'w')
fileOutPut.truncate()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([thing.strip() for thing in fileSecondary.readlines()])
for line in filePrimary:
line = line.strip()
if line in bLines:
continue
else:
fileOutPut.write(line)
fileOutPut.write('\n')
fileOutPut.close()
filePrimary.close()
fileSecondary.close()

但是运行脚本后我遇到了一个问题,有些行非常相似,例如:

[PR] Zero One Two Three ft Four

和(括号后没有空格)

[PR]Zero One Two Three ft Four

[PR] Zero One Two Three ft Four

和(大写 F 字母)

[PR] Zero One Two Three Ft Four

我找到了 SequenceMatcher,它可以满足我的要求,但是我如何将其实现到比较中,因为它们不仅仅是两个字符串,而是一个字符串和一个集合

最佳答案

IIUC,即使空格或大小写不同,您也希望匹配行。

实现此目的的一种简单方法是删除空白并使读取时的所有内容都相同:

import re

def format_line(line):
return re.sub("\s+", "", line.strip()).lower()

filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([format_line(thing) for thing in fileSecondary.readlines()])
for line in filePrimary:
fline = format_line(line)
if fline in bLines:
continue
else:
fileOutPut.write(line + '\n')

更新1:模糊匹配

如果你想模糊匹配,你可以这样做 nltk.metrics.distance.edit_distance ( docs ) 但你无法避免将每一行与其他每一行进行比较(最坏的情况)。您会失去 in 操作的速度。

例如

from nltk.metrics.distance import edit_distance as dist

threshold = 3 # the maximum number of edits between lines

for line in filePrimary:
fline = format_line(line)
match_found = any([dist(fline, other_line) < threshold for other_line in bLines])

if not match_found:
fileOutPut.write(line + '\n')

关于Python 比较文本文件中相似或相等的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48310446/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com