gpt4 book ai didi

python - 如何用linux工具彻底清除重复的行?

转载 作者:太空狗 更新时间:2023-10-30 01:47:59 24 4
gpt4 key购买 nike

本题不等于How to print only the unique lines in BASH?因为那个建议删除重复行的所有副本,而这个只是关于删除它们的重复项,即将 1, 2, 3, 3 更改为 1, 2 , 3 而不仅仅是 1, 2

这个问题真的很难写,因为我看不出有什么可以赋予它意义的。但这个例子显然是直截了当的。如果我有这样的文件:

1
2
2
3
4

解析文件删除重复行后,变成这样:

1
3
4

我知道 python 或其中的一些,这是我编写的用于执行它的 python 脚本。创建一个名为 clean_duplicates.py 的文件并将其运行为:

import sys

#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():

lines = sys.stdin.readlines()

# print( lines )
clean_duplicates( lines )

#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
#
def clean_duplicates( lines ):

lastLine = lines[ 0 ]
nextLine = None
currentLine = None
linesCount = len( lines )

# If it is a one lined file, to print it and stop the algorithm
if linesCount == 1:

sys.stdout.write( lines[ linesCount - 1 ] )
sys.exit()

# To print the first line
if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:

sys.stdout.write( lines[ 0 ] )

# To print the middle lines, range( 0, 2 ) create the list [0, 1]
for index in range( 1, linesCount - 1 ):

currentLine = lines[ index ]
nextLine = lines[ index + 1 ]

if currentLine == lastLine:

continue

lastLine = lines[ index ]

if currentLine == nextLine:

continue

sys.stdout.write( currentLine )

# To print the last line
if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:

sys.stdout.write( lines[ linesCount - 1 ] )

if __name__ == "__main__":

main()

虽然,在搜索重复行时,删除似乎更易于使用 grep、sort、sed、uniq 等工具:

  1. How to remove duplicate lines inside a text file?
  2. removing line from list using sort, grep LINUX
  3. Find duplicate lines in a file and count how many time each line was duplicated?
  4. Remove duplicate entries in a Bash script
  5. How to delete duplicate lines in a file without sorting it in Unix?
  6. How to delete duplicate lines in a file...AWK, SED, UNIQ not working on my file

最佳答案

您可以将 uniq-u/--unique 选项一起使用。根据 uniq man page :

-u / --unique

Don't output lines that are repeated in the input.
Print only lines that are unique in the INPUT.

例如:

cat /tmp/uniques.txt | uniq -u

或者,如 UUOC: Useless use of cat 中所述,更好的方法是这样做:

uniq -u /tmp/uniques.txt

这两个命令都会返回值:

1
3
4

其中 /tmp/uniques.txt 包含问题中提到的数字,即

1
2
2
3
4

注意:uniq 要求文件内容排序。如 doc 中所述:

By default, uniq prints the unique lines in a sorted file, it discards all but one of identical successive input lines. so that the OUTPUT contains unique lines.

如果文件未排序,您需要 sort内容第一然后对排序后的内容使用 uniq:

sort /tmp/uniques.txt | uniq -u

关于python - 如何用linux工具彻底清除重复的行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40916782/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com