- mongodb - 在 MongoDB mapreduce 中,如何展平值对象?
- javascript - 对象传播与 Object.assign
- html - 输入类型 ="submit"Vs 按钮标签它们可以互换吗?
- sql - 使用 MongoDB 而不是 MS SQL Server 的优缺点
我正在尝试使用 difflib 为两个包含推文的文本文件生成差异。代码如下:
#!/usr/bin/env python
# difflib_test
import difflib
file1 = open('/home/saad/Code/test/new_tweets', 'r')
file2 = open('/home/saad/PTITVProgs', 'r')
diff = difflib.context_diff(file1.readlines(), file2.readlines())
delta = ''.join(diff)
print delta
这是 PTITVProgs
文本文件:
Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar at 8PM on ARY News. Rgds #PTI
Watch PTI on April 6th (5) @Asad_Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI
@ArifAlvi
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI
Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI
Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI
这是 new_tweets
文本文件:
Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI
Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI
@ImranKhanPTI
Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI
Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar at 8PM on ARY News. Rgds #PTI
Watch PTI on April 6th (5) @Asad_Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI
@ArifAlvi
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
这是我从程序中得到的差异:
***
---
***************
*** 1,7 ****
- Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI
- Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI
- @ImranKhanPTI
- Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI
Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar at 8PM on ARY News. Rgds #PTI
Watch PTI on April 6th (5) @Asad_Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
--- 1,3 ----
***************
*** 21,24 ****
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI--- 17,23 ----
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
! Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI
! Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI
! Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI
通过快速比较两个源文件(PTITVProgs 和 new_tweets)可以看出,它们之间的区别在于 4 月 7 日的 3 条推文和 4 月 3 日的 3 条推文 。
我只希望 new_tweets
中不在 PTITVProgs
中的行出现在差异中。
但它会抛出一堆我不想看到的文本。我不知道差异输出中的 *** 1,7***
和 *** 1,3***
代表什么...? 仅获得更改的行的正确方法是什么?
最佳答案
只需像这样解析 diff 的输出(如果需要,将 '-' 更改为 '+'):
#!/usr/bin/env python
# difflib_test
import difflib
file1 = open('/home/saad/Code/test/new_tweets', 'r')
file2 = open('/home/saad/PTITVProgs', 'r')
diff = difflib.ndiff(file1.readlines(), file2.readlines())
delta = ''.join(x[2:] for x in diff if x.startswith('- '))
print delta
关于python difflib比较文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15864641/
我在 python 的 difflib 库中遇到了一个非常奇怪的问题。我有两个字符串,如下所示,我对它们运行 get_opcodes ,如下所示: import difflib str1 = "Mat
有谁知道为什么这两个返回不同的比率。 >>> import difflib >>> difflib.SequenceMatcher(None, '10101789', '11426089').rati
我正在使用这段代码来查找两个 csv 列表之间的差异,并提出一些格式问题。这可能是一个简单的修复方法,但我是新手,正在尝试学习并遇到很多问题。 import difflib diff=diffli
在带有 difflib 的 python 中有没有一种方法可以获取更改的偏移量以及更改本身? 我有以下内容: import difflib text1 = 'this is a sample text
我可以在 difflib 中使用正则表达式吗? 具体来说,我想做的是: difflib.context_diff(actual, gold) 实际位置: [master 92a406f] file m
我对 python 完全陌生,我需要一些关于 difflib 的帮助。我尝试阅读文档,但对我来说理解文档并不容易。 我想比较两个字符串,并且我希望输出仅是两个字符串之间的匹配前缀部分(不打印差异)。
import difflib test1 = ")\n )" test2 = "#)\n #)" d = difflib.Differ() diff = d.compare(test1.splitli
根据此处的 Python 文档:https://docs.python.org/2/library/difflib.html ,当我比较两个序列时,如果该行对于序列 1 是唯一的,则会附加“+”,对于
我想使用以下代码从 C# 执行 python 代码。 static void Main(string[] args) { ScriptEngine engine = Py
我在 Python 中使用 difflib,但在使输出看起来不错时遇到了一些困难。出于某种奇怪的原因,difflib 在每个字符前添加了一个空格。例如,我有一个如下所示的文件 (textfile01.
是否可以使用与 GNU 补丁兼容的 python 模块 difflib 创建补丁?我尝试使用 unified_diff 和 context_diff,还尝试将 lineterm 指定为“\n”,但我仍
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。 要求我们推荐或查找书籍、工具、软件库、教程或其他场外资源的问题对于 Stack Overflow 来说是
最近工作需要用到序列匹配,检测相似性,不过有点复杂的是输入长度是不固定的,举例为: ?
我需要将潜在客户与我们的客户数据库进行匹配。 潜在客户来自大量的第三方提供商(成千上万笔记录),销售人员要求我们(以他们的话)“过滤掉客户”,这样他们才不会尝试将我们的服务卖给已建立的客户。 显然,潜
我正在寻找一种比较两个字符串的方法。但不是简单的 equals()。我需要一些指标来说明那些与 String 匹配的可能性有多大。因此,例如(值是一个未计算的猜测): 《汽车》和《汽车》重播1.0 “
根据文档,您可以提供一个 linejunk 函数来忽略 certian 行。但是,我无法让它工作。以下是一些供讨论的示例代码: from re import search from difflib i
我需要帮助尝试使用 difflib 比较两个字典。我的程序需要 2 个 json 文件,将它们转换为 python 字典。然后我想在两个字典上使用 difflib 来显示两者之间的差异。 使用 dif
根据文档,您可以提供一个 linejunk 函数来忽略某些行。但是,我无法让它工作。以下是一些供讨论的示例代码: from re import search from difflib import n
使用 PyMOTW 给出的两个文本, difflib.HtmlDiff.make_file() 用于生成 HTML 输出。然而,当在浏览器中保存并打开时,会显示原始 HTML,而不是呈现为预期的表格。
正在下载this页面并对其进行较小的编辑,将本段中的第一个 65 更改为 68: 然后,我使用 BeauifulSoup 解析两个源,并使用 difflib 区分它们。 url = 'https://
我是一名优秀的程序员,十分优秀!