gpt4 book ai didi

regex - 正则表达式 |删除给定单词前多行的单词

转载 作者:行者123 更新时间:2023-12-01 08:59:26 31 4
gpt4 key购买 nike

我从一个网站上抓取了几篇文章,现在我试图通过从抓取的文本中删除第一部分来使语料库更具可读性。应该删除的区间在标签<p>Advertisement内和最后的标签 </time>在文章开始之前。如您所见,正则表达式应该删除多行的几个单词。我尝试了 DOTALL 序列,但没有成功。

这是我的第一次尝试:

import re

text='''
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author"
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline"
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00"
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time>
</p>, <p class="story-body-text story-content" data-para-count="163" data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.”</p>, <p class="story-body-text story-content"
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also injured. None of the three had life-threatening injuries.</p>
'''
my_pattern=("(.*)</time>")
results= re.sub(my_pattern," ", text)
print(results)

最佳答案

试试这个:

my_pattern=("[\s\S]+\<\/time\>") 

如果您还想删除以下标签 </p> , 逗号 ,和空间,你可以使用这个:

my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s") 

关于regex - 正则表达式 |删除给定单词前多行的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40451622/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com