gpt4 book ai didi

python - 如何删除包含数字、特殊字符、网站 URL 或电子邮件的整个句子?

转载 作者:太空宇宙 更新时间:2023-11-04 09:34:13 25 4
gpt4 key购买 nike

如何删除包含数字、特殊字符、网站 URL 或电子邮件的整个句子?

示例输入选项 A:

['Hi my name is blank.', 'Do it 3 times.', 'Check out this website: https://blah.com', 'I like pie.', 'My email is asdf@jkl@gmail.com.']

示例输入选项 B:

['Hi my name is blank. Do it 3 times. Check out this website: https://blah.com', 'I like pie. My email is asdf@jkl@gmail.com.]

示例输出:

['Hi my name is blank.','I like pie']

当前代码:

def remove_emails(self, dataframe):
self.log.info('Removing emails from text data')
no_emails = dataframe.str.replace('\S*@\S*\s?', '')
return no_emails

def remove_website_links(self, dataframe):
self.log.info('Removing website links from text data')
no_website_links = dataframe.str.replace('http\S+', '')
return no_website_links

def remove_special_characters(self, dataframe):
self.log.info('Removing special characters from text data')
no_special_characters = dataframe.replace(r'[^A-Za-z0-9 ]+', '', regex=True)
return no_special_characters

def remove_numbers(self, dataframe):
self.log.info('Removing numbers from text data')
no_numbers = dataframe.str.replace('\d+', '')
return no_numbers

问题是上面的代码可用于将不需要的字符串替换为空字符串,但如果它与上面给出的任何正则表达式匹配,我不知道如何删除整个列表元素。对于这些句子提取,我也不想多次浏览列表。总的来说,我正在从我的语料库中删除“坏”句子。

最佳答案

您可以使用此正则表达式检查各种情况并拒绝与其匹配的字符串。

https?:|@\w+|\d

Python 代码,

import re

arr = ['Hi my name is blank.', 'Do it 3 times.', 'Check out this website: https://blah.com', 'I like pie', 'My email is asdf@jkl@gmail.com']

for s in arr:
m = re.search(r'https?:|@\w+|\d',s)
if (m):
pass
else:
print(s)

只得到你想要的句子,

Hi my name is blank.
I like pie

关于python - 如何删除包含数字、特殊字符、网站 URL 或电子邮件的整个句子?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54437872/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com