gpt4 book ai didi

python - 保持 url 中的文本干净

转载 作者:行者123 更新时间:2023-11-28 18:08:44 25 4
gpt4 key购买 nike

作为 Python 中的信息检索项目(构建一个迷你搜索引擎)的一部分,我想从下载的推文中保留干净的文本(推文的 .csv 数据集 - 准确地说是 27000 条推文),一条推文将如下所示:

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX

我想使用正则表达式删除推文中不需要的部分,例如 URL、标点符号等

所以结果会是:

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

试过这个:pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'),但它做的并不完美,作为例如,URL 仍然存在于结果中。

请帮我找到一个能满足我要求的正则表达式模式。

最佳答案

这可能会有所帮助。

演示:

import re

s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""

def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))

print(cleanString(s1))
print(cleanString(s2))

关于python - 保持 url 中的文本干净,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51985530/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com