gpt4 book ai didi

python - text.replace(punctuation ,'' ) 不会删除 list(punctuation) 中包含的所有标点符号?

转载 作者:太空宇宙 更新时间:2023-11-04 01:01:29 25 4
gpt4 key购买 nike

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p

# URL for Obama's presidential acceptance speech in 2008
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'

# read in URL
obama_4427_html = urllib2.urlopen(obama_4427_url).read()

# BS magic
obama_4427_soup = BeautifulSoup(obama_4427_html)

# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# obama_4427_div.text.lower() removes extraneous characters (e.g. '<br/>')
# and places all letters in lowercase
obama_4427_str = obama_4427_div.text.lower()

# for further text analysis, remove punctuation
for punct in list(p):
obama_4427_str_processed = obama_4427_str.replace(p,'')
obama_4427_str_processed_2 = obama_4427_str_processed.replace(p,'')
print(obama_4427_str_processed_2)

# store individual words
words = obama_4427_str_processed.split(' ')
print(words)

长话短说,我有奥巴马总统的演讲,我想删除所有标点符号,这样我就只剩下文字了。我已经导入了 punctuation 模块,运行了一个 for 循环,它没有删除我所有的标点符号。我在这里做错了什么?

最佳答案

str.replace() 搜索第一个参数的完整值。它不是一个模式,所以只有当 整个 `string.punctuation* 值存在时,它才会被替换为空字符串。

改用正则表达式:

import re
from string import punctuation as p

punctuation = re.compile('[{}]+'.format(re.escape(p)))

obama_4427_str_processed = punctuation.sub('', obama_4427_str)
words = obama_4427_str_processed.split()

请注意,您可以只使用不带参数的 str.split() 来分割任意宽度的空白,包括换行符。

关于python - text.replace(punctuation ,'' ) 不会删除 list(punctuation) 中包含的所有标点符号?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32636653/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com