- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
在向 Project Gutenberg 发送 get 请求后,我将完整的剧本 Macbeth 作为字符串
response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]
我分开了
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
然后我去除所有标点并将字符串转换为lower()
import string
punctuation = string.punctuation
words_cleaned = []
for word in words_raw:
# remove punctuation
word = word.strip(punctuation)
# make lowercase
word = word.lower()
words_cleaned.append(word)
print("Cleaned word examples:", words_cleaned[400:460])
但是,我不能去掉所有标点符号,因为我需要名字/简称后的句点作为角色即将说话的指示符。
说话的角色由他们名字的(通常是缩写的)版本后跟一个 . (句点)作为一行中的第一件事。因此,例如,当 Macbeth 说话时,它以“Macb”开头。您需要修改处理标点符号的方式,因为您不能去掉所有标点符号
麦克白包含 17737 个单词这里有一些例子:['Gashes', 'cry', 'for', 'helpe', 'King.', 'So', 'well', 'thy', 'words', 'become', 'thee, ', 'as', 'thy', 'wounds,', 'They', 'smack', 'of', 'Honor', 'both:', 'Goe', 'get', 'him', '外科医生.', 'Enter', 'Rosse', 'and', 'Angus.', 'Who', 'comes', 'here?', 'Mal.', 'The', ' worthy', 'Thane', 'of', 'Rosse', 'Lenox.', 'What', 'a', 'haste', 'lookes', 'through', 'his' , '眼睛?', '所以', '应该', '他', '看', '那个', '似乎', '去', '说话', '事物', '奇怪', 'Rosse. ', '上帝', 'saue', 'the', 'King', 'King.']
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
我们知道“Malcolm”在他的名字后跟一个句点(上面的粗体“Mal.”)时正在说话,当他开始说话时“Lenox”也是如此(“Lenox.”)有时角色的名称被缩短,其他人使用全名后紧跟句点。
[“邓肯”、“马尔科姆”、“唐纳贝恩”、“麦克白”、“类柯”、“麦克达夫”、“莱诺克斯”、“罗塞”、“薄荷”、“安格斯”、“凯瑟斯”、“弗莱斯” ", "seyward", "seyton", "boy", "lady", "messenger", "wife"]
尝试隔离非字母数字
print(len(words_raw))
def extra(string):
return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird
discard = []
for char in weird:
if char != '.':
discard.append(char)
print(len(weird))
print(len(discard))
print(discard)
revised_macbeth = []
for character in words_raw:
if not character in discard:
revised_macbeth.append(character)
print(len(revised_macbeth))
# for character in words_raw:
# if not character.isalnum():
# print("found: \'{}\'".format(character))
它的输出
17737
4788
3553
['?', ',', ',', '?', '-', "'", ',', "'", ',', '?', ',', '-', ':', ',', ',', ',', ',', ',', ',', ',', '?', ',', ',', ',', "'", ':', ';', ',', ',', ',', ',', ',', ':', '(', ',', ')', "'", ',', ',', "'", ':', "'", ':', '(', ')', ',', ',', "'", '(', ')', "'", ',', "'", ':', "'", ',', ',', "'", "'", ',', "'", ',', "'", ',', ',', ':', ',', "'", ',', ':', ',', ',', ',', "'", ',', "'", ',', ',', ',', ',', ',', "'", ',', '?', ',', ',', ';', ',', ':', ',', '-', "'", ',', ':', ',', ',', ':', ',', ',', ',', ':', '?', '?', ',', "'", ',', '?', ',', ',', ',', ',', ',', ',', ',', ',', "'", ',', ',', '-', ',', ',', "'", ',', ':', ',', ',', ',', ':', ',', ',', ',', ',', ':', ',', ',', ',', '?', ',', '?', ',', ',', '&', ',', ':', ',', ',', ',', '-', "'", ',', "'", "'", ':', ',', ',', ',', ',', "'", ',', ',', ',', "'", "'", '-', ':', '-', ':', ':', "'", ',', ',', ',', ',', ':', ',', '-', ',', ',', ',', ',', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', "'", "'", "'", '?', ',', "'", ',', ',', "'", "'", "'", ',', "'", '?', ',', '?', ',', ':', ',', ':', '?', ',', ',', ',', ',', ',', '?', "'", "'", ',', '?', ',', ',', ',', ':', ',', ',', ',', ',',
print(macbeth)
The Tragedie of Macbeth
Actus Primus. Scoena Prima.
Thunder and Lightning. Enter three Witches.
1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
2. When the Hurley-burley's done,
When the Battaile's lost, and wonne
3. That will be ere the set of Sunne
1. Where the place?
2. Vpon the Heath
3. There to meet with Macbeth
1. I come, Gray-Malkin
print(revised_macbeth)
['The', 'Tragedie', 'of', 'Macbeth', 'Actus', 'Primus.', 'Scoena', 'Prima.', 'Thunder', 'and', 'Lightning.', 'Enter', 'three', 'Witches.', '1.', 'When', 'shall', 'we', 'three', 'meet', 'againe?', 'In', 'Thunder,', 'Lightning,', 'or', 'in', 'Raine?', '2.', 'When', 'the', "Hurley-burley's", 'done,', 'When', 'the', "Battaile's", 'lost,', 'and', 'wonne', '3.', 'That', 'will', 'be', 'ere', 'the', 'set', 'of', 'Sunne', '1.', 'Where', 'the', 'place?', '2.', 'Vpon', 'the', 'Heath', '3.', 'There', 'to', 'meet', 'with', 'Macbeth', '1.', 'I', 'come,', 'Gray-Malkin', 'All.', 'Padock', 'calls', 'anon:', 'faire', 'is', 'foule,', 'and', 'foule', 'is', 'faire,', 'Houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre.', 'Exeunt.', 'Scena', 'Secunda.', 'Alarum', 'within.', 'Enter', 'King,', 'Malcome,', 'Donalbaine,', 'Lenox,', 'with', 'attendants,', 'meeting', 'a', 'bleeding', 'Captaine.', 'King.', 'What', 'bloody', 'man', 'is', 'that?', 'he', 'can', 'report,', 'As', 'seemeth', 'by', 'his', 'plight,', 'of', 'the', 'Reuolt', 'The', 'newest', 'state', 'Mal.', 'This', 'is', 'the', 'Serieant,', 'Who', 'like', 'a', 'good', 'and', 'hardie', 'Souldier', 'fought', "'Gainst", 'my', 'Captiuitie:', 'Haile', 'braue', 'friend;', 'Say', 'to', 'the', 'King,', 'the', 'knowledge', 'of', 'the', 'Broyle,', 'As', 'thou', 'didst', 'leaue', 'it', 'Cap.', 'Doubtfull', 'it', 'stood,', 'As', 'two', 'spent', 'Swimmers,', 'that', 'doe', 'cling', 'together,', 'And', 'choake', 'their', 'Art:', 'The', 'me
最佳答案
您可以使用 collections.defaultdict
将演讲者姓名上的行分组。 enumerate
可用于获取字符说出的每次文本出现的行号:
import requests, re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d, l, keywords = defaultdict(list), None, ['Enter', 'Exit', 'Flourish', 'Thunder']
#iterate over the play lines, ignoring empty strings (generated from the split)
for i, a in filter(lambda x:x[-1], enumerate(re.split('[\n\r]+', r[r.index('Actus Primus. Scoena Prima.')+27:]))):
#check that the line contains character dialog, not stage prompts
if not re.findall('|'.join(keywords), a):
#grab the name of the character and append to "d"
if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)', a)):
d[(l:=re.sub('^\s+|\.$', '', n[0]).lower())].append((i, a[len(n[0])+1:].lower()))
elif l:
#the line might be a continuation of a larger block of character text
d[l].append((i, a.lower()))
print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking
输出:
['all', 'king', 'mal', 'cap', 'lenox', 'rosse', 'macb', 'banquo', 'mac', 'banq', 'ang', 'lady', 'mess', 'la', 'fleance', 'porter', 'macd', 'port', 'exeunt', 'ban', 'donal', 'malc', 'don', 'ross', 'seruant', 'murth', 'lords', 'mur', 'len', 'hec', 'lord', 'appar', 'musicke', 'wife', 'son', 'mes', 'doct', 'ro', 'gent', 'lad', 'ment', 'cath', 'ser', 'sey', 'seyw', 'sold', 'syw', 'y.sey']
[(137, 'so foule and faire a day i haue not seene'), (170, 'stay you imperfect speakers, tell me more:'), (171, 'by sinells death, i know i am thane of glamis,'), (172, 'but how, of cawdor? the thane of cawdor liues'), (173, 'a prosperous gentleman: and to be king,'), (174, 'stands not within the prospect of beleefe,'), (175, 'no more then to be cawdor. say from whence'), (176, 'you owe this strange intelligence, or why'), (177, 'vpon this blasted heath you stop our way'), (178, 'with such prophetique greeting?')]
编辑:每个字符的常用词:
要过滤每个字符的常用词,迭代 d
中每个字符的句子,然后再次迭代每个句子的 str.split
结果。重要的是要注意,上一步的结果将包含许多 stop words .我的以下解决方案为您提供了过滤这些选项的选项:
from collections import Counter
def common_words(character, filter_stop = False, stop_words = []):
if filter_stop:
stop_words = set(filter(None, requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
w = [i for _, b in d['Macb'] for i in re.sub('[\:\.\?]+', '', b).split() if i.lower() not in stop_words]
return Counter(w).most_common(5)
print(common_words('Macb', filter_stop=True))
输出:
[('haue', 39), ('thou', 34), ('thy', 23), ('shall', 21), ('thee', 20)]
关于python - 当一个角色说话时 split 麦克白,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67536083/
在向 Project Gutenberg 发送 get 请求后,我将完整的剧本 Macbeth 作为字符串 response = requests.get('https://www.gutenberg
我是一名优秀的程序员,十分优秀!