gpt4 book ai didi

python - 当一个角色说话时 split 麦克白

转载 作者:行者123 更新时间:2023-12-04 01:03:04 25 4
gpt4 key购买 nike

在向 Project Gutenberg 发送 get 请求后,我将完整的剧本 Macbeth 作为字符串

response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]

我分开了

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])

然后我去除所有标点并将字符串转换为lower()

import string
punctuation = string.punctuation

words_cleaned = []

for word in words_raw:
# remove punctuation
word = word.strip(punctuation)
# make lowercase
word = word.lower()
words_cleaned.append(word)

print("Cleaned word examples:", words_cleaned[400:460])

但是,我不能去掉所有标点符号,因为我需要名字/简称后的句点作为角色即将说话的指示符。

类(class)摘录

说话的角色由他们名字的(通常是缩写的)版本后跟一个 . (句点)作为一行中的第一件事。因此,例如,当 Macbeth 说话时,它以“Macb”开头。您需要修改处理标点符号的方式,因为您不能去掉所有标点符号

split( ) 后的原始数据切片

名称后跟粗体句号

麦克白包含 17737 个单词这里有一些例子:['Gashes', 'cry', 'for', 'helpe', 'King.', 'So', 'well', 'thy', 'words', 'become', 'thee, ', 'as', 'thy', 'wounds,', 'They', 'smack', 'of', 'Honor', 'both:', 'Goe', 'get', 'him', '外科医生.', 'Enter', 'Rosse', 'and', 'Angus.', 'Who', 'comes', 'here?', 'Mal.', 'The', ' worthy', 'Thane', 'of', 'Rosse', 'Lenox.', 'What', 'a', 'haste', 'lookes', 'through', 'his' , '眼睛?', '所以', '应该', '他', '看', '那个', '似乎', '去', '说话', '事物', '奇怪', 'Rosse. ', '上帝', 'saue', 'the', 'King', 'King.']

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])

我们知道“Malcolm”在他的名字后跟一个句点(上面的粗体“Mal.”)时正在说话,当他开始说话时“Lenox”也是如此(“Lenox.”)有时角色的名称被缩短,其他人使用全名后紧跟句点。

《麦克白》中最常见的名字列表

[“邓肯”、“马尔科姆”、“唐纳贝恩”、“麦克白”、“类柯”、“麦克达夫”、“莱诺克斯”、“罗塞”、“薄荷”、“安格斯”、“凯瑟斯”、“弗莱斯” ", "seyward", "seyton", "boy", "lady", "messenger", "wife"]

目标

  • 从上面的列表中找出所有字符的名称和缩写名称,如果缩写的话
  • 找到一个角色开始说话的地方,用句号表示,并在那里分开

这是我到目前为止尝试过的

尝试隔离非字母数字

print(len(words_raw))
def extra(string):
return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird

discard = []
for char in weird:
if char != '.':
discard.append(char)
print(len(weird))
print(len(discard))
print(discard)

revised_macbeth = []

for character in words_raw:
if not character in discard:
revised_macbeth.append(character)
print(len(revised_macbeth))



# for character in words_raw:
# if not character.isalnum():
# print("found: \'{}\'".format(character))

它的输出

17737
4788
3553
['?', ',', ',', '?', '-', "'", ',', "'", ',', '?', ',', '-', ':', ',', ',', ',', ',', ',', ',', ',', '?', ',', ',', ',', "'", ':', ';', ',', ',', ',', ',', ',', ':', '(', ',', ')', "'", ',', ',', "'", ':', "'", ':', '(', ')', ',', ',', "'", '(', ')', "'", ',', "'", ':', "'", ',', ',', "'", "'", ',', "'", ',', "'", ',', ',', ':', ',', "'", ',', ':', ',', ',', ',', "'", ',', "'", ',', ',', ',', ',', ',', "'", ',', '?', ',', ',', ';', ',', ':', ',', '-', "'", ',', ':', ',', ',', ':', ',', ',', ',', ':', '?', '?', ',', "'", ',', '?', ',', ',', ',', ',', ',', ',', ',', ',', "'", ',', ',', '-', ',', ',', "'", ',', ':', ',', ',', ',', ':', ',', ',', ',', ',', ':', ',', ',', ',', '?', ',', '?', ',', ',', '&', ',', ':', ',', ',', ',', '-', "'", ',', "'", "'", ':', ',', ',', ',', ',', "'", ',', ',', ',', "'", "'", '-', ':', '-', ':', ':', "'", ',', ',', ',', ',', ':', ',', '-', ',', ',', ',', ',', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', "'", "'", "'", '?', ',', "'", ',', ',', "'", "'", "'", ',', "'", '?', ',', '?', ',', ':', ',', ':', '?', ',', ',', ',', ',', ',', '?', "'", "'", ',', '?', ',', ',', ',', ':', ',', ',', ',', ',',

比较

print(macbeth)
The Tragedie of Macbeth

Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

3. That will be ere the set of Sunne

1. Where the place?
2. Vpon the Heath

3. There to meet with Macbeth

1. I come, Gray-Malkin
print(revised_macbeth)
['The', 'Tragedie', 'of', 'Macbeth', 'Actus', 'Primus.', 'Scoena', 'Prima.', 'Thunder', 'and', 'Lightning.', 'Enter', 'three', 'Witches.', '1.', 'When', 'shall', 'we', 'three', 'meet', 'againe?', 'In', 'Thunder,', 'Lightning,', 'or', 'in', 'Raine?', '2.', 'When', 'the', "Hurley-burley's", 'done,', 'When', 'the', "Battaile's", 'lost,', 'and', 'wonne', '3.', 'That', 'will', 'be', 'ere', 'the', 'set', 'of', 'Sunne', '1.', 'Where', 'the', 'place?', '2.', 'Vpon', 'the', 'Heath', '3.', 'There', 'to', 'meet', 'with', 'Macbeth', '1.', 'I', 'come,', 'Gray-Malkin', 'All.', 'Padock', 'calls', 'anon:', 'faire', 'is', 'foule,', 'and', 'foule', 'is', 'faire,', 'Houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre.', 'Exeunt.', 'Scena', 'Secunda.', 'Alarum', 'within.', 'Enter', 'King,', 'Malcome,', 'Donalbaine,', 'Lenox,', 'with', 'attendants,', 'meeting', 'a', 'bleeding', 'Captaine.', 'King.', 'What', 'bloody', 'man', 'is', 'that?', 'he', 'can', 'report,', 'As', 'seemeth', 'by', 'his', 'plight,', 'of', 'the', 'Reuolt', 'The', 'newest', 'state', 'Mal.', 'This', 'is', 'the', 'Serieant,', 'Who', 'like', 'a', 'good', 'and', 'hardie', 'Souldier', 'fought', "'Gainst", 'my', 'Captiuitie:', 'Haile', 'braue', 'friend;', 'Say', 'to', 'the', 'King,', 'the', 'knowledge', 'of', 'the', 'Broyle,', 'As', 'thou', 'didst', 'leaue', 'it', 'Cap.', 'Doubtfull', 'it', 'stood,', 'As', 'two', 'spent', 'Swimmers,', 'that', 'doe', 'cling', 'together,', 'And', 'choake', 'their', 'Art:', 'The', 'me

最佳答案

您可以使用 collections.defaultdict 将演讲者姓名上的行分组。 enumerate 可用于获取字符说出的每次文本出现的行号:

import requests, re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d, l, keywords = defaultdict(list), None, ['Enter', 'Exit', 'Flourish', 'Thunder']
#iterate over the play lines, ignoring empty strings (generated from the split)
for i, a in filter(lambda x:x[-1], enumerate(re.split('[\n\r]+', r[r.index('Actus Primus. Scoena Prima.')+27:]))):
#check that the line contains character dialog, not stage prompts
if not re.findall('|'.join(keywords), a):
#grab the name of the character and append to "d"
if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)', a)):
d[(l:=re.sub('^\s+|\.$', '', n[0]).lower())].append((i, a[len(n[0])+1:].lower()))
elif l:
#the line might be a continuation of a larger block of character text
d[l].append((i, a.lower()))

print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking

输出:

['all', 'king', 'mal', 'cap', 'lenox', 'rosse', 'macb', 'banquo', 'mac', 'banq', 'ang', 'lady', 'mess', 'la', 'fleance', 'porter', 'macd', 'port', 'exeunt', 'ban', 'donal', 'malc', 'don', 'ross', 'seruant', 'murth', 'lords', 'mur', 'len', 'hec', 'lord', 'appar', 'musicke', 'wife', 'son', 'mes', 'doct', 'ro', 'gent', 'lad', 'ment', 'cath', 'ser', 'sey', 'seyw', 'sold', 'syw', 'y.sey']
[(137, 'so foule and faire a day i haue not seene'), (170, 'stay you imperfect speakers, tell me more:'), (171, 'by sinells death, i know i am thane of glamis,'), (172, 'but how, of cawdor? the thane of cawdor liues'), (173, 'a prosperous gentleman: and to be king,'), (174, 'stands not within the prospect of beleefe,'), (175, 'no more then to be cawdor. say from whence'), (176, 'you owe this strange intelligence, or why'), (177, 'vpon this blasted heath you stop our way'), (178, 'with such prophetique greeting?')]

编辑:每个字符的常用词:

要过滤每个字符的常用词,迭代 d 中每个字符的句子,然后再次迭代每个句子的 str.split 结果。重要的是要注意,上一步的结果将包含许多 stop words .我的以下解决方案为您提供了过滤这些选项的选项:

from collections import Counter
def common_words(character, filter_stop = False, stop_words = []):
if filter_stop:
stop_words = set(filter(None, requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
w = [i for _, b in d['Macb'] for i in re.sub('[\:\.\?]+', '', b).split() if i.lower() not in stop_words]
return Counter(w).most_common(5)

print(common_words('Macb', filter_stop=True))

输出:

[('haue', 39), ('thou', 34), ('thy', 23), ('shall', 21), ('thee', 20)]

关于python - 当一个角色说话时 split 麦克白,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67536083/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com