gpt4 book ai didi

python - 在 Python 中以有效的方式清理数据

转载 作者:行者123 更新时间:2023-11-28 16:34:55 25 4
gpt4 key购买 nike

我有以下格式的数据:

TOP (S (PP-LOC (IN In) (NP (NP (DT an) (NNP Oct.) (CD 19) (NN 评论) ) (PP (IN of) (NP () (NP-TTL (DT The) (NN Misanthrope) ) ('' '') (PP-LOC ( IN at) (NP (NP (NNP Chicago) (POS's) ) (NNP Goodman) ( NNP剧院) )))) (PRN (-LRB- -LRB-) () (S-HLN (NP -SBJ (VBN Revitalized) (NNS Classics) ) (VP (VBP Take) (NP (DT the) (NN Stage) ) (PP-LOC (IN in) (NP (NNP Windy) (NNP City) )))) (, ,) ('' '') (NP-TMP (NN休闲) (CC &) (NNS 艺术) ) (-RRB - -RRB-) ))) (, ,) (NP-SBJ-2 (NP (NP (DT the) (NN作用) ) (PP (IN of) (NP (NNP Celimene) ))) (, ,) (VP (VBN 出场) (NP (-NONE- *) ) (PP (IN by) (NP-LGS (NNP Kim) (NNP Cattrall) ))) (, ,) ) (VP (VBD 是) (VP (ADVP-MNR (RB误) ) (VBN归因) (NP (-NONE- *-2) ) (PP-CLR (TO to) (NP (NNP Christina) (NNP Haag) )))) (. .) ))

(TOP (S (NP-SBJ (NNP Ms.) (NNP Haag) ) (VP (VBZ plays) (NP (NNP Elianti) )) (. .) ))

.....(还有7000多个..)

此数据来自报纸。新行是一个新句子(以'TOP'开头)从这些数据中,我只需要每个句子的粗体部分(没有括号):

(IN In)(DT an) (NNP Oct.) (CD 19) (NN review) (IN of) (`` ``) (DT The) (NN Misanthrope)   ('' '')  (IN at)  (NNP Chicago) (POS 's) (NNP Goodman) (NNP Theatre)(-LRB- -LRB-) (`` ``)     (VBN Revitalized) (NNS Classics) (VBP Take) (DT the) (NN Stage)  (IN in)   (NNP Windy) (NNP    City) (, ,) ('' '') (NN Leisure) (CC &) (NNS Arts) (-RRB- -RRB-)(, ,) (DT the) (NN role)(IN of)  (NNP Celimene) (, ,) (VBN played) (-NONE- *)(IN by)(NNP Kim) (NNP Cattrall) (, ,) (VBD was)  (RB mistakenly)(VBN attributed) (-NONE- *-2) (TO to)(NNP Christina) (NNP Haag) (. .)

(NNP Ms.) (NNP Haag) (VBZ plays)(NNP Elianti)(. .)

我尝试了以下方法:

f = open('filename')
data = f.readlines()
f.close()

这部分是为每一行创建一个元组数组(使用正则表达式):

tag_word_train = numpy.empty((5000), dtype = 'object')
for i in range(0,5000) :
tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i])

花了很长时间,所以我不知道它是否正确

您知道如何以有效的方式做到这一点吗?

谢谢,

哈达斯

最佳答案

nltka Tree class那可能适合您的需要。特别是,您需要使用类方法 nltk.tree.Tree.fromstring :

>>> import nltk.tree
>>> nltk.tree.Tree.fromstring("(S (NP (DT The) (N cat)) (VP (V ran)))")
Tree('S', [Tree('NP', [Tree('DT', ['The']), Tree('N', ['cat'])]), Tree('VP', [Tree('V', ['ran'])])])

关于python - 在 Python 中以有效的方式清理数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27499429/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com