gpt4 book ai didi

python - 标记列表列表

转载 作者:行者123 更新时间:2023-12-05 07:37:18 27 4
gpt4 key购买 nike

我正在尝试对废弃推文的 csv 文件进行标记。我将 csv 文件作为列表上传

with open('recent_tweet_purex.csv', 'r') as purex:
reader_purex = csv.reader(purex)
purex_list = list(reader_purex)

现在推文在列表中

["b'I miss having someone to talk to all night..'"], ["b'Pergunte-me 
qualquer coisa'"], ["b'RT @Caracolinhos13: Tenho a
tl cheia dessa merda de quem vos visitou nas \\xc3\\xbaltimas horas'"],
["b'RT @B24pt: #CarlosHadADream'"], ['b\'"Tudo tem
um fim"\''], ["b'RT @thechgama: stalkear as curtidas \\xc3\\xa9 um caminho
sem volta'"], ["b'Como consegues fumar 3 purexs seguidas? \\xe2\\x80\\x94
Eram 2 purex e mix...'"]

我已经导入了 nltk 以及以下包

 from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

我试过用

 purex_words = word_tokenize(purex_words)

要标记化,但我不断收到错误

有什么帮助吗?

最佳答案

您正在将数组传递给 word_tokenize 函数,它需要 string 或 bytes-like object。如果你用绳子喂它,它就会工作。快速示例。

purex_words = [['I miss having someone to talk to all night..'], ['Pergunte-me qualquer coisa'],

['RT @Caracolinhos13: Tenho a tl cheia dessa merda de quem vos visitou nas\xc3\xbaltimas horas'], ['RT @B24pt: #CarlosHadADream'], ["Tudo tem um fim"],[“RT @thechgama:跟踪 curtidas\xc3\xa9 um caminho sem volta”],['Como consegues fumar 3 purexs seguidas?\xe2\x80\x94 Eram 2 purex e mix...']]

for sentence in purex_words:
print(word_tokenize(sentence[0])) # this looks ugly to me

您可以在遍历句子之前展平列表。 请注意,我在您的列表中添加了一个外部[]

flat_list = [item for sublist in purex_words for item in sublist]
for sentence in flat_list:
print(word_tokenize(sentence))

结果看起来像这样。

['I', 'miss', 'having', 'someone', 'to', 'talk', 'to', 'all', 'night..']
['Pergunte-me', 'qualquer', 'coisa']
['RT', '@', 'Caracolinhos13', ':', 'Tenho', 'a', 'tl', 'cheia', 'dessa', 'merda', 'de', 'quem', 'vos', 'visitou', 'nas', '\\xc3\\xbaltimas', 'horas']
['RT', '@', 'B24pt', ':', '#', 'CarlosHadADream']
['Tudo', 'tem', 'um', 'fim']
['RT', '@', 'thechgama', ':', 'stalkear', 'as', 'curtidas', '\\xc3\\xa9', 'um', 'caminho', 'sem', 'volta']
['Como', 'consegues', 'fumar', '3', 'purexs', 'seguidas', '?', '\\xe2\\x80\\x94', 'Eram', '2', 'purex', 'e', 'mix', '...']

关于python - 标记列表列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48677718/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com