gpt4 book ai didi

python 标记化 UnicodeDecodeError

转载 作者:行者123 更新时间:2023-11-28 17:27:48 26 4
gpt4 key购买 nike

我正在尝试标记一些文档,但我遇到了这个错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 6: ordinal not in range(128)

import nltk
import pandas as pd

df = pd.DataFrame(pd.read_csv('status2.csv'))
documents = df['status']

result = [nltk.word_tokenize(sent) for sent in documents]

我认为是unicode问题所以我加了

documents = unicode(documents, 'utf-8')

另一个错误

TypeError: coercing to Unicode: need string or buffer, Series found

print documents

1 Brandon Cachia ,All I know is that,you're so n...
2 Melissa Zejtunija:HAM AND CHEESE BIEX INI??? *...
3 .........Where is my mind?????
4 Having a philosophical discussion with Trudy D...

最佳答案

unicode 对字符串或字节进行操作,但 documents 是 pandas 系列。

也许:

result = [nltk.word_tokenize(unicode(sent, 'utf-8')) for sent in documents]

关于python 标记化 UnicodeDecodeError,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37289944/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com