gpt4 book ai didi

python - 类型错误 : expected string or bytes-like object – with Python/NLTK word_tokenize

转载 作者:太空宇宙 更新时间:2023-11-03 12:20:19 24 4
gpt4 key购买 nike

我有一个包含约 40 列的数据集,并且正在对其中的 5 列使用 .apply(word_tokenize),如下所示:df['token_column'] = df.column.apply( word_tokenize)

我只收到其中一列的 TypeError,我们将其称为 problem_column

TypeError: expected string or bytes-like object

这是完整的错误(去除了 df 和列名,以及 pii),我是 Python 的新手,并且仍在尝试找出错误消息的哪些部分是相关的:

TypeError                                 Traceback (most recent call last)
<ipython-input-51-22429aec3622> in <module>()
----> 1 df['token_column'] = df.problem_column.apply(word_tokenize)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66440)()

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in word_tokenize(text, language, preserve_line)
128 :type preserver_line: bool
129 """
--> 130 sentences = [text] if preserve_line else sent_tokenize(text, language)
131 return [token for sent in sentences
132 for token in _treebank_word_tokenizer.tokenize(sent)]

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
95 """
96 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 97 return tokenizer.tokenize(text)
98
99 # Standard word tokenizer.

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries)
1233 Given a text, returns a list of the sentences in that text.
1234 """
-> 1235 return list(self.sentences_from_text(text, realign_boundaries))
1236
1237 def debug_decisions(self, text):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries)
1281 follows the period.
1282 """
-> 1283 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1284
1285 def _slices_from_text(self, text):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries)
1272 if realign_boundaries:
1273 slices = self._realign_boundaries(text, slices)
-> 1274 return [(sl.start, sl.stop) for sl in slices]
1275
1276 def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)
1272 if realign_boundaries:
1273 slices = self._realign_boundaries(text, slices)
-> 1274 return [(sl.start, sl.stop) for sl in slices]
1275
1276 def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices)
1312 """
1313 realign = 0
-> 1314 for sl1, sl2 in _pair_iter(slices):
1315 sl1 = slice(sl1.start + realign, sl1.stop)
1316 if not sl2:

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
310 """
311 it = iter(it)
--> 312 prev = next(it)
313 for el in it:
314 yield (prev, el)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text)
1285 def _slices_from_text(self, text):
1286 last_break = 0
-> 1287 for match in self._lang_vars.period_context_re().finditer(text):
1288 context = match.group() + match.group('after_tok')
1289 if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

这 5 列都是字符/字符串(已在 SQL Server、SAS 中验证并使用 .select_dtypes(include=[object]))

为了更好地衡量,我使用 .to_string() 来确保 problem_column 除了字符串之外真的什么都不是,但我仍然收到错误。如果我分别处理这些列,good_column1-good_column4 继续工作,problem_column 仍会产生错误。

我用谷歌搜索了一下,除了从集合中删除任何数字(我不能这样做,因为它们很有意义),我还没有找到任何其他修复方法。

最佳答案

问题是您的 DF 中没有 (NA) 类型。试试这个:

df['label'].dropna(inplace=True)
tokens = df['label'].apply(word_tokenize)

关于python - 类型错误 : expected string or bytes-like object – with Python/NLTK word_tokenize,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46105180/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com