gpt4 book ai didi

Python MySQLdb 改变字符串编码

转载 作者:太空宇宙 更新时间:2023-11-03 10:52:35 25 4
gpt4 key购买 nike

我认为我的问题是 python 不能很好地处理 SQL 表中列的字符编码:

| column | varchar(255) | latin1_swedish_ci | YES  |     | NULL              |                             | select,insert,update,references |    | 

上面显示了此列的输出。它的类型为 varchar(255) 并且编码为 latin1_swedish_ci.

现在,当我尝试让 python 处理这些数据时,出现以下错误:

 dictionary = gs.corpora.Dictionary(tweets)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 50, in __init__
self.add_documents(documents)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 97, in add_documents
_ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 121, in doc2bow
document = sorted(utils.to_utf8(token) for token in document)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 121, in <genexpr>
document = sorted(utils.to_utf8(token) for token in document)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/utils.py", line 164, in any2utf8
return unicode(text, encoding, errors=errors).encode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

gsgensim主题建模库。我认为问题在于 gensim 需要 unicode 编码。

  1. 如何更改数据库中此列的字符编码(排序规则?)?
  2. 是否有替代解决方案?

感谢大家的帮助!

最佳答案

我认为你的 MYSQLdb python 库不知道它应该编码为 utf8

并且正在编码为默认的 python 系统定义字符集 latin1。

当你 connect() 到你的数据库时,传递 charset='utf8'

参数。这也应该制作一个手册 SET NAMES

关于Python MySQLdb 改变字符串编码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23348819/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com