gpt4 book ai didi

Python CSV写入文件在Excel中不可读(汉字)

转载 作者:太空狗 更新时间:2023-10-29 23:58:13 28 4
gpt4 key购买 nike

我正在尝试对中文文本进行文本分析。该程序在下面提供。我得到了带有不可读字符的结果,例如 滨烘暯镞ユ姤捐。如果我将输出文件 result.csv 更改为 result.txt,字符将正确为 人民日报社论。那么这有什么问题呢?我想不通。我尝试了几种方法,包括添加 decoderencoder

    # -*- coding: utf-8 -*-
import os
import glob
import jieba
import jieba.analyse
import csv
import codecs

segList = []
raw_data_path = 'monthly_raw_data/'
file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]

jieba.load_userdict("customized_dict.txt")

for name in file_name:
all_text = ""
multi_line_text = ""
with open(raw_data_path + name + ".txt", "r") as file:
for line in file:
if line != '\n':
multi_line_text += line
templist = multi_line_text.split('\n')
for text in templist:
all_text += text
seg_list = jieba.cut(all_text,cut_all=False)
temp_text = []
for item in seg_list:
temp_text.append(item.encode('utf-8'))

stop_list = []
with open("stopwords.txt", "r") as stoplistfile:
for item in stoplistfile:
stop_list.append(item.rstrip('\r\n'))

text_without_stopwords = []
for word in temp_text:
if word not in stop_list:
text_without_stopwords.append(word)

segList.append(text_without_stopwords)


with open("results/result.csv", 'wb') as f:
writer = csv.writer(f)
writer.writerows(segList)

最佳答案

对于 UTF-8 编码,Excel 需要在文件开头写入一个 BOM(字节顺序标记)代码点,否则它将采用 ANSI 编码,这与区域设置有关。 U+FEFF 是 Unicode BOM。下面是一个可以在 Excel 中正确打开的示例:

#!python2
#coding:utf8
import csv

data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]

with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])

Python 3 使这更容易。使用 'w', newline='', encoding='utf-8-sig' 参数代替 'wb' 将直接接受 Unicode 字符串并自动写入 BOM :

#!python3
#coding:utf8
import csv

data = [['American','美国人'],
['Chinese','中国人']]

with open('results.csv','w',newline='',encoding='utf-8-sig') as f:
w = csv.writer(f)
w.writerows(data)

还有一个第 3 方 unicodecsv 模块也使 Python 2 更易于使用:

#!python2
#coding:utf8
import unicodecsv

data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]

with open('results.csv','wb') as f:
w = unicodecsv.writer(f,encoding='utf-8-sig')
w.writerows(data)

关于Python CSV写入文件在Excel中不可读(汉字),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34481700/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com