gpt4 book ai didi

python - 统一码编码错误 : 'ascii' codec can't encode character u'\u201c' when converting series object to unicode in pandas with utf-16

转载 作者:太空宇宙 更新时间:2023-11-04 06:04:20 26 4
gpt4 key购买 nike

我有一个 utf-16 csv 文件,我正在尝试将其加载到 Pandas 中。默认情况下,数据以对象数据类型的形式出现。我计划对标题列进行一些建模,因此我想将列 df['caption'] 从对象转换为 unicode 字符串。目前我遇到以下错误'UnicodeEncodeError:'ascii'编解码器无法在位置6编码字符u'\u201c':序号不在范围(128)'df['caption']=df['caption'].astype(unicode).

我试图通过对 df['caption'] 列中的各个值使用编码和解码函数来解决这个问题,但我无法让它工作。

我对 pandas 和 unicode 很陌生,所以我想知道是否有人知道我做错了什么。

提前致谢。

邓丽君

补充信息如下:

回溯如下:

UnicodeEncodeError: Traceback (most recent call last)
<ipython-input-5-aad36f4acf38> in <module>()
10 print df['caption'].head(10)
11
---> 12 df['caption']=df['caption'].astype(unicode)

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error)
2016
2017 mgr = self._data.astype(
-> 2018 dtype, copy=copy, raise_on_error=raise_on_error)
2019 return self._constructor(mgr).__finalize__(self)
2020

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in astype(self, *args, **kwargs)
2414
2415 def astype(self, *args, **kwargs):
-> 2416 return self.apply('astype', *args, **kwargs)
2417
2418 def convert(self, *args, **kwargs):

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in apply(self, f, *args, **kwargs)
2373
2374 else:
-> 2375 applied = getattr(blk, f)(*args, **kwargs)
2376
2377 if isinstance(applied, list):

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values)
425 def astype(self, dtype, copy=False, raise_on_error=True, values=None):
426 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 427 values=values)
428
429 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
442 # force the copy here
443 if values is None:
--> 444 values = com._astype_nansafe(self.values, dtype, copy=True)
445 newb = make_block(values, self.items, self.ref_items,
446 ndim=self.ndim, placement=self._ref_locs,

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/common.pyc in _astype_nansafe(arr, dtype, copy)
2222 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
2223 elif issubclass(dtype.type, compat.string_types):
-> 2224 return lib.astype_str(arr.ravel()).reshape(arr.shape)
2225
2226 if copy:

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.astype_str (pandas/lib.c:12944)()

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.astype_str (pandas/lib.c:12862)()

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 6: ordinal not in range(128)

我的代码如下:

import pandas as pd
import numpy as np

df = pd.read_csv('Chevrolet_4-7-2014_cvid_data.csv',encoding='utf-16',header=0,na_values=['N/A',''],names=['channel','link','title','posted','views','likes','dislikes','description','category','statdate','statviews','timewatched','averagetw','subsdriven','shares','caption'])
print df.head(5)
print df.dtypes


print df['caption'].head(10)

df['caption']=df['caption'].astype(unicode)

数据如下所示:

channel                                        link  \
0 Chevrolet http://www.youtube.com/watch?v=dCayKZe6WvI
1 Chevrolet http://www.youtube.com/watch?v=IRXK35dPXbE
2 Chevrolet http://www.youtube.com/watch?v=XXdj4QMw748
3 Chevrolet http://www.youtube.com/watch?v=_ger32ROs94
4 Chevrolet http://www.youtube.com/watch?v=Chfm7Pou49k
5 Chevrolet http://www.youtube.com/watch?v=ySmEJyQ94BI

title posted views \
0 Chevy Open House Event: From Our House to Your... Apr 1 2014 73111
1 Truck Towing Capabilities: 2014 Silverado -- #... Mar 26 2014 11934
2 Potholes at the Milford Proving Grounds: Tips ... Mar 20 2014 8037
3 Diesel Trucks: Heavy Duty Strengths -- 2015 Si... Mar 20 2014 12096
4 Captain America: All in a Day's Work -- 2014 T... Mar 14 2014 93377
5 Media Blasting: Camaro Engineering -- 2014 Cam... Mar 13 2014 109931

likes dislikes description \
0 43 13 In March over 100000 people visited our Chevy ...
1 183 56 Farmer Dewayne Kleman and General Motors engin...
2 58 10 Chevrolet vehicles are carefully designed to w...
3 210 6 Introducing the all-new 2015 Silverado HD. The...
4 1095 35 From saving the world to working on math homew...

category statdate statviews timewatched averagetw subsdriven \
0 Autos & Vehicles NaN NaN NaN NaN NaN
1 Autos & Vehicles NaN NaN NaN NaN NaN
2 Autos & Vehicles NaN NaN NaN NaN NaN
3 Autos & Vehicles NaN NaN NaN NaN NaN
4 Autos & Vehicles NaN NaN NaN NaN NaN

shares caption
0 NaN The Chevy Spring Open House Sale the perfect ...
1 NaN 0:03 A Man And His Truck And An Engineer / To...
2 NaN 0:02 Severe Bump road sign 0:07 Pothole Facil...
3 NaN 0:03 And there's no stronger Silverado than t...
4 NaN 0:03 Are you doing anything fun Saturday nigh...
5 NaN 0:05 Camaro Z/28 logo 0:07 Z/28 Bead Lock 0:0...

[5 rows x 16 columns]
channel object
link object
title object
posted object
views object
likes int64
dislikes int64
description object
category object
statdate object
statviews float64
timewatched object
averagetw object
subsdriven float64
shares float64
caption object

dtype: object
0 The Chevy Spring Open House Sale the perfect ...
1 0:03 A Man And His Truck And An Engineer / To...
2 0:02 Severe Bump road sign 0:07 Pothole Facil...
3 0:03 And there's no stronger Silverado than t...
4 0:03 Are you doing anything fun Saturday nigh...
5 0:05 Camaro Z/28 logo 0:07 Z/28 Bead Lock 0:0...

Name: caption, dtype: object

最佳答案

您可以尝试将 dtype={'caption' : str} 添加到您的 read_csv() 调用中吗?喜欢:

df = pd.read_csv('Chevrolet_4-7-2014_cvid_data.csv',
encoding='utf-16',
header=0,
na_values=['N/A',''],
names=[...],
dtype={'caption' : str})

顺便说一句,pandas 默认使用 header=0。并不是说我可以看到您的 CSV,但如果您使用 names 关键字参数,这可能是多余的,因为如果它们位于 CSV 的第 0 行,pandas 将自动使用这些列名。但无论如何,让我知道另一件事是否适合你。 :)

关于python - 统一码编码错误 : 'ascii' codec can't encode character u'\u201c' when converting series object to unicode in pandas with utf-16,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23117159/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com