gpt4 book ai didi

Python - 从文件夹中的所有文件中删除重音

转载 作者:行者123 更新时间:2023-11-28 22:05:41 24 4
gpt4 key购买 nike

我正在尝试从文件夹中的所有编码文件中删除所有重音符号。我已经成功构建了文件列表,问题是当我尝试使用 unicodedata 进行规范化时出现错误:** 追溯(最后一次通话): __run 中的文件“/usr/lib/gedit-2/plugins/pythonconsole/console.py”,第 336 行 在 self.namespace 中执行命令 文件“”,第 2 行,位于UnicodeDecodeError: 'utf8' 编解码器无法解码位置 25 中的字节 0xf3:无效的连续字节 **

if options.remove_nonascii:
nERROR = 0
print _("# Removing all acentuation from coding files in %s") % (options.folder)
exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
for dirpath, dirnames, filenames in os.walk(options.folder):
for filename in (f for f in filenames if f.endswith(exts)):
files.add(os.path.join(dirpath,filename))
for i in range(len(files)):
f = files.pop() ;
os.rename(f,f+'.BACK')
with open(f,'w') as File:
for line in open(f+'.BACK').readlines():
try:
newLine = unicodedata.normalize('NFKD',unicode(line)).encode('ascii','ignore')
File.write(newLine)
except UnicodeDecodeError:
nERROR +=1
print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
newLine = line
File.write(newLine)

最佳答案

看起来文件可能是使用 cp1252 编解码器编码的:

In [18]: print('\xf3'.decode('cp1252'))
ó

unicode(line) 失败,因为 unicode 正在尝试使用 utf-8 编解码器解码 line相反,因此出现错误 UnicodeDecodeError: 'utf8' codec can't decode...

您可以先尝试用 cp1252 解码 line,如果失败,再尝试 utf-8:

if options.remove_nonascii:
nERROR = 0
print _("# Removing all acentuation from coding files in %s") % (options.folder)
exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
for dirpath, dirnames, filenames in os.walk(options.folder):
for filename in (f for f in filenames if f.endswith(exts)):
files.add(os.path.join(dirpath,filename))
for i,f in enumerate(files):
os.rename(f,f+'.BACK')
with open(f,'w') as fout:
with open(f+'.BACK','r') as fin:
for line fin:
try:
try:
line=line.decode('cp1252')
except UnicodeDecodeError:
line=line.decode('utf-8')
# If this still raises an UnicodeDecodeError, let the outer
# except block handle it
newLine = unicodedata.normalize('NFKD',line).encode('ascii','ignore')
fout.write(newLine)
except UnicodeDecodeError:
nERROR +=1
print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
newLine = line
fout.write(newLine)

顺便说一下,

unicodedata.normalize('NFKD',line).encode('ascii','ignore')

有点危险。例如,它完全删除了 u'ß' 和一些引号:

In [23]: unicodedata.normalize('NFKD',u'ß').encode('ascii','ignore')
Out[23]: ''

In [24]: unicodedata.normalize('NFKD',u'‘’“”').encode('ascii','ignore')
Out[24]: ''

如果这是个问题,请使用 unidecode module :

In [25]: import unidecode
In [28]: print(unidecode.unidecode(u'‘’“”ß'))
''""ss

关于Python - 从文件夹中的所有文件中删除重音,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4935347/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com