gpt4 book ai didi

python - C Python API Extensions 忽略了 open(errors ="ignore") 并一直抛出编码异常

转载 作者:太空宇宙 更新时间:2023-11-03 23:37:13 26 4
gpt4 key购买 nike

给定一个包含无效 UTF8 的文件 /myfiles/file_with_invalid_encoding.txt:

parse this correctly
Føö»BÃ¥r
also parse this correctly

我正在使用 C API 中的内置 Python open 函数,如下最小示例(不包括 C Python 设置样板):

const char* filepath = "/myfiles/file_with_invalid_encoding.txt";
PyObject* iomodule = PyImport_ImportModule( "builtins" );

if( iomodule == NULL ) {
PyErr_PrintEx(100); return;
}
PyObject* openfunction = PyObject_GetAttrString( iomodule, "open" );

if( openfunction == NULL ) {
PyErr_PrintEx(100); return;
}

PyObject* openfile = PyObject_CallFunction( openfunction,
"s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );

if( openfile == NULL ) {
PyErr_PrintEx(100); return;
}

PyObject* iterfunction = PyObject_GetAttrString( openfile, "__iter__" );
Py_DECREF( openfunction );

if( iterfunction == NULL ) {
PyErr_PrintEx(100); return;
}
PyObject* openfileresult = PyObject_CallObject( iterfunction, NULL );
Py_DECREF( iterfunction );

if( openfileresult == NULL ) {
PyErr_PrintEx(100); return;
}
PyObject* fileiterator = PyObject_GetAttrString( openfile, "__next__" );
Py_DECREF( openfileresult );
if( fileiterator == NULL ) {
PyErr_PrintEx(100); return;
}
PyObject* readline;
std::cout << "Here 1!" << std::endl;

while( ( readline = PyObject_CallObject( fileiterator, NULL ) ) != NULL ) {
std::cout << "Here 2!" << std::endl;
std::cout << PyUnicode_AsUTF8( readline ) << std::endl;
Py_DECREF( readline );
}
PyErr_PrintEx(100);
PyErr_Clear();

PyObject* closefunction = PyObject_GetAttrString( openfile, "close" );

if( closefunction == NULL ) {
PyErr_PrintEx(100); return;
}

PyObject* closefileresult = PyObject_CallObject( closefunction, NULL );
Py_DECREF( closefunction );

if( closefileresult == NULL ) {
PyErr_PrintEx(100); return;
}

Py_XDECREF( closefileresult );
Py_XDECREF( iomodule );
Py_XDECREF( openfile );
Py_XDECREF( fileiterator );

我正在调用传递 ignore 参数的 open 函数来忽略编码错误,但是 Python 忽略了我并在发现无效的 UTF8 字符时不断抛出编码异常:

Here 1!
Traceback (most recent call last):
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 26: invalid start byte

正如您在上面和下面看到的,当我调用 builtins.open() 函数时,我传递了 ignore 参数,但它没有有什么影响。我还尝试将 ignore 更改为 replace,但 C Python 始终抛出异常:

PyObject* openfile = PyObject_CallFunction( openfunction, 
"s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );

最佳答案

PyObject_CallFunction(和 Py_BuildValue 等)采用描述所有参数的单一格式字符串。当你做的时候

PyObject* openfile = PyObject_CallFunction( openfunction, 
"s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );

你说的是“一个字符串参数”,filepath 之后的所有参数都被忽略了。相反,你应该这样做:

PyObject* openfile = PyObject_CallFunction( openfunction, 
"ssiss", filepath, "r", -1, "UTF8", "ignore" );

说“5 个参数:2 个字符串和 int,以及另外两个字符串”。即使您选择使用其他 PyObject_Call* 函数之一,您也会发现以这种方式使用 Py_BuildValue 会更容易。

关于python - C Python API Extensions 忽略了 open(errors ="ignore") 并一直抛出编码异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56482802/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com