gpt4 book ai didi

python - 文件包含\u00c2\u00a0,转换为字符

转载 作者:太空宇宙 更新时间:2023-11-04 08:26:10 26 4
gpt4 key购买 nike

我有一个 JSON 文件,其中包含这样的文本

 .....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

我的简单问题是如何将这些\u 代码转换(而不是删除)为空格、撇号和 e.t.c...?

输入一个包含.....wax 的文本文件,瞧!\u00c2\u00a0目前您不能使用我们的...

输出: .....wax,瞧!(转换为换行符)目前您不能使用我们的 ...

Python代码

def TEST():
export= requests.get('https://sample.uk/', auth=('user', 'pass')).text

with open("TEST.json",'w') as file:
file.write(export.decode('utf8'))

我尝试过的:

  • 使用.json()
  • 组合 .encode().decode() 和 e.t.c. 的任何不同方式

编辑 1

当我将此文件上传到 BigQuery 时,我有 - 符号

更大的样本:

{
"xxxx1": "...You don\u2019t nee...",
"xxxx2": "...Gu\u00e9rer...",
"xxxx3": "...boost.\u00a0Sit back an....",
"xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
"xxxx5": "\u00a0\n\u00a0",
"xxxx6": "It was Christmas Eve babe\u2026",
"xxxx7": "It\u2019s xxx xxx\u2026"
}

Python代码:

import json
import re
import codecs


def load():
epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
x = json.loads(re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))

with open("TEST.json", "w") as file:
json.dump(x,file)

def unmangle_utf8(match):
escaped = match.group(0) # '\\u00e2\\u0082\\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'

try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)



if __name__ == '__main__':
load()

最佳答案

我制作了这个粗略的 UTF-8 unmangler,它似乎可以解决您困惑的编码情况:

import codecs
import re
import json

def unmangle_utf8(match):
escaped = match.group(0) # '\\u00e2\\u0082\\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'

try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)

用法:

broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)

converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)

data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])

它使用正则表达式从您的字符串中提取十六进制序列,将它们转换为单独的字节并将它们解码为 UTF-8。

对于上面的示例字符串(我已经包含了 3 字节字符 作为测试)打印:

Broken JSON {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}Fixed JSON {"some_key": "... ’ wax, and voila!  At the moment you can't use our € ..."}Parsed data {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}Single value ... ’ wax, and voila!  At the moment you can't use our € ...

Parsed data中的\xa0是Python输出dicts到控制台的方式造成的,它仍然是实际的不间断空格。

关于python - 文件包含\u00c2\u00a0,转换为字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56955320/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com