gpt4 book ai didi

postgresql - UTF16 十六进制转文本

转载 作者:行者123 更新时间:2023-11-29 11:38:44 24 4
gpt4 key购买 nike

我有 UTF-16 十六进制表示,例如“0633064406270645”,在阿拉伯语中是“سلام”。

我想将其转换为等效的文本。在 PostgreSQL 中是否有直接的方法来做到这一点?

我可以像下面这样转换 UTF 代码点;不幸的是,似乎不支持 UTF16。关于如何在 PostgreSQL 中执行此操作的任何想法,最坏的情况我会编写一个函数?

SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8');

"سلام"

SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16');

ERROR: invalid source encoding name "UTF16"
********** Error **********

最佳答案

PostgreSQL 本身不支持 UTF-16。我建议您在将数据提供给数据库之前将其转换为 UTF-8。如果为时已晚(错误数据已存在于您的数据库中),您可以使用这些维护函数从 UTF-16 转换数据(从 wikipedia 复制的逻辑):

-- convert from bytea, containing UTF-16-BE data
CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?')
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
AS $function$
WITH source(unit) AS (
SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1)
FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
FROM source
)
SELECT string_agg(CASE
WHEN unit >= 56320 AND unit <= 57343 THEN CASE
WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
ELSE invalid_replacement
END
WHEN unit >= 55296 AND unit <= 56319 THEN CASE
WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
ELSE invalid_replacement
END
ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16-LE data
CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?')
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
AS $function$
WITH source(unit) AS (
SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8)
FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
FROM source
)
SELECT string_agg(CASE
WHEN unit >= 56320 AND unit <= 57343 THEN CASE
WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
ELSE invalid_replacement
END
WHEN unit >= 55296 AND unit <= 56319 THEN CASE
WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
ELSE invalid_replacement
END
ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16 data (with or without BOM)
CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?')
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
AS $function$
SELECT CASE COALESCE(octet_length(utf16_data), 0)
WHEN 0 THEN ''
WHEN 1 THEN invalid_replacement
ELSE CASE substring(utf16_data FOR 2)
WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement)
ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement)
END
END
$function$;

使用这些函数,您可以从各种 UTF-16 转换:

SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')),
convert_from_utf16le(decode('330644062706450652D862DF', 'hex')),
convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')),
convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex'));

-- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16
------------------------+----------------------+--------------------+-------------------
-- سلام𤭢 | سلام𤭢 | سلام𤭢 | سلام𤭢

关于postgresql - UTF16 十六进制转文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26607867/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com