gpt4 book ai didi

python - 在 Python 中将字符串截断为字节长度

转载 作者:太空狗 更新时间:2023-10-29 17:58:52 25 4
gpt4 key购买 nike

我这里有一个函数可以将给定的字符串截断为给定的字节长度:

LENGTH_BY_PREFIX = [
(0xC0, 2), # first byte mask, total codepoint length
(0xE0, 3),
(0xF0, 4),
(0xF8, 5),
(0xFC, 6),
]

def codepoint_length(first_byte):
if first_byte < 128:
return 1 # ASCII
for mask, length in LENGTH_BY_PREFIX:
if first_byte & mask == mask:
return length
assert False, 'Invalid byte %r' % first_byte

def cut_string_to_bytes_length(unicode_text, byte_limit):
utf8_bytes = unicode_text.encode('UTF-8')
cut_index = 0
while cut_index < len(utf8_bytes):
step = codepoint_length(ord(utf8_bytes[cut_index]))
if cut_index + step > byte_limit:
# can't go a whole codepoint further, time to cut
return utf8_bytes[:cut_index]
else:
cut_index += step
# length limit is longer than our bytes strung, so no cutting
return utf8_bytes

在引入表情符号问题之前,这似乎工作正常:

string = u"\ud83d\ude14"
trunc = cut_string_to_bytes_length(string, 100)

Traceback (most recent call last):
File "<console>", line 1, in <module>
File "<console>", line 5, in cut_string_to_bytes_length
File "<console>", line 7, in codepoint_length
AssertionError: Invalid byte 152

谁能准确解释这里发生了什么,以及可能的解决方案是什么?

编辑:我这里有另一个代码片段,它不会抛出异常,但有时会出现奇怪的行为:

import encodings
_incr_encoder = encodings.search_function('utf8').incrementalencoder()

def utf8_byte_truncate(text, max_bytes):
""" truncate utf-8 text string to no more than max_bytes long """
byte_len = 0
_incr_encoder.reset()
for index,ch in enumerate(text):
byte_len += len(_incr_encoder.encode(ch))
if byte_len > max_bytes:
break
else:
return text
return text[:index]

>>> string = u"\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14"
>>> print string
(prints a set of 5 Apple Emoji...)😔😔😔😔😔
>>> len(string)
10
>>> trunc = utf8_byte_truncate(string, 4)
>>> print trunc
???
>>> len(trunc)
1

因此对于第二个示例,我有一个 10 字节的字符串,将其截断为 4,但奇怪的事情发生了,结果是一个大小为 1 字节的字符串。

最佳答案

正如@jwpat7 指出的那样,算法是错误的。下面是一种更简单的算法,但请注意一些可感知的单个字符(称为字素)由多个 Unicode 代码点组成,例如👨‍👩‍👧‍👦。这不会尝试维护字素。

# NOTE: This is Python 2 to match OP's code

# s = u'\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14'
# Same as above
s = u'\U0001f614' * 5 # Unicode character U+1F614

def utf8_lead_byte(b):
'''A UTF-8 intermediate byte starts with the bits 10xxxxxx.'''

# (b & 0xC0) != 0x80 # Python 3 no need for ord()
return (ord(b) & 0xC0) != 0x80

def utf8_byte_truncate(text, max_bytes):
'''If text[max_bytes] is not a lead byte, back up until a lead byte is
found and truncate before that character.'''
utf8 = text.encode('utf8')
if len(utf8) <= max_bytes:
return utf8
i = max_bytes
while i > 0 and not utf8_lead_byte(utf8[i]):
i -= 1
return utf8[:i]

# test for various max_bytes:
for m in range(len(s.encode('utf8'))+1):
b = utf8_byte_truncate(s,m)
print m,len(b),b.decode('utf8')

###输出

0 0 
1 0
2 0
3 0
4 4 😔
5 4 😔
6 4 😔
7 4 😔
8 8 😔😔
9 8 😔😔
10 8 😔😔
11 8 😔😔
12 12 😔😔😔
13 12 😔😔😔
14 12 😔😔😔
15 12 😔😔😔
16 16 😔😔😔😔
17 16 😔😔😔😔
18 16 😔😔😔😔
19 16 😔😔😔😔
20 20 😔😔😔😔😔

关于python - 在 Python 中将字符串截断为字节长度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13727977/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com