gpt4 book ai didi

string - 使用两种不同的编码在 Ruby 上加载文件

转载 作者:行者123 更新时间:2023-12-02 02:14:08 25 4
gpt4 key购买 nike

我有一个包含两种不同编码的大文件。 “主”文件是 UTF-8,但有些字符如 <80> (isoxxx 中的欧元)或 <9F> (isoxxx 中的ß)采用 ISO-8859-1 编码。我可以用它来替换无效字符:

 string.encode("iso8859-1", "utf-8", {:invalid => :replace, :replace => "-"}).encode("utf-8")

问题是,我需要这个错误编码的字符,所以替换为“-”对我来说没用。如何使用 ruby​​ 修复文档中错误的编码字符?

编辑:我试过 :fallback选项,但没有成功(没有进行替换):

 string.encode("iso8859-1", "utf-8",
:fallback => {"\x80" => "123"}
)

最佳答案

我使用了以下代码 (Ruby 1.8.7)。它测试每个 char >= 128 ASCII 以检查它是否是有效 utf-8 序列的开头。如果不是,则假定为 iso8859-1 并将其转换为 utf-8。

由于您的文件很大,这个过程可能会很慢!

class String
# Grants each char in the final string is utf-8-compliant.
# based on http://php.net/manual/en/function.utf8-encode.php#39986
def utf8
ret = ''

# scan the string
# I'd use self.each_byte do |b|, but I'll need to change i
a = self.unpack('C*')
i = 0
l = a.length
while i < l
b = a[i]
i += 1

# if it's ascii, don't do anything.
if b < 0x80
ret += b.chr
next
end

# check whether it's the beginning of a valid utf-8 sequence
m = [0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe]
n = 0
n += 1 until n > m.length || (b & m[n]) == m[n-1]

# if not, convert it to utf-8
if n > m.length
ret += [b].pack('U')
next
end

# if yes, check if the rest of the sequence is utf8, too
r = [b]
u = false

# n bytes matching 10bbbbbb follow?
n.times do
if i < l
r << a[i]
u = (a[i] & 0xc0) == 0x80
i += 1
else
u = false
end
break unless u
end

# if not, converts it!
ret += r.pack(u ? 'C*' : 'U*')
end

ret
end

def utf8!
replace utf8
end
end

# let s be the string containing your file.
s2 = s.utf8

# or
s.utf8!

关于string - 使用两种不同的编码在 Ruby 上加载文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11395253/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com