gpt4 book ai didi

java - OutOfMemoryError - 来自检测 UTF-8 编码

转载 作者:行者123 更新时间:2023-12-02 07:12:35 25 4
gpt4 key购买 nike

此类应检查currentFile并检测编码。如果结果是 UTF-8 返回 true

运行后的输出是 - java.lang.OutOfMemoryError: Java heap space

要读取数据,您需要有 JDK 7 Files.readAllBytes(path)

代码:

class EncodingsCheck implements Checker {

@Override
public boolean check(File currentFile) {
return isUTF8(currentFile);
}

public static boolean isUTF8(File file) {
// validate input
if (null == file) {
throw new IllegalArgumentException("input file can't be null");
}
if (file.isDirectory()) {
throw new IllegalArgumentException(
"input file refers to a directory");
}

// read input file
byte[] buffer;
try {
buffer = readUTFHeaderBytes(file);
} catch (IOException e) {
throw new IllegalArgumentException(
"Can't read input file, error = " + e.getLocalizedMessage());
}

if (0 == (buffer[0] & 0x80)) {
return true; // ASCII subset character, fast path
} else if (0xF0 == (buffer[0] & 0xF8)) { // start of 4-byte sequence
if (buffer[3] >= buffer.length) {
return false;
}
if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))
&& (0x80 == (buffer[3] & 0xC0)))
return true;
} else if (0xE0 == (buffer[0] & 0xF0)) { // start of 3-byte sequence
if (buffer[2] >= buffer.length) {
return false;
}
if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))) {
return true;
}
} else if (0xC0 == (buffer[0] & 0xE0)) { // start of 2-byte sequence
if (buffer[1] >= buffer.length) {
return false;
}
if (0x80 == (buffer[1] & 0xC0)) {
return true;
}
}

return false;
}

private static byte[] readUTFHeaderBytes(File input) throws IOException {
// read data
Path path = Paths.get(input.getAbsolutePath());
byte[] data = Files.readAllBytes(path);
return data;
}
}

问题:

  • 如何解决这个问题?
  • 如何以这种方式检查 UTF-16(我们需要担心这个还是这只是无用的麻烦)?

最佳答案

您不需要阅读整个文件。

private static byte[] readUTFHeaderBytes(File input) throws IOException {
FileInputStream fileInputStream = new FileInputStream(input);
try{
byte firstBytes[] = new byte[4];
int count = fileInputStream.read(firstBytes);
if(count < 4){
throw new IOException("Empty file");
}
return firstBytes;
} finally {
fileInputStream.close();
}
}

要检测其他 UTF 编码,请使用给定的模式:

Bytes           Encoding Form00 00 FE FF     UTF-32, big-endianFF FE 00 00     UTF-32, little-endianFE FF           UTF-16, big-endianFF FE           UTF-16, little-endianEF BB BF        UTF-8

关于java - OutOfMemoryError - 来自检测 UTF-8 编码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15326361/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com