gpt4 book ai didi

windows - R:即使指定编码也无法读取unicode文本文件

转载 作者:可可西里 更新时间:2023-11-01 13:23:43 24 4
gpt4 key购买 nike

我在 Windows 7 32 位上使用 R 3.1.1。我在阅读一些我想对其执行文本分析的文本文件时遇到了很多问题。根据 Notepad++,这些文件使用 "UCS-2 Little Endian" 编码。 (grepWin,一个名字说明一切的工具,它说文件是“Unicode”。)

问题是,即使指定了编码,我似乎也无法读取文件。 (这些字符属于标准的西类牙拉丁语集 -ñáó-,应该可以使用 CP1252 或类似的东西轻松处理。)

> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
> readLines("filename.txt")
[1] "ÿþE" "" "" "" "" ...
> readLines("filename.txt",encoding="UTF-8")
[1] "\xff\xfeE" "" "" "" "" ...
> readLines("filename.txt",encoding="UCS2LE")
[1] "ÿþE" "" "" "" "" "" "" ...
> readLines("filename.txt",encoding="UCS2")
[1] "ÿþE" "" "" "" "" ...

有什么想法吗?

谢谢!!


编辑:“UTF-16”、“UTF-16LE”和“UTF-16BE”编码同样失败

最佳答案

仔细阅读文档后,我找到了问题的答案。

readLinesencoding 参数只应用于参数输入字符串。文档说:

encoding to be assumed for input strings. It is used to mark character strings as known to be in Latin-1 or UTF-8: it is not used to re-encode the input. To do the latter, specify the encoding as part of the connection con or via options(encoding=): see the examples. See also ‘Details’.

读取具有不常见编码的文件的正确方法是,

filetext <- readLines(con <- file("UnicodeFile.txt", encoding = "UCS-2LE"))
close(con)

关于windows - R:即使指定编码也无法读取unicode文本文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26305884/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com