gpt4 book ai didi

r - 在 R 中强制字符向量编码从 "unknown"到 "UTF-8"

转载 作者:行者123 更新时间:2023-12-03 06:14:31 25 4
gpt4 key购买 nike

我在 R 中遇到字符向量编码不一致的问题。

我从中读取表格的文本文件被编码(通过 Notepad++ )在 UTF-8 中(我也尝试过 UTF-8 without BOM 。)

我想从此文本文件中读取表格,并将其转换为 data.table ,设置key并利用二分查找。当我尝试这样做时,出现以下内容:

Warning message: In [.data.table(poli.dt, "żżonymi", mult = "first") : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

并且二分搜索不起作用

我意识到我的data.table -key列由“未知”和“UTF-8”组成 编码类型:

> table(Encoding(poli.dt$word))
unknown UTF-8
2061312 2739122

我尝试使用以下方法转换此列(在创建 data.table 对象之前):

  • Encoding(word) <- "UTF-8"
  • word<- enc2utf8(word)

但没有效果。

我还尝试了几种将文件读入 R 的不同方法(设置所有有用的参数,例如 encoding = "UTF-8" ):

  • data.table::fread
  • utils::read.table
  • base::scan
  • colbycol::cbc.read.table

但没有效果。

================================================== ===

我的R.版本:

> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3 (2014-03-06)
nickname Warm Puppy

我的 session 信息:

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250

base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6

loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3

最佳答案

如果字符串具有“ native 编码”标记(在您的情况下为 CP-1250)或者采用 ASCII 格式,则 Encoding 函数将返回 unknown。要区分这两种情况,请调用:

library(stringi)
stri_enc_mark(poli.dt$word)

要检查每个字符串是否是有效的 UTF-8 字节序列,请调用:

all(stri_enc_isutf8(poli.dt$word))

如果不是这种情况,则您的文件肯定不是 UTF-8 格式。

我怀疑您没有在数据读取函数中强制使用UTF-8模式(尝试检查poli.dt$word的内容来验证此说法)。如果我的猜测是正确的,请尝试:

read.csv2(file("filename", encoding="UTF-8"))

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

如果 data.table 仍然提示“混合”编码,您可能需要音译非 ASCII 字符,例如:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"

关于r - 在 R 中强制字符向量编码从 "unknown"到 "UTF-8",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23699271/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com