gpt4 book ai didi

r - txt文件的异常结构

转载 作者:行者123 更新时间:2023-12-02 01:18:24 28 4
gpt4 key购买 nike

有这样一个文本文件(例子): https://drive.google.com/open?id=0B1vq9WjkqkvzTEVEUnlXMGVFa00

原始文件有 65k 行。我需要将它上传到 R 并使其可处理。我使用了以下功能:

  1. read.table - 无效(R 从未返回任何结果)
  2. fread 来自 data.table 包 - 需要对文件进行大量手动预处理,并且由于引号中断行且文件未按需要工作t 以适当的形式)
  3. scan 得到一个向量,转换成矩阵没有带来所需的结果。

所需的文件形式是常规数据框:

mydata <- structure(list(fieldName = structure(c(3L, 3L), .Label = c("description", 
"scraped_manufacturer", "title"), class = "factor"), foreign_id = c(13389,
13389), is_single_product = structure(1:2, .Label = c("FALSE",
"TRUE"), class = "factor"), matched_manufacturers = c("Foden /manId: 76775",
"Caterpillar /manId: 74, Skogsjan-Caterpillar /manId: 10329"),
matched_products = c("", "C12 /modelId: 32774 /manId: 74"
), raw_string = c("CAT FODEN C-12 ENGINE", "CATERPILLAR C-12 ENGINE"
), pagesource = structure(c(84L, 84L), .Label = c("", "585e362f6b010083d6962041",
"585f270a300000c614b819ed", "585f84be6b0100c6ee962ab1", "585f84dc66010074efac42ca",
"585f875a6b0100c7ee963000", "585f878c66010074efac483e", "585f87ad66010075efac4880",
"585f88e06b0100b6ee96331c", "585f8b4566010074efac4fcb", "agriaffaires",
"apex-auctions", "arlington-plastics-machinery", "auctelia",
"auctions-international", "autogilles", "baestlein", "baupool",
"bavaria-swiss-ag", "big-iron", "big-machinery", "blackforxx",
"blue-group", "bpi-associates", "buk-baumaschinen", "cegema",
"christophbusch", "cjm-asset", "classified", "cnc-auction",
"cottrill-and-co", "daan", "de-vries", "dechow", "dimex-import-export",
"e-farm", "ebay", "ebay-de", "eberle-hald-gmbh", "eggers-landmaschinen",
"euro-auctions", "fabricating-machinery-corp", "fastline",
"ferwood", "fh-machinery", "first-machinery-auctions-limited",
"forklift-international", "ga-tec-gabelstaplertechnik", "gambtec",
"geiger", "german-graphics", "goindustry-dovebid", "graf",
"gruma-nutzfahrzeuge-gmbh", "hanselmann", "heinrich-kuper-gmbh",
"hooray-machinery", "imz-maschinen", "industrial-discount",
"ipr-petmachinery", "ironplanet", "ironplanet-com", "karl-guenter-wirths-gmbh",
"karner-dechow", "kurt-steiger", "kvd-auctions", "lagermaschinen",
"leinweber-landtechnik", "mach4metal", "machinefinder", "machinery-park",
"machineryzone", "maschinenbau-rehnen-gmbh", "mideast-equipment",
"mmtequipment", "oskar-broziat-maschinen", "perfection-global",
"perlick", "perry-videx", "pfeifer-machinery", "plustech-as",
"polboto-agri-sp-z-oo", "pressenhaas", "rc-tuxford-exports",
"resale", "restlos", "richter-friedewald-gmbh", "ritchie-bros",
"rock-and-dirt", "rogiers", "rs-auktionen", "stig-bindner",
"surplex", "technikboerse", "themar-trucks", "traktorpool",
"unilift", "vebim", "vertimac", "zeppelin-caterpillar", "zoll-auktion",
"zuern-gmbh"), class = "factor")), .Names = c("fieldName",
"foreign_id", "is_single_product", "matched_manufacturers", "matched_products",
"raw_string", "pagesource"), row.names = 1:2, class = "data.frame")

关于如何使用该文件有任何想法吗?

最佳答案

考虑在可以读取 RTF 类型的软件中打开文本文件。在 Windows 机器上,Microsoft Word 和内置写字板可以读取 .rtf 文档。这样做时,有效的 json 会显示在文档中(没有标记内容)。

JSON text

幸运的是,Windows 上的 R 可以使用 RDCOMClient 库连接到 MS Word 对象库,您可以在其中使用 Document.Content 提取文本属性(property)。读取 json 文本后,使用 jsonlite 库将内容迁移到数据框:

library(RDCOMClient)
library(jsonlite)

# OPEN WORD APP
wrdApp = COMCreate("Word.Application")
wrdDoc = wrdApp$Documents()$Open("C:\Path\To\Data.txt")
wrdtext = wrdDoc[['Content']]

# EXTRACT TEXT TO R VARIABLE
doc = wrdtext$Text()

# CLOSE APP
wrdDoc$Close(FALSE)
wrdApp$Quit()

# RELEASE RESOURCES
wrdtext <- wrdDoc <- wrdApp <- NULL
rm(wrdtext, wrdDoc, wrdApp)
gc()

# RAW DF: NAME / COLUMNS / VALUES LIST TYPES
rawdf <- fromJSON(doc)[[1]][[1]][[1]]

# FINAL DF: NORMALIZING VALUES WITH COL NAMES
finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
rawdf$columns[[1]])

输出

Final Dataframe


备选

您应该没有安装 MS Word。启动 CMD 提示符并使用命令行打开写字板(内置 Windows 应用程序)并将所有内容复制到 .json 文件(或右键单击文本文件并使用写字板打开)。如果在另一个操作系统(Linux/Mac)上执行特殊应用程序和终端调用的对应部分:

write "D:\Path\To\Data.txt"

保存json文件后,然后在R中运行:

rawdf <- do.call(rbind, lapply(paste(reaadLines("C:\Path\To\Data.json", warn=FALSE),
collapse=""),
jsonlite::fromJSON))[[1]][[1]][[1]]

finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
rawdf$columns[[1]])

关于r - txt文件的异常结构,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41531562/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com