gpt4 book ai didi

r - 有条件地合并行

转载 作者:行者123 更新时间:2023-12-05 09:31:48 25 4
gpt4 key购买 nike

我正在做一些棘手的数据清理。我有一个数据集(下面的第一个摘录),它是 pdf 表格数字化的输出。不幸的是,列没有正确数字化。有时,X3 列中的内容最终会在 X2 列中与 X2 列的最后一个单词连接起来...

我想做的是将 X3 列中的内容恢复到 X3 并将 X2 中的两行折叠在一起。

我附上了我尝试创建的输出的摘录。

关于我该怎么做的任何想法?

谢谢!

structure(list(X1 = c(111L, NA, 2L, NA, NA, 121L, NA, NA, 121L, 
NA, NA, 141L, NA, NA, 141L, NA), X2 = structure(c(7L, 1L, 8L,
1L, 1L, 9L, 1L, 1L, 6L, 3L, 1L, 5L, 2L, 1L, 10L, 4L), .Label = c("",
"A - BWHITE", "ASMITH", "B - DBURNEY", "Garden Harris", "House M. Aba",
"House M. Bab", "House M. Cac", "Street M. Bak", "Villa Thomas"
), class = "factor"), X3 = structure(c(2L, 1L, 3L, 1L, 1L, 4L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "A",
"A - C", "D"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L))
structure(list(X1 = c(111L, NA, 2L, NA, NA, 121L, NA, NA, 121L, 
NA, NA, 141L, NA, NA, 141L), X2 = structure(c(4L, 1L, 5L, 1L,
1L, 6L, 1L, 1L, 3L, 1L, 1L, 2L, 1L, 1L, 7L), .Label = c("", "Garden Harris WHITE",
"House M. Aba SMITH", "House M. Bab", "House M. Cac", "Street M. Bak",
"Villa Thomas BURNEY"), class = "factor"), X3 = structure(c(2L,
1L, 4L, 1L, 1L, 6L, 1L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 5L), .Label = c("",
"A", "A - B", "A - C", "B - D", "D"), class = "factor")), class = "data.frame", row.names = c(NA,
-15L))

在此处跟进问题:Cleaning extract_tables conditional merge rows, systematic extraction

最佳答案

你可以使用 tidyverse:

library(tidyr)
library(stringr)
library(dplyr)

df %>%
filter(X2 != "") %>%
mutate(
extract_name = lead(str_extract(X2, "(?<=[A-Z])[A-Z]+")),
extract_part = lead(str_extract(X2, "[A-Z](\\s-\\s[A-Z]){0,1}(?=[A-Z]+)")),
new_X2 = ifelse(!is.na(extract_name), paste(X2, extract_name), as.character(X2)),
new_X3 = ifelse(X3 != "", as.character(X3), extract_part)
) %>%
drop_na(X1) %>%
select(-extract_name, -extract_part)

返回

   X1            X2    X3              new_X2 new_X3
1 111 House M. Bab A House M. Bab A
2 2 House M. Cac A - C House M. Cac A - C
3 121 Street M. Bak D Street M. Bak D
4 121 House M. Aba House M. Aba SMITH A
5 141 Garden Harris Garden Harris WHITE A - B
6 141 Villa Thomas Villa Thomas BURNEY B - D

注意:我不认为这种方法对于所使用的正则表达式来说真的很稳定。为了便于阅读,我过滤掉了一些烦人的包含 NA 和空字符串的行,如有必要,您应该删除这些部分。

关于r - 有条件地合并行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68602693/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com