gpt4 book ai didi

r - 如何识别重复的单词以及句子中重复的位置和数量

转载 作者:行者123 更新时间:2023-12-04 12:37:01 24 4
gpt4 key购买 nike

我有一个包含连续单词重复的句子的数据集:

数据:

df <- data.frame(
Turn = c("oh is that that steak i got the other night", # that that
"no no no i 'm dave and you 're alan", # no no no
"yeah i mean the the film was quite long though", # the the
"it had steve martin in it it 's a comedy")) # it it

目标:

我想要获得的是添加到此数据框中的另外三列:

  • df$rep_Word : 指定重复单词的列
  • df$rep_Pos : 一列指定单词在句子中重复的第一个位置
  • df$rep_Numb : 指定单词重复次数的列

所以预期的数据框如下所示:

预期结果:

df
Turn rep_Word rep_Pos rep_Numb
1 oh is that that steak i got the other night that 4 1
2 no no no i 'm dave and you 're alan no 2 2
3 yeah i mean the the film was quite long though the 5 1
4 it had steve martin in it it 's a comedy it 7 1

迄今为止尝试的解决方案:

我的预感是,可以通过 strsplit 获取有关重复单词、位置和重复次数的信息。和函数 duplicated ,例如,因此:

df_split <- apply(df, 2, function(x) strsplit(x, "\\s"))

df_split
$Turn
$Turn[[1]]
[1] "oh" "is" "that" "that" "steak" "i" "got" "the" "other" "night"
$Turn[[2]]
[1] "no" "no" "no" "i" "'m" "dave" "and" "you" "'re" "alan"
$Turn[[3]]
[1] "yeah" "i" "mean" "the" "the" "film" "was" "quite" "long" "though"
$Turn[[4]]
[1] "it" "had" "steve" "martin" "in" "it" "it" "'s" "a" "comedy"

例如,对于 df 中的第一句话, duplicated显示哪个单词被重复(即 duplicated 评估为 TRUE 的单词),并且重复的数量和位置也可以读取该信息:

duplicated(df_split$Turn[[1]])
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

问题是我不知道如何操作 duplicated以在 df 中获得所需的添加列的方式.非常感谢您对这项工作的帮助。

最佳答案

这是解决问题的另一种方法。

df <- data.frame(
Turn = c("oh is that that steak i got the other night", # that that
"no no no i 'm dave and you 're alan", # no no no
"yeah i mean the the film was quite long though", # the the
"it had steve martin in it it 's a comedy", # it it)
"it had steve martin in in it it 's a comedy",
"yeah i mean the film was quite long though",
"hi hi then other words and hi hi again",
"no no no i 'm dave yes yes and you 're alan no no no no")) # no no no and no no no no

library(data.table)
cols <- c("rep_Word", "rep_Pos", "rep_Numb")
setDT(df)[, (cols) := {
words <- strsplit(as.character(Turn), " ")[[1]]
idx <- rleid(words)
check <- duplicated(idx)
chg <- check - shift(check, fill = FALSE)
starts <- which(chg == 1)
aend <- if(sum(chg) == 0L) which(chg == -1) else c(which(chg == -1), length(chg) + 1L)
freq <- aend - starts
wrd <- words[starts]
no_dup_default <- .(.(NA_character_), .(NA_integer_), .(NA_integer_))
if(length(wrd)) .(.(wrd), .(starts), .(freq)) else no_dup_default
}, seq.int(nrow(df))]


df
# Turn rep_Word rep_Pos rep_Numb
# 1: oh is that that steak i got the other night that 4 1
# 2: no no no i 'm dave and you 're alan no 2 2
# 3: yeah i mean the the film was quite long though the 5 1
# 4: it had steve martin in it it 's a comedy it 7 1
# 5: it had steve martin in in it it 's a comedy in,it 6,8 1,1
# 6: yeah i mean the film was quite long though NA NA NA
# 7: hi hi then other words and hi hi again hi,hi 2,8 1,1
# 8: no no no i 'm dave yes yes and you 're alan no no no no no,yes,no 2, 8,14 2,1,3
#

# or
df[, lapply(.SD, unlist), seq.int(nrow(df))][, -1]
# Turn rep_Word rep_Pos rep_Numb
# 1: oh is that that steak i got the other night that 4 1
# 2: no no no i 'm dave and you 're alan no 2 2
# 3: yeah i mean the the film was quite long though the 5 1
# 4: it had steve martin in it it 's a comedy it 7 1
# 5: it had steve martin in in it it 's a comedy in 6 1
# 6: it had steve martin in in it it 's a comedy it 8 1
# 7: yeah i mean the film was quite long though <NA> NA NA
# 8: hi hi then other words and hi hi again hi 2 1
# 9: hi hi then other words and hi hi again hi 8 1
# 10: no no no i 'm dave yes yes and you 're alan no no no no no 2 2
# 11: no no no i 'm dave yes yes and you 're alan no no no no yes 8 1
# 12: no no no i 'm dave yes yes and you 're alan no no no no no 14 3

关于r - 如何识别重复的单词以及句子中重复的位置和数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60463993/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com