gpt4 book ai didi

r - 当两个定界符在字符串中多次出现时,提取两个定界符之间的所有文本

转载 作者:行者123 更新时间:2023-12-03 23:34:26 24 4
gpt4 key购买 nike

我有几行聊天数据,其中包含如下所示的文字记录:

"Participant 1 (Me): I don't know the answer to this. Participant 2: What do you think? Maybe 20%? Participant 2: I don't know either. Participant 1 (Me): that was my guess Participant 2: ok, let's go for it! ...what do you think? Participant 1 (Me): sure! Participant 2: ok! Participant 2: aww! sorry!"

我只想提取前缀 Participant 1 (Me): 之后的文本,直到它显示 Participant 1Participant 2Participant 1 之后直到上述分隔符的所有文本都应存储在名为 participant_1_text 的变量中。我想将所有剩余的文本存储在一个名为 participant_2_text 的单独变量中,如下所示:

participant_1_text = "I don't know the answer to this. that was my guess. sure!
participant_2_text = "What do you think? Maybe 20%? I don't know either. ok, let's go for
it! ...what do you think? ok! aww! sorry!"

因此,参与者 1 的所有文本和参与者 2 的所有文本现在都分开了。

我尝试了类似下面的正则表达式:

(?<=Participant 1)(.*)(?=Participant 2)

但这将匹配这两个分隔符的第一次和最后一次出现之间的所有文本,而不是每次匹配。


编辑:我现在正在尝试采用以下版本的代码并将它们应用于包含大量聊天记录的数据框:

所以,使用@akrun 的代码,我创建了一个函数,将给定的聊天日志分离到 my_chatpartner_chat 并返回一个命名列表:

extract_chat <- function(chat_text){
final_output = chat_text %>%
tibble(col1 = chat_text) %>%
mutate(col1 = str_replace_all(col1, "Participant", "\nParticipant")) %>%
separate_rows(col1, sep="\n") %>%
filter(nzchar(col1)) %>% #filter the non-empty strings
separate(col1, into = c('Participant', "text"), sep=":") %>%
group_by(Participant) %>%
summarise(text = str_c(text, collapse= ' ')) %>%
mutate(Participant = ifelse(str_detect(Participant, "(Me)"), "my_chat_extracted", "partner_chat_extracted")) %>%
spread(Participant, text)

return(list(my_chat_extracted = final_output$my_chat_extracted,
partner_chat_extracted = final_output$partner_chat_extracted))
}

这似乎工作正常,但我不确定如何改变我的数据框中的实际列以使用此功能。

以下是要使用的 data.frame 示例:

str1 <- "Participant 1 (Me): I don't know the answer to this. Participant 2: What do you think? Maybe 20%? Participant 2: I don't know either. Participant 1 (Me): that was my guess Participant 2: ok, let's go for it! ...what do you think? Participant 1 (Me): sure! Participant 2: ok! Participant 2: aww! sorry!"
str2 <- "Participant 1 (Me): Hey, how are you? Participant 2: I'm good, how about you? Participant 2: I'm excited. Participant 1 (Me): I'm also good."
test = data.frame(chat = c(str1, str2))

我想做这样的事情:

   tester = test %>% 
rowwise() %>%
mutate(my_chat_extracted = extract_chat(chat)$my_chat_extracted)

但这在我的实际数据集上似乎很慢,而且感觉很草率。

最佳答案

我们可以在 Participant 之前插入下一行字符(使用 str_replace_all),然后使用 separate_rows 在 \n 处拆分, filter 剔除任何空白 (nzchar), separate: 处将列分成两部分,分组通过 'Participant',paste 'text' 字符串到单个字符串中

library(dplyr)
library(stringr)
library(tidyr)
out <- tibble(col1 = str1) %>%
mutate(col1 = str_replace_all(col1, "Participant", "\nParticipant")) %>%
separate_rows(col1, sep="\n") %>%
filter(nzchar(col1)) %>%
separate(col1, into = c('Participant', "text"), sep=":") %>%
group_by(Participant = str_remove(Participant, "\\s*\\(.*")) %>%
summarise(text = str_c(text, collapse= ' '))

out
# A tibble: 2 x 2
# Participant text
# <chr> <chr>
#1 Participant 1 " I don't know the answer to this. that was my guess sure! "
#2 Participant 2 " What do you think? Maybe 20%? I don't know either. ok, let's go for it! ...what do you think? ok! aww! sorry!"

最好将它保存在 data.frame 中,但如果我们需要单独的对象,请在 deframeing list2env p>

library(tibble)
list2env(as.list(deframe(out)), .GlobalEnv)
`Participant 1`
#[1] " I don't know the answer to this. that was my guess sure! "

数据

str1 <- "Participant 1 (Me): I don't know the answer to this. Participant 2: What do you think? Maybe 20%? Participant 2: I don't know either. Participant 1 (Me): that was my guess Participant 2: ok, let's go for it! ...what do you think? Participant 1 (Me): sure! Participant 2: ok! Participant 2: aww! sorry!"

关于r - 当两个定界符在字符串中多次出现时,提取两个定界符之间的所有文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62436341/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com