gpt4 book ai didi

python - 提取两个句子之间不同的词

转载 作者:太空宇宙 更新时间:2023-11-03 12:55:42 25 4
gpt4 key购买 nike

我有一个非常大的数据框,其中有两列,分别称为 sentence1sentence2。我正在尝试用两个句子之间不同的词创建一个新列,例如:

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))

我的数据框具有以下结构:

ID    sentence1                    sentence2
1 This is sentence one This is the sentence four
2 This is sentence two This is the sentence five
3 This is sentence three This is the sentence six

我的预期结果是:

ID    sentence1        sentence2     Expected_Result
1 This is ... This is ... one the four
2 This is ... This is ... two the five
3 This is ... This is ... three the six

在 R 中,我试图拆分句子,然后得到列表之间不同的元素,例如:

df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

但是这种方法在应用setdiff时不起作用...

在 Python 中,我尝试应用 NLTK,尝试先获取标记,然后提取两个列表之间的差异,例如:

from nltk.tokenize import word_tokenize

df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))

在这一点上,我没有找到可以给我所需结果的函数..

我希望你能帮助我。谢谢

最佳答案

这是一个 R 解决方案。

我创建了一个 exclusiveWords 函数,用于查找两组之间唯一的单词,并返回由这些单词组成的“句子”。我将它包装在 Vectorize() 中,以便它同时处理 data.frame 的所有行。

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)

exclusiveWords <- function(x, y){
x <- strsplit(x, " ")[[1]]
y <- strsplit(y, " ")[[1]]
u <- union(x, y)
u <- union(setdiff(u, x), setdiff(u, y))
return(paste0(u, collapse = " "))
}

exclusiveWords <- Vectorize(exclusiveWords)

df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
# sentence1 sentence2 result
# 1 This is sentence one This is the sentence four the four one
# 2 This is sentence two This is the sentence five the five two
# 3 This is sentence three This is the sentence six the six three

关于python - 提取两个句子之间不同的词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43358543/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com