gpt4 book ai didi

r - 如何从 R 中的数据框列中删除所有未指定的指定单词

转载 作者:行者123 更新时间:2023-12-05 01:24:45 25 4
gpt4 key购买 nike

我有一个数据框,其中的 Twitter bios 格式如下表所示。

<表类="s-表"><头>帐号简介<正文>38374我爱糖果就像爱生活本身一样骄傲自由45673都可以好好相处94928保护基督徒妈妈和骄傲的亲王牌老将 maga11204卫斯理大学 blacklivesmatter 女性与性别研究教授37465前俄亥俄州橄榄球教练现在是七个孙辈的骄傲爸爸

许多关于堆栈溢出的回复询问如何从数据帧列中删除指定的单词列表(如 R - remove word from a sentenceHow to remove words of a sentence by using a dictionary as reference )。但我想删除 bio 列中的所有单词,除非它们出现在预先确定的单词列表中。要保留的单词列表由 1052 个单词组成(如下所示)

> termstokeep
[1] love life follow live just like music regist trademark
[10] make fan one copyright lover thing world time god
[19] can get design peopl artist girl univers writer will
[28] student work busi good new know friend famili best
[37] day account market sport art game manag want book
[46] enthusiast person alway travel never free real help dream
[55] servic mom husband profession beauti offici wife now news
[64] social food come father heart educ develop need anim
[73] everyth proud tri year happi also media way man
[82] team produc look state take back support director home
[91] find call engin learn provid photograph great author video
[100] guy communiti coach name big passion see teacher school
[109] product sinc gamer enjoy keep player better let believ
[118] mother think mind dog futur give colleg say owner
[127] jesus fun got littl chang founder boy use first
[136] liberal write footbal kid fuck event polit consult care
[145] conserv much health technolog tech opinion stay everi right
[154] full former member special well young high creat snap
[163] entrepreneur movi feel view compani coffe cat citi human
[172] digit show singer sometim interest dad watch scienc creativ
[181] blogger base addict fit read bless fashion part noth
[190] run forev editor born hard die around onlin nerd
[199] class web musician made stuff leader ever inspir still
[208] christian place current public danc pleas geek talk film
[217] realli babi someth page rock lot women lead two

理想情况下,在删除所有未指定的词后,数据框将如下所示:

<表类="s-表"><头>帐号简介<正文>38374热爱生活骄傲自由4567394928保护基督教妈妈自豪的亲王牌老将 maga11204教授女性性别大学 blacklivesmatter37465俄亥俄州橄榄球教练骄傲的孙子

如何做到这一点?

最佳答案

这是一种使用基础 gregexprregmatches 的方法。

pattern <- paste0("\\<", termstokeep, "\\>")
pattern <- paste(pattern, collapse = "|")
m <- gregexpr(pattern, df1$bio)
r <- regmatches(df1$bio, m)

df1$bio_clean <- sapply(r, paste, collapse = " ")

reprex package (v2.0.1) 创建于 2022-02-22

数据

termstokeep <-
c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")


df1 <- read.table(text = "
account bio
38374 'i love candy as much as life itself proud liberal'
45673 'can all just get along'
94928 'conserv christian mom and proud pro trump veteran maga'
11204 'professor of women and gender studies at wesleyan university blacklivesmatter'
37465 'former ohio state football coach now a proud papa to seven grandchildren'
", header = TRUE)

reprex package (v2.0.1) 创建于 2022-02-22

关于r - 如何从 R 中的数据框列中删除所有未指定的指定单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71226002/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com