gpt4 book ai didi

r - 如何使用 R 在我的数据中找到最常见的序列?

转载 作者:行者123 更新时间:2023-12-04 14:59:40 25 4
gpt4 key购买 nike

我正在尝试弄清楚如何使用 rollapply 函数(来自 Zoo 包)在数据集中查找最常见字符串的序列,但我还需要对某些变量进行分组(例如日期、行等)

在我继续之前,值得注意的是这个查询建立在我之前发布在这里的一个问题上:How can I find most common sequences (of strings) in my data using Tableau?

那里提供的解决方案非常有效,但我现在想将它应用于不同的数据集,这带来了一些新的挑战!这是我在这个新数据集中使用的数据示例:

structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith", 
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show",
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death",
"Bollywood: The World's Biggest Film Industry", "One Hot Summer",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day",
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer",
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock",
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock",
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty",
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States",
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends",
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama",
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport",
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy",
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary",
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport",
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi",
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary",
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama",
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary",
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured",
"Featured", "Featured", "Featured", "This Weekend's Football",
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured",
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10",
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured",
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets",
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux",
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1",
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2",
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3",
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4",
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1",
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2",
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5",
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3",
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA,
-56L), class = "data.frame")

抱歉,我不太确定共享数据的最佳做法。希望以上工作。它应该看起来像这样:

   Title            Programme_Genre     Programme_Category  date         column row
1 Dragons Den Entertainment Featured 13/08/2018 1 1
2 One Hot Summer Documentary Featured 13/08/2018 2 1
3 Keeping Faith Drama Featured 13/08/2018 3 1
4 Cuckoo New Series Comedy Featured 13/08/2018 4 1
5 Match of the Day Sport This Weekends... 13/08/2018 1 2
6 Sportscene Sport This Weekends... 13/08/2018 2 2

我想做的是使用 rollapply 函数,类似于我在上一个问题中建议的方式(参见上面的链接),但仅用于查找出现在同一日期和跨度的序列一定范围的列。例如,我想知道最常见的流派序列(“Programme_Genre”)是什么,但我只希望 rollapply 函数在每个日期的每一行的第 1-4 列中执行此操作。我确定我没有很好地解释这一点(我不是来自数据科学背景,以防你没有猜到)所以我很乐意在必要时详细说明。提前致谢!

最佳答案

使用 tidyverse、zoo 和 lubridate,尝试:

library(tidyverse)
library(zoo)
library(lubridate)

df %>%
mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
group_split(row, date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
# Step 3 (below). Loops the solution to the previous question, gets a DF, and assigns the date and row signals to each observation.
map_df(.x = . ,
.f = ~(rollapply(data = .x$Programme_Genre , 3, c) %>%
as_tibble() %>%
mutate(date = unique(.x$date), row = unique(.x$row)))) %>%
group_by_all() %>%
tally() %>%
arrange(date, row, n)

# A tibble: 26 x 6
# Groups: V1, V2, V3, date [26]
V1 V2 V3 date row n
<chr> <chr> <chr> <date> <chr> <int>
1 Documentary Drama New SeriesComedy 2018-08-13 1 1
2 Entertainment Documentary Drama 2018-08-13 1 1
3 Sport Sport Sport 2018-08-13 2 2
4 Drama Entertainment Documentary 2018-08-13 3 1
5 Sport Drama Entertainment 2018-08-13 3 1
6 Comedy Drama Comedy 2018-08-13 4 1
7 Drama Comedy Documentary 2018-08-13 4 1
8 Crime Drama Documentary Documentary 2018-08-14 1 1
9 Documentary Documentary Documentary 2018-08-14 1 1
10 Comedy Drama Comedy 2018-08-14 2 1
# ... with 16 more rows

关于r - 如何使用 R 在我的数据中找到最常见的序列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67218140/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com