r - 如何使用 R 在我的数据中找到最常见的序列？-6ren

r - 如何使用 R 在我的数据中找到最常见的序列？

转载作者：行者123 更新时间：2023-12-04 14:59:40

我正在尝试弄清楚如何使用 rollapply 函数(来自 Zoo 包)在数据集中查找最常见字符串的序列，但我还需要对某些变量进行分组(例如日期、行等)

在我继续之前，值得注意的是这个查询建立在我之前发布在这里的一个问题上:How can I find most common sequences (of strings) in my data using Tableau?

那里提供的解决方案非常有效，但我现在想将它应用于不同的数据集，这带来了一些新的挑战!这是我在这个新数据集中使用的数据示例:

structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith", 
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show", 
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death", 
"Bollywood: The World's Biggest Film Industry", "One Hot Summer", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day", 
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer", 
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock", 
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock", 
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty", 
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States", 
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends", 
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama", 
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport", 
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy", 
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary", 
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport", 
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi", 
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary", 
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama", 
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary", 
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured", 
"Featured", "Featured", "Featured", "This Weekend's Football", 
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured", 
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10", 
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured", 
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", 
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux", 
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1", 
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", 
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", 
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", 
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", 
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2", 
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2", 
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5", 
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", 
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA, 
-56L), class = "data.frame")

抱歉，我不太确定共享数据的最佳做法。希望以上工作。它应该看起来像这样:

   Title            Programme_Genre     Programme_Category  date         column row
1   Dragons Den     Entertainment       Featured            13/08/2018      1   1
2  One Hot Summer   Documentary         Featured            13/08/2018      2   1
3  Keeping Faith    Drama               Featured            13/08/2018      3   1
4  Cuckoo           New Series Comedy   Featured            13/08/2018      4   1
5  Match of the Day Sport               This Weekends...    13/08/2018      1   2
6  Sportscene       Sport               This Weekends...    13/08/2018      2   2

我想做的是使用 rollapply 函数，类似于我在上一个问题中建议的方式(参见上面的链接)，但仅用于查找出现在同一日期和跨度的序列一定范围的列。例如，我想知道最常见的流派序列(“Programme_Genre”)是什么，但我只希望 rollapply 函数在每个日期的每一行的第 1-4 列中执行此操作。我确定我没有很好地解释这一点(我不是来自数据科学背景，以防你没有猜到)所以我很乐意在必要时详细说明。提前致谢!

最佳答案

使用 tidyverse、zoo 和 lubridate，尝试:

library(tidyverse)
library(zoo)
library(lubridate)

df %>% 
  mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
  filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
  group_split(row, date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
  # Step 3 (below). Loops the solution to the previous question, gets a DF, and assigns the date and row signals to each observation.
  map_df(.x = . ,
         .f = ~(rollapply(data = .x$Programme_Genre , 3, c) %>% 
                  as_tibble() %>% 
                  mutate(date = unique(.x$date), row = unique(.x$row)))) %>% 
  group_by_all() %>% 
  tally() %>% 
  arrange(date, row, n)

    # A tibble: 26 x 6
# Groups:   V1, V2, V3, date [26]
   V1            V2            V3               date       row       n
   <chr>         <chr>         <chr>            <date>     <chr> <int>
 1 Documentary   Drama         New SeriesComedy 2018-08-13 1         1
 2 Entertainment Documentary   Drama            2018-08-13 1         1
 3 Sport         Sport         Sport            2018-08-13 2         2
 4 Drama         Entertainment Documentary      2018-08-13 3         1
 5 Sport         Drama         Entertainment    2018-08-13 3         1
 6 Comedy        Drama         Comedy           2018-08-13 4         1
 7 Drama         Comedy        Documentary      2018-08-13 4         1
 8 Crime Drama   Documentary   Documentary      2018-08-14 1         1
 9 Documentary   Documentary   Documentary      2018-08-14 1         1
10 Comedy        Drama         Comedy           2018-08-14 2         1
# ... with 16 more rows

关于r - 如何使用 R 在我的数据中找到最常见的序列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67218140/

文章推荐： oauth - Google API 刷新 token 已过期或被撤销

文章推荐： javascript - 具有输入类型范围问题的 Sweet alert 2 模态

文章推荐： babeljs - 在 .babelrc 文件中指定 cacheDirectory

SQl 语句(常见)
新建表： create table [表名] ( [自动编号字段] int IDENTITY (1,1)&nbs
iphone - 常见 UI 字符串的本地化
我的文件中有正在本地化的字符串。其中许多是常见的，并且已经在整个 iOS 中使用。例如。 “保存”、“加载”、“返回”、“收藏夹”、“拍照”。为了与其他应用程序和内置应用程序提供一致的用户体验，是否有
qt - 常见 Qt 问题
我已经学习了 Qt 的基础知识，现在对这个漂亮的库的深度感兴趣。请帮助我理解: 所有类都是从QObject派生的吗？为什么可以在QWidget(和派生类)上绘画？ return app.exec()
javascript - 常见 JS - 是否可以要求一个函数
我在 webpack 中设置了一个自调用函数，并使用常见的 JS 来需要一些包: (function() { var $ = require("jquery"); //...my functi
java - 常见 nlp 任务的效率
我正在尝试制作一个大量使用词性标记的应用程序。但是 nltk 的 pos 标记功能对我来说似乎不符合标准 - 例如: import nltk text = "Obama delivers his fi
php - 常见 MYSQL 查询的缓存
有没有办法处理发送到 MySQL 的常见查询以防止不必要的带宽使用？最佳答案选项是: 使用MySQL缓存查询好:全自动差:仍然需要访问数据库服务器；有一次缓存让我在一个项目中失望，花了很长时间
c# - 常见 Linq 表达式的示例
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 4 年前。 Improve this qu
mobile - AdSense - 移动广告未在某些(常见)设备上显示
关闭。这个问题需要debugging details .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 6年前关闭。 Improve this questio
java - 常见 io copyUrlToFile 不起作用
我正在尝试调用返回 csv 文件的网络服务。因此，我调用的每个 URL 都有一个后缀，它是一个字符串，表示要生成哪个 csv。然后我想将此 csv 保存到文件中。有很多要生成，所以我从多个线程调用此类
android - 常见/典型 Android 设备上的触摸点数量
流行手机型号支持的典型触摸点数量是多少？我在基础研究中看到低至 2 和高至 5，但我希望能够将其映射到实际手机和更好的限制! 最佳答案两部手机的触控点数据: Galaxy S 5 LG
Web 堆栈 - 常见 Web 堆栈/环境的列表
出于好奇 - 我知道有 LAMP - Linux、Apache、MySQL 和 PHP。但是还有哪些其他 Web 堆栈替代方案的缩写呢？像 LAMR - Linux、Apache、MySQL Ruby
java - 无法连接到 SFTP 服务器 Apache 常见
我写了一个java代码(使用apache common vfs2)来上传文件到SFTP服务器。最近，我在我的服务器上引入了 PGP 安全性。现在，java 代码无法连接到该服务器。与 FileZill
c++ - 在 OpenGL 中绘制形状的标准(常见)方法是什么？
由于 GLU 被认为对于现代 OpenGL (3.1+) 来说已经过时，那么使用 C/C++ 在 OpenGL 中绘制基本形状(例如椭圆或弧线/饼图)的方法是什么？令人难以置信的是，在 OpenGL
ios - 常见 iOS 应用程序的 URL 方案
我想知道是否有最流行的 iOS 应用程序的自定义 URL 方案列表，例如 Keynote、Numbers、Pages、Evernote 等。我还想知道这些应用程序使用什么参数网址。我需要这个的原因是
c++ - 常见 Linux 路径名在 Android 上对应的目录是什么？
我正在使用 NDK r10d 移植 C++ myToll Linux 应用程序以在 Android 上运行。 (请注意，这不是带有 apk 的 Android 应用程序，而是从 shell 运行的实用
php - 常见 PHP 服务器应用程序的 UML 部署图
假设您想要使用 UML 2 部署图为在该领域没有太多知识的人可视化一个常见的 PHP 服务器应用程序。这样一个通用的应用程序可能有三个设备节点(数据库服务器、Web 服务器和客户端)和四个执行环境节点
apache - hadoop mapreduce 常见 friend reducer 溢出
我正在尝试运行以下代码，以找到两个人之间的共同 friend 。输入如下 A : B C D B : A C D E C : A B D E D : A B C E E : B C D 我无法在输出文
git - 在 Git 中跟踪 Gitolite(常见)钩子(Hook)
我在 Gitolite 的 manual 中找到的唯一东西在钩子(Hook)上，是: If you want to add your own hook, it's easy as long as it
amazon-web-services - 常见 AWS 故障 - 处理 AZ 故障转移
具体来说，我有一个问题，在 AWS 环境中组织 AZ 故障转移的推荐方法是什么。此外，最好了解典型的 AWS 故障以组织应用程序 HA(高可用性)。因此，应用程序架构(AWS 服务使用)如下: 它或
java - 常见 spring NoRepositoryBean 基接口(interface)上的 PreAuthorize 问题
我正在尝试编写一个通用的 SecurePagingAndSorting 存储库，它将检查 CRUD 操作的安全性，以节省在所有 JPA 存储库中重复相同的 PreAuthorize(使用不同的权限)。

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - 如何使用 R 在我的数据中找到最常见的序列？