r - 对大数据运行回归的更快方法-6ren

r - 对大数据运行回归的更快方法

转载作者：行者123 更新时间：2023-12-04 04:12:20

25

4

我有一个包含 70k+ 行和多列可变数据的大型数据集。此外，我还有一列具有我需要使用超过 5000 个因素。

有什么方法可以加快回归速度，因为目前它需要 40 多分钟才能运行。我认为我加快速度的唯一方法是，如果我只能将测试数据中的因素过滤到训练数据中，或者使用 data.table 并从中运行 reg。

任何帮助将不胜感激。

library(dbplyr)
library(dplyr)
library(data.table)
library(readr)


greys <- read_excel("Punting'/Dogs/greys.xlsx", sheet = 'Vic')
greys$name<- as.factor(greys$name)
ggtrain<- tail(greys,63000)
gtrain<- head(ggtrain, -190)
gtest1<- tail(ggtrain,190)
gtest<- filter(gtest1, runnum >5)

#mygrey<- gam(gtrain$time~ s(name, bs='fs')+s(box)+s(distance),data = gtrain,method = 'ML')
mygrey<- lm(gtrain$margin~name+box+distance+trate+grade+trackid, data = gtrain)
pgrey<- predict(mygrey,gtest)
gdf<- data.frame(gtest$name,pgrey)
#gdf
write.csv(gdf,'thedogs.csv')```

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   63000 obs. of  25 variables:
 $ position: num  4 5 6 7 1 2 3 4 5 6 ...
 $ box     : num  3 10 5 8 3 7 9 5 2 4 ...
 $ name    : Factor w/ 5903 levels "AARON'S ME BOY",..: 4107 2197 3294 3402 4766 4463 5477 274 5506 2249 ...
 $ trainer : chr  "Marcus Lloyd" "Ian Robinson" "Adam Richardson" "Nathan  Hunt" ...
 $ time    : num  22.9 23 23.1 23.5 22.5 ...
 $ margin  : num  7.25 8.31 9.96 15.33 0 ...
 $ split   : num  9.17 8.98 9.12 9.14 8.62 8.73 8.8 8.99 9.04 9.02 ...
 $ inrun   : num  75 44 56 67 11 22 33 54 76 67 ...
 $ weight  : num  27.9 26.2 30.3 27.7 26.5 31.5 34.1 32.8 31.2 34 ...
 $ sire    : chr  "Didda Joe" "Swift Fancy" "Barcia Bale" "Hostile" ...
 $ dam     : chr  "Hurricane Queen" "Ulla Allen" "Diva's Shadow" "Flashing Bessy" ...
 $ odds    : num  20.3 55.5 1.6 33.2 1.6 5 22.6 7.9 12.5 9.9 ...
 $ distance: num  390 390 390 390 390 390 390 390 390 390 ...
 $ grade   : num  4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 ...
 $ race    : chr  "Race 11" "Race 11" "Race 11" "Race 11" ...
 $ location: chr  "Ballarat" "Ballarat" "Ballarat" "Ballarat" ...
 $ date    : chr  "Monday 5th of August 2019" "Monday 5th of August 2019" "Monday 5th of August 2019" "Monday 5th of August 2019" ...
 $ state   : chr  "vic" "vic" "vic" "vic" ...
 $ trate   : num  0.515 0.376 0.818 0.226 0.55 ...
 $ espeed  : num  75 44 56 67 11 22 33 54 76 67 ...
 $ trackid : num  3 3 3 3 3 3 3 3 3 3 ...
 $ runnum  : num  4 6 3 2 2 2 3 4 2 4 ...
 $ qms     : chr  "M/75" "M/44" "M/56" "M/67" ...

最佳答案

由于 name 变量，您的回归拟合速度很慢。拟合具有 5903 个水平的因子将为您的设计矩阵添加 5903 列 - 这就像尝试拟合 5903 个单独的变量。

您的设计矩阵的尺寸为 63000x5908，其一会占用大量内存，其二会使 lm 努力生成其估计值(因此需要 40 分钟的拟合时间)。

您有几个选择:

保持您的设计不变，然后等待(或找到稍微快一点的 lm)
丢弃 name 变量，在这种情况下 lm 几乎可以立即适应
使用 lmer 或其他包拟合混合效应模型，将 name 作为随机效应。 lmer 特别针对随机效应使用稀疏设计矩阵，利用每个观察只能有 5903 个名称之一的事实(因此矩阵的大部分是空的)。

在这三个选项中，第三个选项可能是最有原则的前进方式。随机效应将解释观察中个体水平的差异，并且还会汇集不同个体之间的信息，以帮助对没有大量观察的狗进行更好的估计。最重要的是，由于稀疏设计矩阵，它可以快速计算。

数据集的简单模型可能如下所示:

library(lme4)
## read data
mygrey <- lmer(gtrain$margin~(1|name)+box+distance+trate+grade+trackid,
               data = gtrain)

如果您想走那条路，我建议您阅读更多有关混合效应模型的信息，以便您可以选择对您的数据有意义的模型结构。这里有两个很好的资源:

lmer 使用指南 - https://stats.stackexchange.com/questions/13166/rs-lmer-cheat-sheet
关于随机/混合效应的更多理论讨论 - https://stats.stackexchange.com/a/151800/

关于r - 对大数据运行回归的更快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61492260/

25

4

0

文章推荐： eclipse 调试显示行号错误的源文件

文章推荐： firebase - firebaseRef.push() 与 angularfire 的等价物是什么？

文章推荐： python - sqlalchemy查询相关帖子(按常见的多对多关系排序)

angular - 从批处理文件运行一组命令(运行 VSCode、运行 NG 服务)
好的，所以我想从批处理文件运行我的整个工作环境... 我想要实现什么...... 打开新的 powershell，打开我的 API 文件夹并从该文件夹运行 VS Code 编辑器(cd c:\xy;
单击“运行”按钮时，iOS Xcode 项目不会构建/运行
我正在查看 Cocoa Controls 上的示例并下载了一些演示。我遇到的问题是一些例子，比如 BCTabBarController ，不会在我的设备上构建或启动。当我打开项目时，它看起来很正常，没
c - 运行 C — helloWorld 运行，但没有其他内容 — Ubuntu
我刚刚开始学习 C 语言(擅长 Java 和 Python)。当编写 C 程序(例如 hello world)时，我在 ubuntu cmd 行上使用 gcc hello.c -o hello 编译
php - 从 cron 运行 php 没有作为 CLI 运行
我在 php 脚本从 cron 开始运行到超时后注意到了这个问题，但是当它从命令行手动运行时这不是问题。 (对于 CLI，PHP 默认的 max_execution_time 是 0) 所以我尝试运行
node.js - 如何通过 IntelliJ 运行/调试配置让 wdio 运行？
我可以使用命令行运行测试 > ./node_modules/.bin/wdio wdio.conf.js 但是如果我尝试从 IntelliJ 的运行/调试配置运行它，我会遇到各种不同的错误。 Fea
java - 从 python 运行 bat 文件会返回错误，而直接从 cmd 运行
Error occurred during initialization of VM. Could not reserve enough space for object heap. Error: C
python - 无法从 anaconda 运行 jupyter 笔记本，但可以从 python 运行
将 Anaconda 安装到 C:\ 后，我无法打开 jupyter 笔记本。无论是在带有 jupyter notebook 的 Anaconda Prompt 中还是在导航器中。我就是无法让它工作。
Python 脚本通过双击和 IDLE 运行，但不通过 Windows CMD shell 运行
我遇到一个问题，如果我双击我的脚本 (.py)，或者使用 IDLE 打开它，它将正确编译并运行。但是，如果我尝试在 Windows 命令行中运行脚本，请使用 C:\> "C:\Software_Dev
php - 查询从 postman 和 phpmyadmin 运行，但不是从 android 运行
情况我正在使用 mysql 数据库。查询从 phpmyadmin 和 postman 运行但是当我从 android 发送请求时(它返回零行) 我已经记录了从 android 发送的电子邮件是正确
java - 从 Java 运行 .exe 会提供与直接从 Windows 运行 .exe 不同的控制台输出
所以这个有点奇怪 - 为什么从 Java 运行 .exe 文件会给出不同的输出而不是直接运行 .exe。当 java 在下面的行执行时，它会调用我构建的可与 3CX 电话系统配合使用的 .exe 文
c# - 应用程序在 Visual Studio 的单元测试中以 x86 运行，但在独立时以 x64 运行
这行代码 Environment.Is64BitProcess 当我的应用单独运行时评估为真。但是当它在我的 Visual Studio 单元测试中运行时，相同的表达式的计算结果为 false。我
javascript - 使用 JQuery 运行 AJAX 和使用普通 XMLHttpRequest 运行 AJAX 有什么区别？
关闭。这个问题是opinion-based .它目前不接受答案。想要改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 关闭 8 年前。 Improve
c - 为什么我的 C 程序可以在 "git bash"运行，但不能在 "cmd"运行？
我写了一个使用 libpq 连接到 PostgreSQL 数据库的演示。我尝试通过包含将 C 文件连接到 PostgreSQL #include 在我将路径添加到系统变量 I:\Program F
java - 从 Jenkins 运行 Android 模拟器以使用 Robotium 运行 Junit 测试
如何从 Jenkins 运行 Android 模拟器来运行我的测试？当我在 Execiute Windows bath 命令中写入时，运行模拟器的命令: emulator -avd Tester 然后
ruby-on-rails - 使用 ngninx 运行 errbit，使用 ssl 运行 passenger
我已经配置好东西，这样我就可以使用 ssl 登录和访问在 nginx 上运行的 errbit 我的问题是我不知道如何设置我的 Rails 应用程序的 errbit.rb 以便我可以运行测试 nginx
ios - flutter app 不是由 flutter build ios 运行，而是由 xcode 运行
我编写了 flutter 应用程序，我通过 xcode 打开了 ios 部分并且应用程序正在运行，但是当我通过 flutter build ios 通过 vscode 运行应用程序时，我得到了这个错误
python - 我的 python 脚本通过我的 IDE (PyCharm) 运行，但无法使用 Python shell 运行
我有一个简短的 python 脚本，它使用日志记录模块和 configparser 模块。我在Win7下使用PyCharm 2.7.1和Python 3.3。当我使用 PyCharm 运行我的脚本时
c# - .NET 2005 - 通过 IIS 的测试作为 x86 运行。单元测试以 x64 运行
我在这里遇到了一些难题。我的开发箱是 64 位的，windows 7。我所有的项目都编译为“任何 CPU”。该项目引用了 64 位版本的第 3 方软件当我运行不使用任何 Web 引用的单元测试时，
c++ 相同的代码从不在 Visual Studio 中编译/运行，有时在 Qt Creator 中编译/运行
当我注意到以下问题时，我正在做一些 C++ 练习。给定的代码将不会在 Visual Studio 2013 或 Qt Creator 5.4.1 中运行/编译报错: invalid types 'd
airflow - 运行 dag 并让 Airflow 运行 : error: the following arguments are required: task_id,execution_date
假设我有一个 easteregg.py 文件: from airflow import DAG from dateutil import parser from datetime import tim

首页

博学

6Ren·AI

商城

r - 对大数据运行回归的更快方法