r - 在 R 中使用 openxlsx 进行条件格式化的 Tidyverse/更快的解决方案？-6ren

r - 在 R 中使用 openxlsx 进行条件格式化的 Tidyverse/更快的解决方案？

转载作者：行者123 更新时间：2023-12-04 11:46:31

30

4

我正在处理看起来像这张表但更大的遗传数据:

ID allele.a allele.b
A      115       90
A      115       90
A      116       90
B      120       82
B      120       82
B      120      82M

我的目标是针对每个 ID 突出显示哪些等位基因与每个 ID 组第一行列出的等位基因不匹配。我需要将数据导出到格式良好的 excel 文件中。

这就是我想要的:

我可以使用以下脚本到达那里，但实际脚本涉及大约 67 个“ID”、1000 行数据和 37 列。运行大约需要 5 分钟，所以我希望找到一个可以显着减少处理时间的解决方案。也许是来自 tidyverse 的“做”解决方案——不知道会是什么样子。

这是我的脚本，包括一个测试 data.frame。还包括一个更大的测试数据框架，用于速度测试。

library(xlsx)
library(openxlsx)
library(tidyverse)

# Small data.frame
dframe <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
                     allele.a = c("115", "115", "116", "120", "120", "120"),
                     allele.b = c("90", "90", "90", "82", "82", "82M"),
                     stringsAsFactors = F)

# Bigger data.frame for speed test
# dframe <- data.frame(ID = rep(letters, each = 30),
#                      allele.a = rep(as.character(round(rnorm(n = 30, mean = 100, sd = 0.3), 0)), 26),
#                      allele.b = rep(as.character(round(rnorm(n = 30, mean = 90, sd = 0.3), 0)), 26),
#                      allele.c = rep(as.character(round(rnorm(n = 30, mean = 80, sd = 0.3), 0)), 26),
#                      allele.d = rep(as.character(round(rnorm(n = 30, mean = 70, sd = 0.3), 0)), 26),
#                      allele.e = rep(as.character(round(rnorm(n = 30, mean = 60, sd = 0.3), 0)), 26),
#                      allele.f = rep(as.character(round(rnorm(n = 30, mean = 50, sd = 0.3), 0)), 26),
#                      allele.g = rep(as.character(round(rnorm(n = 30, mean = 40, sd = 0.3), 0)), 26),
#                      allele.h = rep(as.character(round(rnorm(n = 30, mean = 30, sd = 0.3), 0)), 26),
#                      allele.i = rep(as.character(round(rnorm(n = 30, mean = 20, sd = 0.3), 0)), 26),
#                      allele.j = rep(as.character(round(rnorm(n = 30, mean = 10, sd = 0.3), 0)), 26),
#                      stringsAsFactors = F)



# Create a new excel workbook ----
wb <- createWorkbook()

# Add a worksheets
addWorksheet(wb, sheet = 1, gridLines = TRUE)

# add the data to the worksheet        
writeData(wb, sheet = 1, dframe, rowNames = FALSE)      

# Create a style to show alleles that do not match the first row.
style_Red_NoMatch <- createStyle(fontColour = "#FFFFFF", # white text
                                 bgFill = "#CC0000", # Dark red background
                                 textDecoration = c("BOLD")) # bold text

Groups <- unique(dframe$ID)

start_time <- Sys.time()
# For each unique group, 
for(i in 1:length(Groups)){

  # Print a message telling us where the script is processing in the file.
  print(paste("Formatting unique group ", i, "/", length(Groups), sep = ""))

  # What are the allele values of the *first* individual in the group?
  Allele.values <- dframe %>% 
    filter(ID == Groups[i]) %>% 
    slice(1) %>% 
    select(2:ncol(dframe)) %>% 
    as.character()

  # for each column that has allele values in it,
  for (j in 1:length(Allele.values)){
    # format the rest of the rows so that a value that does not match the first value gets red style


    conditionalFormatting(wb, sheet = 1, 
                          style_Red_NoMatch, 
                          rows = (which(dframe$ID == Groups[i]) + 1), 
                          cols = 1+j,  rule=paste("<>\"", Allele.values[j], "\"", sep = ""))
  }

}
end_time <- Sys.time()
end_time - start_time

saveWorkbook(wb, "Example.xlsx", overwrite = TRUE)

最佳答案

我想改进的一种方法是申请 conditionalFormatting在整个列上，而不必遍历每个单元格。
这是一种方法。这种方法的一个缺点是它创建了 TRUE 的逻辑向量。和 FALSE用于conditionalFormatting .可以使用 setColWidths 隐藏这些列功能。
资料

library(openxlsx)

 dframe <- data.frame(ID = rep(letters, each = 30),
                      allele.a = rep(as.character(round(rnorm(n = 30, mean = 100, sd = 0.3), 0)), 26),
                      allele.b = rep(as.character(round(rnorm(n = 30, mean = 90, sd = 0.3), 0)), 26),
                      allele.c = rep(as.character(round(rnorm(n = 30, mean = 80, sd = 0.3), 0)), 26),
                      allele.d = rep(as.character(round(rnorm(n = 30, mean = 70, sd = 0.3), 0)), 26),
                      allele.e = rep(as.character(round(rnorm(n = 30, mean = 60, sd = 0.3), 0)), 26),
                      allele.f = rep(as.character(round(rnorm(n = 30, mean = 50, sd = 0.3), 0)), 26),
                      allele.g = rep(as.character(round(rnorm(n = 30, mean = 40, sd = 0.3), 0)), 26),
                      allele.h = rep(as.character(round(rnorm(n = 30, mean = 30, sd = 0.3), 0)), 26),
                      allele.i = rep(as.character(round(rnorm(n = 30, mean = 20, sd = 0.3), 0)), 26),
                      allele.j = rep(as.character(round(rnorm(n = 30, mean = 10, sd = 0.3), 0)), 26),
                      stringsAsFactors = F)

脚本的第一部分没有改变。

# Create a new excel workbook ----
wb <- createWorkbook()

# Add a worksheets
addWorksheet(wb, sheet = 1, gridLines = TRUE)
    
# Create a style to show alleles that do not match the first row.
style_Red_NoMatch <- createStyle(fontColour = "#FFFFFF", # white text
                                 bgFill = "#CC0000", # Dark red background
                                 textDecoration = c("BOLD")) # bold text

然后识别每个 ID 的第一行并合并到原始数据集中。然后检查任何单元格中是否有任何变化(循环通过每一列)。

# selects first row for each ID which will be used as benchmark
first_row <- dframe[!duplicated(dframe$ID), ]

# Creating new df with the first_row columns added
dframe_chk <- merge(dframe, first_row, by = "ID",  all.x = TRUE, suffixes = c("", "_first"))

# Adding TRUE/FALSE factor for each column to see if it matches or not (-1 to exclude ID column)
for (j in names(dframe)[-1]) {
  
  dframe_chk[, paste0(j, "_chk")] <- dframe_chk[, j] == dframe_chk[, paste0(j, "_first")]
  
}

# Remove _first columns when exporting into Excel
cols <- names(dframe_chk)[!grepl("_first", names(dframe_chk))]

# add the data to the worksheet        
writeData(wb, sheet = 1, dframe_chk[, cols], rowNames = FALSE)      

# This is for conditional Formatting
# first_row is header
row_start <- 2

# Need to add 1 to cover full range (as first row is header)
row_end <- nrow(dframe) + 1

# first column is ID
col_start <- 2 

# last column as per the original dataset
col_end <- ncol(dframe)

# this is to point to the _chk column.
# Note if you have columns more than A-Z then this needs to be adjusted
rule_col <- LETTERS[col_end + 1] 

# Using the _chk column to apply conditional formula
conditionalFormatting(wb, sheet = 1, 
                      style_Red_NoMatch, 
                      rows = row_start:row_end,
                      cols = col_start:col_end,  
                      rule = paste0(rule_col, "2 = FALSE"))

# Exported file includes _chk columns. Hide these columns.
setColWidths(wb, sheet = 1, cols = (col_end + 1):length(cols), hidden = TRUE)

saveWorkbook(wb, "Example2.xlsx", overwrite = TRUE)

关于r - 在 R 中使用 openxlsx 进行条件格式化的 Tidyverse/更快的解决方案？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50992957/

30

4

0

文章推荐： angular - 带有 Angular 6 的 PUT 请求的 CORS 问题

文章推荐： jenkins - Jenkins BlueOcean 中的 "Branch indexing"事件是什么

visual-studio-2010 - Visual Studio 2010 Professional 解决方案/项目是否与 Visual Studio 2010 Premium 解决方案/项目 100% 兼容？
我只是想知道要安装哪个版本的 Visual Studio 2010(专业版或高级版)提示升级项目.. 项目包括:asp.net mvc、数据库和silverlight。最佳答案通常，由不同版本的相
[解决方案]几种通过iproute2来打通不同节点间容器网络的方式
几种通过 iproute2 来打通不同节点间容器网络的方式几种通过 iproute2 来打通不同节点间容器网络的方式 host-gw ipip vxlan 背景之前由于需
【解决方案】基于数据库驱动的自定义TypeHandler处理器
目录前言 1、TypeHandler 简介 1.1转换步骤 1.2转换规则 2、JSON 转换 3、枚举转换 4、文章小结
【解决方案】Java互联网项目中常见的Redis缓存应用场景
目录前言 1、常见 key-value 2、时效性强 3、计数器相关 4、高实时性 5、排行榜系列 6、文章小结前言在笔者 3 年的
【解决方案】Java互联网项目中消息通知系统的设计与实现（下）
目录前言四、技术选型五、后端接口设计 5.1业务系统接口 5.2App 端接口六、关键逻辑实现 6.1Red
【解决方案】Java互联网项目中消息通知系统的设计与实现（上）
目录前言一、需求分析 1.1发送通知 1.2撤回通知 1.3通知消息数 1.4通知消息列表二、数据模型设计
【解决方案】多租户技术架构设计入门（一）
目录前言一、多租户的概念二、隔离模式 2.1独立数据库模式 2.2共享数据库独立数据架构 2.3共享数据库共享数据架构
【解决方案】MySQL中的死锁问题还能这样解决（文末送书）
导读：虽然锁在一定程度上能够解决并发问题，但稍有不慎，就可能造成死锁。本文介绍死锁的产生及处理。死锁的产生和预防发生死锁的必要条件有4个，分别为互斥条件、不可剥夺条件、请求与保持条件和循环等待条
javascript - 获取波斯月的最后一天 + 解决方案
在浏览网页后，我找不到任何功能来执行此操作，我有可行的个人解决方案。也许它对某人有用。 **使用 Moment 插件转换日期。***moment(currentPersianDate).clone()
检测数字手写的 OCR 解决方案？
是否有一种解决方案可以很好地处理数字(1-10)手写？我试过tesseract，但我得到的只是垃圾。理想情况下是 OSS，但商业也可以。最佳答案 OpenCV 现在带有手写数字识别 OCR 示例。
multithreading - Delphi死锁解释/解决方案
在服务器应用程序上，我们有以下内容:一个称为 JobManager 的单例类。另一个类，Scheduler，不断检查是否需要向 JobManager 添加任何类型的作业。当需要这样做时，调度程序会执
javascript - 用于在应用程序中处理和捕获错误的工具/解决方案
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。想改进这个问题？将问题更新为 on-topic对于堆栈溢出。 5年前关闭。 Improve this qu
r - 解决方案。有代理时如何install_github
当您尝试从 GitHub 存储库安装某些 R 包时 install_github('rWBclimate', 'ropensci') 如果您遇到以下错误: Installing github repo
WPF 字体模糊问题-解决方案
问题在以下链接中进行了描述和演示: Paul Stovell WPF: Blurry Text Rendering www.gamedev.net forum Microsoft Connect: W
用于科学记数格式格式化的 R 解决方案
我正在寻找一种解决方案，使用标准格式 a × 10 b 在科学记数法下格式化 R 中的数字。一些同行评审的科学期刊都要求这样做，并且手动修改图表可能会变得乏味。下面是 R 标准“E 表示法”的示例，
java - 如何从另一个java应用程序内部启动资源jar - 解决方案
已编辑解决方案(如下...) 我有一个启动画面，它被打包到它自己的 jar 中。它有效。我可以通过以下方式从另一个 java 应用程序内部调用 Splash.jar: Desktop.getDesk
用于创建门户的 .NET 解决方案
什么是创建像 PageFlakes 或 iGoogle 这样的门户网站的好框架/包？？我们希望创建一个为员工提供 HR 服务的员工/HR 门户，但我们也需要一种足够灵活的产品，以便我们可以使用它来为
用于科学记数格式格式化的 R 解决方案
我正在寻找一种解决方案，使用标准格式 a × 10 b 在科学记数法下格式化 R 中的数字。一些同行评审的科学期刊都要求这样做，并且手动修改图表可能会变得乏味。下面是 R 标准“E 表示法”的示例，
search - 解决方案+遗传
如何将 solr 与 heritrix 集成？我想使用 heritrix 归档一个站点，然后使用 solr 在本地索引和搜索该文件。谢谢最佳答案使用 Solr 进行索引的问题在于它是一个纯文本
jquery - 全日历工作时间 [解决方案]
完整日历不包含工作时间功能选项(在任何一天的议程 View 中选择第一行和最后一行 - 例如公司不工作)。我做到了类似的事情: viewDisplay: function(view){

首页

博学

6Ren·AI

商城

r - 在 R 中使用 openxlsx 进行条件格式化的 Tidyverse/更快的解决方案？