gpt4 book ai didi

r - 在特定短语后提取数字

转载 作者:行者123 更新时间:2023-12-04 15:57:15 26 4
gpt4 key购买 nike

我一直在尝试编写两个正则表达式来完成以下两个任务:

  1. 拉出“EDG ICD HCUP CCS”后面的数字
  2. 拉出“EDG ICD HCUP CCS 159(预测模型-版本1.0)-”后面的字眼

我想将数字存储在名为“类别”的列中,并将单词存储在“诊断”中

字符串位于列名“GROUPER_NAME”中。

df <- structure(list(GROUPER_ID = structure(c("9001742130", "9001742138", 
"9001742058", "9001742062", "9001742102", "9001742247", "9001742055",
"9001742158", "9001742036", "9001742053"), label = "GROUPER_ID", format.sas = "$"),
GROUPER_NAME = structure(c("EDG ICD HCUP CCS 130 (PREDICTIVE MODELS-VERSION 1.0)-PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE",
"EDG ICD HCUP CCS 138 (PREDICTIVE MODELS-VERSION 1.0)-ESOPHAGEAL DISORDERS",
"EDG ICD HCUP CCS 58 (PREDICTIVE MODELS-VERSION 1.0)-OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS",
"EDG ICD HCUP CCS 62 (PREDICTIVE MODELS-VERSION 1.0)-COAGULATION AND HEMORRHAGIC DISORDERS",
"EDG ICD HCUP CCS 102 (PREDICTIVE MODELS-VERSION 1.0)-NONSPECIFIC CHEST PAIN",
"EDG ICD HCUP CCS 247 (PREDICTIVE MODELS-VERSION 1.0)-LYMPHADENITIS",
"EDG ICD HCUP CCS 55 (PREDICTIVE MODELS-VERSION 1.0)-FLUID AND ELECTROLYTE DISORDERS",
"EDG ICD HCUP CCS 158 (PREDICTIVE MODELS-VERSION 1.0)-CHRONIC KIDNEY DISEASE",
"EDG ICD HCUP CCS 36 (PREDICTIVE MODELS-VERSION 1.0)-CANCER OF THYROID",
"EDG ICD HCUP CCS 53 (PREDICTIVE MODELS-VERSION 1.0)-DISORDERS OF LIPID METABOLISM"
), label = "GROUPER_NAME", format.sas = "$")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))

对于第一个示例,我想提取“159”和“泌尿道感染”并将它们分别放在“类别”和“诊断”列中。我试图改变这里的一些解决方案以适应我的场景,但我对正则表达式真的很糟糕,无法得到任何工作。任何帮助将不胜感激!

最佳答案

我们可以使用 base R 中的 sub。捕获前缀子字符串后的数字 (\\d+),以及 )- 后的字符。在替换中,指定捕获组的反向引用(\\1\\2),并用read将它们读入一个两列data.frame .csv

read.csv(text = sub("\\w+ \\w+ \\w+ \\w+ (\\d+)\\s.*\\)-(.*)", 
"\\1:\\2", df$GROUPER_NAME), sep = ":", header = FALSE,
col.names = c("category", "diagnosis"))

-输出

 category                                             diagnosis
1 130 PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2 138 ESOPHAGEAL DISORDERS
3 58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4 62 COAGULATION AND HEMORRHAGIC DISORDERS
5 102 NONSPECIFIC CHEST PAIN
6 247 LYMPHADENITIS
7 55 FLUID AND ELECTROLYTE DISORDERS
8 158 CHRONIC KIDNEY DISEASE
9 36 CANCER OF THYROID
10 53 DISORDERS OF LIPID METABOLISM

关于r - 在特定短语后提取数字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68024208/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com