Update
Based on OP's updated HTML, here's a way to extract the data into a table.
Inspired by @hrbrmstr's post here.
UPDATE基于OP更新的HTML,这里有一种将数据提取到表中的方法。灵感来自@hrbrmstr在这里的帖子。
library(tidyverse)
library(rvest)
library(xml2)
html <- read_html("~/Desktop/GenderStats.html")
xml_find_all(html, ".//br") %>% xml_add_sibling("p", "$$$")
xml_find_all(html, ".//br") %>% xml_remove()
tbls <- html_table(html)
^The first key is to replace <br>
tags with some distinct delimiter. Here I've chosen '$$$'
but you can use anything unlikely to appear normally in the text you're scraping.
^第一个关键是用不同的分隔符替换
标记。我在这里选择了‘$’,但你可以使用任何不太可能在你抓取的文本中正常显示的东西。
The reason for this is html_table()
converts into nondescript whitespace - which becomes indistinguishable from whitespace within valid strings and then impossible to split on.
这是因为html_table()会转换为非描述性空格--它与有效字符串中的空格无法区分,因此无法拆分。
tbls[[1]] |>
rename(values_pct = `Values, n (%)`) |> # just for ease of typing
filter(values_pct != "") |> # drop "spacer" row entries
# :Characteristics: values are now separated by $$$, split on that delimiter
separate_wider_regex(Characteristics,
patterns = c(var = ".*?", " \\$\\$\\$", category = ".*")) |>
group_by(var, category) |>
# now :values_pct: has entries like: $$$69 (68.3) $$$19 (18.8) $$$9 (8.9) $$$4 (4.0)
# however, some values for :var: have multiple rows already, like Sex
# so we create a list column of :values_pct: , then expand into columns
# with unnest_wide()
# then remove leading $$$ values with str_replace()
# finally expand each value into a column with separate()
# then convert to long format with pivot_wider()
summarise(values_pct = list(values_pct)) |>
unnest_wider(values_pct, names_sep = "_") |>
mutate(values_pct_1 = str_replace(values_pct_1, "^\\$\\$\\$", "")) |>
separate_wider_delim(values_pct_1, "$$$", names_sep = "__",
too_few="align_start") |>
pivot_longer(-c(var, category)) |>
filter(!is.na(value)) |> # drop empty values
# now split :category: into one row per value, splitting on the delimiter
separate_longer_delim(category, " $$$") |>
# the trick is to align each category with its associated value
# do this by enumerating categories and values in the order they appear
group_by(var, value = fct_inorder(value)) |>
mutate(value_id = cur_group_id()) |>
group_by(var, category = fct_inorder(category)) |>
mutate(cat_id = cur_group_id()) |>
ungroup() |>
# now reduce to just those entries where category ID and value ID match
filter(cat_id == value_id) |>
select(var, category, value)
Output
输出
# A tibble: 27 × 3
var category value
<chr> <fct> <fct>
1 Age <70 years of age "63 (62.4) "
2 Age ≥70 years of age "38 (37.6)"
3 Brain metastases Yes "16 (15.8) "
4 Brain metastases No "85 (84.2)"
5 CCI 0–2 "91 (90.1) "
6 CCI ≥3 "10 (9.9)"
7 Disease stage IIIB "2 (2.3) "
8 Disease stage IV "86 (97.7)"
9 ECOG PS 0 "20 (19.8) "
10 ECOG PS 1 "43 (42.6) "
11 ECOG PS 2 "30 (29.7) "
12 ECOG PS 3 "8 (7.9)"
13 Frailty Scoring System Low "28 (27.7) "
14 Frailty Scoring System Intermediate "41 (40.6) "
15 Frailty Scoring System High "32 (31.7)"
16 Histology Adenocarcinoma "69 (68.3) "
17 Histology Squamous "19 (18.8) "
18 Histology NSCLC poorly differentiated "9 (8.9) "
19 Histology Others "4 (4.0)"
20 NLR ≥4 "58 (57.4) "
21 NLR <4 "43 (42.6)"
22 PD-L1 TPS% <90% "74 (73.3) "
23 PD-L1 TPS% ≥90% "27 (26.7)"
24 Sex Male "75 (74.3)"
25 Sex Female "26 (25.7)"
26 Smoking history No "93 (92.1) "
27 Smoking history Yes "8 (7.9)"
Note: This still feels like a brittle solution and will almost certainly not generalize well. Also verging on more suitable for Data Science Stack Exchange than SO, as it's less about coding (IMO) and more about thinking creatively through a data organization problem. Keeping it here in the spirit that it may be helpful to others learning to code in R/tidyverse.
注意:这仍然感觉像是一个脆弱的解决方案,几乎可以肯定不会很好地推广。也接近于更适合数据科学堆栈交换,因为它不是关于编码(IMO),而是更多地通过数据组织问题进行创造性的思考。将它保存在这里的精神是,它可能会对其他学习R/tidyverse编程的人有所帮助。
Original
Here's a solution that pulls out each value in List
and gives it its own row in the data frame that comes out of tbls
. Then just drop the row with an empty Value
:
这里有一个解决方案,它可以提取List中的每个值,并在从tbls出来的数据框中为其提供自己的行。然后删除具有空Value的行:
library(tidyverse)
tbls[[1]] |>
rownames_to_column() |>
rowwise() |>
mutate(List = str_split_1(List, " ")[[as.numeric(rowname)]]) |>
filter(`Values, n (%)` != "") |>
select(-rowname)
# A tibble: 2 × 2
# Rowwise:
List `Values, n (%)`
<chr> <chr>
1 Male 75 (74.3)
2 Female 26 (25.7)
更多回答
Thanks you, this worked for above table. However it does not work when I try to scrap table from this URL as there are no spaces: ncbi.nlm.nih.gov/pmc/articles/PMC9953107/table/…
谢谢,这对上面的桌子很管用。但是,当我尝试从此url中删除表时,它不起作用,因为没有空格:ncbi.nlm.nih.gov/pmc/articles/PMC9953107/table/…
Can you update your post with example data that reflects the actual data you want to scrape? As I noted in my comment, the html in your example data doesn't have a reliable structure and so solutions may not generalize well. The more your example covers the actual use case, the better others can help you.
你能用反映你想要抓取的实际数据的示例数据来更新你的帖子吗?正如我在评论中指出的,示例数据中的html没有可靠的结构,因此解决方案可能不能很好地推广。您的示例覆盖的实际用例越多,其他人就越能更好地帮助您。
我是一名优秀的程序员,十分优秀!