gpt4 book ai didi

How to handle rowspan and colspan while scraping a Table using R(如何在使用R刮取表时处理行跨度和列跨度)

转载 作者:bug小助手 更新时间:2023-10-28 20:58:39 27 4
gpt4 key购买 nike



I am trying to scrape data from table (HTML pasted below):

我正在尝试从表中抓取数据(粘贴在下面的HTML):


I tried to use below code, but it does not return the contents as required. It merges all names in one for the rows (where rowspan is mentioned). Refer Table1 below.

我尝试使用下面的代码,但它没有按要求返回内容。它将各行的所有名称合并为一个(其中提到了rowspan)。请参阅下面的表1。


This table does have structural issues as it is using br tags. Could someone please help me to get a table with values mapped properly to all items. (Like Table 2)

此表确实存在结构问题,因为它使用了br标签。有没有人能帮我弄一张表,把值正确地映射到所有项目上?(如表2所示)


<!--HTML for Table -->
<table frame="hsides" rules="groups" class="rendered small default_table">
<thead>
<tr>
<th align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Characteristics</th>
<th align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Values, n (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" align="center" valign="middle" style="border-bottom:solid thin" colspan="1">Sex <br />Male <br />Female </td>
<td align="center" valign="middle" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">75 (74.3)</td>
</tr>
<tr>
<td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">26 (25.7)</td>
</tr>
<tr>
<td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin" colspan="1">Age <br />&#x0003c;70 years of age <br />&#x02265;70 years of age </td>
<td align="center" valign="middle" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">63 (62.4) <br />38 (37.6) </td>
</tr>
<tr>
<td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin" colspan="1">Smoking history <br />Yes <br />No </td>
<td align="center" valign="middle" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">93 (92.1) <br />8 (7.9) </td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">Histology <br />Adenocarcinoma <br />Squamous <br />NSCLC poorly differentiated <br />Others </td>
<td align="center" valign="middle" rowspan="1" colspan="1">
<br />69 (68.3) <br />19 (18.8) <br />9 (8.9) <br />4 (4.0)
</td>
</tr>
<tr>
<td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">Disease stage <br />IIIB <br />IV </td>
<td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">
<br />2 (2.3) <br />86 (97.7)
</td>
</tr>
<tr>
<td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Brain metastases <br />Yes <br />No </td>
<td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">16 (15.8) <br />85 (84.2) </td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">PD-L1 TPS% <br />&#x0003c;90% <br />&#x02265;90% </td>
<td align="center" valign="middle" rowspan="1" colspan="1">
<br />74 (73.3) <br />27 (26.7)
</td>
</tr>
<tr>
<td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">ECOG PS <br />0 <br />1 <br />2 <br />3 </td>
<td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">
<br />20 (19.8) <br />43 (42.6) <br />30 (29.7) <br />8 (7.9)
</td>
</tr>
<tr>
<td align="center" valign="middle" rowspan="1" colspan="1">CCI <br />0&#x02013;2 <br />&#x02265;3 </td>
<td align="center" valign="middle" rowspan="1" colspan="1">
<br />91 (90.1) <br />10 (9.9)
</td>
</tr>
<tr>
<td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">NLR <br />&#x02265;4 <br />&#x0003c;4 </td>
<td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">
<br />58 (57.4) <br />43 (42.6)
</td>
</tr>
<tr>
<td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Frailty Scoring System <br />Low <br />Intermediate <br />High </td>
<td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">
<br />28 (27.7) <br />41 (40.6) <br />32 (31.7)
</td>
</tr>
</tbody>
</table>




#R Code

library(rvest)
tbls <- html_table(read_html("c:/GenderStats.html"))
for (t in 1:length(tbls)) {
assign(paste0("Table", t), tbls[[t]])
}


#Table 1

# A tibble: 15 × 2
Characteristics `Values, n (%)`
<chr> <chr>
1 SexMaleFemale ""
2 SexMaleFemale "75 (74.3)"
3 SexMaleFemale "26 (25.7)"
4 Age<70 years of age≥70 years of age ""
5 Age<70 years of age≥70 years of age "63 (62.4)38 (37.6)"
6 Smoking historyYesNo ""
7 Smoking historyYesNo "93 (92.1)8 (7.9)"
8 HistologyAdenocarcinomaSquamousNSCLC poorly differentiatedOthers "69 (68.3)19 (18.8)9 (8.9)4 (4.0)"
9 Disease stageIIIBIV "2 (2.3)86 (97.7)"
10 Brain metastasesYesNo "16 (15.8)85 (84.2)"
11 PD-L1 TPS%<90%≥90% "74 (73.3)27 (26.7)"
12 ECOG PS0123 "20 (19.8)43 (42.6)30 (29.7)8 (7.9)"
13 CCI0–2≥3 "91 (90.1)10 (9.9)"
14 NLR≥4<4 "58 (57.4)43 (42.6)"
15 Frailty Scoring SystemLowIntermediateHigh "28 (27.7)41 (40.6)32 (31.7)"



**Is there a way to get like below attached table?**



#Table 2

# A tibble: 38 × 2
Characteristics `Values, n (%)`
<chr> <chr>
1 Sex ""
2 Male "75 (74.3)"
3 Female "26 (25.7)"
4 Age ""
5 <70 years of age "63 (62.4)"
6 >=70 years of age "38 (37.6)"
7 Smoking history ""
8 Yes "93 (92.1)"
9 No "8 (7.9)"
10 Histology ""
11 Adenocarcinoma "69 (68.3)"
12 Squamous "19 (18.8)"
13 NSCLC poorly differentiated "9 (8.9)"
14 Others "4 (4.0)"
15 Disease stage ""
16 IIIB "2 (2.3)"
17 IV "86 (97.7)"
18 Brain metastases ""
19 Yes "16 (15.8)"
20 No "85 (84.2)"
21 PD-L1 TPS% ""
22 <90% "74 (73.3)"
23 >=90% "27 (26.7)"
24 ECOG PS ""
25 0 "20 (19.8)"
26 1 "43 (42.6)"
27 2 "30 (29.7)"
28 3 "8 (7.9)"
29 CCI ""
30 0-2 "91 (90.1)"
31 >=3 "10 (9.9)"
32 NLR ""
33 >=4 "58 (57.4)"
34 <4 "43 (42.6)"
35 Frailty Scoring System ""
36 Low "28 (27.7)"
37 Intermediate "41 (40.6)"
38 High "32 (31.7)"


更多回答

The problem is that this HTML uses line breaks (<br>) in one column to align labels to values in <td> cells in other column. There's nothing structural that allows for mapping list names to values. You might be able to create a list of labels, by splitting on <br>, and then map that to the Value <td>s, including the empty first one - but that isn't a very sturdy solution. The quality of the HTML here will limit scraping options.

问题是,该HTML在一列中使用换行符(
)来将标签与其他列中的单元格中的值对齐。没有任何结构允许将列表名称映射到值。您可以通过拆分
来创建标签列表,然后将其映射到S的值,包括第一个空的标签-但这不是一个非常可靠的解决方案。这里的超文本标记语言的质量将限制抓取选项。

Thanks for the update! But please post output tables inline, as code, instead of screenshots - screenshots don't show up for screen readers.

感谢您的更新!但请将输出表内联,作为代码,而不是屏幕截图-屏幕截图不会显示给屏幕阅读器。

Separately - I get a 403 error when trying to load that URL from rvest::read_html.

另外-当我尝试从rvest::Read_html加载该URL时,出现403错误。

I have updated tables inline and also the HTML now

我现在已经更新了内联表格和HTML

优秀答案推荐

Update

Based on OP's updated HTML, here's a way to extract the data into a table.

Inspired by @hrbrmstr's post here.

UPDATE基于OP更新的HTML,这里有一种将数据提取到表中的方法。灵感来自@hrbrmstr在这里的帖子。


library(tidyverse)
library(rvest)
library(xml2)

html <- read_html("~/Desktop/GenderStats.html")
xml_find_all(html, ".//br") %>% xml_add_sibling("p", "$$$")
xml_find_all(html, ".//br") %>% xml_remove()

tbls <- html_table(html)

^The first key is to replace <br> tags with some distinct delimiter. Here I've chosen '$$$' but you can use anything unlikely to appear normally in the text you're scraping.

^第一个关键是用不同的分隔符替换
标记。我在这里选择了‘$’,但你可以使用任何不太可能在你抓取的文本中正常显示的东西。


The reason for this is html_table() converts into nondescript whitespace - which becomes indistinguishable from whitespace within valid strings and then impossible to split on.

这是因为html_table()会转换为非描述性空格--它与有效字符串中的空格无法区分,因此无法拆分。


tbls[[1]] |> 
rename(values_pct = `Values, n (%)`) |> # just for ease of typing
filter(values_pct != "") |> # drop "spacer" row entries

# :Characteristics: values are now separated by $$$, split on that delimiter
separate_wider_regex(Characteristics,
patterns = c(var = ".*?", " \\$\\$\\$", category = ".*")) |>
group_by(var, category) |>

# now :values_pct: has entries like: $$$69 (68.3) $$$19 (18.8) $$$9 (8.9) $$$4 (4.0)
# however, some values for :var: have multiple rows already, like Sex
# so we create a list column of :values_pct: , then expand into columns
# with unnest_wide()

# then remove leading $$$ values with str_replace()
# finally expand each value into a column with separate()
# then convert to long format with pivot_wider()

summarise(values_pct = list(values_pct)) |>
unnest_wider(values_pct, names_sep = "_") |>
mutate(values_pct_1 = str_replace(values_pct_1, "^\\$\\$\\$", "")) |>
separate_wider_delim(values_pct_1, "$$$", names_sep = "__",
too_few="align_start") |>
pivot_longer(-c(var, category)) |>
filter(!is.na(value)) |> # drop empty values

# now split :category: into one row per value, splitting on the delimiter
separate_longer_delim(category, " $$$") |>

# the trick is to align each category with its associated value
# do this by enumerating categories and values in the order they appear
group_by(var, value = fct_inorder(value)) |>
mutate(value_id = cur_group_id()) |>
group_by(var, category = fct_inorder(category)) |>
mutate(cat_id = cur_group_id()) |>
ungroup() |>

# now reduce to just those entries where category ID and value ID match
filter(cat_id == value_id) |>
select(var, category, value)

Output

输出


# A tibble: 27 × 3
var category value
<chr> <fct> <fct>
1 Age <70 years of age "63 (62.4) "
2 Age ≥70 years of age "38 (37.6)"
3 Brain metastases Yes "16 (15.8) "
4 Brain metastases No "85 (84.2)"
5 CCI 0–2 "91 (90.1) "
6 CCI ≥3 "10 (9.9)"
7 Disease stage IIIB "2 (2.3) "
8 Disease stage IV "86 (97.7)"
9 ECOG PS 0 "20 (19.8) "
10 ECOG PS 1 "43 (42.6) "
11 ECOG PS 2 "30 (29.7) "
12 ECOG PS 3 "8 (7.9)"
13 Frailty Scoring System Low "28 (27.7) "
14 Frailty Scoring System Intermediate "41 (40.6) "
15 Frailty Scoring System High "32 (31.7)"
16 Histology Adenocarcinoma "69 (68.3) "
17 Histology Squamous "19 (18.8) "
18 Histology NSCLC poorly differentiated "9 (8.9) "
19 Histology Others "4 (4.0)"
20 NLR ≥4 "58 (57.4) "
21 NLR <4 "43 (42.6)"
22 PD-L1 TPS% <90% "74 (73.3) "
23 PD-L1 TPS% ≥90% "27 (26.7)"
24 Sex Male "75 (74.3)"
25 Sex Female "26 (25.7)"
26 Smoking history No "93 (92.1) "
27 Smoking history Yes "8 (7.9)"

Note: This still feels like a brittle solution and will almost certainly not generalize well. Also verging on more suitable for Data Science Stack Exchange than SO, as it's less about coding (IMO) and more about thinking creatively through a data organization problem. Keeping it here in the spirit that it may be helpful to others learning to code in R/tidyverse.

注意:这仍然感觉像是一个脆弱的解决方案,几乎可以肯定不会很好地推广。也接近于更适合数据科学堆栈交换,因为它不是关于编码(IMO),而是更多地通过数据组织问题进行创造性的思考。将它保存在这里的精神是,它可能会对其他学习R/tidyverse编程的人有所帮助。


Original

Here's a solution that pulls out each value in List and gives it its own row in the data frame that comes out of tbls. Then just drop the row with an empty Value:

这里有一个解决方案,它可以提取List中的每个值,并在从tbls出来的数据框中为其提供自己的行。然后删除具有空Value的行:


library(tidyverse)

tbls[[1]] |>
rownames_to_column() |>
rowwise() |>
mutate(List = str_split_1(List, " ")[[as.numeric(rowname)]]) |>
filter(`Values, n (%)` != "") |>
select(-rowname)

# A tibble: 2 × 2
# Rowwise:
List `Values, n (%)`
<chr> <chr>
1 Male 75 (74.3)
2 Female 26 (25.7)

更多回答

Thanks you, this worked for above table. However it does not work when I try to scrap table from this URL as there are no spaces: ncbi.nlm.nih.gov/pmc/articles/PMC9953107/table/…

谢谢,这对上面的桌子很管用。但是,当我尝试从此url中删除表时,它不起作用,因为没有空格:ncbi.nlm.nih.gov/pmc/articles/PMC9953107/table/…

Can you update your post with example data that reflects the actual data you want to scrape? As I noted in my comment, the html in your example data doesn't have a reliable structure and so solutions may not generalize well. The more your example covers the actual use case, the better others can help you.

你能用反映你想要抓取的实际数据的示例数据来更新你的帖子吗?正如我在评论中指出的,示例数据中的html没有可靠的结构,因此解决方案可能不能很好地推广。您的示例覆盖的实际用例越多,其他人就越能更好地帮助您。

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com