How to handle rowspan and colspan while scraping a Table using R(如何在使用R刮取表时处理行跨度和列跨度)-6ren

How to handle rowspan and colspan while scraping a Table using R(如何在使用R刮取表时处理行跨度和列跨度)

转载作者：bug小助手更新时间：2023-10-28 20:58:39

I am trying to scrape data from table (HTML pasted below):

我正在尝试从表中抓取数据(粘贴在下面的HTML)：

I tried to use below code, but it does not return the contents as required. It merges all names in one for the rows (where rowspan is mentioned). Refer Table1 below.

我尝试使用下面的代码，但它没有按要求返回内容。它将各行的所有名称合并为一个(其中提到了rowspan)。请参阅下面的表1。

This table does have structural issues as it is using br tags. Could someone please help me to get a table with values mapped properly to all items. (Like Table 2)

此表确实存在结构问题，因为它使用了br标签。有没有人能帮我弄一张表，把值正确地映射到所有项目上？(如表2所示)

<!--HTML for Table -->
<table frame="hsides" rules="groups" class="rendered small default_table">
  <thead>
    <tr>
      <th align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Characteristics</th>
      <th align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Values, n (%)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="3" align="center" valign="middle" style="border-bottom:solid thin" colspan="1">Sex <br />Male <br />Female </td>
      <td align="center" valign="middle" rowspan="1" colspan="1"></td>
    </tr>
    <tr>
      <td align="center" valign="middle" rowspan="1" colspan="1">75 (74.3)</td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">26 (25.7)</td>
    </tr>
    <tr>
      <td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin" colspan="1">Age <br />&#x0003c;70 years of age <br />&#x02265;70 years of age </td>
      <td align="center" valign="middle" rowspan="1" colspan="1"></td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">63 (62.4) <br />38 (37.6) </td>
    </tr>
    <tr>
      <td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin" colspan="1">Smoking history <br />Yes <br />No </td>
      <td align="center" valign="middle" rowspan="1" colspan="1"></td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-bottom:solid thin" rowspan="1" colspan="1">93 (92.1) <br />8 (7.9) </td>
    </tr>
    <tr>
      <td align="center" valign="middle" rowspan="1" colspan="1">Histology <br />Adenocarcinoma <br />Squamous <br />NSCLC poorly differentiated <br />Others </td>
      <td align="center" valign="middle" rowspan="1" colspan="1">
        <br />69 (68.3) <br />19 (18.8) <br />9 (8.9) <br />4 (4.0)
      </td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">Disease stage <br />IIIB <br />IV </td>
      <td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">
        <br />2 (2.3) <br />86 (97.7)
      </td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Brain metastases <br />Yes <br />No </td>
      <td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">16 (15.8) <br />85 (84.2) </td>
    </tr>
    <tr>
      <td align="center" valign="middle" rowspan="1" colspan="1">PD-L1 TPS% <br />&#x0003c;90% <br />&#x02265;90% </td>
      <td align="center" valign="middle" rowspan="1" colspan="1">
        <br />74 (73.3) <br />27 (26.7)
      </td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">ECOG PS <br />0 <br />1 <br />2 <br />3 </td>
      <td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">
        <br />20 (19.8) <br />43 (42.6) <br />30 (29.7) <br />8 (7.9)
      </td>
    </tr>
    <tr>
      <td align="center" valign="middle" rowspan="1" colspan="1">CCI <br />0&#x02013;2 <br />&#x02265;3 </td>
      <td align="center" valign="middle" rowspan="1" colspan="1">
        <br />91 (90.1) <br />10 (9.9)
      </td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">NLR <br />&#x02265;4 <br />&#x0003c;4 </td>
      <td align="center" valign="middle" style="border-top:solid thin" rowspan="1" colspan="1">
        <br />58 (57.4) <br />43 (42.6)
      </td>
    </tr>
    <tr>
      <td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">Frailty Scoring System <br />Low <br />Intermediate <br />High </td>
      <td align="center" valign="middle" style="border-top:solid thin;border-bottom:solid thin" rowspan="1" colspan="1">
        <br />28 (27.7) <br />41 (40.6) <br />32 (31.7)
      </td>
    </tr>
  </tbody>
</table>




#R Code

library(rvest)
tbls <- html_table(read_html("c:/GenderStats.html"))
for (t in 1:length(tbls)) {
  assign(paste0("Table", t), tbls[[t]])
}


#Table 1

    # A tibble: 15 × 2
            Characteristics                                                         `Values, n (%)`                     
            <chr>                                                                   <chr>                               
     1      SexMaleFemale                                                           ""                                  
     2      SexMaleFemale                                                           "75 (74.3)"                         
     3      SexMaleFemale                                                           "26 (25.7)"                         
     4      Age<70 years of age≥70 years of age                                     ""                                  
     5      Age<70 years of age≥70 years of age                                     "63 (62.4)38 (37.6)"                
     6      Smoking historyYesNo                                                    ""                                  
     7      Smoking historyYesNo                                                    "93 (92.1)8 (7.9)"                  
     8      HistologyAdenocarcinomaSquamousNSCLC poorly differentiatedOthers        "69 (68.3)19 (18.8)9 (8.9)4 (4.0)"  
     9      Disease stageIIIBIV                                                     "2 (2.3)86 (97.7)"                  
    10      Brain metastasesYesNo                                                   "16 (15.8)85 (84.2)"                
    11      PD-L1 TPS%<90%≥90%                                                      "74 (73.3)27 (26.7)"                
    12      ECOG PS0123                                                             "20 (19.8)43 (42.6)30 (29.7)8 (7.9)"
    13      CCI0–2≥3                                                                "91 (90.1)10 (9.9)"                 
    14      NLR≥4<4                                                                 "58 (57.4)43 (42.6)"                
    15      Frailty Scoring SystemLowIntermediateHigh                               "28 (27.7)41 (40.6)32 (31.7)"



**Is there a way to get like below attached table?**

    

#Table 2

    # A tibble: 38 × 2
       Characteristics             `Values, n (%)`
       <chr>                       <chr>          
     1 Sex                         ""             
     2 Male                        "75 (74.3)"    
     3 Female                      "26 (25.7)"    
     4 Age                         ""             
     5 <70 years of age            "63 (62.4)"    
     6 >=70 years of age           "38 (37.6)"    
     7 Smoking history             ""             
     8 Yes                         "93 (92.1)"    
     9 No                          "8 (7.9)"      
    10 Histology                   ""             
    11 Adenocarcinoma              "69 (68.3)"    
    12 Squamous                    "19 (18.8)"    
    13 NSCLC poorly differentiated "9 (8.9)"      
    14 Others                      "4 (4.0)"      
    15 Disease stage               ""             
    16 IIIB                        "2 (2.3)"      
    17 IV                          "86 (97.7)"    
    18 Brain metastases            ""             
    19 Yes                         "16 (15.8)"    
    20 No                          "85 (84.2)"    
    21 PD-L1 TPS%                  ""             
    22 <90%                        "74 (73.3)"    
    23 >=90%                       "27 (26.7)"    
    24 ECOG PS                     ""             
    25 0                           "20 (19.8)"    
    26 1                           "43 (42.6)"    
    27 2                           "30 (29.7)"    
    28 3                           "8 (7.9)"      
    29 CCI                         ""             
    30 0-2                         "91 (90.1)"    
    31 >=3                         "10 (9.9)"     
    32 NLR                         ""             
    33 >=4                         "58 (57.4)"    
    34 <4                          "43 (42.6)"    
    35 Frailty Scoring System      ""             
    36 Low                         "28 (27.7)"    
    37 Intermediate                "41 (40.6)"    
    38 High                        "32 (31.7)"

更多回答

The problem is that this HTML uses line breaks (<br>) in one column to align labels to values in <td> cells in other column. There's nothing structural that allows for mapping list names to values. You might be able to create a list of labels, by splitting on <br>, and then map that to the Value <td>s, including the empty first one - but that isn't a very sturdy solution. The quality of the HTML here will limit scraping options.

问题是，该HTML在一列中使用换行符(
)来将标签与其他列中的单元格中的值对齐。没有任何结构允许将列表名称映射到值。您可以通过拆分
来创建标签列表，然后将其映射到S的值，包括第一个空的标签-但这不是一个非常可靠的解决方案。这里的超文本标记语言的质量将限制抓取选项。

Thanks for the update! But please post output tables inline, as code, instead of screenshots - screenshots don't show up for screen readers.

感谢您的更新！但请将输出表内联，作为代码，而不是屏幕截图-屏幕截图不会显示给屏幕阅读器。

Separately - I get a 403 error when trying to load that URL from rvest::read_html.

另外-当我尝试从rvest：：Read_html加载该URL时，出现403错误。

I have updated tables inline and also the HTML now

我现在已经更新了内联表格和HTML

优秀答案推荐

Update

Based on OP's updated HTML, here's a way to extract the data into a table.

Inspired by @hrbrmstr's post here.

UPDATE基于OP更新的HTML，这里有一种将数据提取到表中的方法。灵感来自@hrbrmstr在这里的帖子。

library(tidyverse)
library(rvest)
library(xml2)

html <- read_html("~/Desktop/GenderStats.html")
xml_find_all(html, ".//br") %>% xml_add_sibling("p", "$$$")
xml_find_all(html, ".//br") %>% xml_remove()

tbls <- html_table(html)

^The first key is to replace <br> tags with some distinct delimiter. Here I've chosen '$$$' but you can use anything unlikely to appear normally in the text you're scraping.

^第一个关键是用不同的分隔符替换
标记。我在这里选择了‘$’，但你可以使用任何不太可能在你抓取的文本中正常显示的东西。

The reason for this is html_table() converts into nondescript whitespace - which becomes indistinguishable from whitespace within valid strings and then impossible to split on.

这是因为html_table()会转换为非描述性空格--它与有效字符串中的空格无法区分，因此无法拆分。

tbls[[1]] |> 
  rename(values_pct = `Values, n (%)`) |>  # just for ease of typing
  filter(values_pct != "") |> # drop "spacer" row entries

  # :Characteristics: values are now separated by $$$, split on that delimiter
  separate_wider_regex(Characteristics, 
                       patterns = c(var = ".*?", " \\$\\$\\$", category = ".*")) |> 
  group_by(var, category) |> 

  # now :values_pct: has entries like: $$$69 (68.3) $$$19 (18.8) $$$9 (8.9) $$$4 (4.0) 
  # however, some values for :var: have multiple rows already, like Sex
  # so we create a list column of :values_pct: , then expand into columns
  # with unnest_wide()

  # then remove leading $$$ values with str_replace()
  # finally expand each value into a column with separate()
  # then convert to long format with pivot_wider()

  summarise(values_pct = list(values_pct)) |> 
  unnest_wider(values_pct, names_sep = "_") |> 
  mutate(values_pct_1 = str_replace(values_pct_1, "^\\$\\$\\$", "")) |> 
  separate_wider_delim(values_pct_1, "$$$", names_sep = "__", 
                       too_few="align_start") |> 
  pivot_longer(-c(var, category)) |> 
  filter(!is.na(value)) |> # drop empty values

  # now split :category: into one row per value, splitting on the delimiter
  separate_longer_delim(category, " $$$") |> 

  # the trick is to align each category with its associated value
  # do this by enumerating categories and values in the order they appear
  group_by(var, value = fct_inorder(value)) |> 
  mutate(value_id = cur_group_id()) |> 
  group_by(var, category = fct_inorder(category)) |> 
  mutate(cat_id = cur_group_id()) |> 
  ungroup() |> 

  # now reduce to just those entries where category ID and value ID match
  filter(cat_id == value_id) |> 
  select(var, category, value)

Output

输出

# A tibble: 27 × 3
   var                    category                    value       
   <chr>                  <fct>                       <fct>       
 1 Age                    <70 years of age            "63 (62.4) "
 2 Age                    ≥70 years of age            "38 (37.6)" 
 3 Brain metastases       Yes                         "16 (15.8) "
 4 Brain metastases       No                          "85 (84.2)" 
 5 CCI                    0–2                         "91 (90.1) "
 6 CCI                    ≥3                          "10 (9.9)"  
 7 Disease stage          IIIB                        "2 (2.3) "  
 8 Disease stage          IV                          "86 (97.7)" 
 9 ECOG PS                0                           "20 (19.8) "
10 ECOG PS                1                           "43 (42.6) "
11 ECOG PS                2                           "30 (29.7) "
12 ECOG PS                3                           "8 (7.9)"   
13 Frailty Scoring System Low                         "28 (27.7) "
14 Frailty Scoring System Intermediate                "41 (40.6) "
15 Frailty Scoring System High                        "32 (31.7)" 
16 Histology              Adenocarcinoma              "69 (68.3) "
17 Histology              Squamous                    "19 (18.8) "
18 Histology              NSCLC poorly differentiated "9 (8.9) "  
19 Histology              Others                      "4 (4.0)"   
20 NLR                    ≥4                          "58 (57.4) "
21 NLR                    <4                          "43 (42.6)" 
22 PD-L1 TPS%             <90%                        "74 (73.3) "
23 PD-L1 TPS%             ≥90%                        "27 (26.7)" 
24 Sex                    Male                        "75 (74.3)" 
25 Sex                    Female                      "26 (25.7)" 
26 Smoking history        No                          "93 (92.1) "
27 Smoking history        Yes                         "8 (7.9)"

Note: This still feels like a brittle solution and will almost certainly not generalize well. Also verging on more suitable for Data Science Stack Exchange than SO, as it's less about coding (IMO) and more about thinking creatively through a data organization problem. Keeping it here in the spirit that it may be helpful to others learning to code in R/tidyverse.

注意：这仍然感觉像是一个脆弱的解决方案，几乎可以肯定不会很好地推广。也接近于更适合数据科学堆栈交换，因为它不是关于编码(IMO)，而是更多地通过数据组织问题进行创造性的思考。将它保存在这里的精神是，它可能会对其他学习R/tidyverse编程的人有所帮助。

Original

Here's a solution that pulls out each value in List and gives it its own row in the data frame that comes out of tbls. Then just drop the row with an empty Value:

这里有一个解决方案，它可以提取List中的每个值，并在从tbls出来的数据框中为其提供自己的行。然后删除具有空Value的行：

library(tidyverse)

tbls[[1]] |> 
  rownames_to_column() |> 
  rowwise() |> 
  mutate(List = str_split_1(List, " ")[[as.numeric(rowname)]]) |> 
  filter(`Values, n (%)` != "") |> 
  select(-rowname)

# A tibble: 2 × 2
# Rowwise: 
  List   `Values, n (%)`
  <chr>  <chr>          
1 Male   75 (74.3)      
2 Female 26 (25.7)

更多回答

Thanks you, this worked for above table. However it does not work when I try to scrap table from this URL as there are no spaces: ncbi.nlm.nih.gov/pmc/articles/PMC9953107/table/…

谢谢，这对上面的桌子很管用。但是，当我尝试从此url中删除表时，它不起作用，因为没有空格：ncbi.nlm.nih.gov/pmc/articles/PMC9953107/table/…

Can you update your post with example data that reflects the actual data you want to scrape? As I noted in my comment, the html in your example data doesn't have a reliable structure and so solutions may not generalize well. The more your example covers the actual use case, the better others can help you.

你能用反映你想要抓取的实际数据的示例数据来更新你的帖子吗？正如我在评论中指出的，示例数据中的html没有可靠的结构，因此解决方案可能不能很好地推广。您的示例覆盖的实际用例越多，其他人就越能更好地帮助您。

文章推荐： python - dict.get() - 默认 arg 即使在成功时也会被评估

文章推荐： java - 使用返回整数列表的 power mock 测试私有(private)方法

文章推荐： java - java.time 是否无法解析秒的分数？

文章推荐： java - 如果 compareTo() 返回 0，为什么暗示对象相等？

mysql - 同步/流式传输 MySQL 表/表(连接表)与 PostgreSQL 表/表
我有一台 MySQL 服务器和一台 PostgreSQL 服务器。需要从多个表中复制或重新插入一组数据 MySQL 流式传输/同步到 PostgreSQL 表。这种复制可以基于时间(Sync)或事
php - 从用户(表)获取数据其中用户(表)的id等于 friend (表)的id
如果两个表的 id 彼此相等，我尝试从一个表中获取数据。这是我使用的代码: SELECT id_to , email_to , name_to , status_to
sql - Excel 表 SQL 表
我有一个 Excel 工作表。顶行对应于列名称，而连续的行每行代表一个条目。如何将此 Excel 工作表转换为 SQL 表？我使用的是 SQL Server 2005。最佳答案这取决于您使用哪
mysql - 如何将两个django模型(表)合并为一个模型(表)
我想合并两个 Django 模型并创建一个模型。让我们假设我有第一个表表 A，其中包含一些列和数据。 Table A -------------- col1 col2 col3 col
mysql - 表 1、表 2 的多列左连接
我有两个表:table1，table2，如下所示 table1: id name 1 tamil 2 english 3 maths 4 science table2: p
sql - 大传感器数据最佳选择。表 SQL 与 Azure 表
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 1 年前。 Improve th
dynamics-ax-2009 - 表=表与表.数据(表)
下面两个语句有什么区别？ newTable = orginalTable 或 newTable.data(originalTable) 我怀疑 .data() 方法具有性能优势，因为它在标准 AX 中
SQL Server 表 -(或可能是任何 SQL 表)没有主键会影响性能吗？
我有一个表，我没有在其中显式定义主键，它并不是真正需要的功能......但是一位同事建议我添加一个列作为唯一主键以随着数据库的增长提高性能...... 谁能解释一下这是如何提高性能的？没有使用索引(
php - 将产品详细信息插入 'product' 表，并将产品图像插入 'image' 表
如何将表“产品”中的产品记录与其不同表“图像”中的图像相关联？我正在对产品 ID 使用自动增量。我觉得不可能进行关联，因为产品 ID 是自动递增的，因此在插入期间不可用! 如何插入新产品，获取产品
python - 创建一个新的 sql 表，其中的列源自另一个 sql 表
我有一个 sql 表，其中包含关键字和出现次数，如下所示(尽管出现次数并不重要): ____________ dog | 3 | ____________ rat | 7 | ____
MySQL LAST_INSERT_ID() 与 INSERT INTO 表 SELECT FROM 表
是否可以使用目标表中的LAST_INSERT_ID更新源表？ INSERT INTO `target` SELECT `a`, `b` FROM `source` 目标表有一个自动增量键id，我想将其
mysql - 查询 - 在简单的 mysql 内连接中定义(表，表)
我正在重建一个搜索查询，因为它在“我看到的”中变得多余，我想知道什么 (albums_artists, artists) ( ) does in join? is it for boosting pe
innodb - mysqldump 备份缺少所有 innodb 表，但没有 MyISAM 表
以下是我使用 mysqldump 备份数据库的开关: /usr/bin/mysqldump -u **** --password=**** --single-transaction --databas
html - 为什么 MySQL 表中的所有行都是相同的？ (MySQL 表 > HTML 表)
我试图获取 MySQL 表中的所有行并将它们放入 HTML 表中: Exam ID Status Assigned Examiner
mysql - 查询 'photos' 表，同时查询 'bookmarks' 表，以便知道添加书签的照片
如何查询名为 photos 的表中的所有记录，并知道当前用户使用单个查询将哪些结果照片添加为书签？这是我的表格: -- -- Table structure for table `photos` -
Mysql MEMORY 表 vs InnoDB 表(很多插入，很少读取)
我的网站都在 InnoDB 表上运行，目前为止运行良好。现在我想知道在我的网站上实时发生了什么，所以我将每个页面浏览量(页面、引荐来源网址、IP、主机名等)存储在 InnoDB 表中。每秒大约有 10
mysql - 如何在 mysql 中存储客户数据(2 表 vs 1 表)
我在想我会为 mysql 准备两个表。一个用于存储登录信息，另一个用于存储送货地址。这是传统方式还是所有内容都存储在一张表中？对于两个表...有没有办法自动将表 A 的列复制到表 B，以便我可以引用
mysql - 表 1 包含名字和姓氏，表 2 包含两列引用表 1 上的名称
我不是程序员，我从这个表格中阅读了很多关于如何解决我的问题的内容，但我的搜索效果不好我有两张 table 表 1:成员 id*| name | surname -------------------
c# - 如何在 ASP.NET 中显示 "View"表(SQL 表)？
我知道如何在 ASP.NET 中显示真实表，例如 public ActionResult Index() { var s = db.StaffInfoDBSet.ToList(); r
php - INSERT INTO 表 VALUES.. 与 INSERT INTO 表 SET 错误
我正在尝试运行以下查询: "insert into visits set source = 'http://google.com' and country = 'en' and ref = '1234

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

How to handle rowspan and colspan while scraping a Table using R(如何在使用R刮取表时处理行跨度和列跨度)