gpt4 book ai didi

html - 使用 rvest 将复杂的 html 文件读入 R

转载 作者:太空狗 更新时间:2023-10-29 14:18:42 24 4
gpt4 key购买 nike

我是 R 和 stackoverflow 的新手,所以请保持温和,我会尽量保持这篇文章的正确性。我正在开展一个项目,将全外显子组测序 (WES) 结果与蛋白质组数据进行比较。我们的 WES 设施仅以 html 文件形式提供数据,因此我需要将其读入 R 以继续我的工作。

我试图跟随 DataCamp tutorial for rvest但我认为问题可能是 html 文件太复杂了,因为我得到的是\t\t\tn\n\t 之间的一些文本。我想问题是 html_node 不正确?

这是我的 R 代码,后跟经过缩短和变体修改的 HTML。

我想要得到的是一个与 html 中具有相同列的数据框。如示例中所示,某些变体会影响多个转录本,在这些情况下,单行/转录本将是完美的,但无论如何都不是必须的。

非常感谢您的帮助!

塞巴斯蒂安

library(tidyverse)  
library(rvest)

htmlALL <- read_html("Example_html")

getDATA <- function(html){
html %>%
html_nodes(".table") %>%
html_text() %>%
str_trim() %>%
unlist()

}

df_html <- getDATA(htmlALL)

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<!-- add title in the brower tab bar -->
<title>Homozygous variants of sample XXX </title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>


<!-- change style to look nice -->
<style type="text/css">


html {
text-align: center;
vertical-align: middle;
height: 100%;
width: 100%;
}
body {
background: #eee url('http://i.imgur.com/eeQeRmk.png'); /* http://subtlepatterns.com/weave/ */
font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
font-size: 62.5%;
entry-height: 1;
color: #585858;
padding: 22px 10px;
padding-bottom: 55px;

}

::selection { background: #5f74a0; color: #fff; }
::-moz-selection { background: #5f74a0; color: #fff; }
::-webkit-selection { background: #5f74a0; color: #fff; }

br { display: block; entry-height: 1.6em; }

input, textarea {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
-ms-text-size-adjust: 100%;
-webkit-box-sizing: border-box;
-moz-box-sizing: border-box;
box-sizing: border-box;
outentry: none;
}

blockquote, q { quotes: none; }
blockquote:before, blockquote:after, q:before, q:after { content: ''; content: none; }
strong, b { font-weight: bold; }


h1 {
font-weight: bold;
font-size: 3.6em;
entry-height: 1.7em;
margin-bottom: 10px;
text-align: center;
}

h2 {
font-weight: bold;
font-size: 2.6em;
entry-height: 1.7em;
margin-bottom: 10px;
text-align: center;
}

/** big white sheet everything is on **/
.wrapper {
display: block;
width: 95%;
background: #fff;
margin: 0 auto;
padding: 10px 17px 100px;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
overflow-x: auto;
overflow-y: visible;
}

/* smaller box the family information is on */
.info{
display: block;
width: 800px;
background: #f2f2f2;
margin: 0 auto;
padding: 10px 17px 10px 10px;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
font-size: 1.8em;
margin-bottom: 10px;
}


/* this is what actually contains the info */
.table {
display: table;
margin: 0 auto;
width: 99%;
font-size: 1.2em;
margin-bottom: 15px;
border-collapse: collapse;
overflow: visible;
}

/* one row of the variants */
.tablerow {
display: table-row;
overflow: visible;
border: 1px solid gray;
width: 100%;
}

/* header are bigger and may in the future be clickable to sort accordginly*/
.tableheader {
display: table-cell;
background: #f2f2f2;
padding: 3px 10px;
margin-bottom: 25px;
font-size: 1.8em;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
}

/* in the following each column gets specified to increase readablity*/

.position {
display: table-cell;
padding: 3px 10px;
font-size: 1.4em;
height: 100%;
text-align: center;
vertical-align: middle;
}

.variants {
display: table-cell;
height: 100%;
vertical-align: middle;
overflow: visible;
white-space: nowrap;

}

.stacked {
display: table;
height: 50%;
width: 100%;

}

.center {
display: table-cell;
vertical-align: middle;
width: 100%;
padding: 0px 5px;
}


.consequences {
display: table-cell;
height: 100%;
vertical-align: middle;
padding: 3px 10px;
}

.gene {
display: table-cell;
padding: 3px 15px;
height: 100%;
vertical-align: middle;
font-size: 1.4em;
font-weight: bold;
}

.transcripts {
display: table-cell;
vertical-align: middle;
height: 100%;
}

.list {
height: 100%;
width: 100%;
display: table;
table-layout: fixed;
}
.row {
display: table-row;
overflow: visible;
vertical-align: middle;
}
.entry {
display: table-cell;
vertical-align:middle;
padding: 0% 1% 0% 1%;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
}

.cdspos {
display: table-cell;
vertical-align: middle;
height: 100%;
}

.exon {
display: table-cell;
vertical-align: middle;
height: 100%;
}



.hgvs {
display: table-cell;
height: 100%;
vertical-align: middle;
}

.hgvs .list .row{
display: table-row;
vertical-align: middle;
}

.polyphen {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.polyphen .list .row{
display: table-row;
vertical-align: middle;
}

.sift {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.sift .list .row{
display: table-row;
vertical-align: middle;
}

.allelefreq {
display: table-cell;
height: 100%;
vertical-align: middle;
}



/* Tooltip container */
.tooltip_gene, .tooltip_allelefrq ,.tooltip_qual{
position: relative;
display: inline-block;
border-bottom: 1px dotted black; /* If you want dots under the hoverable text */

}



.tooltiptext{
visibility: hidden;
overflow: auto;
min-width: 400px;
background-color: #ffb380;
color: black;
text-align: left;
padding: 5px 10px;
border-radius: 6px;
font-size: 12pt;
font-weight: normal;

/* Position the tooltip text - see examples below! */
position: absolute;
z-index:1;

/* shadow */
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);

opacity: 0.95;
filter: alpha(opacity=95);

}

/* Tooltip text */
.tooltip_gene .tooltiptext {
top: -5px;
left: 105%;

}


/* Tooltip text */
.tooltip_allelefrq .tooltiptext {
top: -5px;
right: 105%;
min-width: 120px;


}

/* Show the tooltip text when you mouse over the tooltip container */
.tooltip_allelefrq:hover .tooltiptext, .tooltip_gene:hover .tooltiptext {
visibility: visible;
}


.clin {
display: table-cell;
height: 100%;
vertical-align: middle;
padding: 0% 1% 0% 1%;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
}

</style>


<body>
<div class="wrapper">
<!-- add info about patients -->
<h1>Homozygous variants of sample XXX</h1>
<h2>Tue Jan 23 09:01:56 2018</h2>
<div class="info">

Patient only<br>

</div>
<!-- variants table start -->
<div class="table">
<!-- table header start -->
<div class="tablerow">
<div class="tableheader">
Position
</div>
<div class="tableheader">
Variant
</div>
<div class="tableheader">
Cons
</div>
<div class="tableheader">
Gene
</div>
<div class="tableheader">
Transcript
</div>
<div class="tableheader">
HGVSC
</div>
<div class="tableheader">
HGVSP
</div>
<div class="tableheader">
PolyPhen
</div>
<div class="tableheader">
SIFT
</div>
<div class="tableheader">
AF
</div>
<div class="tableheader">
Clin
</div>
</div>
<!-- table header stop -->
<!-- var loop start -->

<div class="tablerow" >
<!-- position start -->
<div class="position">
<a href="http://gnomad.broadinstitute.org/region/1-117635467-117635507">1:117635487</a>
</div>
<!-- position stop -->
<!-- variants start -->
<div class="variants">


G->T


</div>
<!-- variants stop -->
<!-- consequences start -->
<div class="consequences" style="background: rgb(196, 197, 198);">

synonymous

</div>
<!-- consequences stop -->
<!-- gene start -->
<div class="gene" >




<div class="tooltip_gene">
<a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=TTF2" >
TTF2
</a>
<span class="tooltiptext">GeneCards Summary<hr>
TTF2 (Transcription Termination Factor 2) is a Protein Coding gene.
Diseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.
Among its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.
GO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.
An important paralog of this gene is HLTF.</span>
</div>

</div>
<!-- gene stop -->
<!-- transcripts start -->
<div class="transcripts">
<div class="list">

<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000369466">ENST00000369466
</a>
</div>
</div>

</div>
</div>
<!-- transcripts stop -->
<!-- exon start -->
<!-- <div class="exon">
<div class="list">

</div>
</div>-->
<!-- exon stop -->
<!-- hgvsc start -->
<div class="hgvs">
<div class="list">

<div class="row">
<div class="entry">

c.2940G>T

</div>
</div>

</div>
</div>
<!-- hgvsc stop -->
<!-- hgvsp start -->
<div class="hgvs">
<div class="list">

<div class="row">
<div class="entry">

c.2940G>T(p.%3D)

</div>
</div>

</div>
</div>
<!-- hgvsp stop -->
<!-- polyphen start -->
<div class="polyphen">
<div class="list">

<div class="row">
<div class="entry">



</div>
</div>

</div>
</div>
<!-- polyphen stop -->
<!-- sift start -->
<div class="sift">
<div class="list">

<div class="row">
<div class="entry">



</div>
</div>

</div>
</div>
<!-- sift stop -->
<!--.allelefreq start -->
<div class="allelefreq">


<div class="tooltip_allelefrq">
0.00000
<span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>0</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>0</span><hr>inhouse:<span style='float:right;'>0.00118</span></span>
</div>


</div>
<!--.allelefreq stop -->
<!--.allelefreq start -->
<div class="clin">



</div>
<!--.allelefreq stop -->
</div>
<!-- table row stop-->


<div class="tablerow" >
<!-- position start -->
<div class="position">
<a href="http://gnomad.broadinstitute.org/region/1-149898435-149898475">1:149898455</a>
</div>
<!-- position stop -->
<!-- variants start -->
<div class="variants">



<a href="https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs143105666">G->A</a>



</div>
<!-- variants stop -->
<!-- consequences start -->
<div class="consequences" style="background: rgb(196, 197, 198);">

synonymous

</div>
<!-- consequences stop -->
<!-- gene start -->
<div class="gene" >




<div class="tooltip_gene">
<a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=SF3B4" >
SF3B4
</a>
<span class="tooltiptext">GeneCards Summary<hr>
SF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.
Diseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.
Among its related pathways are mRNA Splicing - Major Pathway and Gene Expression.
GO annotations related to this gene include nucleic acid binding and nucleotide binding.
</span>
</div>

</div>
<!-- gene stop -->
<!-- transcripts start -->
<div class="transcripts">
<div class="list">

<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000457312">ENST00000457312
</a>
</div>
</div>

<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000271628">ENST00000271628
</a>
</div>
</div>

</div>
</div>
<!-- transcripts stop -->
<!-- exon start -->
<!-- <div class="exon">
<div class="list">

</div>
</div>-->
<!-- exon stop -->
<!-- hgvsc start -->
<div class="hgvs">
<div class="list">

<div class="row">
<div class="entry">

c.390C>A

</div>
</div>

<div class="row">
<div class="entry">

c.519C>A

</div>
</div>

</div>
</div>
<!-- hgvsc stop -->
<!-- hgvsp start -->
<div class="hgvs">
<div class="list">

<div class="row">
<div class="entry">

c.390C>A(p.%3D)

</div>
</div>

<div class="row">
<div class="entry">

c.519C>A(p.%3D)

</div>
</div>

</div>
</div>
<!-- hgvsp stop -->
<!-- polyphen start -->
<div class="polyphen">
<div class="list">

<div class="row">
<div class="entry">



</div>
</div>

<div class="row">
<div class="entry">



</div>
</div>

</div>
</div>
<!-- polyphen stop -->
<!-- sift start -->
<div class="sift">
<div class="list">

<div class="row">
<div class="entry">



</div>
</div>

<div class="row">
<div class="entry">



</div>
</div>

</div>
</div>
<!-- sift stop -->
<!--.allelefreq start -->
<div class="allelefreq">


<div class="tooltip_allelefrq">
0.00021
<span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>57</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>277082</span><hr>inhouse:<span style='float:right;'>0.00236</span></span>
</div>


</div>
<!--.allelefreq stop -->
<!--.allelefreq start -->
<div class="clin">



</div>
<!--.allelefreq stop -->
</div>
<!-- table row stop-->

<!-- var loop stop -->
</div>
<!-- variant table stop -->
</div>
</body>
</html>

最佳答案

这是我能为您提供的最好的。请注意,输出包括将鼠标悬停在 Gene 列中的数据上时弹出的“工具提示文本”。

library(rvest)

# I saved your sample to my Desktop as test.html
pg = read_html('~/Desktop/test.html')

# count rows (including header):
n_rows = pg %>% html_nodes('div.tablerow') %>% length

# sprintf-friendly format to get the %d-th node matching
# //div[@class="tablerow"] (same as div.tablerow in CSS)
# All of the /div after this are columns
xp_fmt = '//div[@class="tablerow"][%d]/div'

# div.tableheader nodes contain column names
col_names = pg %>% html_nodes(xpath = sprintf(xp_fmt, 1L)) %>%
html_text %>% trimws

# rows 2:n contain the actual data; gsub is
# stripping leading/trailing whitespace and
# any duplicate internal whitespace
rows = lapply(2:n_rows, function(ii) {
pg %>% html_nodes(xpath = sprintf(xp_fmt, ii)) %>%
html_text %>% gsub('^\\s+|\\s{2,}|\\s+$', '', .)
})

# can't forget those pesky factors
DF = as.data.frame(do.call(rbind, rows), stringsAsFactors = FALSE)
names(DF) = col_names
DF
# Position Variant Cons
# 1 1:117635487 G->T synonymous
# 2 1:149898455 G->A synonymous
# Gene
# 1 TTF2GeneCards Summary\nTTF2 (Transcription Termination Factor 2) is a Protein Coding gene.\nDiseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.\nAmong its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.\nGO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.\nAn important paralog of this gene is HLTF.
# 2 SF3B4GeneCards Summary\nSF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.\nDiseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.\nAmong its related pathways are mRNA Splicing - Major Pathway and Gene Expression.\nGO annotations related to this gene include nucleic acid binding and nucleotide binding.
# Transcript HGVSC
# 1 ENST00000369466 c.2940G>T
# 2 ENST00000457312ENST00000271628 c.390C>Ac.519C>A
# HGVSP PolyPhen SIFT
# 1 c.2940G>T(p.%3D)
# 2 c.390C>A(p.%3D)c.519C>A(p.%3D)
# AF
# 1 0.00000allele countsht: 0hm: 0wt: 0inhouse:0.00118
# 2 0.00021allele countsht: 57hm: 0wt: 277082inhouse:0.00236
# Clin
# 1
# 2

请注意,它不适用于此处,因为您的所有列似乎都是 character 类型,但更复杂的方法会将此处的行转换为常规文件(例如 csv ),然后使用 read.table(或者更好,fread)读入文本并自动检测列类型。

关于html - 使用 rvest 将复杂的 html 文件读入 R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52297953/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com