- c - 在位数组中找到第一个零
- linux - Unix 显示有关匹配两种模式之一的文件的信息
- 正则表达式替换多个文件
- linux - 隐藏来自 xtrace 的命令
我是 R 和 stackoverflow 的新手,所以请保持温和,我会尽量保持这篇文章的正确性。我正在开展一个项目,将全外显子组测序 (WES) 结果与蛋白质组数据进行比较。我们的 WES 设施仅以 html 文件形式提供数据,因此我需要将其读入 R 以继续我的工作。
我试图跟随 DataCamp tutorial for rvest但我认为问题可能是 html 文件太复杂了,因为我得到的是\t\t\tn\n\t 之间的一些文本。我想问题是 html_node 不正确?
这是我的 R 代码,后跟经过缩短和变体修改的 HTML。
我想要得到的是一个与 html 中具有相同列的数据框。如示例中所示,某些变体会影响多个转录本,在这些情况下,单行/转录本将是完美的,但无论如何都不是必须的。
非常感谢您的帮助!
塞巴斯蒂安
library(tidyverse)
library(rvest)
htmlALL <- read_html("Example_html")
getDATA <- function(html){
html %>%
html_nodes(".table") %>%
html_text() %>%
str_trim() %>%
unlist()
}
df_html <- getDATA(htmlALL)
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<!-- add title in the brower tab bar -->
<title>Homozygous variants of sample XXX </title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<!-- change style to look nice -->
<style type="text/css">
html {
text-align: center;
vertical-align: middle;
height: 100%;
width: 100%;
}
body {
background: #eee url('http://i.imgur.com/eeQeRmk.png'); /* http://subtlepatterns.com/weave/ */
font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
font-size: 62.5%;
entry-height: 1;
color: #585858;
padding: 22px 10px;
padding-bottom: 55px;
}
::selection { background: #5f74a0; color: #fff; }
::-moz-selection { background: #5f74a0; color: #fff; }
::-webkit-selection { background: #5f74a0; color: #fff; }
br { display: block; entry-height: 1.6em; }
input, textarea {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
-ms-text-size-adjust: 100%;
-webkit-box-sizing: border-box;
-moz-box-sizing: border-box;
box-sizing: border-box;
outentry: none;
}
blockquote, q { quotes: none; }
blockquote:before, blockquote:after, q:before, q:after { content: ''; content: none; }
strong, b { font-weight: bold; }
h1 {
font-weight: bold;
font-size: 3.6em;
entry-height: 1.7em;
margin-bottom: 10px;
text-align: center;
}
h2 {
font-weight: bold;
font-size: 2.6em;
entry-height: 1.7em;
margin-bottom: 10px;
text-align: center;
}
/** big white sheet everything is on **/
.wrapper {
display: block;
width: 95%;
background: #fff;
margin: 0 auto;
padding: 10px 17px 100px;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
overflow-x: auto;
overflow-y: visible;
}
/* smaller box the family information is on */
.info{
display: block;
width: 800px;
background: #f2f2f2;
margin: 0 auto;
padding: 10px 17px 10px 10px;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
font-size: 1.8em;
margin-bottom: 10px;
}
/* this is what actually contains the info */
.table {
display: table;
margin: 0 auto;
width: 99%;
font-size: 1.2em;
margin-bottom: 15px;
border-collapse: collapse;
overflow: visible;
}
/* one row of the variants */
.tablerow {
display: table-row;
overflow: visible;
border: 1px solid gray;
width: 100%;
}
/* header are bigger and may in the future be clickable to sort accordginly*/
.tableheader {
display: table-cell;
background: #f2f2f2;
padding: 3px 10px;
margin-bottom: 25px;
font-size: 1.8em;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
}
/* in the following each column gets specified to increase readablity*/
.position {
display: table-cell;
padding: 3px 10px;
font-size: 1.4em;
height: 100%;
text-align: center;
vertical-align: middle;
}
.variants {
display: table-cell;
height: 100%;
vertical-align: middle;
overflow: visible;
white-space: nowrap;
}
.stacked {
display: table;
height: 50%;
width: 100%;
}
.center {
display: table-cell;
vertical-align: middle;
width: 100%;
padding: 0px 5px;
}
.consequences {
display: table-cell;
height: 100%;
vertical-align: middle;
padding: 3px 10px;
}
.gene {
display: table-cell;
padding: 3px 15px;
height: 100%;
vertical-align: middle;
font-size: 1.4em;
font-weight: bold;
}
.transcripts {
display: table-cell;
vertical-align: middle;
height: 100%;
}
.list {
height: 100%;
width: 100%;
display: table;
table-layout: fixed;
}
.row {
display: table-row;
overflow: visible;
vertical-align: middle;
}
.entry {
display: table-cell;
vertical-align:middle;
padding: 0% 1% 0% 1%;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
}
.cdspos {
display: table-cell;
vertical-align: middle;
height: 100%;
}
.exon {
display: table-cell;
vertical-align: middle;
height: 100%;
}
.hgvs {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.hgvs .list .row{
display: table-row;
vertical-align: middle;
}
.polyphen {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.polyphen .list .row{
display: table-row;
vertical-align: middle;
}
.sift {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.sift .list .row{
display: table-row;
vertical-align: middle;
}
.allelefreq {
display: table-cell;
height: 100%;
vertical-align: middle;
}
/* Tooltip container */
.tooltip_gene, .tooltip_allelefrq ,.tooltip_qual{
position: relative;
display: inline-block;
border-bottom: 1px dotted black; /* If you want dots under the hoverable text */
}
.tooltiptext{
visibility: hidden;
overflow: auto;
min-width: 400px;
background-color: #ffb380;
color: black;
text-align: left;
padding: 5px 10px;
border-radius: 6px;
font-size: 12pt;
font-weight: normal;
/* Position the tooltip text - see examples below! */
position: absolute;
z-index:1;
/* shadow */
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
opacity: 0.95;
filter: alpha(opacity=95);
}
/* Tooltip text */
.tooltip_gene .tooltiptext {
top: -5px;
left: 105%;
}
/* Tooltip text */
.tooltip_allelefrq .tooltiptext {
top: -5px;
right: 105%;
min-width: 120px;
}
/* Show the tooltip text when you mouse over the tooltip container */
.tooltip_allelefrq:hover .tooltiptext, .tooltip_gene:hover .tooltiptext {
visibility: visible;
}
.clin {
display: table-cell;
height: 100%;
vertical-align: middle;
padding: 0% 1% 0% 1%;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
}
</style>
<body>
<div class="wrapper">
<!-- add info about patients -->
<h1>Homozygous variants of sample XXX</h1>
<h2>Tue Jan 23 09:01:56 2018</h2>
<div class="info">
Patient only<br>
</div>
<!-- variants table start -->
<div class="table">
<!-- table header start -->
<div class="tablerow">
<div class="tableheader">
Position
</div>
<div class="tableheader">
Variant
</div>
<div class="tableheader">
Cons
</div>
<div class="tableheader">
Gene
</div>
<div class="tableheader">
Transcript
</div>
<div class="tableheader">
HGVSC
</div>
<div class="tableheader">
HGVSP
</div>
<div class="tableheader">
PolyPhen
</div>
<div class="tableheader">
SIFT
</div>
<div class="tableheader">
AF
</div>
<div class="tableheader">
Clin
</div>
</div>
<!-- table header stop -->
<!-- var loop start -->
<div class="tablerow" >
<!-- position start -->
<div class="position">
<a href="http://gnomad.broadinstitute.org/region/1-117635467-117635507">1:117635487</a>
</div>
<!-- position stop -->
<!-- variants start -->
<div class="variants">
G->T
</div>
<!-- variants stop -->
<!-- consequences start -->
<div class="consequences" style="background: rgb(196, 197, 198);">
synonymous
</div>
<!-- consequences stop -->
<!-- gene start -->
<div class="gene" >
<div class="tooltip_gene">
<a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=TTF2" >
TTF2
</a>
<span class="tooltiptext">GeneCards Summary<hr>
TTF2 (Transcription Termination Factor 2) is a Protein Coding gene.
Diseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.
Among its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.
GO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.
An important paralog of this gene is HLTF.</span>
</div>
</div>
<!-- gene stop -->
<!-- transcripts start -->
<div class="transcripts">
<div class="list">
<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000369466">ENST00000369466
</a>
</div>
</div>
</div>
</div>
<!-- transcripts stop -->
<!-- exon start -->
<!-- <div class="exon">
<div class="list">
</div>
</div>-->
<!-- exon stop -->
<!-- hgvsc start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.2940G>T
</div>
</div>
</div>
</div>
<!-- hgvsc stop -->
<!-- hgvsp start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.2940G>T(p.%3D)
</div>
</div>
</div>
</div>
<!-- hgvsp stop -->
<!-- polyphen start -->
<div class="polyphen">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- polyphen stop -->
<!-- sift start -->
<div class="sift">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- sift stop -->
<!--.allelefreq start -->
<div class="allelefreq">
<div class="tooltip_allelefrq">
0.00000
<span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>0</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>0</span><hr>inhouse:<span style='float:right;'>0.00118</span></span>
</div>
</div>
<!--.allelefreq stop -->
<!--.allelefreq start -->
<div class="clin">
</div>
<!--.allelefreq stop -->
</div>
<!-- table row stop-->
<div class="tablerow" >
<!-- position start -->
<div class="position">
<a href="http://gnomad.broadinstitute.org/region/1-149898435-149898475">1:149898455</a>
</div>
<!-- position stop -->
<!-- variants start -->
<div class="variants">
<a href="https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs143105666">G->A</a>
</div>
<!-- variants stop -->
<!-- consequences start -->
<div class="consequences" style="background: rgb(196, 197, 198);">
synonymous
</div>
<!-- consequences stop -->
<!-- gene start -->
<div class="gene" >
<div class="tooltip_gene">
<a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=SF3B4" >
SF3B4
</a>
<span class="tooltiptext">GeneCards Summary<hr>
SF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.
Diseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.
Among its related pathways are mRNA Splicing - Major Pathway and Gene Expression.
GO annotations related to this gene include nucleic acid binding and nucleotide binding.
</span>
</div>
</div>
<!-- gene stop -->
<!-- transcripts start -->
<div class="transcripts">
<div class="list">
<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000457312">ENST00000457312
</a>
</div>
</div>
<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000271628">ENST00000271628
</a>
</div>
</div>
</div>
</div>
<!-- transcripts stop -->
<!-- exon start -->
<!-- <div class="exon">
<div class="list">
</div>
</div>-->
<!-- exon stop -->
<!-- hgvsc start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.390C>A
</div>
</div>
<div class="row">
<div class="entry">
c.519C>A
</div>
</div>
</div>
</div>
<!-- hgvsc stop -->
<!-- hgvsp start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.390C>A(p.%3D)
</div>
</div>
<div class="row">
<div class="entry">
c.519C>A(p.%3D)
</div>
</div>
</div>
</div>
<!-- hgvsp stop -->
<!-- polyphen start -->
<div class="polyphen">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- polyphen stop -->
<!-- sift start -->
<div class="sift">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- sift stop -->
<!--.allelefreq start -->
<div class="allelefreq">
<div class="tooltip_allelefrq">
0.00021
<span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>57</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>277082</span><hr>inhouse:<span style='float:right;'>0.00236</span></span>
</div>
</div>
<!--.allelefreq stop -->
<!--.allelefreq start -->
<div class="clin">
</div>
<!--.allelefreq stop -->
</div>
<!-- table row stop-->
<!-- var loop stop -->
</div>
<!-- variant table stop -->
</div>
</body>
</html>
最佳答案
这是我能为您提供的最好的。请注意,输出包括将鼠标悬停在 Gene
列中的数据上时弹出的“工具提示文本”。
library(rvest)
# I saved your sample to my Desktop as test.html
pg = read_html('~/Desktop/test.html')
# count rows (including header):
n_rows = pg %>% html_nodes('div.tablerow') %>% length
# sprintf-friendly format to get the %d-th node matching
# //div[@class="tablerow"] (same as div.tablerow in CSS)
# All of the /div after this are columns
xp_fmt = '//div[@class="tablerow"][%d]/div'
# div.tableheader nodes contain column names
col_names = pg %>% html_nodes(xpath = sprintf(xp_fmt, 1L)) %>%
html_text %>% trimws
# rows 2:n contain the actual data; gsub is
# stripping leading/trailing whitespace and
# any duplicate internal whitespace
rows = lapply(2:n_rows, function(ii) {
pg %>% html_nodes(xpath = sprintf(xp_fmt, ii)) %>%
html_text %>% gsub('^\\s+|\\s{2,}|\\s+$', '', .)
})
# can't forget those pesky factors
DF = as.data.frame(do.call(rbind, rows), stringsAsFactors = FALSE)
names(DF) = col_names
DF
# Position Variant Cons
# 1 1:117635487 G->T synonymous
# 2 1:149898455 G->A synonymous
# Gene
# 1 TTF2GeneCards Summary\nTTF2 (Transcription Termination Factor 2) is a Protein Coding gene.\nDiseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.\nAmong its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.\nGO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.\nAn important paralog of this gene is HLTF.
# 2 SF3B4GeneCards Summary\nSF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.\nDiseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.\nAmong its related pathways are mRNA Splicing - Major Pathway and Gene Expression.\nGO annotations related to this gene include nucleic acid binding and nucleotide binding.
# Transcript HGVSC
# 1 ENST00000369466 c.2940G>T
# 2 ENST00000457312ENST00000271628 c.390C>Ac.519C>A
# HGVSP PolyPhen SIFT
# 1 c.2940G>T(p.%3D)
# 2 c.390C>A(p.%3D)c.519C>A(p.%3D)
# AF
# 1 0.00000allele countsht: 0hm: 0wt: 0inhouse:0.00118
# 2 0.00021allele countsht: 57hm: 0wt: 277082inhouse:0.00236
# Clin
# 1
# 2
请注意,它不适用于此处,因为您的所有列似乎都是 character
类型,但更复杂的方法会将此处的行转换为常规文件(例如 csv
),然后使用 read.table
(或者更好,fread
)读入文本并自动检测列类型。
关于html - 使用 rvest 将复杂的 html 文件读入 R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52297953/
我想提取下面页面中的表格 https://www.mcxindia.com/market-data/spot-market-price 我已经尝试过 rvest 和 RCurl,但在这两种情况下,下载
我不确定如何描述问题,因此我将直接进入示例。 我有一个 HTML 文档( html_doc ),如下所示: A X Y B
我正试图在比赛列表中抓取足球运动员效力的俱乐部(例如,在 http://www.transfermarkt.com/alan-shearer/leistungsdatendetails/spieler
我正在尝试从需要提交表单的网站中抓取结果,为此我使用 rvest 包。 运行以下命令后代码失败: require("rvest") require(dplyr) require(XML) BasicU
html_text() 方法(来自 R Package rvest)连接节点的文本和它的所有子节点。我想提取仅父亲的文本。 对于以下示例,html_text() 给出HELLO GOODBYE。 我只
我正在尝试从需要提交表单的网站中抓取结果,为此我使用 rvest 包。 运行以下命令后代码失败: require("rvest") require(dplyr) require(XML) BasicU
我正在尝试抓取 irregular table来自维基百科,使用 rvest。该表具有跨多行的单元格。 documentation for html_table 明确指出这是一个限制。我只是想知道是否
我的情况:我有一个很长(2 万行)的 URL 列表,我需要从中抓取特定数据元素以进行分析。出于本示例的目的,我正在寻找一个名为“sol-num”的特定字段,它是招标编号。使用以下函数,我可以获取 Fe
我正在尝试通过 URL 列表循环抓取一些 IMDB 数据。不幸的是,我的输出并不完全是我所希望的,更不用说将它存储在数据帧中了。 我得到的网址是 library(rvest) topmovies %
我正在使用 RVest 抓取博客文本,并且正在努力找出一种排除特定节点的简单方法。下面拉取文本: AllandSundry_test % html_node("#contentmiddle") %>%
我一直在尝试从这个网址使用inf“rvest”包抓取股票市场:http://finans.mynet.com/borsa/canliborsa/#A这需要注册。我创建了虚拟帐户供您尝试。下面的用户名和
我正在对这个网站进行网络抓取: http://www.falabella.com.pe/falabella-pe/category/cat40536/Climatizacion?navAction=p
在这个问题上花了很多时间并查看了可用的答案之后,我想继续提出一个新问题来解决我使用 R 和 rvest 进行网络抓取的问题。我已尝试完全列出问题以尽量减少问题 问题我正在尝试从 session 网页中
我正在尝试抓取下面列出的以下网站。我尝试通过使用 rvest 和下面的代码来做到这一点。 我的尝试是尝试复制我在 Google Chrome 中为下载按钮找到的 PUT。我不确定我做错了什么。我的 r
我已经成功地抓取了我想要的数据(在 SO 用户的帮助下),但是我遗漏了每个抓取表中的数据代表谁的关键。所以我试图使用 mutate 添加一个名为 player 的字段,它与 player[[j]] 相
我的目标是使用 library(tm)一个相当大的 word 文档上的工具包。 word 文档有合理的排版,所以我们有 h1对于主要部分,一些 h2和 h3副标题。我想比较每个部分并对其进行文本挖掘(
我正在尝试使用 rvest 包抓取在议会中举行的部分演讲。使用 css 选择器或 chrome 的检查器工具为我提供了一个选择器,但是我无法检索预期的(任何)数据。据我所知,该站点也不是基于 java
我正在尝试下载 png通过 R 来自安全站点的图像。 为了访问我使用的安全站点 Rvest效果很好。 到目前为止,我已经提取了 png 的 URL。图片。 如何使用 rvest 下载此链接的图像? r
我正在尝试写一个爬虫来下载一些信息,类似于this Stack Overflow post.答案对于创建填写的表单很有用,但是当提交按钮不是表单的一部分时,我正在努力寻找一种提交表单的方法。这是一个例
我正面临网络抓取问题。我打算在 tripadvisor 上收集一些评论。我想使用 rvest 并获得所有语言的评论。来自 this questions我知道一种可能的方法是在 url 的末尾使用 ?f
我是一名优秀的程序员,十分优秀!