gpt4 book ai didi

html - R 解析网页中的不完整文本(HTML)

转载 作者:行者123 更新时间:2023-11-28 01:09:24 27 4
gpt4 key购买 nike

我正在尝试从多篇科学文章中解析纯文本以供后续文本分析。到目前为止,我使用 R script by Tony Breyal基于包 RCurlXML。这适用于所有目标期刊, http://www.sciencedirect.com 发表的除外。 。当我尝试解析来自 SD 的文章时(这对于我需要从 SD 访问的所有测试期刊都是一致的),R 中的文本对象仅将整个文档的第一部分存储在其中。不幸的是,我不太熟悉 html,但我认为问题应该出在 SD html 代码中,因为它适用于所有其他情况。我知道有些期刊不是开放访问的,但我有访问权限,问题也出现在开放访问的文章中(查看示例)。这是来自 Github 的代码:

 htmlToText <- function(input, ...) {
###---PACKAGES ---###
require(RCurl)
require(XML)


###--- LOCAL FUNCTIONS ---###
# Determine how to grab html for a single input element
evaluate_input <- function(input) {
# if input is a .html file
if(file.exists(input)) {
char.vec <- readLines(input, warn = FALSE)
return(paste(char.vec, collapse = ""))
}

# if input is html text
if(grepl("</html>", input, fixed = TRUE)) return(input)

# if input is a URL, probably should use a regex here instead?
if(!grepl(" ", input)) {
# downolad SSL certificate in case of https problem
if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")
return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm"))
}

# return NULL if none of the conditions above apply
return(NULL)
}

# convert HTML to plain text
convert_html_to_text <- function(html) {
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
return(text)
}

# format text vector into one character string
collapse_text <- function(txt) {
return(paste(txt, collapse = " "))
}

###--- MAIN ---###
# STEP 1: Evaluate input
html.list <- lapply(input, evaluate_input)

# STEP 2: Extract text from HTML
text.list <- lapply(html.list, convert_html_to_text)

# STEP 3: Return text
text.vector <- sapply(text.list, collapse_text)
return(text.vector)
}

现在这是我的代码和一篇示例文章:

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319"
temp.text <- htmlToText(target)

未格式化的文本在方法部分的某处停止:

DNA was extracted using the MasterPure™ Yeast DNA Purification Kit (Epicentre, Madison, Wisconsin, USA) following the manufacturer's instructions.

有什么建议/想法吗?

附言我还尝试了基于 rvesthtml_text,结果相同。

最佳答案

您可以直接使用您现有的代码,只需将 ?np=y 添加到 URL 的末尾,但这样更紧凑一些:

library(rvest)
library(stringi)

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y"

pg <- read_html(target)
pg %>%
html_nodes(xpath=".//div[@id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>%
stri_trim() %>%
paste0(collapse=" ") %>%
write(file="output.txt")

一些输出(那篇文章的总数 >80K):

 Fungal Ecology Volume 22 , August 2016, Pages 61–72        175394|| Species richness 
influences wine ecosystem function through a dominant species Primrose J. Boynton a , , ,
Duncan Greig a , b a Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany
b The Galton Laboratory, Department of Genetics, Evolution, and Environment, University
College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016,
Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise
Davey Abstract Increased species richness does not always cause increased ecosystem function.
Instead, richness can influence individual species with positive or negative ecosystem effects.
We investigated richness and function in fermenting wine, and found that richness indirectly
affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae .
While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich
communities, probably because antagonistic species prevent it from growing. It is also diluted
from species-poor communities,

关于html - R 解析网页中的不完整文本(HTML),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38347902/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com