gpt4 book ai didi

java - 如何使用 HtmlUnit 从 html 页面中提取元素

转载 作者:行者123 更新时间:2023-12-03 23:08:45 30 4
gpt4 key购买 nike

在使用 HtmlUnit 解析 html 页面时,我有两个问题(实际上是问题)。我尝试了他们的“入门指南”并搜索了谷歌,但没有帮助。这是我的第一个问题。

1) 我想从页面中提取以下 bold 标签的文本

<b class="productPrice">Five Dollars</b>

2)我想在以下结构的最后一段中提取整个文本(包括进一步的跨度或链接文本,如果存在的话)

<div class="alertContainer">
<p>Hello</p>
<p>Haven't you registeret yet?</p>
<p>Registrations will close on 3 July 2012.<span>So don't wait</span></p>
</div>

你能给我一行代码片段吗?我该怎么做?我是 HtmlUnit 的新手。

编辑:

HtmlUnit有getElementByName()getElementById(),那我们要select using class用什么呢?

这将是我第一个问题的答案。

最佳答案

实际上,我建议你改用 xpath 和 jtidy,就像这样

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlItalic;
import com.gargoylesoftware.htmlunit.html.HtmlOption;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlRadioButtonInput;
import com.gargoylesoftware.htmlunit.html.HtmlSelect;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import com.gargoylesoftware.htmlunit.html.HtmlTextArea;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;

public class WebScraper {

private static final String TEXT = "some random text here";
private static final String SWALLOW = "continental";
private static final String COLOR = "indigo2";
private static final String QUESTION = "why?";
private static final String NAME = "Leo";

/**
* @param args
* @throws IOException
* @throws MalformedURLException
* @throws FailingHttpStatusCodeException
*/
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {

//to get the HTML Xpath, download and install firefox plugin Xpather from
//http://jassage.com/xpather-1.4.5b.xpi
//
//then right-click on any part of the html and choose "show in xpather"
//
//HtmlUnit is a suite for functional web app tests (headless) with a
//built-in "browser". Very useful for screen scraping.
//
//for HtmlUnit examples and usage, try
//http://htmlunit.sourceforge.net/gettingStarted.html
//
//sometimes, the HTML is malformed, so you'll need to "clean it"
//that's why I've also added JTidy to this project

WebClient webClient = new WebClient();

HtmlPage page = webClient.getPage("http://cgi-lib.berkeley.edu/ex/simple-form.html");

// System.out.println(page.asXml());

HtmlForm form = (HtmlForm) page.getByXPath("/html/body/form").get(0);

HtmlTextInput name = form.getInputByName("name");
name.setValueAttribute(NAME);

HtmlTextInput quest = form.getInputByName("quest");
quest.setValueAttribute(QUESTION);

HtmlSelect color = form.getOneHtmlElementByAttribute("select", "name", "color");
List<HtmlOption> options = color.getOptions();
for(HtmlOption op:options){
if (op.getValueAttribute().equals(COLOR)){
op.setSelected(true);
}
}

HtmlTextArea text = form.getOneHtmlElementByAttribute("textarea", "name", "text");
text.setText(TEXT);

//swallow
HtmlRadioButtonInput swallow = form.getInputByValue(SWALLOW);
swallow.click();

HtmlSubmitInput submit = form.getInputByValue("here");

//submit
HtmlPage page2 = submit.click();

// System.out.println(page2.asXml());

String color2 = ((HtmlItalic)page2.getByXPath("//dd[1]/i").get(0)).getTextContent();
String name2 = ((HtmlItalic)page2.getByXPath("//dd[2]/i").get(0)).getTextContent();
String quest2 = ((HtmlItalic)page2.getByXPath("//dd[3]/i").get(0)).getTextContent();
String swallow2 = ((HtmlItalic)page2.getByXPath("//dd[4]/i").get(0)).getTextContent();
String text2 = ((HtmlItalic)page2.getByXPath("//dd[5]/i").get(0)).getTextContent();

System.out.println(COLOR.equals(color2)
&& NAME.equals(name2)
&& QUESTION.equals(quest2)
&& SWALLOW.equals(swallow2)
&& TEXT.equals(text2));

webClient.closeAllWindows();

}

}

关于java - 如何使用 HtmlUnit 从 html 页面中提取元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21278722/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com