gpt4 book ai didi

java - Jsoup 删除 H2 标签之前的所有内容

转载 作者:行者123 更新时间:2023-12-02 02:50:25 26 4
gpt4 key购买 nike

我有我的 HTML 源代码,是使用 Jsoup.connect() 从网站获取的方法。以下是该 HTML 源代码中的一段代码(链接: https://learn.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community )

.....
<p>When you set dependencies in your VSIX manifest, you must specify Component IDs
only. Use the tables on this page to determine our minimum component dependencies.
In some scenarios, this might mean that you specify only one component from a workload.
In other scenarios, it might mean that you specify multiple components from a single
workload or multiple components from multiple workloads. For more information, see
the
<a href="../extensibility/how-to-migrate-extensibility-projects-to-visual-studio-2017" data-linktype="relative-path">How to: Migrate Extensibility Projects to Visual Studio 2017</a> page.</p>
.....
<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>
.....
<h2 id="see-also">See also</h2>
.....

我想使用 jsoup 做什么就是,我想删除 <h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2> 之前的每一个 Html 片段

,以及(包括)<h2 id="see-also">See also</h2>之后的所有内容

我有一个这样的解决方案,但这对我来说几乎不起作用:

        try {
document = Jsoup.connect(Constants.URL).get();
}
catch (IOException iex) {
iex.printStackTrace();
}
document = Parser.parse(document.toString().replaceAll(".*?<a href=\"workload-and-component-ids\" data-linktype=\"relative-path\">Visual Studio 2017 Workload and Component IDs</a> page.</p>", "") , Constants.URL);
document = Parser.parse(document.toString().replaceAll("<h2 id=\"see-also\">See also</h2>?.*", "") , Constants.URL);
return null;

如有任何帮助,我们将不胜感激。

最佳答案

简单的方法可能是:将页面的整个 html 作为字符串获取,创建所需部分的子字符串,然后使用 jsoup 再次解析该子字符串。

        Document doc = Jsoup.connect("https://learn.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community").get();
String html = doc.html().substring(doc.html().indexOf("visual-studio-core-editor-included-with-visual-studio-community-2017")-8,
doc.html().indexOf("unaffiliated-components")-8);
Document doc2 = Jsoup.parse(html);
System.out.println(doc2);

关于java - Jsoup 删除 H2 标签之前的所有内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43935780/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com