gpt4 book ai didi

java - jsoup : parse data of p tag which is between every h2 tag

转载 作者:行者123 更新时间:2023-12-02 13:15:18 27 4
gpt4 key购买 nike

最近3天我试图通过Java中的jsoup解析某些信息-_-,这是我的代码:

   Document document = Jsoup.connect(urlofpage).get();
Elements links = document.select(".contentBox");
for (Element link : links) {
// String name = link.text();
String title = link.select("h2").text();
int h2length = link.select("h2").size();

for( int i = 0; i <= h2length -1; i++)
{
String s = link.select("h2").get(i).text();
boolean desc1 = Pattern.compile("What is").matcher(s).find();
boolean desc2 = Pattern.compile("Uses for").matcher(s).find();

if(desc1 == true || desc2 == true)
{
String descritop = "";
int plength = link.select("p ~ h2 ~ p").size() - link.select("h2 ~ p").size();
System.out.println(h2length);
String ssv = link.select("h2 ~ p").get(1).text();
}
}

它正在按指示获取数据,分别获取 h2 和 p 的数据,但问题是,我想解析 d ata inside of <p> tag which is just after every <h2> tag .

例如(HTML 内容):

<h2>main content</h2>
<div class="acx"><div>
<p>content</p>
<p>content 2</p>

<h2>content 2</h2>
<div class="acx"><div>
<p>new content od 2</p>
<p>new 2</p>

现在它应该像(在数组中)一样获取:

array[0] = "content content 2",
array[1] = "new content od 2 new 2",

有什么解决办法吗?

解析的URL为https://www.drugs.com/mtm/a-d-topical.html

最佳答案

我的想法很简单。获取 h2 元素之后的第一个 p 元素并将其添加到 ArrayList,然后检查下一个元素是否为 p 并将其添加。例如:

ArrayList<ArrayList<String>> textInsidePList = new ArrayList<ArrayList<String>>();
for (Element link : links) {
Elements headings2 = link.select("h2 ~ p");
for (int i = 0; i < headings2.size(); i++) {
ArrayList<String> textInsideP = new ArrayList<String>();
textInsideP.add(headings2.get(i).text());
Element nextPar = headings2.get(i).nextElementSibling();
if (nextPar.nodeName() == "p") {
textInsideP.add(nextPar.text());
}
textInsidePList.add(textInsideP);
}
}

如果你有超过 2 个 p 元素,你只需要编写一个递归即可。但如果 p 之间可以有其他元素,则此代码将不起作用。结果,您将拥有一个 ArrayList,其中包含一个 ArrayList,该 ArrayList 表示 h2 元素以及来自 p 节点的文本。

编辑。递归示例:

public static void main(String[] args) throws IOException {
String html = "<h2>first h2</h2>" +
"<div class=\"acx\"></div>" +
"<p>first h2 content 1</p>" +
"<p>first h2 content 2</p>" +
"<p>first h2 content 3</p>" +
"<p>first h2 content 4</p>" +
"<h2>second h2</h2>" +
"<div class=\"acx\"></div>" +
"<p>second h2 content 1</p>" +
"<p>second h2 content 2</p>";
Document document = Jsoup.parse(html);

/* creating first order ArrayList */
ArrayList<ArrayList<String>> textInsidePList = new ArrayList<ArrayList<String>>();
Elements headings2 = document.select("h2");
for (Element heading2 : headings2) {
/* creating second order ArrayList and adding data */

ArrayList<String> textInsideP = new ArrayList<String>();
textInsideP.add(heading2.text()); // delete this line to remove h2 content from array, this just for example
parsingRecursion(heading2, textInsideP);
textInsidePList.add(textInsideP);

}

/* iteraiting through ArrayList */
for (ArrayList<String> firstH2 : textInsidePList) {
System.out.println("h2:");
for (String parsInsideH2 : firstH2) {
System.out.println(parsInsideH2);
}
}

}

/* recursive function */
private static void parsingRecursion(Element heading2, ArrayList<String> textInsideP) {
Element nextPar = heading2.nextElementSibling();
if (nextPar != null && nextPar.nodeName() == "p") {
textInsideP.add(nextPar.text());
parsingRecursion(nextPar, textInsideP);
} else if (nextPar != null && nextPar.nodeName() != "h2") {
Element nextNotP = nextPar.nextElementSibling();
textInsideP.add(nextNotP.text());
parsingRecursion(nextNotP, textInsideP);
}
}
}

控制台输出:

    h2:
first h2
first h2 content 1
first h2 content 2
first h2 content 3
first h2 content 4
h2:
second h2
second h2 content 1
second h2 content 2

使用递归是因为我们不知道在h2之前会遇到多少个“p”节点。使用 ArrayList 代替数组,因为我们可以动态添加元素而无需设置数组的大小。

编辑 #2,因为问题已更改:

public static void main(String[] args) throws IOException {
Document document = Jsoup.connect(pathToYoursCusromUrl).get();
Elements links = document.select(".contentBox");
for (Element link : links) {
/* creating first order ArrayList */
ArrayList<ArrayList<String>> textInsidePList = new ArrayList<ArrayList<String>>();
Elements headings2 = document.select("h2");
for (Element heading2 : headings2) {
/* creating second order ArrayList and adding data */

ArrayList<String> textInsideP = new ArrayList<String>();
parsingRecursion(heading2, textInsideP);
textInsidePList.add(textInsideP);

}

/* iteraiting through ArrayList */
for (ArrayList<String> firstH2 : textInsidePList) {
System.out.println("h2:");
for (String parsInsideH2 : firstH2) {
System.out.println("p:" + parsInsideH2);
}
}

}
}

/* recursive function */
private static void parsingRecursion(Element heading2, ArrayList<String> textInsideP) {
Element nextPar = heading2.nextElementSibling();
if (nextPar != null && nextPar.nodeName() == "p") {
textInsideP.add(nextPar.text());
parsingRecursion(nextPar, textInsideP);
} else if (nextPar != null && nextPar.nodeName() != "h2") {
Element nextNotP = nextPar.nextElementSibling();
if (nextNotP != null) {
textInsideP.add(nextNotP.text());
parsingRecursion(nextNotP, textInsideP);
}

}
}
}

控制台输出:

h2:
p:Vitamins A, D, and E topical (for the skin) is a skin protectant. It works by moisturizing and sealing the skin, and aids in skin healing.
p:This medication is used to treat diaper rash, dry or chafed skin, and minor cuts or burns.
p:Vitamins A, D, and E may also be used for purposes not listed in this medication guide.
h2:
p:You should not use this medication if your child is allergic to it. Do not apply vitamins A, D, and E topical without a rubber glove or finger cot if you are allergic this medication.
p:Ask a doctor or pharmacist if it is safe for you to use this medication on your child if the child is allergic to any medicines or skin products, including soaps, oils, lotions, or creams.
p:Stop using the medication and call your doctor at once if your child has a serious side effect such as warmth, redness, oozing, or severe irritation where the medicine is applied.
p:Keep the baby's diaper area as dry as possible. Change wet or soiled diapers immediately to keep wetness and bacteria from irritating the baby's skin. Always put on a new diaper when the baby first wakes up in the morning, and also just before putting the baby to bed each night.

等等...

关于java - jsoup : parse data of p tag which is between every h2 tag,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43802164/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com