gpt4 book ai didi

java - 是否可以使用 jsoup 来解析 html?解析后还需要在输出中保留一些标签

转载 作者:行者123 更新时间:2023-12-02 18:36:14 25 4
gpt4 key购买 nike

我必须将下面的 html 正文部分解析为下面给出的输出。

标签必须存在于输出中。输出可以包含 {p,i,b,br} 标签。剩余的标签必须删除,并且只有文本才能输出。

这是我的输入。

<!DOCTYPE HTML>
<html>
<head>
<title>Introduction</title>
</head>
<body>
<article id="mobi_content">
<h1 class="mobi-page-title">Introduction</h1>
<section id="dataSectionInstanceId-431331" class="body-text">This book is about creating a great career. <p>You might be saying to yourself, "I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!" <p>Well, if you're looking, we're going to show you how to get that great job now. That's the first, short-term step. <p>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. <p>This book is about today and tomorrow. It's about getting a great job now and enjoying a great career for life. <p>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: "I can't believe I get paid for doing this!" Are only a few people entitled to feel that way, but not the rest of us? <p>And what about you? Are you looking forward to a great career? Would you describe your current career as "great"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? <p>Furthermore, just how do you create a great career for yourself? <p>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you.
</section>
</article>
</body>
</html>

输出期望如下:

This book is about creating a great career.
<P>You might be saying to yourself, "I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!"
<P>Well, if you're looking, we're going to show you how to get that great job now. That's the first, short-term step.
<P>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world.
<P>This book is about today and tomorrow. It's about getting a great job now and enjoying a great career for life.
<P>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: "I can't believe I get paid for doing this!" Are only a few people entitled to feel that way, but not the rest of us?
<P>And what about you? Are you looking forward to a great career? Would you describe your current career as "great"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know?
<P>Furthermore, just how do you create a great career for yourself?
<P>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you.

我的代码:

doc.body().traverse(new NodeVisitor() {

@Override
public void head(Node node, int depth) {

String name = node.nodeName();
String paraText = "";

if (node instanceof TextNode) {

TextNode tn = ((TextNode) node);

if (node.nodeName().equals("p")) {
//finalHtml+="<p>"+tn.text()+"</p>";
} else {
finalHtml += tn.text();
}

} else if (node instanceof Node) {

if (node.nodeName() == "p") {
System.out.println("fnbdnv"+node.toString());
}
if (node.nodeName() == "h1") {
// finalHtml+="<p>"+node.toString()+"<p>";
} else if (node.nodeName() == "div") {
node.removeAttr("class");
finalHtml += node.toString();
} else if (node.nodeName() == "seection") {
finalHtml += node.toString();
} else if (node.nodeName() == "<b>") {
finalHtml += node.toString();
} else if (node.nodeName() == "<i>") {
finalHtml += "<i>" + node.toString() + "</i>";
}
}

}

@Override
public void tail(Node node, int depth) {
// Do Nothing
}
});

最佳答案

也许在这种情况下使用一些正则表达式会更好。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

public static void main(String[] args) {
try {
String html = "<!DOCTYPE HTML>" +
"<html>" +
"<head>" +
"<title>Introduction</title>" +
"</head>" +
"<body>" +
"<article id=\"mobi_content\">" +
"<h1 class=\"mobi-page-title\">Introduction</h1>" +
"<section id=\"dataSectionInstanceId-431331\" class=\"body-text\">This <i>book</i> is about creating a great career. <p>You might be saying to yourself, \"I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!\" <p>Well, if you're looking, we're going to show you how to get that great job now. That's the first, short-term step. <p>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. <p>This book is about today and tomorrow. It's about getting a great job now and enjoying a great career for life. <p>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: \"I can't believe I get paid for doing this!\" Are only a few people entitled to feel that way, but not the rest of us? <p>And what about you? Are you looking forward to a great career? Would you describe your current career as \"great\"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? <p>Furthermore, just how do you create a great career for yourself? <p>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you." +
"</section>" +
"</article>" +
"</body>" +
"</html>";

Document doc = Jsoup.parse(html);


System.out.println(removeTags(doc.body().toString()));

} catch (Exception e) {
e.printStackTrace();
}
}

public static String removeTags(String source) {
return source.replaceAll("(?!(</?p>|</?i>|</?b>|<br/?>))(</?.*?>)", " ");
}
}

更新

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

public static void main(String[] args) {
try {
String html = "<!DOCTYPE HTML>" +
"<html>" +
"<head>" +
"<title>Introduction</title>" +
"</head>" +
"<body> <article id=\"mobi_content\"> <h1 class=\"mobi-page-title\">\"Build Your Village\" Tool</h1> <section id=\"dataSectionInstanceId-431408\" class=\"body-text\"><p class=\"nonindent\">Your great career depends not only on you,</p> <p class=\"nonindent\">Sample deposits in the Emotional Bank Account:</p> <ul class=\"bullet\"> <li><p class=\"nonindent\">Congratulate the person on a job well done.</p></li> <li><p class=\"nonindent\">Send birthday greetings.</p></li></section></article></body>" +
"</html>";

Document doc = Jsoup.parse(html);


System.out.println(removeTags(doc.body().toString()));

} catch (Exception e) {
e.printStackTrace();
}
}

public static String removeTags(String source) {
return source.replaceAll("(?!(</p>|<p .*?>|</?i>|</?b>|<br/?>))(</?.*?>)", " ");
}
}

更新2

import java.util.ListIterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {

public static void main(String[] args) {
try {
Pattern pattern = Pattern.compile("/(((?!/).)*)[.]");

String html = "<!DOCTYPE HTML>" +
"<html>" +
"<head>" +
"<title>Introduction</title>" +
"</head>" +
"<body> <article id=\"mobi_content\"> <h1 class=\"mobi-page-title\">\"Build Your Village\" Tool</h1> <section id=\"dataSectionInstanceId-431408\" class=\"body-text\"><p class=\"nonindent\">Your great career depends not only on you,</p> <p class=\"center\"><img src=\"mpla/multimedia/Cove_9781936111107_epub_005_r1.png\" id=\"mobi_image_12776\" class=\"inline-img\" alt=\"PNG\"/></p><p class=\"nonindent\">Sample deposits in the Emotional Bank Account:</p> <ul class=\"bullet\"> <li><p class=\"nonindent\">Congratulate the person on a job well done.</p></li> <li><p class=\"nonindent\">Send birthday greetings.</p></li></section></article></body>" +
"</html>";

Document doc = Jsoup.parse(html);
Elements imgs = doc.select("img");
System.out.println(imgs);
ListIterator<Element> iter = imgs.listIterator();
while(iter.hasNext()) {
Element img = iter.next();
String src = img.attr("src");
Matcher matcher = pattern.matcher(src);
if (matcher.find()) {
img.tagName("graphic").text(matcher.group(1));
removeAttr(img);
}
}

System.out.println(removeTags(doc.body().toString()));

} catch (Exception e) {
e.printStackTrace();
}
}

public static void removeAttr(Element e) {
Attributes at = e.attributes();
for (Attribute a : at) {
e.removeAttr(a.getKey());
}
}

public static String removeTags(String source) {
return source.replaceAll("(?!(</p>|<p .*?>|</?graphic>|</?i>|</?b>|<br/?>))(</?.*?>)", " ").trim();
}
}

关于java - 是否可以使用 jsoup 来解析 html?解析后还需要在输出中保留一些标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25643266/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com