gpt4 book ai didi

php - 使用 DOMXPath 在

标签内保留换行符?

转载 作者:可可西里 更新时间:2023-10-31 22:15:58 27 4
gpt4 key购买 nike

我目前正在使用 PHP 和 DOMXPath获取所有 <p> 的内容网页元素:

<?php
...
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

我的问题是 textContent 产生的字符串不尊重<br /> <p> 中存在的标签元素。相反,它删除了换行符并将通常位于不同行上的单词放在一起。例如:

示例 HTML:

<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>

<p>
Random information and what not<br />
Isn't that cool?
</p>

上面 PHP 的当前输出:

Some happy talk about our great product.We would love for you to buy it!

Random information and what notIsn't that cool?

我试过了$paragraphs = $doc->getElementsByTagName("p");以及它给了我同样的东西。

有没有办法让 DOMXPath/DOMDocument 保留换行符?我需要能够分隔段落中的每个单词,而当前的输出不允许这样做。

如果有其他方法可以检索 <p> 中的字符串元素同时保留<br />'\n'那也太好了。

编辑


经过进一步调查,有问题的 HTML 实际上是一个由 <br> 分隔的 anchor 列表。标签但没有实际的换行符:

<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

事实证明,这适用于给定的原始 HTML。

更新:已解决


在@ircmaxell 的回答以及@netcoder 和@Gordon 留下的评论的帮助下,这个问题已经解决了,它不是很优雅,但现在就可以了。

例子:

foreach ($paragraphs as $paragraph){
$p_text = new DOMDocument();
$p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
//Do whatever, in this case get all of the words in an array.
$words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}

这利用了 DOMinnerHTML (如@netcoder 所建议)替换 <br> 的实例使用“\r\n”(如@ircmaxell 所建议),然后可以在 textContent. 后对其进行评估

显然还有一些改进的余地,但它已经解决了我当前的问题。

感谢大家的帮助,

最佳答案

好吧,我要做的是用文字换行符替换换行符:

$doc = new DOMDocument();
$doc->loadHTML($html);

$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
$node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}


$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

关于php - 使用 DOMXPath 在 <p> 标签内保留换行符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4739896/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com