gpt4 book ai didi

python - 为什么 Beautiful Soup 只提取 CDATA 而不是常规评论?

转载 作者:行者123 更新时间:2023-11-28 17:00:06 24 4
gpt4 key购买 nike

我正在制作一个脚本,从网站的页面源中提取所有评论。

for addr in links:
driver.get(addr)
print(addr)
for comments in soup.findAll(text=lambda text: isinstance(text, Comment)):
comments.extract()
print(comments)

为什么它只像这样提取 CDATA:

//<![CDATA[
Feedback.Bootstrap.InitializeFeedback({page:true},"epf",true,false,false,false,false);;
//]]>
//<![CDATA[
function addRemoveListenersOnAll(){var t=_ge("b_content"),n,i,r;t&&(n=_d.createElement("STYLE"),n.id=styleIdString,n.innerText="#b_results h2>a {padding: 16px 40px 0 6px;margin: -16px -40px 0 -6px;}",_d.head&&_d.head.appendChild(n),i=t.getElementsByClassName("b_ad"),i&&AddRemoveListener(i),r=t.getElementsByClassName("b_algo"),r&&AddRemoveListener(r))}function AddRemoveListener(n){for(var t,i,u=n.length,r=0;r<u;r++)if(t=n[r].getElementsByTagName("CITE"),t&&t.length>0)for(i=0;i<t.length;i++)sj_be(t[i],"click",algo_c)}function mouseMoveAfterTouchHandler(){sj_ue(document,"mousemove",mouseMoveAfterTouchHandler);var n=_d.getElementById(styleIdString);n&&n.parentNode&&n.parentNode.removeChild(n);sj_log("CI.TTC","mouse","started");sj_ue(document,"mousemove",mouseMoveAfterTouchHandler)}function touchStartHandlerAll(n){n.pointerType==="touch"&&(addRemoveListenersOnAll(),sj_log("CI.TTC","touch","started"),sj_ue(document,"pointerdown",touchStartHandlerAll),document.addEventListener("mousemove",mouseMoveAfterTouchHandler))}var styleIdString="ttcDynStyle",algo_c=function(n){function i(n){var t=n.getElementsByTagName("a"),i;t&&t.length>0&&(i=t[0],si_ct(i),sj_log("CI.TTC","click","touch"),_w.open(i.href,"_self"))}n=sj_ev(n);var t=sj_et(n);if(t){if(t.tagName=="A")return!0;while(t&&!(t.className.indexOf("b_algo")>=0||t.className.indexOf("sb_add")>=0)){if(t.tagName=="BODY")return;t=t.parentNode}}return t?(i(t),!0):(sj_sp(n),!1)};document.addEventListener("pointerdown",touchStartHandlerAll);Feedback.Bootstrap.InitializeFeedback({page:true},"sb_feedback",1,0,0);;
//]]>

但它不会提取像这样的常规评论:

<!--div class="s-bk-lf"><div class="acc-title" >Следите за новостями и акциями нашего проекта!!!</div> </div><br-->
<!--LiveInternet counter-->
<!--img src="/img/ng6.png"width="150" height="" hspace="100" vspace="80" align="left" -->

如何提取常规评论而不仅仅是 CDATA?

最佳答案

我发现它是在脚本的早期查看由 BeautifulSoup 解析的html(来自bing搜索结果。

我通过在循环中添加 beautifulsoup 来修复它。

for addr in links:
print(addr)
driver.get(addr)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
for comments in soup.findAll(text=lambda text: isinstance(text, Comment)):
comments.extract()
print(comments)

它不干净,因为我只是从脚本的前面复制了 html 和 soup,但它确实有效。

关于python - 为什么 Beautiful Soup 只提取 CDATA 而不是常规评论?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55187812/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com