- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
首先我想说,我已经找到了the same question有答案,但我无法让他们工作。我尝试从评论中提取数据,目前评论的内容及其有用性。一般来说,我是 BeautifulSoup 和 Python 的新手。
现在,我使用 findAll 方法来获取包含评论的 div 列表,例如,一些对产品有意见的随机网站:
import urllib2
from BeautifulSoup import BeautifulSoup
turl = "http://www.amazon.com/The-Great-Gatsby-Scott-Fitzgerald/product-reviews/0743273567/ref=cm_cr_pr_hist_5?ie=UTF8&filterBy=addFiveStar&showViewpoints=0"
page= urllib2.urlopen(turl);
soup = BeautifulSoup(page);
products = soup.findAll("div", style = "margin-left:0.5em;")
print products[0]
这样我得到的输出是这样的:
<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
335 of 368 people found the following review helpful
</div>
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars"><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Decades later, still great but on different terms.</b>, <nobr>August 24, 2001</nobr></span>
</div>
<div style="margin-bottom:0.5em;">
<div><div style="float:left;">By </div><div style="float:left;"><a href="http://www.amazon.com/gp/pdp/profile/A1IKD6BDEE18CI"><span style="font-weight: bold;">mirope "mirope"</span></a> - <a href="http://www.amazon.com/gp/cdp/member-reviews/A1IKD6BDEE18CI?ie=UTF8&sort_by=MostRecentReview">See all my reviews</a><br />
<a href="http://www.amazon.com/gp/help/customer/display.html?ie=UTF8&nodeId=14279681&pop-up=1#VN" target="AmazonHelp" onclick="return amz_js_PopWin(this.href,'AmazonHelp','width=340,height=340,resizable=1,scrollbars=1,toolbar=1,status=1');"><span class="cmtySprite s_BadgeVineVoice "><span>(VINE VOICE)</span></span></a>
</div></div><div style="clear:both;"></div>
</div>
<div class="tiny" style="margin-bottom:0.5em;">
<span class="crVerifiedStripe"><b class="h3color tiny" style="margin-right: 0.5em;">Amazon Verified Purchase</b><span class="tiny verifyWhatsThis">(<a href="http://www.amazon.com/gp/community-help/amazon-verified-purchase" target="AmazonHelp" onclick="amz_js_PopWin('http://www.amazon.com/gp/community-help/amazon-verified-purchase', 'AmazonHelp', 'width=400,height=500,resizable=1,scrollbars=1,toolbar=0,status=1');return false; ">What's this?</a>)</span></span>
</div>
<div class="tiny" style="margin-bottom:0.5em;">
<b><span class="h3color tiny">This review is from: </span><a href="https://rads.stackoverflow.com/amzn/click/com/0684801523" rel="nofollow noreferrer">The Great Gatsby (Paperback)</a></b>
</div>
Having reread this book for the first time in 20 years, I can confirm that there's a reason that it's considered one of the very best American novels. However, my reaction to the story was different than when I first read it in high school. I recall that back then I was hoping that Daisy and Gatsby's love story would ultimately yield a happy ending. Now, I found them both to be such shallow creatures that they inspired no pity. While I considered the characters to be emotionally stunted, that dooesn't mean I was not impressed with Fitzergerald's skillful rendering. As in most forms of art, in literature it is more difficult to accurately and interestingly portray nothingness than to describe a richly endowed subject. At this more cynical age, I found Daisy to be a remarkable emotional void, and Gatsby's quest to pour all of his hopes and dreams into such a shallow cauldron only confirmed his own vapidity. One thing that hasn't changed in all these years is my amazement at Fitzgerald's ability to set a scene. His descriptive passages are truly poetic, and his command of word choice in unparalleled. All this made for a stimulating and delightful read.
<div style="padding-top: 10px; clear: both; width: 100%;">
<div class="reviews-voting-stripe" style="float:left; padding-right:15px; border-right:1px solid #CCCCCC"><div style="padding-bottom:5px;"><b class="tiny" style="color:#666666;white-space:nowrap;">Help other customers find the most helpful reviews</b> </div><div style="width:300px;">
<a name="R3KCIEAV000FPG.2115.Helpful.Reviews" style="font-size:1px;"> </a><span class="crVotingButtons"><nobr><span class="votingPrompt">Was this review helpful to you? </span><a rel="nofollow" class="votingButtonReviews votingButton-yes" href="http://www.amazon.com/gp/voting/cast/Reviews/2115/R3KCIEAV000FPG/Helpful/1?ie=UTF8&target=aHR0cDovL3d3dy5hbWF6b24uY29tL3Jldmlldy8wNzQzMjczNTY3&token=9BE8627F650F9D873DB4042D67CB37FA98AFD161&voteAnchorName=R3KCIEAV000FPG.2115.Helpful.Reviews&voteSessionID=000-0000000-0000000"><span class="cmtySprite s_largeYes "><span>Yes</span></span></a>
<a rel="nofollow" class="votingButtonReviews votingButton-no" href="http://www.amazon.com/gp/voting/cast/Reviews/2115/R3KCIEAV000FPG/Helpful/-1?ie=UTF8&target=aHR0cDovL3d3dy5hbWF6b24uY29tL3Jldmlldy8wNzQzMjczNTY3&token=B35087155FEB75AC5155B500CE8518AEFD4ADBAC&voteAnchorName=R3KCIEAV000FPG.2115.Helpful.Reviews&voteSessionID=000-0000000-0000000"><span class="cmtySprite s_largeNo "><span>No</span></span></a></nobr> <span class="votingMessage"></span></span>
</div></div><div style="float:left;"><div style="padding-left:15px;"><div style="white-space:nowrap;"><span class="tiny">
<a name="R3KCIEAV000FPG.2115.Inappropriate.Reviews" style="font-size:1px;"> </a><span class="reportingButton"><nobr><a rel="nofollow" class="reportingButton" href="http://www.amazon.com/gp/voting/cast/Reviews/2115/R3KCIEAV000FPG/Inappropriate/1?ie=UTF8&target=aHR0cDovL3d3dy5hbWF6b24uY29tL3Jldmlldy8wNzQzMjczNTY3&token=414B10F161A63A55D269D6EE7DC174FF22482F7E&voteAnchorName=R3KCIEAV000FPG.2115.Inappropriate.Reviews&voteSessionID=000-0000000-0000000">Report abuse</a></nobr></span>
</span> <span style="color:#CCCCCC;">|</span> <span class="tiny"><a href="http://www.amazon.com/review/R3KCIEAV000FPG">Permalink</a></span></div><div style="white-space:nowrap;padding-left:-5px;padding-top:5px;"><a href="http://www.amazon.com/review/R3KCIEAV000FPG"><span class="swSprite s_comment "><span>Comment</span></span></a> <a href="http://www.amazon.com/review/R3KCIEAV000FPG">Comments (19)</a></div></div></div><div style="clear:both;"></div>
</div>
<br />
</div>
我想从这个输出中提取两个整数——335 和 368(有多少人认为它有用)和包含评论本身的评论文本(只是单词,没有标签和换行符)的字符串,它被放置在主 div 中,在 5 个子 div 下。我怎样才能在没有其余部分的情况下获取此 div 的一部分,处理标签?
我将 BeautifulSoap 返回的对象转换为字符串并加载回 soup - 还有其他方法吗?好像不太好看然后我使用你的方法,但我得到了很多空行,我尝试删除它们的 strip 化和使用条件,但它们仍然存在:
import urllib2
from BeautifulSoup import BeautifulSoup
turl = "http://www.amazon.com/The-Great-Gatsby-Scott-Fitzgerald/product-reviews/0743273567/ref=cm_cr_pr_hist_5?ie=UTF8&filterBy=addFiveStar&showViewpoints=0"
toppage = urllib2.urlopen(turl);
soup = BeautifulSoup(toppage);
products = soup.findAll("div", style = "margin-left:0.5em;")
for (counter,i) in enumerate(products):
soup2 = BeautifulSoup(str(products[counter]))
for (counter2,x) in enumerate(soup2.div):
if x.string:
if x.string.isspace:
print "empty string"
else:
print "string number " + str(counter) + " " + x.string.strip().lstrip()
**
最佳答案
使用您的源网页,这是一个完整的示例
import urllib2, re
from BeautifulSoup import BeautifulSoup
turl = "http://rads.stackoverflow.com/amzn/click/0743273567"
toppage = urllib2.urlopen(turl)
soup = BeautifulSoup(toppage)
review_tag = {'class':re.compile("mt9 reviewText")}
helpful_tag = {'class':re.compile("hlp")}
all_reviews = soup.findAll(attrs=review_tag)
all_helpful = soup.findAll(attrs=helpful_tag)
for text,info in zip(all_reviews, all_helpful):
print info.string.strip()
print '\n'.join(text.findAll(text=True)).strip()
print "*******************************************"
这给出了
337 of 370 people found the following review helpful
Having reread this book for the first time in 20 years, I can confirm that there's a reason that it's [...]
*******************************************
114 of 123 people found the following review helpful
It's difficult to give any even-handed critique F. Scott Fitzgerald's standard-setting Jazz Age [...]
*******************************************
54 of 60 people found the following review helpful
Scott Fitzgerald, a monumental talent who only occasionally got things working right, made Gatsby great by the extraordinary invention of Nick Carraway. Carraway as
这是在编辑帖子之前完成的:
假设您已将数据加载到一个名为 soup
的汤中,这个汤缺乏想象力。
for x in soup.body.div:
if x.string:
print x.string.strip()
给予:
335 of 368 people found the following review helpful
Having reread this book for the first time in 20 years, [... more here]
您要查找的字符串是什么。
html 可能一团糟,所以让我给您一些提示,帮助您搜索新网页。首先我找到了文本:
import re
x = soup.find(text=re.compile('Having reread this book'))
然后我通过 parent 了解我正在调查的内容:
print x.parent
print x.parent.parent
print x.parent.parent.parent
从那里我看到所有内容都作为字符串包含在主 div 中。然后循环遍历我要找的东西很简单!
关于python - 使用 BeautifulSoup 提取 div 的内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16269104/
我正在处理一组标记为 160 个组的 173k 点。我想通过合并最接近的(到 9 或 10 个组)来减少组/集群的数量。我搜索过 sklearn 或类似的库,但没有成功。 我猜它只是通过 knn 聚类
我有一个扁平数字列表,这些数字逻辑上以 3 为一组,其中每个三元组是 (number, __ignored, flag[0 or 1]),例如: [7,56,1, 8,0,0, 2,0,0, 6,1,
我正在使用 pipenv 来管理我的包。我想编写一个 python 脚本来调用另一个使用不同虚拟环境(VE)的 python 脚本。 如何运行使用 VE1 的 python 脚本 1 并调用另一个 p
假设我有一个文件 script.py 位于 path = "foo/bar/script.py"。我正在寻找一种在 Python 中通过函数 execute_script() 从我的主要 Python
这听起来像是谜语或笑话,但实际上我还没有找到这个问题的答案。 问题到底是什么? 我想运行 2 个脚本。在第一个脚本中,我调用另一个脚本,但我希望它们继续并行,而不是在两个单独的线程中。主要是我不希望第
我有一个带有 python 2.5.5 的软件。我想发送一个命令,该命令将在 python 2.7.5 中启动一个脚本,然后继续执行该脚本。 我试过用 #!python2.7.5 和http://re
我在 python 命令行(使用 python 2.7)中,并尝试运行 Python 脚本。我的操作系统是 Windows 7。我已将我的目录设置为包含我所有脚本的文件夹,使用: os.chdir("
剧透:部分解决(见最后)。 以下是使用 Python 嵌入的代码示例: #include int main(int argc, char** argv) { Py_SetPythonHome
假设我有以下列表,对应于及时的股票价格: prices = [1, 3, 7, 10, 9, 8, 5, 3, 6, 8, 12, 9, 6, 10, 13, 8, 4, 11] 我想确定以下总体上最
所以我试图在选择某个单选按钮时更改此框架的背景。 我的框架位于一个类中,并且单选按钮的功能位于该类之外。 (这样我就可以在所有其他框架上调用它们。) 问题是每当我选择单选按钮时都会出现以下错误: co
我正在尝试将字符串与 python 中的正则表达式进行比较,如下所示, #!/usr/bin/env python3 import re str1 = "Expecting property name
考虑以下原型(prototype) Boost.Python 模块,该模块从单独的 C++ 头文件中引入类“D”。 /* file: a/b.cpp */ BOOST_PYTHON_MODULE(c)
如何编写一个程序来“识别函数调用的行号?” python 检查模块提供了定位行号的选项,但是, def di(): return inspect.currentframe().f_back.f_l
我已经使用 macports 安装了 Python 2.7,并且由于我的 $PATH 变量,这就是我输入 $ python 时得到的变量。然而,virtualenv 默认使用 Python 2.6,除
我只想问如何加快 python 上的 re.search 速度。 我有一个很长的字符串行,长度为 176861(即带有一些符号的字母数字字符),我使用此函数测试了该行以进行研究: def getExe
list1= [u'%app%%General%%Council%', u'%people%', u'%people%%Regional%%Council%%Mandate%', u'%ppp%%Ge
这个问题在这里已经有了答案: Is it Pythonic to use list comprehensions for just side effects? (7 个答案) 关闭 4 个月前。 告
我想用 Python 将两个列表组合成一个列表,方法如下: a = [1,1,1,2,2,2,3,3,3,3] b= ["Sun", "is", "bright", "June","and" ,"Ju
我正在运行带有最新 Boost 发行版 (1.55.0) 的 Mac OS X 10.8.4 (Darwin 12.4.0)。我正在按照说明 here构建包含在我的发行版中的教程 Boost-Pyth
学习 Python,我正在尝试制作一个没有任何第 3 方库的网络抓取工具,这样过程对我来说并没有简化,而且我知道我在做什么。我浏览了一些在线资源,但所有这些都让我对某些事情感到困惑。 html 看起来
我是一名优秀的程序员,十分优秀!