- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我试图从 URL 中抓取一些玩家数据 (tr) 行,但是当我运行代码时似乎没有任何反应。我确信我的代码很好,因为它可以与其他包含表格的统计网站一起使用。谁能告诉我为什么什么都没有发生?提前致谢。
import urllib
import urllib.request
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup("https://www.whoscored.com/Regions/252/Tournaments/7/Seasons/6365/Stages/13832/PlayerStatistics/England-Championship-2016-2017")
for record in soup.findAll('tr'):
print(record.text)
最佳答案
简短回答:您要查找的玩家数据不在该网址中。
那么您可能想问为什么? 我在该页面中看到了它们,为什么它们不在那里?
因此,我将尝试解释当您使用 Chrome 等现代浏览器浏览该网址时会发生什么。
You: Type in the url and hit enter.
Chrome: Gotcha. I'll get that page for you asap, just a second. (fetching content from that url), great now I have it! But wait let me read/parse it first before I show it to you, (reading what's inside the content), oh crap this javascript tells me to get additional information from another url, ok I'll do it; oh wait here's another one to tell me to load an ads in the header, well I don't like it but I'm just gonna do what I'm told; just a second, these css tells me to display player names in bold, ok not bad; oh here's another photo from url xxx I need to load, no problem... oh man, how many stuff are there for me to process? I'm not happy with this website... (working on a bunch of other stuff...) Finally everything's ready! Now check it out!
You: Player xxx is actually quite good, I'll check it out. (click player xxx)
Chrome:: ......
正如您每次浏览网页时所看到的那样,浏览器会执行许多“幕后”操作来向用户显示网页。所以基本上:输入网址>>从网址获取内容>>解析内容>>获取其他内容>>呈现的所有内容>>显示页面(一个或多个步骤可能同时完成)
使用您的代码,它只是“从网址获取的内容”,而且您想要的那些统计数据恰好是必须从其他地方加载的“附加内容”,所以这就是为什么您什么也没有的原因。
那么我如何获得这些统计数据呢?一旦您知道负责加载这些统计信息的网址,只需跟踪它们即可。我如何找到这些网址?好吧,你总是可以阅读 JavaScript...如果你有足够的耐心...
获得所需内容的最简单方法是分析页面加载时的流量,并找出所有幕后流量。我会推荐fiddler ,但您可以使用任何您认为合适的工具。
实际上有数百个请求来完全呈现您访问的页面,您所需要做的就是找出哪个请求提供“实际”或“真实”统计数据。即使其中包含“StatisticsFeed”,也有一个 url,它可能就是这个吗?我们来看看:
{
"playerTableStats": [{
"name": "Conor Hourihane",
"firstName": "Conor",
"lastName": "Hourihane",
"playerId": 134172,
"height": 181,
"weight": 62,
"age": 25,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-MC-",
"positionText": "Midfielder",
"playedPositionsShort": "M(C)",
"teamId": 142,
"teamName": "Barnsley",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "ie",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.8705882352941181,
"ranking": 1,
"apps": 17,
"subOn": 0,
"minsPlayed": 1530,
"manOfTheMatch": 4,
"yellowCard": 5.0,
"redCard": 0.0,
"goal": 3,
"assistTotal": 8,
"shotsPerGame": 2.2352941176470589,
"aerialWonPerGame": 0.6470588235294118,
"passSuccess": 81.370449678800867
},
{
"name": "Anthony Knockaert",
"firstName": "Anthony",
"lastName": "Knockaert",
"playerId": 86794,
"height": 172,
"weight": 69,
"age": 25,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-AML-AMR-",
"positionText": "Midfielder",
"playedPositionsShort": "AM(LR)",
"teamId": 211,
"teamName": "Brighton",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "fr",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.6722222222222216,
"ranking": 2,
"apps": 18,
"subOn": 1,
"minsPlayed": 1471,
"manOfTheMatch": 5,
"yellowCard": 4.0,
"redCard": 0.0,
"goal": 6,
"assistTotal": 0,
"shotsPerGame": 2.3888888888888888,
"aerialWonPerGame": 0.22222222222222221,
"passSuccess": 83.420593368237348
},
{
"name": "Lewis Dunk",
"firstName": "Lewis",
"lastName": "Dunk",
"playerId": 86441,
"height": 192,
"weight": 88,
"age": 25,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-DC-",
"positionText": "Defender",
"playedPositionsShort": "D(C)",
"teamId": 211,
"teamName": "Brighton",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "gb-eng",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.660000000000001,
"ranking": 3,
"apps": 18,
"subOn": 0,
"minsPlayed": 1620,
"manOfTheMatch": 3,
"yellowCard": 8.0,
"redCard": 0.0,
"goal": 1,
"assistTotal": 1,
"shotsPerGame": 0.61111111111111116,
"aerialWonPerGame": 3.5,
"passSuccess": 79.72251867662753
},
{
"name": "Tom Clarke",
"firstName": "Tom",
"lastName": "Clarke",
"playerId": 133974,
"height": 180,
"weight": 77,
"age": 28,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-DC-",
"positionText": "Defender",
"playedPositionsShort": "D(C)",
"teamId": 181,
"teamName": "Preston",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "gb-eng",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.6126315789473677,
"ranking": 4,
"apps": 19,
"subOn": 0,
"minsPlayed": 1692,
"manOfTheMatch": 4,
"yellowCard": 0.0,
"redCard": 0.0,
"goal": 2,
"assistTotal": 0,
"shotsPerGame": 0.89473684210526316,
"aerialWonPerGame": 5.4736842105263159,
"passSuccess": 66.666666666666657
},
{
"name": "Pontus Jansson",
"firstName": "Pontus",
"lastName": "Jansson",
"playerId": 121123,
"height": 194,
"weight": 89,
"age": 25,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-DC-",
"positionText": "Defender",
"playedPositionsShort": "D(C)",
"teamId": 19,
"teamName": "Leeds",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "se",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.5976923076923066,
"ranking": 5,
"apps": 13,
"subOn": 0,
"minsPlayed": 1126,
"manOfTheMatch": 1,
"yellowCard": 6.0,
"redCard": 0.0,
"goal": 1,
"assistTotal": 0,
"shotsPerGame": 0.53846153846153844,
"aerialWonPerGame": 3.5384615384615383,
"passSuccess": 86.336633663366342
},
{
"name": "Angus MacDonald",
"firstName": "Angus",
"lastName": "MacDonald",
"playerId": 110825,
"height": 184,
"weight": 70,
"age": 24,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-DC-",
"positionText": "Defender",
"playedPositionsShort": "D(C)",
"teamId": 142,
"teamName": "Barnsley",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "gb-eng",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.5066666666666677,
"ranking": 6,
"apps": 12,
"subOn": 0,
"minsPlayed": 1080,
"manOfTheMatch": 0,
"yellowCard": 3.0,
"redCard": 0.0,
"goal": 0,
"assistTotal": 0,
"shotsPerGame": 0.33333333333333331,
"aerialWonPerGame": 4.833333333333333,
"passSuccess": 72.147651006711413
},
{
"name": "Marc Roberts",
"firstName": "Marc",
"lastName": "Roberts",
"playerId": 138949,
"height": 183,
"weight": 81,
"age": 26,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-DC-",
"positionText": "Defender",
"playedPositionsShort": "D(C)",
"teamId": 142,
"teamName": "Barnsley",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "gb-eng",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.503125,
"ranking": 7,
"apps": 16,
"subOn": 0,
"minsPlayed": 1440,
"manOfTheMatch": 1,
"yellowCard": 3.0,
"redCard": 0.0,
"goal": 2,
"assistTotal": 2,
"shotsPerGame": 0.625,
"aerialWonPerGame": 7.0625,
"passSuccess": 61.595547309833023
},
{
"name": "Bradley Johnson",
"firstName": "Bradley",
"lastName": "Johnson",
"playerId": 12490,
"height": 178,
"weight": 68,
"age": 29,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-MC-ML-",
"positionText": "Midfielder",
"playedPositionsShort": "M(CL)",
"teamId": 20,
"teamName": "Derby",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "gb-eng",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.4954545454545443,
"ranking": 8,
"apps": 11,
"subOn": 0,
"minsPlayed": 952,
"manOfTheMatch": 1,
"yellowCard": 4.0,
"redCard": 0.0,
"goal": 2,
"assistTotal": 1,
"shotsPerGame": 1.3636363636363635,
"aerialWonPerGame": 4.0909090909090908,
"passSuccess": 71.908127208480565
},
{
"name": "Christophe Berra",
"firstName": "Christophe",
"lastName": "Berra",
"playerId": 8287,
"height": 186,
"weight": 81,
"age": 31,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-DC-",
"positionText": "Defender",
"playedPositionsShort": "D(C)",
"teamId": 165,
"teamName": "Ipswich",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "gb-sct",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.4789473684210526,
"ranking": 9,
"apps": 19,
"subOn": 0,
"minsPlayed": 1710,
"manOfTheMatch": 3,
"yellowCard": 4.0,
"redCard": 0.0,
"goal": 0,
"assistTotal": 1,
"shotsPerGame": 0.94736842105263153,
"aerialWonPerGame": 6.2105263157894735,
"passSuccess": 58.636363636363633
},
{
"name": "Adam Webster",
"firstName": "Adam",
"lastName": "Webster",
"playerId": 109922,
"height": 191,
"weight": 0,
"age": 21,
"isManOfTheMatch": false,
"isActive": true,
"isOpta": true,
"playedPositions": "-DC-",
"positionText": "Defender",
"playedPositionsShort": "D(C)",
"teamId": 165,
"teamName": "Ipswich",
"seasonId": 6365,
"seasonName": "2016/2017",
"tournamentId": 7,
"tournamentRegionId": 252,
"tournamentRegionCode": "gb-eng",
"regionCode": "gb-eng",
"tournamentName": "Championship",
"tournamentShortName": "EC",
"rating": 7.4780000000000006,
"ranking": 10,
"apps": 15,
"subOn": 1,
"minsPlayed": 1227,
"manOfTheMatch": 2,
"yellowCard": 1.0,
"redCard": 0.0,
"goal": 0,
"assistTotal": 0,
"shotsPerGame": 0.2,
"aerialWonPerGame": 5.0666666666666664,
"passSuccess": 58.256029684601117
}],
"paging": {
"currentPage": 1,
"totalPages": 34,
"resultsPerPage": 10,
"totalResults": 338,
"firstRecordIndex": 1,
"lastRecordIndex": 10
},
"statColumns": ["apps",
"subOn",
"minsPlayed",
"goal",
"assistTotal",
"yellowCard",
"redCard",
"shotsPerGame",
"passSuccess",
"aerialWonPerGame",
"manOfTheMatch"]
}
正是如此!那么现在怎么办? 模拟此请求并解析内容,因为它已经是 JSON 格式,内置模块 json
可以轻松完成这项工作,您甚至不必使用 BeautifulSoup
你可能会问,为什么我直接浏览这个链接什么也没有呢?这是因为他们在服务器上设置了限制,以便只有具有有效 header 的请求才会获得提要。那么我该如何绕过它呢?使用正确的参数(主要是标题)“生动地”模拟,以便他们相信您。
关于Python BeautifulSoup 不抓取这个网址,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41007470/
如本answer所述,如果浏览器不支持 e,可以设置后备游标。 G。 光标:抓取;。我现在的问题是获取这些图像。在我的驱动器上本地搜索“.cur”只给了我系统光标,其中 grab.cur 和 grab
以下代码在计算机上运行以从 Instagram 帐户中抓取数据。当我尝试在 VPS 服务器上使用它时,我被重定向到 Instagram 登录页面,因此脚本不起作用。 为什么当我在电脑上或服务器上时,I
我在使用 Ruby 和 Mechanize 将 POST 查询传递到站点的网站上。访问站点的查询基于 firebug,如下所示 param.PrdNo=-1¶m.Type=Prop¶m
我正在尝试抓取一个具有多个页面结果的网站,例如“1、2、3、4、5...”。 每个分页号都是到另一个页面的链接,我需要抓取每个页面。 到目前为止,我想出了这个: while lien = page.l
我正在使用 HtmlAgilityPack 在 C# Asp.Net 中执行 Scraping,到目前为止,我在从多个 Web 执行 Scratch 时没有遇到问题,但是,尝试弹出以下代码时出现错误
如果我有一个 css 文件做这样的事情 #foo:after{content:"bar;} ,有没有办法用 javascript 获取 :after 的内容?获取父元素的内容只返回 #foo 元素的内
问题是这样的: 我有一个 Web 应用程序 - 一个经常更改的通知系统 - 在一系列本地计算机上运行。该应用程序每隔几秒刷新一次以显示新信息。计算机仅显示信息,没有键盘或任何输入设备。 问题是,如果与
我想制作一个程序来模拟用户浏览网站和点击链接。必须启用 Cookie 和 javascript。我已经在 python 中成功地做到了这一点,但我想把它写成一种可编译的语言(python ide 不会
我制作了这个小机器人,它通过搜索参数列表进行处理。它工作正常,直到页面上有几个结果: product_prices_euros 给出了一半为空的项目列表。因此,当我与 product_prices_c
我需要找到一个单词的匹配项,例如: 在网上找到所有单词“学习”https://www.georgetown.edu/(结果:4个字)(您可以看到它按CTRL + F并搜索) 我有我的 Python 代
有一个站点\资源提供一些一般统计信息以及搜索工具的界面。这种搜索操作成本高昂,因此我想限制频繁且连续(即自动)的搜索请求(来自人,而不是来自搜索引擎)。 我相信有很多现有的技术和框架可以执行一些情报抓
这并不是真正的抓取,我只是想在网页中找到类具有特定值的 URL。例如: 我想获取 href 值。关于如何做到这一点的任何想法?也许正则表达式?你能发布一些示例代码吗?我猜 html 抓取库,比如 B
我正在使用 scrapy。 我正在使用的网站具有无限滚动功能。 该网站有很多帖子,但我只抓取了 13 个。 如何抓取剩余的帖子? 这是我的代码: class exampleSpider(scrapy.
我正在尝试从这个 website 中抓取图像和新闻 url .我定义的标签是 root_tag=["div", {"class":"ngp_col ngp_col-bottom-gutter-2 ng
关闭。这个问题需要更多focused .它目前不接受答案。 想改进这个问题吗? 更新问题,使其只关注一个问题 editing this post . 关闭上个月。 Improve this ques
我在几个文件夹中有数千个 html 文件,我想从评论中提取数据并将其放入 csv 文件中。这将允许我为项目格式化和清理它。例如,我在这个文件夹中有 640 个 html 文件: D:\My Web S
我在编写用于抓取网页的实用程序时遇到了一个问题。 我正在发送 POST 请求来检索数据,我模仿我正在抓取的网络行为(根据使用 fiddler 收集的信息)。 我已经能够自动替换我的 POST 中除 V
对于 Googlebot 的 AJAX 抓取,我在我的网站中使用“_escaped_fragment_”参数。 现在我查看了 Yandex 对我网站的搜索结果。 我看到搜索结果中不存在 AJAX 响应
我正在尝试抓取网站的所有结果页面,它可以工作,但有时脚本会停止并显示此错误: 502 => Net::HTTPBadGateway for https://website.com/id/12/ --
我是一个学习网络爬虫的初学者,由于某种原因我无法爬网this地点。当我在 Chrome 中检查它时,代码看起来不错,但是当我用 BeautifulSoup 阅读它时,它不再是可刮的。汤提到“谷歌分析”
我是一名优秀的程序员,十分优秀!