gpt4 book ai didi

python - 简单的 Python 社交媒体抓取公共(public)信息

转载 作者:太空宇宙 更新时间:2023-11-03 17:15:15 25 4
gpt4 key购买 nike

我只想从我在两个社交媒体网站上的帐户中获取公共(public)信息。 (Instagram 和 Twitter)我的代码返回 twitter 的信息,我知道 xpath 对于 Instagram 是正确的,但由于某种原因我没有获取它的数据。我知道 XPATH 可能更具体,但我可以稍后修复它。我的两个帐户都是公开的。

1) 我想也许它不喜欢 python 头,所以我尝试更改它,但仍然一无所获。该行已被注释掉,但它仍然存在。

2)我在 github 上听说过一些关于 API 的事情,这段冗长的代码非常令人生畏,远远超出了我的理解水平。我不知道我在那里读到的内容的一半以上。

from lxml import html
import requests
import webbrowser

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)

instaFollowers = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")

instaFollowing = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")

twitFollowers = treeTwo.xpath("//a[@data-nav='followers']/span[@class='ProfileNav-value']/text()")

twitFollowing = treeTwo.xpath("//a[@data-nav='following']/span[@class='ProfileNav-value']/text()")

print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)

最佳答案

如上所述,Instragram 的页面源代码并不反射(reflect)其渲染源代码,因为调用 Javascript 函数将内容从 JSON 数据传递到浏览器。因此,Python 在页面源中抓取的内容并不能准确显示浏览器在屏幕上呈现的内容。欢迎来到动态网络编程的新世界!考虑使用Instagram's API或其他可以检索 html 生成内容(不仅仅是页面源)的 Web 解析器。

话虽如此,如果您只需要 IG 帐户数据,您仍然可以使用 Python 的 lxml 对 <script> 中的 JSON 内容进行 XPath 处理。标签(具体是第六次出现,但根据您需要的页面进行调整)。下面的示例解析 Google's Instagram JSON 数据:

import lxml.etree as et
import urllib.request as rq

rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()

tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[@type='text/javascript' and position()=6]/text()")

for i in jsondata:
print(i)

输出

window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...

JSON Pretty Print (从上面提取 window._sharedData 变量)

参见下文,用户(关注者、关注者等)数据在开头显示:

{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {

},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...

关于python - 简单的 Python 社交媒体抓取公共(public)信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33713794/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com