gpt4 book ai didi

Python - 从 Google Alerts feeds 获取链接的重定向 url

转载 作者:太空宇宙 更新时间:2023-11-04 03:41:14 25 4
gpt4 key购买 nike

如果您将 google 提醒创建为 rss 提要(不会自动发送到您的电子邮件地址),它包含如下链接:https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA .

这个链接显然是一个重定向(只要尝试一下,你就会在这里结束:http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/),但我无法用 Python 获得这个最终 url(除非删除 url 的开头,这是非常难看的).

到目前为止,我已经尝试使用包 urllib2、httplib2 和请求:

  • urllib2.urlopen 和 geturl() 从返回值
  • 使用 follow_all_redirects=True 和返回值中的“content-location”的 httplib2 请求
  • requests.get 和返回值的历史记录

有人遇到过这个问题吗?谢谢!

最佳答案

Google 不会为您提供 HTTP 重定向;返回 200 OK 响应,而不是 30x 重定向:

>>> import requests
>>> url = 'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
>>> response = requests.get(url)
>>> response.url
u'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
>>> response.text
u'<script>window.googleJavaScriptRedirect=1</script><script>var m={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};m.navigateTo(window.parent,window,"http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/");\n</script><noscript><META http-equiv="refresh" content="0;URL=\'http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/\'"></noscript>'

响应是一段 HTML 和 JavaScript,您的浏览器会将其解释为加载新的 URL。您必须解析该响应以提取目标。

字符串拆分可以实现这一点:

>>> response.text.partition("URL='")[-1].rpartition("'\"")[0]
u'http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/'

如果我们假设正文中的 URL 参数只是查询字符串中 url 参数的直接反射(reflect),那么您也可以从那里提取它,我们甚至不必要求 Google 执行重定向:

try:
from urllib.parse import parse_qs, urlsplit
except ImportError:
# Python 2
from urlparse import parse_qs, urlsplit

target = parse_qs(urlsplit(url).query)['url'][0]

关于Python - 从 Google Alerts feeds 获取链接的重定向 url,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26358453/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com