python - 使用Python抓取谷歌搜索结果标题和网址-6ren

python - 使用Python抓取谷歌搜索结果标题和网址

转载作者：行者123 更新时间：2023-12-01 00:53:37

我正在使用 Python(3.7) 开发一个项目，其中我需要抓取标题和网址的前几个 Google 结果，我已经使用 BeautifulSoup 尝试过，但它不起作用:

这是我尝试过的:

import requests
from my_fake_useragent import UserAgent
from bs4 import BeautifulSoup

ua = UserAgent()

google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")

result_div = soup.find_all('div', attrs={'class': 'g'})

links = []
titles = []
descriptions = []
for r in result_div:
    # Checks if each element is present, else, raise exception
    try:
        link = r.find('a', href=True)
        title = r.find('h3', attrs={'class': 'r'}).get_text()
        description = r.find('span', attrs={'class': 'st'}).get_text()

        # Check to make sure everything is present before appending
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    # Next loop if one element is not present
    except:
        continue

print(titles)

但它不会返回任何内容。

当我尝试像这样获取 HTML 时:

url = 'https://google.com/search?q=python'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.prettify())

这是它返回的内容:(添加了返回的 HTML 代码示例)

<div id="main">
   <div class="ZINbbc xpd O9g5cc uUPGi">
    <div>
     <div class="jfp3ef">
      <a href="/url?q=https://www.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQFjAAegQIBxAB&amp;usg=AOvVaw0nCy-teBd7nOrThY5YGQ4o">
       <div class="BNeawe vvjwJb AP7Wnd">
        Python.org
       </div>
       <div class="BNeawe UPmit AP7Wnd">
        https://www.python.org
       </div>
      </a>
     </div>
     <div class="NJM3tb">
     </div>
     <div class="jfp3ef">
      <div>
       <div class="BNeawe s3v9rd AP7Wnd">
        <div>
         <div>
          <div class="Ap5OSd">
           <div class="BNeawe s3v9rd AP7Wnd">
            The official home of the Python Programming Language.
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/downloads/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAXoECAcQAw&amp;usg=AOvVaw0TKe6ApGOQcWuHcXIkvAT0">
              <span class="XLloXe AP7Wnd">
               Download Python
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/about/gettingstarted/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAnoECAcQBQ&amp;usg=AOvVaw03o9Qt-KFSbwECm8-wmUZS">
              <span class="XLloXe AP7Wnd">
               Python For Beginners
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/doc/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwA3oECAcQBw&amp;usg=AOvVaw3Yz3mO8HXGJoaf35qhyb3V">
              <span class="XLloXe AP7Wnd">
               Documentation
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://docs.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBHoECAcQCQ&amp;usg=AOvVaw0nY6NyZm0wErJJ1RIgTiPm">
              <span class="XLloXe AP7Wnd">
               Python Docs
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/psf/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBXoECAcQCw&amp;usg=AOvVaw3HoEDHmdRBcufXuwakPCAz">
              <span class="XLloXe AP7Wnd">
               Python Software Foundation
              </span>
             </a>
            </span>
           </div>
          </div>
          <div>
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/downloads/release/python-373/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBnoECAcQDQ&amp;usg=AOvVaw3HsJpvpsCvYikd_mP7ndN3">
              <span class="XLloXe AP7Wnd">
               Python 3.7.3
              </span>
             </a>
            </span>
           </div>
          </div>
         </div>
        </div>
       </div>
      </div>
     </div>
    </div>
   </div>
</div>

最佳答案

您应该尝试自动化 Selenium 库。它允许您抓取动态渲染请求(js 或 ajax)页面数据。

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from bs4.element import Tag

driver = webdriver.Chrome('/usr/bin/chromedriver')
google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)
driver.get(google_url)
time.sleep(3)

soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})


links = []
titles = []
descriptions = []
for r in result_div:
    # Checks if each element is present, else, raise exception
    try:
        link = r.find('a', href=True)
        title = None
        title = r.find('h3')

        if isinstance(title,Tag):
            title = title.get_text()

        description = None
        description = r.find('span', attrs={'class': 'st'})

        if isinstance(description, Tag):
            description = description.get_text()

        # Check to make sure everything is present before appending
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    # Next loop if one element is not present
    except Exception as e:
        print(e)
        continue

print(titles)
print(links)
print(descriptions)

操作:

['Welcome to Python.org', 'Download Python | Python.org', 'Python Tutorial - W3Schools', 'Introduction to Python - W3Schools', 'Python Programming Language - GeeksforGeeks', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python Tutorial - Tutorialspoint', 'Python Download and Installation Instructions', 'Python vs C++ - Find Out The 9 Important Differences - eduCBA', None, 'Description']
['https://www.python.org/', 'https://www.python.org/downloads/', 'https://www.w3schools.com/python/', 'https://www.w3schools.com/python/python_intro.asp', 'https://www.geeksforgeeks.org/python-programming-language/', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://www.tutorialspoint.com/python/', 'https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html', 'https://www.educba.com/python-vs-c-plus-plus/', '/search?num=5&q=Python&stick=H4sIAAAAAAAAAONgFuLQz9U3MK0yjFeCs7SEs5Ot9JPzc3Pz86yKM1NSyxMri1cxsqVZOQZ4Fi9iZQuoLMnIzwMAlVPV1j0AAAA&sa=X&ved=2ahUKEwigvcqKx8XiAhUOSX0KHdtmBgoQzTooADAQegQIChAC', 'mailto:?body=Python%20https%3A%2F%2Fwww.google.com%2Fsearch%3Fkgmid%3D%2Fm%2F05z1_%26hl%3Den-IN%26kgs%3De1764a9f31831e11%26q%3DPython%26shndl%3D0%26source%3Dsh%2Fx%2Fkp%26entrypoint%3Dsh%2Fx%2Fkp']
['The official home of the Python Programming Language.', 'Looking for Python 2.7? See below for specific releases. Contribute to the PSF by Purchasing a PyCharm License. All proceeds benefit the PSF. Donate Now\xa0...', 'Python can be used on a server to create web applications. ... Our "Show Python" tool makes it easy to learn Python, it shows both the code and the result.', 'What is Python? Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for: web development\xa0...', 'Python is a widely used general-purpose, high level programming language. It was initially designed by Guido van Rossum in 1991 and developed by Python\xa0...', None, None, None, None, None, None, None]

其中 '/usr/bin/chromedriver' selenium Web 驱动程序路径。

下载适用于 Chrome 浏览器的 selenium Web 驱动程序:

http://chromedriver.chromium.org/downloads

安装 Chrome 浏览器的网络驱动程序:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

Selenium 教程:

https://selenium-python.readthedocs.io/

关于python - 使用Python抓取谷歌搜索结果标题和网址，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56392962/

文章推荐： LLVM - 如何获取指令的结果变量

文章推荐： python - Telnet 到 Python 服务器卡住

文章推荐： php - 两个表之间的内部连接显示错误

文章推荐： python - 在 pandas 中应用取决于行不工作

javascript - 在 Node Express 应用程序中设置通用路由。 (网址/索引、网址/索引2、网址/索引3...)
我正在通过 NodeSchool.io 练习学习 React 和 Express 框架。我想将所有练习文件存储在具有多个页面的单个应用程序中，例如索引索引2 索引3 索引4 .... local
java - 这种变量在 Android 中意味着什么？ (网址...网址，整数...进度)
从这里:http://developer.android.com/reference/android/os/AsyncTask.html doInBackground(URL... urls) onP
email - 如何解决垃圾邮件中的/@网址？
我最近收到了一封电子邮件，其中包含以下内容(请勿点击!): UNS 这是原始电子邮件的链接:https://gist.github.com/anonymous/16963a230cab0a3a1bc
Android TextView 网址
在 android 中，可以单击带有 URL 的 TextView 以在网络中打开 URL，方法是: android:autoLink="web" 我想做的是捕获这次点击，如果这个 TextView
javascript - channel 网址
我在我的网站上以 mysite.anotherdomain.org 的形式实现 Facebook 登录。我在 JavaScript SDK 的文档中做了所有解释，但由于我遇到了一些问题，我想知道错误是
javascript - 从窗口位置获取值。网址
我在 window.location.href 中有响应网址，我需要其中的 error、error_description 和 state 的值 http://localhost:4200/#erro
javascript - 当用户到达底部时如何加载新页面/网址
我正在创建无限加载，意味着当用户到达页面底部/特定 div 时会加载新页面。目前我有这个代码可以在点击时加载新页面。 $("#about").click(function(){ // load
web - 如何告诉像谷歌这样的搜索引擎显示它的标签/网址？
当我们在谷歌引擎中搜索时，它也会显示热门网站标签或链接。就像我们搜索“bing”或“net beans”时一样。问:它如何显示这些链接。我们是否必须告诉它显示这些链接。问:它是否与 sitemap
php 网址 explode
我想从我的网址中获取我的产品。例如: http://www.website.com/product-category/iphone 我想获取 iphone，这对我的代码来说没问题，但我有一个下拉菜单来
Pythonanywhere，如何使用静态文件？网址？
我对 Pythonanywhere 完全陌生，我不知道为什么静态文件没有加载...这是我存储 css 和图像的路径，即 static/images/wikiLang.png 等 /static/adm
regex - 正则表达式 Youtube 网址
我正在使用这个正则表达式来验证 youtube 网址。 ^http:\/\/(?:www\.)?youtube.com\/watch\?(?=.*v=\w+)(?:\S+)?$ 它很好用。但我有这个
url - 我如何使用这个 github 网址？
我刚刚在 gist.github 上传了一个我正在处理的小编码项目，因为它似乎是一次上传几个类的好方法。我想将某人与我的“要点”联系起来，并在角落里写着: Public Clone URL: git
jquery - 正则表达式验证 Twitter 网址
我正在使用 jQuery 验证引擎来解析我的表单数据: https://github.com/posabsolute/jQuery-Validation-Engine 验证 Twitter URL 的
Django utf-8 网址
我有一个 Django 应用程序，它可以在 localhost 上正常工作。即使对于 utf-8 URL 路径也是如此。但是当我在生产中使用它时，它给了我一个错误: 2019-09-01 14:32:
image - Laravel Assets 网址
我已经安装了Laravel并开始尝试编写一个应用程序。我在/ app所在的目录中为 Assets 创建了一些目录。但是，当我尝试访问本地主机中的图像时，例如:http://localhost/asse
video - 批量检查 YouTube 网址
我们正在寻找一种方法来检查一长串 YouTube 网址，以查找目前私有(private)、已删除或不再可用的视频。我们可以检查状态，但即使视频不再公开可用，URL 也会返回 200。例如这两个: ht
YouTube 直播 RTMP 网址
我在 YouTube 上有现场事件，我想在我的网站上播放它。我想将我的事件设为私有(private)，获取它的 RTMP 广播 URL 并将其粘贴到我的网站上，在 JWPlayer 中。那可能吗？
nginx - 如何防止谷歌索引我的 https 网址？
当我在谷歌上搜索我的域时，它会显示我网站上的几个 https 网址，因为谷歌喜欢 https，但出于特殊原因我不想索引 https/ssl 版本。如何避免这种情况，全世界都只通过 htaccess
php - 如何在PHP中获取网页的当前完整网址(网址+片段)？
我想获取在 Salesforce.com 授权期间作为回调收到的当前 URL。 url 中的数据位于片段部分。最佳答案您可以使用 $_SERVER['HTTP_HOST'] 和 $_SERVER[
angularjs - 如何刷新 iframe 网址？
我正在使用 ionic 创建一个应用程序，其中我使用 iframe 显示 URL。这是 HTML 代码: 这是 Angular js: $scope.iframeHeight = windo

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用Python抓取谷歌搜索结果标题和网址