gpt4 book ai didi

python - 使用 pyppeteer 抓取数据

转载 作者:行者123 更新时间:2023-12-04 14:13:31 48 4
gpt4 key购买 nike

我正在尝试 本站数据https://quickfs.net/company/BABA:US使用 pyppeteer,没有这个网站就会知道我在抓取。
所以我的第一个问题是:

  • 将 pyppeteer 用于 是否正确?刮 我不会被(网站)注意到做抓取?

  • 进入时 link above右上角有一个下拉列表,其中包含以下项目:概览、损益表、...、关键比率。
    我想使用 pyppeteer 来从下拉菜单中选择关键比率,然后从那里提取 的数据。每股项目 然后是 的行账面值(value) .
    在我在该网站上提出的预览问题的最后一条评论中 link有人告诉我,这个下拉菜单“只会触发不同的方式来呈现相同的数据”。
    所以我的第二和第三个问题是(也许它们是一样的):
  • 我应该以某种方式模拟使用 pyppeteer 选择的关键比率吗?
  • 如何从Key Ratios触发器中提取数据,使用pyppeteer,没有网站会知道有人在抓取它?

  • 我使用这些问题来编写代码来执行此操作,但我的代码仅从概览页面中提取数据,这是第一个。
    这是我基于代码的问题
  • How can I retrieve data from a web page with a loading screen?
  • Scraping content using pyppeteer in association with asyncio

  • 我也试着从这个 article: Web Scraping with a Headless Browser: A Puppeteer Tutorial 了解如何使用底部但它不是使用 pyppeteer for Python 而是 Puppeteer
    这是我使用的代码:
    import pyppeteer
    import asyncio

    async def main():
    # launches a chromium browser, can use chrome instead of chromium as well.
    browser = await pyppeteer.launch(headless=False)
    # creates a blank page
    page = await browser.newPage()
    # follows to the requested page and runs the dynamic code on the site.
    await page.goto("https://api.quickfs.net/stocks/BABA:US/ovr/Annual/")
    # provides the html content of the page
    cont = await page.content()
    return cont

    # prints the html code
    print(asyncio.get_event_loop().run_until_complete(main()))
    ovr=(asyncio.get_event_loop().run_until_complete(main()))
    提前致谢

    最佳答案

    问题 1:使用 pyppeteer 进行抓取我不会被(网站)注意到进行抓取是否正确?
    简单回答:是的。这个网站使用的是javascript,所以你需要一个像pyppeteer这样的东西来呈现网页。使用 pyppeteer 也会模拟你是一个普通用户。所以被发现的机会少。
    技术答案:这需要更多的网络抓取经验,但如果您查看正在调用的请求。该网站使用 API 来呈现数据。因此,使用适当的方法和 header 向 API 发出请求以避免被检测到会更有效。

    GET https://api.quickfs.net/stocks/BABA:US/ovr/Annual/

    {"datasets":{"metadata":{"_id":{},"qfs_symbol":"NYSE:BABA","currency":"USD","fsCat":"normal","name":"Alibaba Group Holding Limited","gs3_version_at_metadata_update":20191106,"exchange":"NYSE","industry":"Retailing","symbol":"BABA","country":"US","price":215.7,"p_pretax_inc":"24.9","ps":"8.1","ev_ebit":"42.5","ev_fcf":"21.7","ev_s":"7.7","ev_ebitda":"37.2","pb":"4.7","mkt_cap":588375,"pe":"27.7","ev_pretax_inc":"23.6","ev":558430,"qfs_symbol_v2":"BABA:US","description":"","avg_vol_50d":19498671,"beta":1.8212,"betaLastUpdated":20200419,"share_turnover":"180","sector":"Consumer Discretionary","template_version":4,"gics":"25502020","template_type":"normal"},"ks":"\n\t\t <div class=\"ksTblBg\">\n\t\t <table class=\"ksTbl\">\n\t\t <thead>\n\t\t <tr>\n\t\t <th colspan=\"6\" style=\"text-align:center\">Key Statistics<\/th>\n\t\t <\/tr>\n\t\t <\/thead>\n\t\t <tbody>\n\t\t \n\t\t <tr>\n\t\t <td class=\"ksSectHead\" colspan=\"2\">Valuation Ratios<\/td>\n\t\t <td class=\"ksSectHead\" colspan=\"2\">10-Yr Median Returns<\/td>\n\t\t <td class=\"ksSectHead\" colspan=\"2\">10-Yr Median Margins<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>P\/E<\/td><td class='rt' id='ks-pe'><\/td>\n\t\t <td class='lt'>ROA<\/td><td class='rt'>13.0%<\/td>\n\t\t <td class='lt'>Gross Profit<\/td><td class='rt'>66.7%<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>P\/B<\/td><td class='rt' id='ks-pb'><\/td>\n\t\t <td class='lt'>ROE<\/td><td class='rt'>22.3%<\/td>\n\t\t <td class='lt'>EBIT<\/td><td class='rt'>28.9%<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>P\/S<\/td><td class='rt' id='ks-ps'><\/td>\n\t\t <td class='lt'>ROIC<\/td><td class='rt'>30.4%<\/td>\n\t\t <td class='lt'>Pre-Tax Income<\/td><td class='rt'>35.6%<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>EV\/S<\/td><td class='rt' id='ks-ev_s'><\/td>\n\t\t <td class='ksSectHead' colspan='2'>10-Year CAGR<\/td>\n\t\t <td class='lt'>FCF<\/td><td class='rt'>40.8%<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>EV\/EBITDA<\/td><td class='rt' id='ks-ev_ebitda'><\/td>\n\t\t <td class='lt'>Revenue<\/td><td class='rt'>56.3%<\/td>\n\t\t <td class='ksSectHead' colspan='2'>Capital Structure<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>EV\/EBIT<\/td><td class='rt' id='ks-ev_ebit'><\/td>\n\t\t <td class='lt'>Assets<\/td><td class='rt'>58.2%<\/td>\n\t\t <td class='lt'>Assets \/ Equity<\/td><td class='rt'>1.6<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>EV\/Pretax<\/td><td class='rt' id='ks-ev_pretax_income'><\/td>\n\t\t <td class='lt'>FCF<\/td><td class='rt'>51.1%<\/td>\n\t\t <td class='lt'>Debt \/ Equity<\/td><td class='rt'>0.3<\/td>\n\t\t <\/tr>\n\t\t <tr>\n\t\t <td class='lt'>EV\/FCF<\/td><td class='rt' id='ks-ev_fcf'><\/td>\n\t\t <td class='lt'>EPS<\/td><td class='rt'>68.6%<\/td>\n\t\t <td class='lt'>Debt \/ Assets<\/td><td class='rt'>0.2<\/td>\n\t\t <\/tr>\n\t\t \n\t\t <\/tbody>\n\t\t <\/table>\n\t\t <\/div>","ovr":"<table class='fs-table' id='ovr-table'>\n <tbody>\n <tr class='thead'><td><\/td><td>2011<\/td><td>2012<\/td><td>2013<\/td><td>2014<\/td><td>2015<\/td><td>2016<\/td><td>2017<\/td><td>2018<\/td><td>2019<\/td><td>2020<\/td><\/tr><tr class=' '><td class='labelCell'>Revenue<\/td><td class='dataCell' data-type='normal' data-value='1010821000'>1,011<\/td><td class='dataCell' data-type='normal' data-value='3172277000'>3,172<\/td><td class='dataCell' data-type='normal' data-value='5553464000'>5,553<\/td><td class='dataCell' data-type='normal' data-value='8505565000'>8,506<\/td><td class='dataCell' data-type='normal' data-value='12214920000'>12,215<\/td><td class='dataCell' data-type='normal' data-value='15554001000'>15,554<\/td><td class='dataCell' data-type='normal' data-value='22958079000'>22,958<\/td><td class='dataCell' data-type='normal' data-value='39615348000'>39,615<\/td><td class='dataCell' data-type='normal' data-value='56145652000'>56,146<\/td><td class='dataCell' data-type='normal' data-value='72603233000'>72,603<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>Revenue Growth<\/td><td class='dataCell italic' data-type='percentage' data-value='0.20945600737049'>20.9%<\/td><td class='dataCell italic' data-type='percentage' data-value='2.1383172688339'>213.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.75062392092494'>75.1%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.53157830860162'>53.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.43610918263513'>43.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.27336085705023'>27.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.47602401465706'>47.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.72555151500263'>72.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.41727019537983'>41.7%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.29312298305842'>29.3%<\/td><\/tr><tr class=' '><td class='labelCell'>Gross Profit<\/td><td class='dataCell' data-type='normal' data-value='812343000'>812<\/td><td class='dataCell' data-type='normal' data-value='2134020000'>2,134<\/td><td class='dataCell' data-type='normal' data-value='3989767000'>3,990<\/td><td class='dataCell' data-type='normal' data-value='6339808000'>6,340<\/td><td class='dataCell' data-type='normal' data-value='8394512000'>8,395<\/td><td class='dataCell' data-type='normal' data-value='10270811000'>10,271<\/td><td class='dataCell' data-type='normal' data-value='14329852000'>14,330<\/td><td class='dataCell' data-type='normal' data-value='22671036000'>22,671<\/td><td class='dataCell' data-type='normal' data-value='25315484000'>25,315<\/td><td class='dataCell' data-type='normal' data-value='32382879000'>32,383<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>Gross Margin %<\/td><td class='dataCell italic' data-type='percentage' data-value='0.80364673864116'>80.4%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.67270922432057'>67.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.71842853397447'>71.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.74537176542652'>74.5%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.68723430034744'>68.7%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.66033241221985'>66.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.62417469684637'>62.4%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.57227910758224'>57.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.45088948294696'>45.1%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.44602530303299'>44.6%<\/td><\/tr><tr class=' '><td class='labelCell'>Operating Profit<\/td><td class='dataCell' data-type='normal' data-value='266271000'>266<\/td><td class='dataCell' data-type='normal' data-value='847525000'>848<\/td><td class='dataCell' data-type='normal' data-value='1820317000'>1,820<\/td><td class='dataCell' data-type='normal' data-value='4084952000'>4,085<\/td><td class='dataCell' data-type='normal' data-value='3736415000'>3,736<\/td><td class='dataCell' data-type='normal' data-value='4607009000'>4,607<\/td><td class='dataCell' data-type='normal' data-value='7035973000'>7,036<\/td><td class='dataCell' data-type='normal' data-value='11137968000'>11,138<\/td><td class='dataCell' data-type='normal' data-value='8604121000'>8,604<\/td><td class='dataCell' data-type='normal' data-value='13105334000'>13,105<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>Operating Margin %<\/td><td class='dataCell italic' data-type='percentage' data-value='0.26342052648293'>26.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.267166139653'>26.7%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.32778046278863'>32.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.48026815384986'>48.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.30588943685264'>30.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.29619446469111'>29.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.30647045861285'>30.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.28115285015293'>28.1%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.15324643482633'>15.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.18050620417964'>18.1%<\/td><\/tr><tr class=' '><td class='labelCell'>Earnings Per Share<\/td><td class='dataCell' data-type='eps' data-value='0.053'>$0.05<\/td><td class='dataCell' data-type='eps' data-value='0.287'>$0.29<\/td><td class='dataCell' data-type='eps' data-value='0.574'>$0.57<\/td><td class='dataCell' data-type='eps' data-value='1.62'>$1.62<\/td><td class='dataCell' data-type='eps' data-value='1.555'>$1.56<\/td><td class='dataCell' data-type='eps' data-value='4.289'>$4.29<\/td><td class='dataCell' data-type='eps' data-value='2.462'>$2.46<\/td><td class='dataCell' data-type='eps' data-value='3.88'>$3.88<\/td><td class='dataCell' data-type='eps' data-value='4.973'>$4.97<\/td><td class='dataCell' data-type='eps' data-value='7.965'>$7.97<\/td><\/tr><tr class=' '><td class='labelCell italic indent'>EPS Growth<\/td><td class='dataCell italic' data-type='percentage' data-value='0.23255813953488'>23.3%<\/td><td class='dataCell italic' data-type='percentage' data-value='4.4150943396226'>441.5%<\/td><td class='dataCell italic' data-type='percentage' data-value='1'>100.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='1.8222996515679'>182.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='-0.040123456790124'>-4.0%<\/td><td class='dataCell italic' data-type='percentage' data-value='1.7581993569132'>175.8%<\/td><td class='dataCell italic' data-type='percentage' data-value='-0.42597342037771'>-42.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.57595450852965'>57.6%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.28170103092784'>28.2%<\/td><td class='dataCell italic' data-type='percentage' data-value='0.60164890408204'>60.2%<\/td><\/tr><tr class=' '><td class='labelCell'>Return on Assets<\/td><td class='dataCell' data-type='percentage' data-value='0.12490081137912'>12.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.13547055438638'>13.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.15474766176667'>15.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.26661125490522'>26.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.13179228026259'>13.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.22667998521059'>22.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.097819038194011'>9.8%<\/td><td class='dataCell' data-type='percentage' data-value='0.10848994208585'>10.8%<\/td><td class='dataCell' data-type='percentage' data-value='0.10177987302833'>10.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.12868657986734'>12.9%<\/td><\/tr><tr class=' '><td class='labelCell'>Return on Equity<\/td><td class='dataCell' data-type='percentage' data-value='0.26227533616942'>26.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.20200237445123'>20.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.38077278024081'>38.1%<\/td><td class='dataCell' data-type='percentage' data-value='0.90392646328004'>90.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.24438521190488'>24.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.34553804941983'>34.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.14914185483796'>14.9%<\/td><td class='dataCell' data-type='percentage' data-value='0.17542701201745'>17.5%<\/td><td class='dataCell' data-type='percentage' data-value='0.16392435911507'>16.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.19830371476362'>19.8%<\/td><\/tr><tr class=' '><td class='labelCell'>Return on Invested Capital<\/td><td class='dataCell' data-type='percentage' data-value='0.41743100812616'>41.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.31146385668929'>31.1%<\/td><td class='dataCell' data-type='percentage' data-value='0.56166392937543'>56.2%<\/td><td class='dataCell' data-type='percentage' data-value='0.79357545168436'>79.4%<\/td><td class='dataCell' data-type='percentage' data-value='0.29563665163366'>29.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.40666624726852'>40.7%<\/td><td class='dataCell' data-type='percentage' data-value='0.15645567128128'>15.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.17835726067885'>17.8%<\/td><td class='dataCell' data-type='percentage' data-value='0.15560704472355'>15.6%<\/td><td class='dataCell' data-type='percentage' data-value='0.20185124127701'>20.2%<\/td><\/tr><\/tbody><\/table>","chart":[["2006-12",0],["2007-12",-2.7333985391131],["2008-12",1.4806594382205],["2009-12",0.44823138109063],["2010-12",0.57515717254689],["2011-12",0.41743100812616],["2012-03",0.31146385668929],["2013-03",0.56166392937543],["2014-03",0.79357545168436],["2015-03",0.29563665163366],["2016-03",0.40666624726852],["2017-03",0.15645567128128],["2018-03",0.17835726067885],["2019-03",0.15560704472355],["2020-03",0.20185124127701]]},"errors":[],"code":0,"qfs_symbol_v2":"BABA:US","statementPeriod":"Annual"}
    问题 2:我应该以某种方式模拟使用 pyppeteer 选择的关键比率吗?
    简单回答:pyppeteer 使用 css 选择器来选择页面上的元素。要选择该下拉菜单,您需要找到可以获取该元素的选择器路径。您可以使用 Chrome DevTools (F12) 之类的工具右键单击该元素并复制 css 选择器。然后用 pypeteer 调用下拉菜单:
    # select the button for Key Ratios
    await page.select('body > app-root > app-company > div > div > div.pageHead > div > div:nth-child(3) > div.col-xs-offset-3.col-xs-2 > select-fs-dropdown > div > button > div')
    您应该能够阅读 documentation for pyppeteer 以更好地了解如何实际执行此操作。
    问题3:如何从Key Ratios触发器中提取数据,使用pyppeteer,不用网站会知道有人在抓取吗?
    简答:您可以使用类似于问题 2 的答案的选择器获取表格。然后解析表格。
    技术答案:更好地了解网站的运作方式。您可以对网站进行逆向工程以了解其工作原理。使用 Chrome DevTools 之类的工具,您可以看到它调用了一个 API。 API 以易于解析的 JSON 格式返回您需要的所有数据。使用 API 很简单。只需更改股票代码。
    # get data for Alibaba
    https://api.quickfs.net/stocks/BABA:US/ovr/Annual/

    # get data for Tesla
    https://api.quickfs.net/stocks/TSLA:US/ovr/Annual/

    # get data for Apple
    https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/
    然后你可以简单地使用请求调用 Python 中的 API:
    import requests
    resp = requests.get("https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/")
    data = resp.json

    关于python - 使用 pyppeteer 抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62429010/

    48 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com