gpt4 book ai didi

javascript - Webscraper被封——如何进行Puppeteer IP地址轮换?

转载 作者:行者123 更新时间:2023-12-05 06:00:39 25 4
gpt4 key购买 nike

所以在我的网络抓取功能中,我有以下代码行:

let portList = [9050, 9052, 9053, 9054, 9055, 9056, 9057, 9058, 9059, 9060];
let spoofPort = portList[Math.floor(Math.random()*portList.length)];
console.log("The chosen port was " + spoofPort);

const browser = await puppeteerExtra.launch({ headless: true, args: [
'--no-sandbox', '--disable-setuid-sandbox', '--proxy-server=socks5://127.0.0.1:' + spoofPort
]});

const page = await browser.newPage();

const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';

await page.setUserAgent(userAgent);

我正在尝试为每个请求轮换 IP 地址(包含此代码的函数实质上是针对来自客户端的每个请求调用的),这样我就不会很快被抓取的网站阻止。我收到以下错误:

2021-05-17T12:08:19.625349+00:00 app[web.1]: The chosen port was 9050
2021-05-17T12:08:20.042016+00:00 app[web.1]: Error: net::ERR_PROXY_CONNECTION_FAILED at https://expampleDomanPlaceholder.com
2021-05-17T12:08:20.042018+00:00 app[web.1]: at navigate (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
2021-05-17T12:08:20.042018+00:00 app[web.1]: at processTicksAndRejections (internal/process/task_queues.js:93:5)
2021-05-17T12:08:20.042019+00:00 app[web.1]: at async FrameManager.navigateFrame (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
2021-05-17T12:08:20.042020+00:00 app[web.1]: at async Frame.goto (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
2021-05-17T12:08:20.042021+00:00 app[web.1]: at async Page.goto (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:819:16)
2021-05-17T12:08:20.042021+00:00 app[web.1]: at async /app/app.js:174:9

我已经尝试了这些帖子中详述的解决方案,但问题可能出在我的 userAgent 上?:

Getting error when attempting to use proxy server in Node.js / Puppeteer

https://github.com/puppeteer/puppeteer/issues/2472

更新:我尝试使用此构建包 ( https://github.com/iamashks/heroku-buildpack-tor-proxy.git ),但它一直导致我的 web dyno 中断(返回“H14”错误,这意味着您必须清除构建包并重新添加它们)。不确定如何从这里开始,因为这似乎是我能够遇到的唯一解决方案。

最佳答案

所以有几个问题。

  1. 发布的错误消息缺少占位符
  2. 该请求因拼写错误而失败。
  3. 您必须实际向浏览器对象提供代理服务器。它必须被初始化。
Error: net::ERR_PROXY_CONNECTION_FAILED at https://expampleDomanPlaceholder.com

这里是柬埔寨代理服务器的例子

We will use SOCKS4 proxy and IP location of this proxy at Cambodia.
Proxy IP address 96.9.77.192 and port 55796 (not sure if it still works)


const puppeteer = require('puppeteer');

(async () => {
let launchOptions = { headless: false,
args: ['--start-maximized',
'--proxy-server=socks4://96.9.77.192:55796'] // this is where we set the proxy
};

const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();

// set viewport and user agent (just in case for nice viewing)
await page.setViewport({width: 1366, height: 768});
await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');

// go to whatismycountry.com to see if proxy works (based on geography location)
await page.goto('https://whatismycountry.com');

// close the browser
await browser.close();
})();

#Proxy Issue
If the proxy host requires AUTH then the example below would be more fitting.


'use strict';

const puppeteer = require('puppeteer');

(async () => {
const username = process.env.USER
const password = process.env.PASS
const url = 'https://www.google.com'

const browser = await puppeteer.launch({
# proxy host must be correct.
args: [
'--proxy-server=socks5://proxyhost:8000',
],
});

const page = await browser.newPage();

await page.authenticate({
username,
password,
});

await page.goto(url);

await browser.close();
})();

this worked with tor.
Tor ('--proxy-server=socks5://localhost:9050')

引用资料:感谢@Grant Miller 的 TOR 测试。

https://dev.to/sonyarianto/practical-puppeteer-using-proxy-to-browse-a-page-1m82

How to make puppeteer work through socks5 proxy?

关于javascript - Webscraper被封——如何进行Puppeteer IP地址轮换?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67569465/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com