gpt4 book ai didi

python - 如何抓取受密码保护的网站

转载 作者:行者123 更新时间:2023-12-01 07:49:55 27 4
gpt4 key购买 nike

我在抓取受密码保护的网站时遇到困难。我知道有很多问题,但是没有一个能解决我的问题。

问题是,我不知道问题是什么。我确实从他们的服务器收到了 200 响应,但是,这不是我期望的内容。它确实是一个很大的 HTML 结构,但是有诸如“access”、“RequestURLDenied”、“Password”、“Help”、“Sign in”之类的单词,这表明我的登录尝试无法正常工作。但我不知道要改变什么?有人有抓取经验吗?

这是我到目前为止的代码(摘自 here ):

import requests
from lxml import html

USERNAME = "XXX"
PASSWORD = "XXX"
LOGIN_URL = "https://signin.lexisnexis.com/lnaccess/app/signin?back=https%3A%2F%2Fadvance.lexis.com%3A443%2Fnexis-uni%2Flaapi%2Fpermalink%2F35a8b8d7-925d-4219-b89d-af27c10a7a31%2F%3Fcontext%3D1516831&aci=nu"
LOGIN_URL2 = "https://signin.lexisnexis.com:443/lnaccess/Transition?aci=nu"
URL = "https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:7XM6-WXH0-Y9M6-H1V0-00000-00&context=1516831"

def main():
# Create session
session = requests.session()

# Get login cookies
session.get(LOGIN_URL)

# Create payload - used to log into password protected area
login_data = {
"rmtoken": "dummy",
"request_id": "null",
"OAM_REQ": "null",
"userid": USERNAME,
"password": PASSWORD,
"rmflag": "0",
"aci": "nu"
}

# Perform login
session.post(LOGIN_URL, data = login_data)

# Scrape url
result = session.get(URL)

# Content
print(result.content)


if __name__ == '__main__':
main()

这是我运行脚本时的响应:

script output

另一个问题:假设我已经可以从代码登录,并且执行了数千个服务器请求来提取文本,这是否会导致他们的服务器出现问题:D?

最佳答案

总而言之,您的代码看起来是正确的,您只是在向其发送 POST 请求的 URL 上犯了一些错误,并且您使用的负载不完整。

尝试以下代码:

import requests
from lxml import html
from lxml.etree import tostring

USERNAME = "XXX"
PASSWORD = "XXX"
LOGIN_URL = "https://signin.lexisnexis.com/lnaccess/app/signin?back=https%3A%2F%2Fadvance.lexis.com%3A443%2Fnexis-uni%2Flaapi%2Fpermalink%2F35a8b8d7-925d-4219-b89d-af27c10a7a31%2F%3Fcontext%3D1516831&aci=nu"
URL = "https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:7XM6-WXH0-Y9M6-H1V0-00000-00&context=1516831"

def main():
session_requests = requests.session()

# Get login cookies
session_requests.get(LOGIN_URL)

# Create payload - used to log into password protected are
payload = {
"rmtoken": "dummy",
"request_id": "null",
"OAM_REQ": "null",
"userid": USERNAME,
"password": PASSWORD,
"rmflag": "0",
"aci": "nu"
}

# Perform login
result = session_requests.post("https://signin.lexisnexis.com:443/lnaccess/Transition?aci=nu", data = payload)

# Scrape url
result = session_requests.get(URL)
tree = html.fromstring(result.content)
# bucket_names = tree.xpath("//div[@class='repo-list--repo']/a/text()")

print(tostring(tree))

if __name__ == '__main__':
main()

希望这有帮助

关于python - 如何抓取受密码保护的网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56289061/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com