gpt4 book ai didi

python - 使用python创建一个Linkedin Webscraper

转载 作者:行者123 更新时间:2023-12-03 08:54:06 28 4
gpt4 key购买 nike

我正在使用anaconda并试图创建一个可以在Linkedin上继续使用并从页面中刮取相关信息的刮刀。目前,我只是想让它登录并从相关页面中提取源代码。但是,下面的代码不断返回“TypeError:'NoneType'对象不可下标”。有谁知道该代码出了什么问题?

import http.cookiejar as cookielib
import os
import urllib
import re
import string
import html5lib
from bs4 import BeautifulSoup

username = "user@email.com"
password = "password"

cookie_filename = "parser.cookies.txt"

class LinkedInParser(object):

def __init__(self, login, password):
""" Start up... """
self.login = login
self.password = password

# Simulate browser with cookies enabled
self.cj = cookielib.MozillaCookieJar(cookie_filename)
if os.access(cookie_filename, os.F_OK):
self.cj.load()
self.opener = urllib.request.build_opener(
urllib.request.HTTPRedirectHandler(),
urllib.request.HTTPHandler(debuglevel=0),
urllib.request.HTTPSHandler(debuglevel=0),
urllib.request.HTTPCookieProcessor(self.cj)
)
self.opener.addheaders = [
('User-agent', ('Mozilla/4.0 (compatible; MSIE 6.0; '
'Windows NT 5.2; .NET CLR 1.1.4322)'))
]

# Login
self.loginPage()

title = self.loadTitle()
print(title)

self.cj.save()


def loadPage(self, url, data=None):
"""
Utility function to load HTML from URLs for us with hack to continue despite 404
"""
# We'll print the url in case of infinite loop
# print "Loading URL: %s" % url
try:
if data is not None:
response = self.opener.open(url, data)
else:
response = self.opener.open(url)
return ''.join([str(l) for l in response.readlines()])
except Exception as e:
# If URL doesn't load for ANY reason, try again...
# Quick and dirty solution for 404 returns because of network problems
# However, this could infinite loop if there's an actual problem
return self.loadPage(url, data)

def loadSoup(self, url, data=None):
"""
Combine loading of URL, HTML, and parsing with BeautifulSoup
"""
html = self.loadPage(url, data)
soup = BeautifulSoup(html, "html5lib")
return soup

def loginPage(self):
"""
Handle login. This should populate our cookie jar.
"""
soup = self.loadSoup("https://www.linkedin.com/")
csrf = soup.find(id="loginCsrfParam-login")['value']
login_data = urllib.parse.urlencode({
'session_key': self.login,
'session_password': self.password,
'loginCsrfParam': csrf,
}).encode('utf8')

self.loadPage("https://www.linkedin.com/uas/login-submit", login_data)
return

def loadTitle(self):
soup = self.loadSoup("http://www.linkedin.com/nhome")
return soup.find("title")

parser = LinkedInParser(username, password)

错误消息如下:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-43815804ad91> in <module>()
87 return soup.find("linked")
88
---> 89 parser = LinkedInParser(username, password)

<ipython-input-24-43815804ad91> in __init__(self, login, password)
34
35 # Login
---> 36 self.loginPage()
37
38 title = self.loadTitle()

<ipython-input-24-43815804ad91> in loginPage(self)
73 """
74 soup = self.loadSoup("https://www.linkedin.com/")
---> 75 csrf = soup.find(id="loginCsrfParam-login")['value']
76 login_data = urllib.parse.urlencode({
77 'session_key': self.login,

TypeError: 'NoneType' object is not subscriptable

最佳答案

看起来像

soup.find(id="loginCsrfParam-login")

什么也没退最有可能的是,根本找不到具有此特定ID的对象。

您的脚本正在使用Cookie jar 接受来自LinkedIn的Cookie。 LinkedIn使用此cookie来保持您的帐户登录,因此您无需再次登录。但是,每次创建新实例时,您的 LinkedInParser对象都会尝试登录。它从文件加载cookie:
if os.access(cookie_filename, os.F_OK):
self.cj.load()

但是,在尝试登录之前,它不会检查此cookie是否存在:
# Login
self.loginPage()

我猜这是发生了什么:您运行脚本一次,它有一个cookie。现在,您尝试再次运行它,它尝试再次登录,但是当它拉出LinkedIn主页时,由于您已经登录,因此没有登录表单。该脚本找不到所需的元素,并且您无法获取不存在的元素的 ['value']

您可能想做的是:
if not os.access(cookie_filename, os.F_OK):
self.loginPage()
self.cj.save()

关于python - 使用python创建一个Linkedin Webscraper,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31836828/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com