gpt4 book ai didi

python - Beautiful Soup 打开所有带有 pid 的 url

转载 作者:太空宇宙 更新时间:2023-11-03 12:05:56 24 4
gpt4 key购买 nike

我试图通过其中的 pid 打开所有链接,但有两种情况:

  1. 它在哪里打开所有 url(我的意思是甚至是垃圾 url)

    def get_links(self): 
    links = []
    host = urlparse( self.url ).hostname
    scheme = urlparse( self.url ).scheme
    domain_link = scheme+'://'+host
    pattern = re.compile(r'(/pid/)')

    for a in self.soup.find_all(href=True):
    href = a['href']
    if not href or len(href) <= 1:
    continue
    elif 'javascript:' in href.lower():
    continue
    elif 'forgotpassword' in href.lower():
    continue
    elif 'images' in href.lower():
    continue
    elif 'seller-account' in href.lower():
    continue
    elif 'review' in href.lower():
    continue
    else:
    href = href.strip()
    if href[0] == '/':
    href = (domain_link + href).strip()
    elif href[:4] == 'http':
    href = href.strip()
    elif href[0] != '/' and href[:4] != 'http':
    href = ( domain_link + '/' + href ).strip()
    if '#' in href:
    indx = href.index('#')
    href = href[:indx].strip()
    if href in links:
    continue

    links.append(self.re_encode(href))

    return links
  2. 在这种情况下,它只是打开其中包含 pid 的 url,但在这种情况下,它不会跟随链接,并且仅限于主页。使用 pid 打开几个链接后,它崩溃了。

    def get_links(self): 
    links = []
    host = urlparse( self.url ).hostname
    scheme = urlparse( self.url ).scheme
    domain_link = scheme+'://'+host
    pattern = re.compile(r'(/pid/)')

    for a in self.soup.find_all(href=True):
    if pattern.search(a['href']) is not None:
    href = a['href']
    if not href or len(href) <= 1:
    continue
    elif 'javascript:' in href.lower():
    continue
    elif 'forgotpassword' in href.lower():
    continue
    elif 'images' in href.lower():
    continue
    elif 'seller-account' in href.lower():
    continue
    elif 'review' in href.lower():
    continue
    else:
    href= href.strip()
    if href[0] == '/':
    href = (domain_link + href).strip()
    elif href[:4] == 'http':
    href = href.strip()
    elif href[0] != '/' and href[:4] != 'http':
    href = ( domain_link + '/' + href ).strip()
    if '#' in href:
    indx = href.index('#')
    href = href[:indx].strip()
    if href in links:
    continue

    links.append(self.re_encode(href))

    return links

有人可以帮助获取所有链接,甚至是 url 中的内部链接,并且最后只接受 pid 作为返回链接。

最佳答案

也许我遗漏了一些东西,但你为什么不在 from 处放入一个 if 语句而不是正则表达式?所以它看起来像这样:

def get_links(self): 
links = []
host = urlparse( self.url ).hostname
scheme = urlparse( self.url ).scheme
domain_link = scheme+'://'+host

for a in self.soup.find_all(href=True):
href = a['href']
if not href or len(href) <= 1:
continue
if href.lower().find("/pid/") != -1:
if 'javascript:' in href.lower():
continue
elif 'forgotpassword' in href.lower():
continue
elif 'images' in href.lower():
continue
elif 'seller-account' in href.lower():
continue
elif 'review' in href.lower():
continue

if href[0] == '/':
href = (domain_link + href).strip()
elif href[:4] == 'http':
href = href.strip()
elif href[0] != '/' and href[:4] != 'http':
href = ( domain_link + '/' + href ).strip()

if '#' in href:
indx = href.index('#')
href = href[:indx].strip()

if href in links:
continue

links.append(self.re_encode(href))

return links

我还删除了以下行,因为我相信否则你的代码将永远不会到达较低的区域,因为你继续进行所有操作。

else:
continue

关于python - Beautiful Soup 打开所有带有 pid 的 url,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32440541/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com