gpt4 book ai didi

python - beautifulSoup 在 Python 网络抓取时不匹配 chrome inspect

转载 作者:行者123 更新时间:2023-11-28 00:46:45 26 4
gpt4 key购买 nike

我目前正在尝试从 ncbi 蛋白质数据库中抓取蛋白质序列。此时,用户可以搜索蛋白质,我可以获得数据库吐出的第一个结果的链接。但是,当我通过漂亮的汤运行它时,汤与 chrome 检查元素不匹配,也根本没有序列。

这是我当前的代码:

import string
import requests
from bs4 import BeautifulSoup

def getSequence():
searchProt = input("Enter a Protein Name!:")
if searchProt != '':
searchString = "https://www.ncbi.nlm.nih.gov/protein/?term=" + searchProt
page = requests.get(searchString)
soup = BeautifulSoup(page.text, 'html.parser')
soup = str(soup)
accIndex = soup.find("a")
accessionStart = soup.find('<dd>',accIndex)
accessionEnd = soup.find('</dd>', accessionStart + 4)
accession = soup[accessionStart + 4: accessionEnd]
newSearchString = "https://www.ncbi.nlm.nih.gov/protein/" + accession
try:
newPage = requests.get(newSearchString)
#This is where it fails
newSoup = BeautifulSoup(newPage.text, 'html.parser')
aaList = []
spaceCount = newSoup.count("ff_line")
print(spaceCount)
for i in range(spaceCount):
startIndex = newSoup.find("ff_line")
startIndex = newSoup.find(">", startIndex) + 2
nextAA = newSoup[startIndex]
while nextAA in string.ascii_lowercase:
aaList.append(nextAA)
startIndex += 1
nextAA = newSoup[startIndex]
return aaList
except:
print("Please Enter a Valid Protein")

我一直在尝试通过搜索“p53”来运行它,并已找到链接:here

我查看了该网站上的一系列网页抓取条目,并尝试了很多方法,包括安装 selenium 和使用不同的解析器。我仍然对为什么这些不匹配感到困惑。 (抱歉,如果这是一个重复的问题,我对网络抓取很陌生,目前有脑震荡,所以我正在寻找一些个案反馈)

最佳答案

此代码将使用 Selenium 提取您想要的蛋白质序列。我修改了您的原始代码,为您提供了您想要的结果。

from bs4 import BeautifulSoup
from selenium import webdriver
import requests

driver = webdriver.Firefox()

def getSequence():
searchProt = input("Enter a Protein Name!:")
if searchProt != '':
searchString = "https://www.ncbi.nlm.nih.gov/protein/?term=" + searchProt
page = requests.get(searchString)
soup = BeautifulSoup(page.text, 'html.parser')
soup = str(soup)
accIndex = soup.find("a")
accessionStart = soup.find('<dd>',accIndex)
accessionEnd = soup.find('</dd>', accessionStart + 4)
accession = soup[accessionStart + 4: accessionEnd]
newSearchString = "https://www.ncbi.nlm.nih.gov/protein/" + accession
try:
driver.get(newSearchString)
html = driver.page_source
newSoup = BeautifulSoup(html, "lxml")
ff_tags = newSoup.find_all(class_="ff_line")
aaList = []
for tag in ff_tags:
aaList.append(tag.text.strip().replace(" ",""))
protSeq = "".join(aaList)
return protSeq
except:
print("Please Enter a Valid Protein")

sequence = getSequence()
print(sequence)

它为“p53”的输入产生以下输出:

meepqsdlsielplsqetfsdlwkllppnnvlstlpssdsieelflsenvtgwledsggalqgvaaaaastaedpvtetpapvasapatpwplsssvpsyktfqgdygfrlgflhsgtaksvtctyspslnklfcqlaktcpvqlwvnstpppgtrvramaiykklqymtevvrrcphherssegdslappqhlirvegnlhaeylddkqtfrhsvvvpyeppevgsdcttihynymcnsscmggmnrrpiltiitledpsgnllgrnsfevricacpgrdrrteeknfqkkgepcpelppksakralptntssspppkkktldgeyftlkirgherfkmfqelnealelkdaqaskgsedngahssylkskkgqsasrlkklmikregpdsd

关于python - beautifulSoup 在 Python 网络抓取时不匹配 chrome inspect,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50012384/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com