gpt4 book ai didi

Python Selenium 抓取不正确的括号集(Python、Selenium、MySQL)

转载 作者:行者123 更新时间:2023-11-30 21:35:31 25 4
gpt4 key购买 nike

问题:

我有一些代码可以抓取 https://au.pcpartpicker.com/products/cpu/overall-list/#它获取每组括号内的文本,并通过匹配名称将它们添加到 MySQL 数据库中,这可以正常工作,但是如果条目有 2 组括号,它会选择第二组,请参见下文。

例子

Incorrect set of brackets chosen

代码:

这是我的代码:

import mysql.connector
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, re

mydb = mysql.connector.connect(
host="host",
user="user",
passwd="passwd",
database="database"
)

mycursor = mydb.cursor()

d = webdriver.Chrome('D:/Uskompuf/Downloads/chromedriver')
d.get('https://au.pcpartpicker.com/products/cpu/overall-list/#page=1')
def cpus(_source):
result = soup(_source, 'html.parser').find('ul', {'id':'category_content'}).find_all('li')
_titles = list(filter(None, [(lambda x:'' if x is None else x.text)(i.find('div', {'class':'title'})) for i in result]))
data = [list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result]
return _titles, [a for *_, [a] in filter(None, data)]


_titles, _cpus = cpus(d.page_source)
sql = "UPDATE cpu set family = %s where name = %s"
mycursor.executemany(sql, list(zip(_cpus, _titles)))
print(sql, list(zip(_cpus, _titles)))
mydb.commit()
_last_page = soup(d.page_source, 'html.parser').find_all('a', {'href':re.compile('#page\=\d+')})[-1].text
for i in range(2, int(_last_page)+1):
d.get(f'https://au.pcpartpicker.com/products/cpu/overall-list/#page={i}')
time.sleep(3)
_titles, _cpus = cpus(d.page_source)
sql = "UPDATE cpu set family = %s where name = %s"
mycursor.executemany(sql, list(zip(_cpus, _titles)))
mydb.commit()

mydb.commit()

更新

我厌倦了@Daniel Scott 的以下代码

改变

_titles = list(filter(None, [(lambda x:'' if x is None else x.text)(i.find('div', {'class':'title'})) for i in result]))

_titles = list(filter(None, [(lambda x:'' if x is None else str(x.text).split(")")[0])(i.find('div', {'class':'title'})) for i in result]))

但是我似乎仍然得到 on-dieon die有什么想法吗?

更新2

这是两个括号似乎都是 title 的一部分的类然而

enter image description here

我在想我可能必须改变这个str(x.text).split(")

更新 3

我已将代码更改为

import mysql.connector
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, re

mydb = mysql.connector.connect(
host="host",
user="root",
passwd="passwd",
database="database"
)

mycursor = mydb.cursor()

d = webdriver.Chrome('D:/Uskompuf/Downloads/chromedriver')
d.get('https://au.pcpartpicker.com/products/cpu/overall-list/#page=1')
def cpus(_source):
result = soup(_source, 'html.parser').find('ul', {'id':'category_content'}).find_all('li')
_titles = list(filter(None, [(lambda x:'' if x is None else x.text)(i.find('div', {'class':'title'})) for i in result]))
data = [list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result]
data2=[]
for i in data:
ii=0
arr2=[]
for c in i:
# Skep the rest of section if we've already seen a closing bracket
if (")" in c) and ii>1:
a=1
if ")" in c:
ii+=1
try:
arr2.append(c.replace("(","").replace(")",""))
except Exception:
pass
data2.append(arr2)
data = data2
return _titles, [a for *_, [a] in filter(None, data)]

_titles, _cpus = cpus(d.page_source)
sql = "UPDATE cpu set family = %s where name = %s"
mycursor.executemany(sql, list(zip(_cpus, _titles)))
mydb.commit()
_last_page = soup(d.page_source, 'html.parser').find_all('a', {'href':re.compile('#page\=\d+')})[-1].text
for i in range(2, int(_last_page)+1):
d.get(f'https://au.pcpartpicker.com/products/cpu/overall-list/#page={i}')
time.sleep(3)
_titles, _cpus = cpus(d.page_source)
sql = "UPDATE cpu set family = %s where name = %s"
mycursor.executemany(sql, list(zip(_cpus, _titles)))
mydb.commit()

mydb.commit()

根据答案更新,但是这个 no 根本不起作用,因为我的数据库中所有 family 值都为 Null。

数据运行后print(data)返回。

有什么想法吗?

更新 4

没有运气回到基地 1。

更新 5

如果我print([list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result])

这是我不想要的示例,我需要摆脱片上。

[['0'], ['0'], ['OEM/Tray'], ['Godavari'], ['on-die']]

更新 6

我认为这是需要更改的代码:

return _titles, [a for *_, [a] in filter(None, data)]

其他:

如果您需要更多信息,请告诉我。

谢谢

最佳答案

您可以更改行:

_titles = list(filter(None,
[(lambda x:'' if x is None else x.text)(i.find('div',
{'class':'title'})) for i in result]))

对此:

_titles = list(filter(None,
[(lambda x:'' if x is None else str(x.text).split(")")[0])(i.find('div',
{'class':'title'})) for i in result]))

这应该改变它,以便将双括号分成两个字符串,并且只使用第一个。让我知道这是否适合您!

引用:Splitting strings

更新

我的抓取工具目前真的断断续续,我无法正确解决这个问题。

在这一行之后:

data = [list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result]

粘贴这个:

data2=[]
for i in data:
ii=0
arr2=[]
for c in i:
# Skep the rest of section if we've already seen a closing bracket
if (")" in c) and ii>1:
a=1
if ")" in c:
ii+=1
try:
arr2.append(c.replace("(","").replace(")",""))
except Exception:
pass
data2.append(arr2)
data = data2

它的代码是如此丑陋,它呼唤否决票。但至少这对你来说是一个开始,所以你不会放慢脚步。它遍历数组并忽略每个 ul 中出现的第二个括号。

干杯伙计。

关于Python Selenium 抓取不正确的括号集(Python、Selenium、MySQL),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54048478/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com