gpt4 book ai didi

python - 使用 BeautifulSoup 将 HTML 表格数据解析为字典

转载 作者:太空宇宙 更新时间:2023-11-04 10:03:31 27 4
gpt4 key购买 nike

我正在尝试使用 BeautifulSoup 来解析存储在 HTML 表中的信息并将其存储到字典中。我已经能够找到表格并遍历这些值,但表格中仍然有很多我不确定如何处理的垃圾。

# load the HTML file
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, "html.parser")

# navigate to the item attributes table
table = soup.find('div', 'itemAttr')

# iterate through the attribute information
attr = []
for i in table.findAll("tr"):
attr.append(i.text.strip().replace('\t', ''))

用这个方法,数据就是这个样子的。如您所见,那里有很多垃圾,有些行包含多个项目,例如 Year 和 VIN。

[u'Condition:\nUsed',
u'Seller Notes:\n\u201cExcellent Condition\u201d',
u'Year: \n\n2015\n\n VIN (Vehicle Identification Number): \n\n2G1FJ1EW2F9192023',
u'Mileage: \n\n29,000\n\n Transmission: \n\nManual',
u'Make: \n\nChevrolet\n\n Body Type: \n\nCoupe',
u'Model: \n\nCamaro\n\n Warranty: \n\nVehicle has an existing warranty',
u'Trim: \n\nSS Coupe 2-Door\n\n Vehicle Title: \n\nClear',
u'Engine: \n\n6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated\n\n Options: \n\nLeather Seats',
u'Drive Type: \n\nRWD\n\n Safety Features: \n\nAnti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
u'Power Options: \n\nAir Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats\n\n Sub Model: \n\n1LE',
u'Fuel Type: \n\nGasoline\n\n Color: \n\nWhite',
u'For Sale By: \n\nPrivate Seller\n\n Interior Color: \n\nBlack',
u'Disability Equipped: \n\nNo\n\n Number of Cylinders: \n\n8',
u'']

最终,我希望将数据存储在如下字典中。我知道如何创建字典,但不知道如何在不使用蛮力查找和替换的情况下清理需要进入字典的数据。

{'Condition' : 'Used',
'Seller Notes' : 'Excellent Condition',
'Year': '2015',
'VIN (Vehicle Identification Number)': '2G1FJ1EW2F9192023',
'Mileage': '29,000',
'Transmission': 'Manual',
'Make': 'Chevrolet',
'Body Type': 'Coupe',
'Model': 'Camaro',
'Warranty': 'Vehicle has an existing warranty',
'Trim': 'SS Coupe 2-Door',
'Vehicle Title' : 'Clear',
'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated',
'Options': 'Leather Seats',
'Drive Type': 'RWD',
'Safety Features' : 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
'Power Options' : 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats',
'Sub Model' : '1LE',
'Fuel Type' : 'Gasoline',
'Exterior Color' : 'White',
'For Sale By' : 'Private Seller',
'Interior Color' : 'Black',
'Disability Equipped' : 'No',
'Number of Cylinders': '8'}

最佳答案

与其尝试从 tr 元素中解析出数据,更好的方法是迭代 td.attrLabels 数据元素。您可以使用这些标签作为键,然后使用相邻的兄弟元素作为值。

在下面的示例中,CSS 选择器 div.itemAttr td.attrLabels 用于选择所有带有 .attrLabels 类的 td 元素是 div.itemAttr 的后代。从那里,方法 .find_next_sibling()用于查找相邻的兄弟元素。

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')

data = []
for label in soup.select('div.itemAttr td.attrLabels'):
data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

输出:

> [{'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]

如果您还想检索表格标题 th 元素,那么您可以选择表格元素,然后使用 CSS 选择器 th, td.attrLabels 以便检索两个标签:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

输出:

> [{'Condition:': 'Used'}, {'Seller Notes:': '“Excellent Condition”'}, {'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]

如果你想去掉键的非字母数字字符,那么你可以使用:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
key = re.sub(r'\W+', '', label.text.strip())
value = label.find_next_sibling().text.strip()

data.append({ key: value })

输出:

> [{'Condition': 'Used'}, {'SellerNotes': '“Excellent Condition”'}, {'Year': '2015'}, {'VINVehicleIdentificationNumber': '2G1FJ1EW2F9192023'}, {'Mileage': '29,000'}, {'Transmission': 'Manual'}, {'Make': 'Chevrolet'}, {'BodyType': 'Coupe'}, {'Model': 'Camaro'}, {'Warranty': 'Vehicle has an existing warranty'}, {'Trim': 'SS Coupe 2-Door'}, {'VehicleTitle': 'Clear'}, {'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options': 'Leather Seats'}, {'DriveType': 'RWD'}, {'SafetyFeatures': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'PowerOptions': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'SubModel': '1LE'}, {'FuelType': 'Gasoline'}, {'ExteriorColor': 'White'}, {'ForSaleBy': 'Private Seller'}, {'InteriorColor': 'Black'}, {'DisabilityEquipped': 'No'}, {'NumberofCylinders': '8'}]

关于python - 使用 BeautifulSoup 将 HTML 表格数据解析为字典,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42184367/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com