gpt4 book ai didi

python - 从 JSON 文件中删除重复条目 - BeautifulSoup

转载 作者:太空宇宙 更新时间:2023-11-04 04:43:17 25 4
gpt4 key购买 nike

我正在运行一个脚本来对网站进行 scape 以获取教科书信息,并且该脚本正在运行。但是,当它写入 JSON 文件时,它会给我重复的结果。我想弄清楚如何从 JSON 文件中删除重复项。这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = ['https://open.bccampus.ca/find-open-textbooks/',
'https://open.bccampus.ca/find-open-textbooks/?start=10']

data = []
#opening up connection and grabbing page
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grabs info for each textbook
containers = page_soup.findAll("h4")

for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.parent.a.text
item['author'] = container.nextSibling.findNextSibling(text=True)
item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
item['source'] = "BC Campus"
data.append(item) # add the item to the list

with open("./json/bc.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)

这是 JSON 输出的示例

{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}

最佳答案

想通了。这是万一其他人遇到此问题的解决方案:

textbook_list = []
for item in data:
if item not in textbook_list:
textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
json.dump(textbook_list, writeJSON, ensure_ascii=False)

关于python - 从 JSON 文件中删除重复条目 - BeautifulSoup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50160675/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com