gpt4 book ai didi

python - 从 html 中获取内容并将获取的内容以特定格式写入 CSV

转载 作者:太空宇宙 更新时间:2023-11-03 15:02:37 25 4
gpt4 key购买 nike

我的 HTML 代码如下:

<!-- Snippet snippets/search_result_text.html end -->
</h2>





<p class="filter-list">


<span class="facet">Organisations:</span>

<span class="filtered pill">**Reserve Bank of Australia**
<a href="/dataset?groups=business" class="remove" title="Remove"><i class="icon-remove"></i></a>
</span>



<span class="facet">Groups:</span>

<span class="filtered pill">**Business Support and Regulation**
<a href="/dataset?organization=reservebankofaustralia" class="remove" title="Remove"><i class="icon-remove"></i></a>
</span>


</p>



</form>




<!-- Snippet snippets/search_form.html end -->




<!-- Snippet snippets/search_package_list.html start -->



<ul class="dataset-list unstyled">






<!-- Snippet snippets/package_item.html start -->






<li class="dataset-item">

<div class="dataset-content">
<h3 class="dataset-heading">



<a href="/dataset/banks-assets">**Banks – Assets**</a>




</h3>


<div>These data are derived from returns submitted to the Australian Prudential Regulation Authority (APRA) by banks authorised under the Banking Act 1959. APRA assumed...</div>

</div>

<ul class="dataset-resources unstyled">

<li>

<a href="/dataset/banks-assets" class="label" data-format="xls">XLS</a>

</li>

</ul>


</li>
<!-- Snippet snippets/package_item.html end -->





<!-- Snippet snippets/package_item.html start -->






<li class="dataset-item">

<div class="dataset-content">
<h3 class="dataset-heading">



<a href="/dataset/consolidated-exposures-immediate-and-ultimate-risk-basis">**Consolidated Exposures – Immediate and Ultimate Risk Basis**</a>




</h3>


<div>In March 2003, banks and selected Registered Financial Corporations (RFCs) began reporting their international assets, liabilities and country exposures to APRA in ARF/RRF 231...</div>

</div>

<ul class="dataset-resources unstyled">

<li>

<a href="/dataset/consolidated-exposures-immediate-and-ultimate-risk-basis" class="label" data-format="xls">XLS</a>

</li>

</ul>


</li>
<!-- Snippet snippets/package_item.html end -->

我想提取上面以粗体字母显示的数据,并希望以 csv 特定格式写入,例如:

Group                               Organisation              Title              
Business Support and Regulation Reserve Bank of Australia Banks-Assets
Business Support and Regulation Reserve Bank of Australia Consolidated Exposures – Immediate and Ultimate Risk Basis

等等......我的 python 代码提供了两个不同的文件。

webpage_urls = ["https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0",
"https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=department-of-finance&_groups_limit=0",
"https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=departmentofagriculturefisheriesandforestry&_groups_limit=0",
"https://data.gov.au/dataset?organization=department-of-communications&q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
"https://data.gov.au/dataset?organization=ip-australia&q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
"https://data.gov.au/dataset?q=&organization=australiancommunicationsandmediaauthority&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
"https://data.gov.au/dataset?q=&organization=www-mitchellshirecouncil-vic-gov-au&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
"https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=digital-transformation-agency&_groups_limit=0"]
# fetching data from all urls
data = []
dfs = []

for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page)

lobbying = {}
data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
print()
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index').rename_axis('Titles').reset_index()
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles')
print (df1)
df1.to_csv('D:/output2.csv')

for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page)

# fetching organisations
data3 = soup.find_all('li', class_="nav-item active")
lobbying1 = []
for element in data3:
lobbying1.append(element.span.get_text())
data.append(lobbying1)



df_ = pd.DataFrame(data, columns = ['Organisations', 'Groups'])
df2 = df_.drop_duplicates(subset = 'Organisations')
with pd.option_context('display.max_rows', 999):
print (df2)
df2.to_csv('D:/output_new.csv')

上面也给出了链接。请帮助获得具有三列的单个 csv 中所需的格式。

最佳答案

我尝试对原始解决方案进行一些修改 - 最好是仅循环一次并使用所有数据创建一个大的DataFrame。然后仅选择具有子集 [['col1','col2'] 的列作为新的 DataFrames

也可以使用 () 删除数字 str.replace :

for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")

lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
org = soup.find_all('li', class_="nav-item active")[0].span.get_text()
groups = soup.find_all('li', class_="nav-item active")[1].span.get_text()

data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups
#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)



df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
<小时/>
print (df1.head())
Titles Organisation \
0 Banks – Assets Reserve Bank of Aus...
1 Consolidated Exposures – Immediate and Ultimat... Reserve Bank of Aus...
2 Foreign Exchange Transactions and Holdings of ... Reserve Bank of Aus...
3 Finance Companies and General Financiers – Sel... Reserve Bank of Aus...
4 Liabilities and Assets – Monthly Reserve Bank of Aus...

link Group
0 https://data.gov.au/dataset/banks-assets Business Support an...
1 https://data.gov.au/dataset/consolidated-expos... Business Support an...
2 https://data.gov.au/dataset/foreign-exchange-t... Business Support an...
3 https://data.gov.au/dataset/finance-companies-... Business Support an...
4 https://data.gov.au/dataset/liabilities-and-as... Business Support an...
<小时/>
df2 = df1[['Titles', 'link']]
print (df2.head())
Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly

link
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...
<小时/>
df3 = df1[['Group','Organisation','Titles']]
print (df3.head())
Group Organisation \
0 Business Support an... Reserve Bank of Aus...
1 Business Support an... Reserve Bank of Aus...
2 Business Support an... Reserve Bank of Aus...
3 Business Support an... Reserve Bank of Aus...
4 Business Support an... Reserve Bank of Aus...

Titles
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly

关于python - 从 html 中获取内容并将获取的内容以特定格式写入 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44941796/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com