- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我正在尝试抓取维基页面“https://en.wikipedia.org/wiki/Glossary_of_nautical_terms”上的列表,获取每个航海术语的标题/描述,我的第一个问题是正确处理描述中的列表,如下所示:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Glossary_of_nautical_terms'
page = requests.get(url)
get_title = []
get_desc = []
corrected_desc = []
output = ''
if page.status_code == 200:
soup = BeautifulSoup(page.text, 'html.parser')
get_title = soup.find_all('dt', class_='glossary')
get_desc = soup.find_all('dd', class_='glossary')
for i in get_desc:
first_char = i.get_text()[:1]
second_char = i.get_text()[1:2]
if (first_char.isnumeric() and second_char == '.'):
if(first_char == '1' and output):
corrected_desc.append(output)
output = ''
output += '{} '.format(i.get_text())
continue
else:
output += '{} '.format(i.get_text())
continue
if output:
corrected_desc.append(output)
output = ''
corrected_desc.append(i.get_text())
else:
corrected_desc.append(i.get_text())
else:
print('failed to get the page!')
print(str(len(get_title)) + ' - ' + str(len(corrected_desc)))
zipped = zip(get_title, corrected_desc)
for j in zipped:
output = '{}, {}\n'.format(j[0].get_text(), j[1].strip())
with open('test.txt', "a", encoding='utf-8') as myfile:
myfile.write(output)
但我似乎不知道如何处理同时包含列表和句子的描述。
编辑:我正在寻找的输出是:
"Title", "Description"
"Title", "Description"
"Title", "Description"
"Title", "Description"
但我不确定如何调整我的代码来处理描述是列表 + 句子的情况。
最佳答案
所有的标题都在<dt>
里面标签,描述在 <dd>
内标签。因此,第一步是找到所有这些标签。可以使用 soup.find_all(['dt', 'dd'])
来完成.然后,循环这些标签,检查标签是否为dt
或 dd
使用 if tag.name == 'dt'
.如果标签是 dd
将其内容附加到 description
变量,否则打印变量的当前内容。
完整代码:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/Glossary_of_nautical_terms')
soup = BeautifulSoup(r.text, 'lxml')
curr_title, curr_description = '', ''
for tag in soup.find_all(['dt', 'dd']):
if tag.name == 'dt':
if curr_title:
print('{}: {}'.format(curr_title, curr_description))
curr_description = ''
curr_title = tag.text.strip()
else:
curr_description = ' '.join((curr_description, tag.text.strip()))
部分输出:
A-back: A foresail when against the wind, used when tacking to help the vessel turn.[1]
Abaft: Toward the stern, relative to some object ("abaft the fore hatch").
Abaft the beam: Further aft than the beam: a relative bearing of greater than 90 degrees from the bow: "two points abaft the beam, starboard side". That would describe "an object lying 22.5 degrees toward the rear of the ship, as measured clockwise from a perpendicular line from the right side, center, of the ship, toward the horizon."[2]
Abandon ship!: An imperative to leave the vessel immediately, usually in the face of some imminent overwhelming danger.[3] It is an order issued by the Master or a delegated person in command. (It must be a verbal order). It is usually the last resort after all other mitigating actions have failed or become impossible, and destruction or loss of the ship is imminent; and customarily followed by a command to "man the lifeboats" or life rafts.[3][4]
Abeam: On the beam, a relative bearing at right angles to the ship's keel.[5]
Able seaman: Also able-bodied seaman. A merchant seaman qualified to perform all routine duties, or a junior rank in some navies.
Aboard: On or in a vessel. Synonymous with "on board." (See also close aboard.)
About: "To go about is to change the course of a ship by tacking. Ready about, or boutship, is the order to prepare for tacking."[6]
Above board: On or above the deck, in plain view, not hiding anything. Pirates would hide their crews below decks, thereby creating the false impression that an encounter with another ship was a casual matter of chance.[7]
Above-water hull: The hull section of a vessel above the waterline, the visible part of a ship. Also, topsides.
Absentee pennant: Special pennant flown to indicate absence of commanding officer, admiral, his chief of staff, or officer whose flag is flying (division, squadron, or flotilla commander).
Absolute bearing: The bearing of an object in relation to north. Either true bearing, using the geographical or true north, or magnetic bearing, using magnetic north. See also bearing and relative bearing.
Accommodation ladder: A portable flight of steps down a ship's side.
Accommodation ship (or accommodation hulk): A ship or hulk used as housing, generally when there is a lack of quarters available ashore. An operational ship can be used, but more commonly a hulk modified for accommodation is used.
Act of Pardon or Act of Grace: A letter from a state or power authorising action by a privateer. See also Letter of marque.
Action Stations: See Battle stations.
Admiral: Senior naval officer of Flag rank. In ascending order of seniority, Rear Admiral, Vice Admiral, Admiral and (until about 2001 when all UK five-star ranks were discontinued) Admiral of the Fleet (Royal Navy). Derivation Arabic, from Amir al-Bahr ("Ruler of the sea").
Admiralty: 1. A high naval authority in charge of a state's Navy or a major territorial component. In the Royal Navy (UK) the Board of Admiralty, executing the office of the Lord High Admiral, promulgates Naval law in the form of Queen's (or King's) Regulations and Admiralty Instructions. 2. Admiralty law
Admiralty law: Body of law that deals with maritime cases. In the UK administered by the Probate, Divorce and Admiralty Division of the High Court of Justice or supreme court.
Adrift: 1. Afloat and unattached in any way to the shore or seabed, but not under way. When referring to a vessel, it implies that the vessel is not under control and therefore goes where the wind and current take her (loose from moorings or out of place). 2. Any gear not fastened down or put away properly. 3. Any person or thing that is misplaced or missing. When applied to a member of the navy or marine corps, such a person is "absent without leave" (AWOL) or, in US Navy and US Marine Corps terminology, is guilty of an "unauthorized absence" (UA).[8]
关于python - Beautiful Soup - 抓取维基页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49833207/
我注意到一个非常烦人的错误:BeautifulSoup4(包:bs4)经常发现比以前版本(包:BeautifulSoup)更少的标签。 这是该问题的一个可重现的实例: import requests
我正在尝试从具有我所知道的特定ID的表中获取数据。 由于某种原因,该代码不断给我“无”结果。 我正在尝试从HTML代码中解析: שווי שוק (אלפי ש"ח)
我正在尝试从包含以下 HTML 的网站中提取价格: $ 29.99 我正在使用以下 Beautiful Soup 代码: book_prices = soup_pack
我做了一个网络爬虫,它从一个文本文件中获取数千个 Urls,然后爬取该网页上的数据。 现在它有很多网址;一些网址也被破坏了。 所以它给了我错误: Traceback (most recent call
我正在尝试加载 html 页面并输出文本,即使我正确获取网页,BeautifulSoup 以某种方式破坏了编码。 来源: # -*- coding: utf-8 -*- import requests
目录 beautiful soup库的安装 beautiful soup库的理解 beautiful soup库的引用 BeautifulSoup类
Beautiful Soup就是Python的一个HTML或XML的解析库,可以用它来方便地从网页中提取数据。它有如下三个特点: Beautiful Soup提供一些简单的、Python式的
题目地址:https://leetcode.com/problems/beautiful-arrangement/description/ 题目描述 Suppose you have N inte
题目地址:https://leetcode.com/problems/beautiful-array/description/ 题目描述 Forsome fixed N, an array A i
您好,我正在尝试从网站获取一些信息。请原谅我,如果我的格式有任何错误,这是我第一次发布到 SO。 soup.find('div', {"class":"stars"}) 从这里我收到 我需要 “
我想从 Google Arts & Culture 检索信息使用 BeautifulSoup。我检查了许多 stackoverflow 帖子( [1] , [2] , [3] , [4] , [5]
我决定学习 Python,因为我现在有更多时间(由于大流行)并且一直在自学 Python。 我试图从一个网站上刮取税率,几乎可以获得我需要的一切。下面是来自我的 Soup 变量以及相关 Python
我正在使用 beautifulsoup 从页面中获取所有链接。我的代码是: import requests from bs4 import BeautifulSoup url = 'http://ww
我正在使用react-beautiful-dnd版本8.0.5(最新)并尝试渲染可重组列表,但我不断收到此错误: Warning: React.createElement: type is inval
我在将组件放入应用程序的下一个列表区域时遇到困难。我可以在父列中完美地拖放和排序,但无法将组件放在其他地方。这是我的 onDragEnd 函数中的代码: onDragEnd = result =>
发生的情况是,当我在一列中有多个项目并尝试拖动其中一个时,只显示一个项目,并且根据发现的经验教训 here我应该处于可以移动同一列内的项目但不能移动的位置。在 React 开发工具中,state 和
我正在尝试根据部分属性值来识别 html 文档中的标签。 例如,如果我有一个 Beautifulsoup 对象: import bs4 as BeautifulSoup r = requests.ge
Показать телефон 如何在 Beautiful Soup 中找到上述元素? 我尝试了以下方法,但没有奏效: show = soup.find('div', {'class': 'acti
我如何获得结果网址:https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-in
我是 python 新手,尝试从页面中提取表格,但无法使用 BS4 找到该表格。你能告诉我我哪里出错了吗? import requests from bs4 import BeautifulSoup
我是一名优秀的程序员,十分优秀!