html - 用父元素的 beautifulsoup4 : does it affect the . 字符串解包元素？-6ren

html - 用父元素的 beautifulsoup4 : does it affect the . 字符串解包元素？

转载作者：行者123 更新时间：2023-11-28 03:15:06

27

4

我正在网络抓取如下表中的文本数据，我想获得结果:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

    html = '''
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

我打开了 <span> beautifulsoup4 元素:

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

但是，我想出了所有空的空行 <td>元素，或者 'dolor sit amet' 行不打印，即使我在使用 prettify 打印 html 时可以看到它。

# text with empty lines
for line in soup.find_all('td'):
    print(line.get_text().strip())
    print(line.string) # line with <span> prints None

# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
    print(line.get_text().strip())

print(soup.prettify())

我做错了什么吗？我如何使用 unwrap() 并仍然访问所有没有空行的文本内容？

感谢您的帮助!

最佳答案

据我测试，您就在附近。应用 strip() 然后使用 re 模块将多个空格替换为一个空格，例如:

from bs4 import BeautifulSoup
import re

html = ''' 
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

print('\n'.join(
  re.sub(r'\s+', ' ', td.text.strip()) 
    for td in soup.find_all('td') if td.text.strip()))

它产生:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

关于html - 用父元素的 beautifulsoup4 : does it affect the . 字符串解包元素？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28528594/

27

4

0

文章推荐： php - 联系表格 php 失败

文章推荐： javascript - 如何使用 nuxtjs 在 vue 模板上添加标题和脚本

文章推荐： javascript - 通过不透明度对 div 类进行排序

文章推荐： javascript - 如何根据单击图像隐藏/取消隐藏一组 div

objective-c - 以编程方式制作 mac 包/包
通过终端，您可以使用命令 - “SetFile -a B 文件名” 以编程方式，我认为我应该通过[[NSFileManager defaultManager] createDirectoryAtPat
r - 包 igraph0 已弃用，因此无法访问 gspan 包
嗨，正在尝试书中的一些示例:Practical Graph mining with R对于子图挖掘: library(subgraphMining) library(igraph) graph1 =
java - 具有默认(包)访问级别的类中方法的默认(包)和公共(public)访问级别之间有什么区别吗？
代码中的相同问题: class Foo { int getIntProperty () { ... } CustomObject getObjectProperty () { ... }
javascript - react | Npm 包 - 如何导出 2 个组件以用作 npm 包
所以这可能是一个愚蠢的问题，但它已经困扰我一段时间了。使用 React，我创建了两个组件(Buttons.js 和 Message.js)，每个组件都有一个导出。但是，现在我希望将这两个组件用作 n
node.js - 无法在某个范围内安装 NPM 包，或者无法在范围内安装 NPM 包(或者看起来像这样)
从今天早上开始，我发现我无法再从某个范围安装任何 NPM 包(或任何具有依赖项的包)。例如，如果我输入 npm i webpack 我会收到以下错误... npm ERR! code E401 npm
angular - 找不到本地 "typescript"包。 "@ngtools/webpack"包 Angular 2
我在这里搜索过，Angular 2, @ngtools/webpack, AOT ，但对我不起作用。我运行了 npm install 命令。我正在做的是创建一个新的 Angular 2 项目。当我运行
swift - 集成具有本地 Swift 包 : how to avoid invalidManifestFormat errors? 的远程 Swift 包
情况: 我有一个 Swift 包，将其命名为 lib。 lib 位于其自己的存储库中。在lib的仓库中，有一堆本地包；也就是说，这些包是在 lib 中定义的，使用本地路径依赖格式 .package(p
node.js - 如何安装完整的 Node JS 包，从而避免使用 npm 来安装模块/包？
我想在工作中学习和使用nodejs，但是在使用 de npm 命令安装模块/包时遇到网络问题。我是否可以使用我的家用计算机构建完整的 Node js 包，然后将其安装在另一台计算机(我的工作场所计算机
python - 如何将非 Python 包 (.tar.bz2) 安装/转换为 Anaconda 包？
我需要将一些 .tar.bz2 格式的非 Python 包转换为 Anaconda/miniConda .egg 文件并安装它们。为此，我需要一个适用于 Windows 的 bld.bat 文件。互联
c++ - thrift-0.9.3 包 C++ 构建问题。使用哪些 boost 包？
我需要共享库文件 libthrift-0.9.3.so 作为其他包的依赖项。我在构建 thrift-0.9.3 包时看到编译问题(我确实从 https://thrift.apache.org/down
r - 在 R 版本 3.5.0 中安装 arcgisbinding 包，收到警告 : as ‘lib’ is unspecified, 包 ‘‘arcgisbinding’ 不可用
我尝试在 R 版本 3.5.0 中安装“arcgisbinding”包。但是我失败了，得到以下错误和警告。 Installing package into ‘C:/Users/Lenovo/Docum
r - 在 R 版本 3.5.0 中安装 arcgisbinding 包，收到警告 : as ‘lib’ is unspecified, 包 ‘‘arcgisbinding’ 不可用
我尝试在 R 版本 3.5.0 中安装“arcgisbinding”包。但是我失败了，得到以下错误和警告。 Installing package into ‘C:/Users/Lenovo/Docum
android - "The name ' 页 ' is defined in the libraries ' 包 :burn_off/widgets/page. Dart ' and ' 包 :flutter/src/widgets/navigator. Dart '
我试图在 flutter 中测试这个应用程序，但我无法运行该应用程序，因为出现此错误“名称‘Page’在库‘package:burn_off/widgets/page.dart’和‘package’中
包/模块之间的python变量共享
试图理解和学习如何编写包...用我一直使用的东西进行测试，记录... 您能帮我理解为什么“日志”变量不起作用...并且屏幕上没有日志记录吗？谢谢! 主要文件: #!/opt/local/bin/py
Python 包
我尝试运行此使用 Google 云的代码。 import signal import sys from google.cloud import language, exceptions # creat
用于分析眼动追踪数据的 R 包
我想知道是否有人找到了一个很好的 R 包来分析眼动追踪数据？我遇到了 eyetrackR，但据我所知，没有可用的英文支持文档: http://read.psych.uni-potsdam.de/pm
R 包 - 我可以在包中使用全局变量吗？
我正在 R 上制作一个包。我有两个函数共享一个变量(全局)。如何将其导入到包中？例如， m<-0 f<-function() { m <- m+1 } g<-function() { m <- m
包含子包的 Lua 包
我用 C 为 Lua 编写了很多模块。每个模块都包含一个 Lua 用户数据类型，我像这样加载和使用它们: A = require("A") B = require("B") a = A.new(3,{
rubuntu xlsx 包
我正在尝试在 R 中的 Ubuntu 上安装 xlsx 包，以便使用允许在 R 中插入链接然后将它们导出到 Excel 的功能。话虽如此，我根本无法安装该软件包。显然它必须与 rJava 一起使用
用于从标准概率分布中采样的 Haskell 包
我想在 Haskell 中做一些蒙特卡洛分析。我希望能够编写这样的代码: do n <- poisson lambda xs <- replicateM n $ normal mu sigma

首页

博学

6Ren·AI

商城

html - 用父元素的 beautifulsoup4 : does it affect the . 字符串解包元素？