python - 如何根据文本中的关键字将一个html页面拆分为多个html-6ren

python - 如何根据文本中的关键字将一个html页面拆分为多个html

转载作者：行者123 更新时间：2023-12-01 06:38:49

我想根据关键字PART将单个html文件拆分为多个html文件。给定的 html 文件包含提及四个部分的文本 - 第 I 部分、第 II 部分、第 III 部分和第 IV 部分。

我想将 html 分成 5 部分:

第 0 部分 - 应包含从 html 开头到第 I 部分之前的文本
第 I 部分 - 应包含从第 I 部分开始到第 II 部分之前的文本
第 II 部分 - 应包含从第 II 部分开始到第 III 部分之前的文本
第 III 部分 - 应包含从第 III 部分开始到第 IV 部分之前的文本
第 IV 部分 - 应包含从第 IV 部分开始到结束的文本。

这些是一些示例 html 文件:

https://www.sec.gov/Archives/edgar/data/763744/000076374419000018/lcii-20181231.htm https://www.sec.gov/Archives/edgar/data/820027/000082002719000010/amp12312018.htm

请引用下面我的代码:

import sys
import re
from bs4 import BeautifulSoup
import os
import numpy as np
from urllib.request import urlopen
import pandas as pd
list_values_page_number=[]


type_parts = ['PART 0','PART I','PART II','PART III','PART IV']
output_path = r"D:\Tasks\10K\SEGMENTATION\2_segmentation"
input_files = ['https://www.sec.gov/Archives/edgar/data/763744/000076374419000018/lcii-20181231.htm',
              'https://www.sec.gov/Archives/edgar/data/820027/000082002719000010/amp12312018.htm']
input_folder = r'D:\Tasks\10K\input_files'
#content_segmentation_file_name = '/home/mobius365/Downloads/10-K_financial_documents/content_segmentation.csv'

#co_ent_nbr_links = dict(zip(list(input_data_frame["CO_Ent_Nbr"]),list(input_data_frame["Updated_Links"])))


def page_segmentation(list_of_content,prev_index, page_number):
    global Part_page_number
    global previous_index
    global count
    global store_index_list
    global output_file_storage_folder
    global file_content_prettified_list
    global part_repeat_storage_list
    global indices
    page_soup = BeautifulSoup(" ".join(list_of_content), "lxml")
    values_with_part=page_soup.findAll(text=re.compile("Par|PAR|ART"))
    list_of_values=[]
    values_with_part=[values_list.strip() for values_list in values_with_part] 
    for Part_values in values_with_part:
        if (("ART" in Part_values.strip()[:5] or "art" in Part_values.strip()[:5] ) and Part_values.strip()[-1] in ["I","V"] and len(Part_values)<9):
            list_of_values.append(Part_values)
        elif(len(Part_values.strip())<6):
            list_of_values.append(Part_values)
        else:
            pass 

    if len(list_of_values) == 1 :
        values_parents_finder = page_soup.find(text=re.compile(list_of_values[0]))
        parent_0_value = values_parents_finder.findParents()[0].text.strip().upper()
        parent_1_value = values_parents_finder.findParents()[1].text.strip().upper()
        parent_0_value = parent_0_value.replace(u'\xa0', u' ')
        parent_1_value = parent_1_value.replace(u'\xa0', u' ')
        parent_0_value = re.sub(' +', '',parent_0_value)
        parent_1_value = re.sub(' +', '',parent_1_value)
        if ((parent_0_value[0]=='P' and  parent_0_value[-1] in ["I","V"]) or (parent_1_value[0]=='P' and  (parent_1_value[-1] in ["I","V"] or parent_1_value[-2:] in ["I.","V."] ))):

            if(parent_0_value[:4].upper()=='PART' and  (parent_0_value[-1] in ["I","V"] or parent_0_value[-2:] in ["I.","V."])):
                temp_name=re.sub('t', 't ',parent_0_value)
                temp_name=re.sub('T', 'T ',parent_0_value)
            else:
                temp_name=re.sub('t', 't ',parent_1_value)
                temp_name=re.sub('T', 'T ',parent_1_value)  

            if (temp_name not in part_repeat_storage_list):
                part_repeat_storage_list.append(temp_name)
                Part_page_number[temp_name.upper()] = page_number
                next_level_index = prev_index
                with open(output_file_storage_folder+"/"+type_parts[count]+".html", "w",encoding='utf-8') as file:
                    file.write(" ".join(file_content_prettified_list[previous_index:next_level_index]))
                file.close()
                store_index_list.append((previous_index,next_level_index))
                previous_index = next_level_index
                count+=1
        else:
            pass
    elif len(list_of_values) == 2 :
        for two_values in list_of_values :
            values_parents_finder = page_soup.find(text = re.compile(two_values[0]))
            parent_0_value = values_parents_finder.findParents()[0].text.strip().upper()
            parent_1_value = values_parents_finder.findParents()[1].text.strip().upper()
            parent_0_value = parent_0_value.replace(u'\xa0', u' ')
            parent_1_value = parent_1_value.replace(u'\xa0', u' ')
            parent_0_value = re.sub(' +', '',parent_0_value)
            parent_1_value = re.sub(' +', '',parent_1_value)
            if ((parent_0_value[0]=='P' and  parent_0_value[-1] in ["I","V"]) or (parent_1_value[0]=='P' and  (parent_1_value[-1] in ["I","V"] or parent_1_value[-2:] in ["I.","V."]))):
                if(parent_0_value[:4].upper()=='PART' and  parent_0_value[-1] in ["I","V"] ):
                    temp_name=re.sub('t', 't ',parent_0_value)
                    temp_name=re.sub('T', 'T ',parent_0_value)
                else:
                    temp_name=re.sub('t', 't ',parent_1_value)
                    temp_name=re.sub('T', 'T ',parent_1_value)  
                if (temp_name not in part_repeat_storage_list):

                    part_repeat_storage_list.append(temp_name)
                    next_level_index = prev_index
                    Part_page_number[temp_name.upper()] = page_number
                    with open(output_file_storage_folder+"/"+type_parts[count]+".html", "w",encoding='utf-8') as file:
                        file.write(" ".join(file_content_prettified_list[previous_index:indices[indices.index(next_level_index)+1]]))
                    file.close()
                    store_index_list.append((previous_index,next_level_index))
                    previous_index = next_level_index
                    count+=1



for link in input_files:
    html = urlopen(link).read().decode('utf-8')
    name = link.split('/')[-1]
    with open(input_folder+"/"+name, 'w', encoding='utf-8') as f:
        f.write(html)
    f.close()



for links in input_files:
    files = links.split("/")[-1]
    file_name = os.path.join(input_folder,files)
    print (file_name)
    output_file_storage_folder = os.path.join(output_path,files)
    if not os.path.exists(output_file_storage_folder):
        os.makedirs(output_file_storage_folder)    
    try:
        file_content_reading = open(file_name, encoding="utf8").read()
    except Exception as e:
        print(e)
    file_content_bs = BeautifulSoup(file_content_reading, 'lxml')
    file_content_prettified_list = file_content_bs.prettify().split("\n")
    file_content_space_removed = [tags_values.strip() for tags_values in file_content_prettified_list]

    page_splits = file_content_bs.find_all(attrs={'style': re.compile('page-break-before|page-break-after',re.IGNORECASE)})
    if (len(page_splits)< 90 ):
        page_splits=page_splits
        indices = [index_number for index_number, html_tags in enumerate(file_content_space_removed) if ('page-break-after' in html_tags.lower() or 'page-break-before' in html_tags.lower())]
    else:
        page_splits=[tag_value for tag_value in page_splits if str(tag_value)[:2]!="<p"]
        indices = [index_number for index_number, html_tags in enumerate(file_content_space_removed) if ('page-break-after' in html_tags.lower() or 'page-break-before' in html_tags.lower())]

    type_parts=['PART 0','PART I','PART II','PART III','PART IV']
    previous_index=0
    store_index_list=[]
    part_repeat_storage_list=[]
    count=0

    Part_page_number = { "PART 0" : 0, "PART I" : np.nan, "PART II" : np.nan , "PART III" : np.nan , "PART IV" : np.nan }

    prev_index=0
    count_page_number=1

    for index_value in indices:
        next_index = index_value
        page_segmentation(file_content_space_removed[prev_index:index_value],prev_index,count_page_number)
        prev_index = next_index
        count_page_number+=1
    page_segmentation(file_content_space_removed[next_index:],prev_index,count_page_number)

    if(len(store_index_list)!=0):
        with open(output_file_storage_folder+"/"+type_parts[count]+".html", "w",encoding='utf-8') as file:
            file.write(" ".join(file_content_prettified_list[store_index_list[-1][-1]:]))
        file.close()   
    else:
        with open(output_file_storage_folder+"/"+type_parts[count]+".html", "w",encoding='utf-8') as file:
            file.write(" ".join(file_content_prettified_list[:]))
        file.close()

    Part_page_number['File_Name']=files
    list_values_page_number.append(Part_page_number)

    df_summary = pd.DataFrame(list_values_page_number)
    df_summary.to_excel("summary_10K_Page_Segmentation.xlsx",index=False)

从上面的代码中，我无法按照我的意愿分割 html 文件。

编辑:

我添加了一组新的网址。

最佳答案

嗯，我写得很快，但是很复杂。

我会解释一下代码。

拆分为分隔页面的元素 ( <hr style="page-break-after:always"></hr> )。
在分割的页面中找到代表 PART 的文本并将内容组合起来。
保存。

我将粘贴代码。我希望这段代码有帮助

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.sec.gov/Archives/edgar/data/763744/000076374419000018/lcii-20181231.htm", verify=False)
file_content_reading = response.text

splited_pages = file_content_reading.split('<hr style="page-break-after:always"></hr>')

skip_words = ['INDEX']

part_strings = [['PART I', 'PART I.', 'PART I. '],['PART II', 'PART II.', 'PART II. '],['PART III', 'PART III.', 'PART III. '],['PART IV', 'PART IV.', 'PART IV. ']]

part_content_list = []
appned_content = ""
part = 0

def maching_result(content_soup, list_string):
    result = None
    for match_string in list_string:
        if content_soup.find("span", text=match_string) is not None:
            result = match_string
            break
    return result

for page in splited_pages:
    content = BeautifulSoup(page, "lxml")
    if(part < len(part_strings)) and maching_result(content, skip_words) is None:

        output = maching_result(content, part_strings[part])
        if maching_result(content, part_strings[part]) is not None:
            part += 1
            index = page.find(str(content.find("span", text=output)))
            first = page[:index]
            second = page[index:]
            part_content_list.append(appned_content + first)
            appned_content = second + page
        else:
            appned_content += page + '<hr style="page-break-after:always"></hr>'
    else:
        appned_content += page

part_content_list.append(appned_content)

num = 0
for part in part_content_list:
    soup = BeautifulSoup(part,"lxml")
    with open("output"+ str(num)+".html", "w", encoding="utf-8") as file:
        file.write(str(soup))
    num +=1

关于python - 如何根据文本中的关键字将一个html页面拆分为多个html，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59550860/

文章推荐： jquery - 在特定选项上显示文本字段选择

文章推荐： Azure 表存储访问时间 - 插入/读取

文章推荐： jquery - 如何处理元素的任何子元素(在任何级别)

java - and 之间的区别
大家好，我看到了来自 java 项目中的 jsp 页面。想问一下这些html标签有什么区别。请多多指教。示例代码如下: 最佳答案使用struts-html标签库，其中只是普
html - HTML 页面中损坏的 HTML
我有一个页面，我正在从电子邮件中读取 HTML。有时，来自电子邮件的文本包含 HTML 和 CSS，它完全改变了我的页面样式。我不希望我的页面样式因此受到影响。我如何严格阅读特定 div(框)内的
html - HTML 中的图像 - HTML 表中行之间的间距
我知道有类似的问题，但我想对我的特定代码进行一些输入。我有一个图像，我将其切成 9 块，并创建了一个 3x3 HTML 表来显示它。但是我的表在行之间有空格，但在列之间没有空格。我没有使用任何 C
html - 为什么我的本地 html 链接会转到父文件夹而不是 .html？
编辑:Waylan 的回答成功了!谢谢! 我正在尝试压缩文档的 .html 文件以发送给客户。目标是获得与浏览实际网站相同的体验。打开 .html 文件时，单击的任何链接都会转到父文件夹，而不是特定
html - 为什么我的本地 html 链接会转到父文件夹而不是 .html？
编辑:Waylan 的回答成功了!谢谢! 我正在尝试压缩文档的 .html 文件以发送给客户。目标是获得与浏览实际网站相同的体验。打开 .html 文件时，单击的任何链接都会转到父文件夹，而不是特定
html - 如何解析和规范化来自不同 HTML 生成器的 HTML？
这是 question 的扩展.我正在尝试解析嵌入在 Blogger 博客的 XML 备份中的 HTML 片段，并用 InDesign 标签重新标记它们。 Blogger 并未对其任何帖子的 HTML
html - html 元素之间的换行符破坏了 html 布局
我知道在 html 中元素之间的换行符被视为空格，但我认为当您尝试使用响应式布局时这非常可怕。例如，这里我们有预期和正确的行为，但要获得它，我必须删除元素之间的 html 中的换行符: https:
html - 将带有 html 标签的文本显示为 html
我正在尝试将文本文件显示为 html。我正在使用 ionic 。我正在发送一个 html 格式的响应，但在一个文本文件中发送到配置文件页面。它在 .ts 页面的变量名中。 @Component({
html - 如何在 html 中显示 html？
假设我有一个 html 文档: test 我想在浏览器中显示该代码。然后我会创建类似的东西: <html>test<html> 为了在中间制作 gubbins，我有一个函数
html - HTML 元素和 HTML 标签有什么区别？
HTML 元素和 HTML 标签有什么区别？渲染有什么区别吗？使用标签或元素时有什么特殊注意事项吗？最佳答案是一个标签，特别是一个开始标签也是一个标签，一个结束标签 This is a para
html - 降低 html 表格高度和过度滚动 - HTML
我有这个表格的模态形式。该表正在填充大量数据，但我不想分页。相反，我想以模式形式降低表格的高度并为表格添加溢出。下面是我的代码，但它不起作用。请问我该如何实现？ CSS #table{
html - 查看 HTML 代码而不是呈现的 HTML
我记得有一个 Linux 命令可以从给定的 URL 返回 HTML 代码。您可以将 URL 作为此命令的参数，然后返回 HTML 代码，而不是在浏览器中输入 URL。哪个命令执行此操作？最佳答案
html - 在 html 中显示 html
我有一个 html 页面，我想在其中包含另一个有很多链接的 html 页面。我能够使用 iframe 实现它，但我希望 iframe 内的页面具有与原始页面相同的文本和链接颜色属性，我不想要滚动条，我
html - 如何从另一个 HTML 加载部分 HTML？
我正在使用 HTML 写一本书。如果我把它写在一个 html 文件中，整个代码就会变长，所以我想将每一章保存到不同的文件中，然后将它们加载到主 html 中。我的意思是有像 chapter1.html
html - 将 html 页面重定向到另一个 html
在显示之前，我必须将一个网站重定向到另一个网站。我试过使用 .htaccess，但它给我带来了问题。我也使用过 javavscript 和 meta，但在加载我要从中传输的页面之前它不起作用。帮助？
html - 将网页 html 转换为电子邮件 html
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 7 年前。
html - 使用 html 打印 ""
如何打印“html”标签，包括“”？如何在不使用文本区域和 Javascript 的情况下对任何标签执行此操作？最佳答案使用HTML character references : <html
html - 如何将 html.slim 文件转换为 html 或 html.erb？
我需要将 Ruby on Rails 应用程序中的 html.slim 文件转换为 html.erb。有什么简单的方法吗？我尝试了 Stack Overflow 和其他网站中列出的许多选项。但对我没有
html - 没有标签可以创建 HTML 文档吗？
这个问题在这里已经有了答案: Is it necessary to write HEAD, BODY and HTML tags? (6 个答案) 关闭 8 年前。我在 gitHub 上找到了这个
html - 什么是加载外部资源的 HTML 元素列表？ (HTML 电子邮件)
如果不允许通过 JavaScript 进行额外的 DOM 操作，我正在寻找可以加载外部资源的元素列表。我正在尝试使用 HTML 查看器托管来自第三方的电子邮件，当发生这种情况时，我需要删除任何自动加载

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何根据文本中的关键字将一个html页面拆分为多个html