gpt4 book ai didi

python - 获取 HTML 的最终结果文本

转载 作者:行者123 更新时间:2023-12-03 20:56:27 36 4
gpt4 key购买 nike

我有以下字符串:

html = '<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li></ol>'

是否有任何 Python 库可以将其转换为以下字符串?

'a. hello'

编辑:这需要工作arbirary HTML/CSS(因此使用 ol标签的 type 属性, li标签的 value ATTR,CSS的 content counters ,大概数以百计的其他HTML/CSS模式列表和其他的东西)。

EDIT 2 :我尝试过 Lynx,如果不是因为 Lynx 显然无法处理 list-style-type 和其他常见的 CSS 内容,那实际上是可行的。

最佳答案

您可以使用像 Selenium Webdriver 这样的 headless 浏览器来做到这一点。因为我们需要使用 Window.getComputedStyle()看看哪个ol li项目有 lower-alpha list-style-type 的值.无法获取列表项算术/alpha 索引的文本。

我们可以根据 CSS 和 HTML 参数生成这些数字。 HTML 列表可能会变得非常复杂,因为它们可能有 26 个以上的项目,其中的字母必须是 aa. , ab.等等还有startreversed ol attributes . start定义订单从哪里开始,例如 <ol start="3">计数将从字母 c 开始. reversed属性,以相反的顺序显示列表c. , b. , a.等等。我们需要解决这两种情况。

使用 Chrome Webdriver 安装 Selenium 的说明

  • 使用 pip安装 Selenium :
    pip3 install selenium
  • 下载 Chrome 网络驱动程序 here并添加到您的系统 PATH .请小心选择本地 Chrome 安装的版本。最新的 Chrome 是 80.0.3987 .

  • 用于抓取有序列表项并生成 lower-alpha 的 Python 脚本柜台

    在这个脚本中,我使用了一个实时 URL,但你可以在脚本底部查看我解释了如何使用 'data:text/html;'向 webdriver 传递一些像您一样的自定义 HTML。

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import string


    def get_content(link):
    driver.get(link)

    # Get all page ordered lists
    for ol in WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ol"))):
    # Get all items from the current ordered list
    list_items = ol.find_elements_by_css_selector("li")
    list_items_count = len(list_items)

    # Get the list start attribute, will return 1 if not present
    ol_start = int(ol.get_attribute("start"))

    # Get the list reversed attribute, will return None if not present
    ol_reversed = ol.get_attribute("reversed")

    # Print information about the ordered list
    print("OL with %s items starting at %s, reversed: %s" % (
    len(list_items),
    ol_start,
    "yes" if ol_reversed else "no"))

    # Counter for the letters.
    # If the list is reversed begin count from the last item to the first,
    # else count from first (start) to last
    li_letter = list_items_count if ol_reversed else ol_start

    # Keep count how many list items found with lower-alpha list-style-type
    list_items_found = 0

    for li in list_items:
    # Execute javascript getComputedStyle to get the list item computed style
    list_style_type = driver.execute_script("return window.getComputedStyle(arguments[0])['list-style-type']", li)

    # If the list item computed style 'list-style-type' has 'lower-alpha' value
    if list_style_type == "lower-alpha":
    # Print generated alpha counter and the item text
    print("%s. %s" % (get_alpha_num(li_letter), li.text))

    # If the list is reversed, decrease letter by 1, else increase it
    li_letter += -1 if ol_reversed else 1

    # Keep counting how many items found with 'lower-alpha'
    list_items_found += 1

    # If no items with 'lower-alpha' found do something
    if list_items_found == 0:
    print("No list items found with 'lower-alpha' list style type")
    print()

    # Function to convert numbers to letters 1 => a, 26 => aa
    def get_alpha_num(num):
    letters = string.ascii_lowercase
    letters_count = len(letters)
    result = ''
    cnum = num - 1

    while(cnum // letters_count > 0):
    cnum //= letters_count
    result += list(letters)[cnum - 1]

    result += list(letters)[((num - 1) % letters_count)]
    return result

    if __name__ == '__main__':
    URL = 'https://zikro.gr/dbg/html/lists.html'

    # If you want to parse HTML code from a string
    # then you can use a 'data:text/html;' URL with the HTML contents like this:
    #
    # html_content = '<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li><li>there</li></ol>'
    # URL = "data:text/html;charset=utf-8,{html_content}".format(html_content=html_content)
    #
    # Will result to this:
    # OL with 2 items starting at 1, reversed: no
    # a. hello
    # b. there

    chrome_options = Options()

    # Make headless
    # chrome_options.add_argument("--headless")

    with webdriver.Chrome(options=chrome_options) as driver:
    get_content(URL)

    结果

    如果您尝试解析具有多个列表的页面,例如 this one :

    #ol-a-css li {
    list-style-type: lower-alpha;
    }
    <ul>
    <li>Unordered list item 1</li>
    <li>Unordered list item 2</li>
    <li>Unordered list item 3</li>
    <li>Unordered list item 4</li>
    </ul>

    <ol>
    <li>Simple ordered list item 1</li>
    <li>Simple ordered list item 2</li>
    <li>Simple ordered list item 3</li>
    <li>Simple ordered list item 4</li>
    </ol>

    <ol type="a">
    <li>Lower alpha, ordered list item 1</li>
    <li>Lower alpha, ordered list item 2</li>
    <li>Lower alpha, ordered list item 3</li>
    <li>Lower alpha, ordered list item 4</li>
    </ol>

    <ol type="a" start="3">
    <li>Lower alpha start=3, ordered list item 1</li>
    <li>Lower alpha start=3, ordered list item 2</li>
    <li>Lower alpha start=3, ordered list item 3</li>
    <li>Lower alpha start=3, ordered list item 4</li>
    </ol>

    <ol type="a" reversed>
    <li>Lower alpha reversed, ordered list item 1</li>
    <li>Lower alpha reversed, ordered list item 2</li>
    <li>Lower alpha reversed, ordered list item 3</li>
    <li>Lower alpha reversed, ordered list item 4</li>
    </ol>

    <ol id="ol-a-css">
    <li>Lower alpha CSS, ordered list item 1</li>
    <li>Lower alpha CSS, ordered list item 2</li>
    <li>Lower alpha CSS, ordered list item 3</li>
    <li>Lower alpha CSS, ordered list item 4</li>
    <li>Lower alpha CSS, ordered list item 5</li>
    <li>Lower alpha CSS, ordered list item 6</li>
    <li>Lower alpha CSS, ordered list item 7</li>
    <li>Lower alpha CSS, ordered list item 8</li>
    <li>Lower alpha CSS, ordered list item 9</li>
    <li>Lower alpha CSS, ordered list item 10</li>
    <li>Lower alpha CSS, ordered list item 11</li>
    <li>Lower alpha CSS, ordered list item 12</li>
    <li>Lower alpha CSS, ordered list item 13</li>
    <li>Lower alpha CSS, ordered list item 14</li>
    <li>Lower alpha CSS, ordered list item 15</li>
    <li>Lower alpha CSS, ordered list item 16</li>
    <li>Lower alpha CSS, ordered list item 17</li>
    <li>Lower alpha CSS, ordered list item 18</li>
    <li>Lower alpha CSS, ordered list item 19</li>
    <li>Lower alpha CSS, ordered list item 20</li>
    <li>Lower alpha CSS, ordered list item 21</li>
    <li>Lower alpha CSS, ordered list item 22</li>
    <li>Lower alpha CSS, ordered list item 23</li>
    <li>Lower alpha CSS, ordered list item 24</li>
    <li>Lower alpha CSS, ordered list item 25</li>
    <li>Lower alpha CSS, ordered list item 26</li>
    <li>Lower alpha CSS, ordered list item 27</li>
    <li>Lower alpha CSS, ordered list item 28</li>
    <li>Lower alpha CSS, ordered list item 29</li>
    <li>Lower alpha CSS, ordered list item 30</li>
    <li>Lower alpha CSS, ordered list item 31</li>
    </ol>


    你会得到这样的结果:

    OL with 4 items starting at 1, reversed: no
    No list items found with 'lower-alpha' list style type

    OL with 4 items starting at 1, reversed: no
    a. Lower alpha, ordered list item 1
    b. Lower alpha, ordered list item 2
    c. Lower alpha, ordered list item 3
    d. Lower alpha, ordered list item 4

    OL with 4 items starting at 3, reversed: no
    c. Lower alpha start=3, ordered list item 1
    d. Lower alpha start=3, ordered list item 2
    e. Lower alpha start=3, ordered list item 3
    f. Lower alpha start=3, ordered list item 4

    OL with 4 items starting at 1, reversed: yes
    d. Lower alpha reversed, ordered list item 1
    c. Lower alpha reversed, ordered list item 2
    b. Lower alpha reversed, ordered list item 3
    a. Lower alpha reversed, ordered list item 4

    OL with 31 items starting at 1, reversed: no
    a. Lower alpha CSS, ordered list item 1
    b. Lower alpha CSS, ordered list item 2
    c. Lower alpha CSS, ordered list item 3
    d. Lower alpha CSS, ordered list item 4
    e. Lower alpha CSS, ordered list item 5
    f. Lower alpha CSS, ordered list item 6
    g. Lower alpha CSS, ordered list item 7
    h. Lower alpha CSS, ordered list item 8
    i. Lower alpha CSS, ordered list item 9
    j. Lower alpha CSS, ordered list item 10
    k. Lower alpha CSS, ordered list item 11
    l. Lower alpha CSS, ordered list item 12
    m. Lower alpha CSS, ordered list item 13
    n. Lower alpha CSS, ordered list item 14
    o. Lower alpha CSS, ordered list item 15
    p. Lower alpha CSS, ordered list item 16
    q. Lower alpha CSS, ordered list item 17
    r. Lower alpha CSS, ordered list item 18
    s. Lower alpha CSS, ordered list item 19
    t. Lower alpha CSS, ordered list item 20
    u. Lower alpha CSS, ordered list item 21
    v. Lower alpha CSS, ordered list item 22
    w. Lower alpha CSS, ordered list item 23
    x. Lower alpha CSS, ordered list item 24
    y. Lower alpha CSS, ordered list item 25
    z. Lower alpha CSS, ordered list item 26
    aa. Lower alpha CSS, ordered list item 27
    ab. Lower alpha CSS, ordered list item 28
    ac. Lower alpha CSS, ordered list item 29
    ad. Lower alpha CSS, ordered list item 30
    ae. Lower alpha CSS, ordered list item 31

    关于python - 获取 HTML 的最终结果文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60769015/

    36 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com