gpt4 book ai didi

python - 网页抓取编码价格

转载 作者:太空宇宙 更新时间:2023-11-03 19:56:27 24 4
gpt4 key购买 nike

在网络抓取文章时,价格位于元素中,而不位于资源中。相反,有以下编码文本

<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value) {
return base64UTF8Codec.decode(arguments[0])
};

replaceWith(
document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'),
f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA=')
);
</script>

如何将文本解码为价格?

enter image description here

enter image description here

最佳答案

文本采用 Base64 编码。如果用beautifulsoup可以找到右边的<script>标签,您可以使用 re 提取正确的信息模块:

import re
import base64
from bs4 import BeautifulSoup

txt = '''<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])};
replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA='));
</script>'''

soup = BeautifulSoup(txt, 'html.parser')

# 1. locate the right <script> tag
script = soup.script

# 2. get coded text from the script tag
coded_text = re.findall(r".*\('(.*?)'\)\);", script.text)[0]

# 3. decode the text
decoded_text = base64.b64decode(coded_text) # b'\n <span class="pull-right"> 2.590,- </span>\n '

# 4. get the price from the decoded text
soup2 = BeautifulSoup(decoded_text, 'html.parser')

print(soup2.span.get_text(strip=True))

打印:

2.590,-

关于python - 网页抓取编码价格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59506744/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com