gpt4 book ai didi

python - 在特定的 <script> </script> 标记之间提取

转载 作者:行者123 更新时间:2023-12-01 04:43:07 28 4
gpt4 key购买 nike

我可以使用以下代码提取所有标签。但是,我不知道如何在 <script> 之间查看内部。和</script>标签。特别是,假设我只想要这一部分(中间还有更多内容,但我对此不感兴趣):

<script>
var quoteDataObj = [{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}]
</script>

不确定我需要添加什么代码?我需要在 [{ 之间获取逗号分隔的内容和}]进入Python字典。

编辑以采纳答案中的建议:

# -*- coding: utf-8 -*-
"""
Created on Thu May 7 10:31:02 2015

@author: idf
"""

import re
import json
import urllib2
from lxml import etree


url='http://data.cnbc.com/quotes/CLCV1'

def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''

return outtxt

def extract_text(elem):
if elem is None:
print None
else:
return ''.join(i for i in elem.itertext())

content = wgetUrl(url)
node = etree.HTML(content)
parser = etree.HTMLParser()


nodes = node.findall(r'.//script')
for x in nodes:
matches = re.findall(r'quoteDataObj\s\=\s(\[.+\])', x)
if len(matches) > 0:
python_dict = json.loads(matches[0])

最佳答案

您可以在脚本上使用正则表达式来查找 quoteDataObj 变量并使用 JSON 加载其内容。示例:

import re
import json

#...your code...

content = wgetUrl(url)
matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', content)
if len(matches) > 0:
python_dict = json.loads(matches[0])

输出:

{u'altSymbol': u'CL/M5',
u'assetType': u'DERIVATIVE',
u'change': u'-1.39',
u'code': 0,
u'curmktstatus': u'REG_MKT',
u'currencyCode': u'USD',
u'encodedSymbol': u'CLCV1',
u'exchange': u'New York Mercantile Exchange',
u'high': u'61.31',
u'last': u'59.54',
u'low': u'59.14',
u'name': u"WTI Crude Oil (Jun'15)",
u'noStreaming': u'false',
u'open': u'60.69',
u'provider': u'CNBC Quote Cache',
u'realTime': u'false',
u'shortName': u'OIL',
u'source': u'',
u'symbol': u'CLCV1',
u'symbolType': u'symbol',
u'timeZone': u'EDT',
u'volume': u'189607'}

使用LXML解析

OP 表示有兴趣了解如何通过 LXML 解析解决该问题。这是:

import re
import json

#...your code...

for x in nodes:
matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', str(x.text))
if len(matches) > 0:
python_dict = json.loads(matches[0])

关于python - 在特定的 &lt;script&gt; &lt;/script&gt; 标记之间提取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30106669/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com