gpt4 book ai didi

javascript - 使用 Scrapy 从 <script> 标签中提取多行 javascript 内容

转载 作者:太空狗 更新时间:2023-10-30 01:51:04 25 4
gpt4 key购买 nike

我正在尝试使用 Scrapy 从这个脚本标签中提取数据:

<script>
var hardwareTemplateFunctions;
var storefrontContextUrl = '';

jq(function() {
var data = new Object();
data.hardwareProductCode = '9054832';
data.offeringCode = 'SMART_BASIC.TLF12PLEAS';
data.defaultTab = '';
data.categoryId = 10001;

data.bundles = new Object();
data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1099'),
monthlyPrice: parsePrice('499'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Super',
offeringType: 'VOICE',
monthlyPrice: parsePrice('499'),
commitmentTime: 12
};
data.bundles['SMART_PLUSS.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1599'),
monthlyPrice: parsePrice('399'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Pluss',
offeringType: 'VOICE',
monthlyPrice: parsePrice('399'),
commitmentTime: 12
};
data.bundles['SMART_BASIC.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2199'),
monthlyPrice: parsePrice('299'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Basis',
offeringType: 'VOICE',
monthlyPrice: parsePrice('299'),
commitmentTime: 12
};
data.bundles['SMART_MINI.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2999'),
monthlyPrice: parsePrice('199'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Mini',
offeringType: 'VOICE',
monthlyPrice: parsePrice('199'),
commitmentTime: 12
};
data.bundles['KONTANT_KOMPLETT.REGULAR'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('0'),
upfrontPrice: parsePrice('3499'),
monthlyPrice: parsePrice('0'),
commitmentTime: parsePrice('0'),
offeringTitle: 'SMART Kontant',
offeringType: 'PREPAID',
monthlyPrice: parsePrice('0'),
commitmentTime: 0
};

data.reviewJson = new Object();


hardwareTemplateFunctions = hardwareTemplateFunctions(data);
hardwareTemplateFunctions.init();

data.reviewSummaryBox = hardwareTemplateFunctions.reviewSummaryBox;

accessoryFunctions(data).init();
additionalServiceFunctions(data).init();
});

function parsePrice(str) {
var price = parseFloat(str);
return isNaN(price) ? 0 : price;
}

var offerings = {};
</script>

我想从每个部分获取数据,如下所示:

 data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1099'),
monthlyPrice: parsePrice('499'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Super',
offeringType: 'VOICE',
monthlyPrice: parsePrice('499'),
commitmentTime: 12
};

然后从每个字段中获取数据,并从例如 upfrontPrice 中获取最终数据(例如本例中的 1099)。

我试过使用这个获取每个对象:

items = response.xpath('//script/text()').re("data.bundles\[.*\](.*)")

但是那只给我第一行数据。 (= {)。那我该怎么做呢?有没有更好的方法从脚本标签中提取这些数据?

编辑:当我使用 items = response.xpath('//script/text()').re("data.bundles\[.*\] = {( (?s).*) };") 我似乎只得到最后一个 block (带有 data.bundles['KONTANT_KOMPLETT.REGULAR'] 的 block )

我如何获得所有这些的列表?

最佳答案

如果您不想玩正则表达式,可以使用 js2xml ,它解析 Javascript 代码并将其转换为 lxml 文档。然后您可以使用 XPath 从 Javascript 语句中查询内容。(免责声明:我编写并维护了 js2xml)

下面是有关如何获取这些 data.bundles 分配的示例代码:

import scrapy

selector = scrapy.Selector(text="""<script>
var hardwareTemplateFunctions;
var storefrontContextUrl = '';

jq(function() {
var data = new Object();
data.hardwareProductCode = '9054832';
data.offeringCode = 'SMART_BASIC.TLF12PLEAS';
data.defaultTab = '';
data.categoryId = 10001;

data.bundles = new Object();
data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1099'),
monthlyPrice: parsePrice('499'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Super',
offeringType: 'VOICE',
monthlyPrice: parsePrice('499'),
commitmentTime: 12
};
data.bundles['SMART_PLUSS.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1599'),
monthlyPrice: parsePrice('399'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Pluss',
offeringType: 'VOICE',
monthlyPrice: parsePrice('399'),
commitmentTime: 12
};
data.bundles['SMART_BASIC.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2199'),
monthlyPrice: parsePrice('299'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Basis',
offeringType: 'VOICE',
monthlyPrice: parsePrice('299'),
commitmentTime: 12
};
data.bundles['SMART_MINI.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2999'),
monthlyPrice: parsePrice('199'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Mini',
offeringType: 'VOICE',
monthlyPrice: parsePrice('199'),
commitmentTime: 12
};
data.bundles['KONTANT_KOMPLETT.REGULAR'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('0'),
upfrontPrice: parsePrice('3499'),
monthlyPrice: parsePrice('0'),
commitmentTime: parsePrice('0'),
offeringTitle: 'SMART Kontant',
offeringType: 'PREPAID',
monthlyPrice: parsePrice('0'),
commitmentTime: 0
};

data.reviewJson = new Object();


hardwareTemplateFunctions = hardwareTemplateFunctions(data);
hardwareTemplateFunctions.init();

data.reviewSummaryBox = hardwareTemplateFunctions.reviewSummaryBox;

accessoryFunctions(data).init();
additionalServiceFunctions(data).init();
});

function parsePrice(str) {
var price = parseFloat(str);
return isNaN(price) ? 0 : price;
}

var offerings = {};
</script>""")

(第一部分是在 Scrapy 选择器中获取 HTML 输入)

import js2xml
import pprint

data_bundles = {}
for script in selector.xpath('//script/text()').extract():
# this is how you turn Javascript code into an XML document (lxml document in fact)
jstree = js2xml.parse(script)

# then, we're interested in assignments of data.bundles object
for a in jstree.xpath('//assign[left//property/identifier/@name="bundles" and right/object]'):
# the assigned property is give by a <string> property from a <bracketaccessor>
bundle_prop = a.xpath('./left/bracketaccessor/property/string/text()')
if bundle_prop is not None:
curr_prop = bundle_prop[0]

data_bundles[curr_prop] = {}

# the left object is assigned an object (inside a <right> element)
# let's loop on the <property> elements)
# the values are either numbers or string arguments of a function call
for prop in a.xpath('./right/object/property'):
data_bundles[curr_prop][prop.xpath('@name')[0]] = prop.xpath('.//number/@value | .//string/text()')[0]

pprint.pprint(data_bundles)

这就是你从中得到的:

{'KONTANT_KOMPLETT.REGULAR': {'commitmentTime': '0',
'monthlyPrice': '0',
'newMsisdnFee': '0',
'offeringTitle': 'SMART Kontant',
'offeringType': 'PREPAID',
'signupFee': '0',
'upfrontPrice': '3499'},
'SMART_BASIC.TLF12PLEAS': {'commitmentTime': '12',
'monthlyPrice': '299',
'newMsisdnFee': '199',
'offeringTitle': 'SMART Basis',
'offeringType': 'VOICE',
'signupFee': '0',
'upfrontPrice': '2199'},
'SMART_MINI.TLF12PLEAS': {'commitmentTime': '12',
'monthlyPrice': '199',
'newMsisdnFee': '199',
'offeringTitle': 'SMART Mini',
'offeringType': 'VOICE',
'signupFee': '0',
'upfrontPrice': '2999'},
'SMART_PLUSS.TLF12PLEAS': {'commitmentTime': '12',
'monthlyPrice': '399',
'newMsisdnFee': '199',
'offeringTitle': 'SMART Pluss',
'offeringType': 'VOICE',
'signupFee': '0',
'upfrontPrice': '1599'},
'SMART_SUPERX.TLF12PLEAS': {'commitmentTime': '12',
'monthlyPrice': '499',
'newMsisdnFee': '199',
'offeringTitle': 'SMART Super',
'offeringType': 'VOICE',
'signupFee': '0',
'upfrontPrice': '1099'}}

有关使用 js2xml.parse() 获得的 XML 模式的更多信息,您可以查看 https://github.com/redapple/js2xml/blob/master/SCHEMA.rst

关于javascript - 使用 Scrapy 从 &lt;script&gt; 标签中提取多行 javascript 内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27754760/

25 4 0