gpt4 book ai didi

ruby - 使用 Ruby、Nokogiri 和 Mechanize 解析网页中的 java cookie 链接

转载 作者:行者123 更新时间:2023-12-04 16:19:07 25 4
gpt4 key购买 nike

每个人。
我需要解析一个为每个链接设置了 java cookie 的网页。我可以解析正常搜索,并显示每个产品并将其导入到 mysql 数据库中。
我能够使用以下代码从搜索结果中抓取每个产品及其元素:
这就是我所拥有的:

    require 'rubygems'
require 'logger'
require 'mechanize'
require 'mysql2'

agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
#agent.set_proxy('a-proxy', '8080')
agent.read_timeout = 60

def add_cookie(agent, uri, cookie)
uri = URI.parse(uri)
Mechanize::Cookie.parse(uri, cookie) do |cookie|
agent.cookie_jar.add(uri, cookie)
end
end


# get main page
page = agent.get "http://www.site.com.mx"

# get login form
form = page.forms.first
form.correo_ingresar = "user"
form.password = "password"

# submit login form
page = agent.submit form

# parse cookies
myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)

# set session cookies
myarray.each do |item|
add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
end
# show 1000 search results per page
add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")

# order results
add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")

# section results
add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")

# get main page
page = agent.get "http://www.site.com.mx/tienda/index.php"

search_form = page.forms.first

search_result = agent.submit search_form

doc = Nokogiri::HTML(search_result.body)

rows = doc.css("table.articulos tr")

i = 0
details = rows.collect do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
i = i + 1
detail
end

# walk through paginator links
links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!

links.each do |l|
page = agent.get l

doc = Nokogiri::HTML(page.body)

rows = doc.css("table.articulos tr")

rows.each do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
details << detail
end
end

# update db
client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")

details.each do |d|
if d[:sku] != ""
price = d[:price].split

if price[1] == "D"
currency = 144
else
currency = 168
end

cost = price[0].gsub(",", "").to_f

if d[:qty] == ""
qty = d[:qty2]
else
qty = d[:qty]
end

results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
if results.count == 1
product = results.first

client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id =
#{product['product_id']};")

client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
else
client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
last_id = client.last_id

client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
end
end
end
现在我不想搜索我想从类别列表中解析:
主页链接:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11
这显示了这样的表格(所有内容都包含链接)
顶部名称:ACCESSORIOS 是 ACCESORIOS 类别的链接,下面列出的粗体名称是子类别,粗体名称下面的是品牌。如果我点击 ACCESORIOS,它会显示每个品牌和每个混合的子类别,依此类推。
饰品
配件多媒体(6)
墨西哥 ACTECK (5), 曼哈顿 (1)
配件 P/impress. Punto De Venta(1)
爱普生公司 (1)
Accesorios Para Cableados De Patch Panels(1)
智能网络解决方案 (1)
Accesorios Para Camaras Digitales(1)
曼哈顿 (1)
Accesorios Para Computadoras De Escritorio(32)
墨西哥 ACTECK (2)、GENERICA (1)、曼哈顿 (28)、TARGUS (1)
Accesorios Para Computadoras Portatiles(60)
ACTECK DE MEXICO (3)、GENIUS (2)、HP COMERCIAL (2)、HP IMPRESION (1)、MANHATTAN (17)、PERFECT CHOICES (32)、SOLIDEX (1)、TARGUS (1)、TECH Zone (1)
配件 Para Ipod(3)
墨西哥 ACTECK (1),完美的选择 (2)
配件 Para Mesas(3)
曼哈顿 (2),完美的选择 (1)
配件 Para Redes(13)
英特尔网络解决方案 (5), 曼哈顿 (8)
Accesoriso Para Celulares(14)
黑莓 (14)
适配器蓝牙(6)
墨西哥 ACTECK (1)、曼哈顿 (2)、完美选择 (3)
Adaptadores Para Mouse Y Teclado(3)
曼哈顿 (2),完美的选择 (1)
Audifono/diademas Y Microfonos(49)
墨西哥 ACTECK (14)、BTO (1)、天才 (3)、罗技 (2)、曼哈顿 (11)、完美选择 (18)
这是表格的代码,每个链接都有 cookie,这就是为什么我一直很难抓取它。
    <table width="95%" cellspacing="0" cellpadding="3" border="0">
<tbody>
<tr>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
</tr>
<tr>
<td width="20" valign="top" align="left"></td>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
<a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
<a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
<a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
</td>
</tr>
</tbody>
</table>
所以问题是我要在代码中添加什么才能访问每个链接?如果它使用java cookie。
使用的 cookies :
名称、值范围
codigoseccion_buscar, 11-30
codigomarca_buscar, 100-736
codigolinea_buscar, 15-1385

最佳答案

我设法通过向我的 Ruby 代码添加 cookie 来抓取这些链接内容之一:

    # set cookies
add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")

add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")

add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")

add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")

奇怪的是,如果我只添加其中一个 cookie,它将不起作用。所以我不得不添加所有,即使它们没有任何值,因为每个链接都有一个 cookie,这样它就会删除或清除保存的 cookie。

现在我需要刮掉那些 cookies ,将其用作变量并执行循环或其他操作,有人可以帮助我吗?
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>

关于ruby - 使用 Ruby、Nokogiri 和 Mechanize 解析网页中的 java cookie 链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6286073/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com