gpt4 book ai didi

python - Scrapy 入门指南_Urls

转载 作者:行者123 更新时间:2023-11-29 00:04:35 25 4
gpt4 key购买 nike

好吧,长话短说,我得赶紧去开会了

我正在尝试在 scrapy 中获取起始 url,但无论我如何尝试,我似乎都无法完成。这是我的代码(蜘蛛)。

import scrapy
import csv

from scrapycrawler.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request


class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["snipplr.com"]


def start_requests(self):
for i in range(1, 230):
yield self.make_requests_from_url("http://www.snipplr.com/view/%d" % i)




def make_requests_from_url(self, url):
item = DmozItem()

# assign url
item['link'] = url
request = Request(url, dont_filter=True)

# set the meta['item'] to use the item in the next call back
request.meta['item'] = item
return request


#Rules only apply before
rules = (
Rule (LxmlLinkExtractor(deny_domains=('http:\/\/www.snipplr.com\/snippet-not-found\/',)),callback="parse", follow= True),
)


def parse(self, response):
sel = Selector(response)
item = response.meta['item']
item['title'] = sel.xpath('//div[@class="post"]/h1/text()').extract()
#start_url
item['link'] = response.url

我已经尝试了所有方法,直到现在,我在我的数据库中的 url 列中得到了一个“h”。

this

这是我的数据库:

import csv
from scrapy.exceptions import DropItem
from scrapy import log
import sys
import mysql.connector

class CsvWriterPipeline(object):

def __init__(self):
self.connection = mysql.connector.connect(host='localhost', user='ws', passwd='ps', db='ws')
self.cursor = self.connection.cursor()

def process_item(self, item, spider):
self.cursor.execute("SELECT title,url FROM items WHERE title= %s", item['title'])
result = self.cursor.fetchone()
if result:

log.msg("Item already in database: %s" % item, level=log.DEBUG)
else:
self.cursor.execute(
"INSERT INTO items (title, url) VALUES (%s, %s)",
(item['title'][0], item['link'][0]))
self.connection.commit()

log.msg("Item stored : " % item, level=log.DEBUG)
return item

def handle_error(self, e):
log.err(e)

从这里你可以看到, here它显然在工作。

我将如何获得开始 url 或者我将如何赞美它。我相信 h 表示该字段为空。数据库是mysql。

感谢您的阅读和帮助

问候,查理

最佳答案

item['link']item['title'] 不同,它只是一个字符串,而不是列表:

self.cursor.execute("INSERT INTO items (title, url) VALUES (%s, %s)",
(item['title'][0], item['link']))

关于python - Scrapy 入门指南_Urls,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28121383/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com