gpt4 book ai didi

mysql - Scrapy MySQL 管道——所有数据库条目都相同

转载 作者:行者123 更新时间:2023-11-29 06:46:40 24 4
gpt4 key购买 nike

我在 Mac OSX Lion 10.7.5 上运行 Scrapy(以防万一)

以下是我的爬虫:

 from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from BoxOfficeMojo.items import BoxofficemojoItem
from BoxOfficeMojo.items import ActorItem

class MojoSpider(BaseSpider):
name = 'MojoSpider'
allowed_domains = ['boxofficemojo.com']
start_urls = ['http://www.boxofficemojo.com/movies/alphabetical.htm?letter=A&p=.htm']

def parse(self, response):
items = []
movie = BoxofficemojoItem()
hxs = HtmlXPathSelector(response)
print ('hxs:', hxs)
links = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[1]/font/a/@href').extract() #was previously
titles = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[1]/font/a/b/text()').extract()
gross = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[3]/font/text()').extract()
opening = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[7]/font//text()').extract()
for item in gross:
if 'Total' in item:
gross.remove(item)


items = []
for i in range(len(links)):
movie['title'] = titles[i]
movie['link'] = 'http://www.boxofficemojo.com' + links[i]
movie['gross'] = gross[i]
movie['release_date'] = opening[i]
items.append(movie)
return items

这是我的 MySQL 管道:

  import sys; sys.path.append("/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages")
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request

class BoxofficemojoPipeline(object):

def __init__(self):
self.conn = MySQLdb.connect(user='testuser', passwd='test', db='testdb', host='localhost', charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor()

def process_item(self, item, spider):
try:
self.cursor.execute("""INSERT INTO example_movie (title, link, gross, release_date) VALUES (%s, %s, %s, %s)""", (item['title'], item['link'], item['gross'], item['release_date']))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])

return item

当我查看 MySQL 数据库 中的条目时,页面中的电影数量应该在那里,但它们都是同一部电影,Act of Worship,这是页面上的最后一部电影。欢迎提出任何建议!感谢您的关注!

最佳答案

尝试在 for i in range(len(links)): 循环中移动 movie = BoxofficemojoItem()

    def parse(self, response):
items = []

hxs = HtmlXPathSelector(response)
print ('hxs:', hxs)
links = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[1]/font/a/@href').extract() #was previously
titles = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[1]/font/a/b/text()').extract()
gross = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[3]/font/text()').extract()
opening = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[7]/font//text()').extract()
for item in gross:
if 'Total' in item:
gross.remove(item)

items = []
for i in range(len(links)):
movie = BoxofficemojoItem()
movie['title'] = titles[i]
movie['link'] = 'http://www.boxofficemojo.com' + links[i]
movie['gross'] = gross[i]
movie['release_date'] = opening[i]
items.append(movie)
return items

下面是使您的代码更简单的建议:

  • 为所有电影项目字段使用共同的祖先://div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr)
  • 使用 urlparse.urljoin() 创建“完整”URL

    导入url解析...

    def parse(self, response):
    items = []

    hxs = HtmlXPathSelector(response)
    print ('hxs:', hxs)

    movie_rows = hxs.select('//div[@id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr')
    for m in movie_rows:
    movie = BoxofficemojoItem()

    movie['title'] = m.select('td[1]/font/a/@href').extract()[0]
    movie['link'] = urlparse.urljoin(
    response.url, m.select('td[1]/font/a/b/text()').extract()[0])
    movie['gross'] = m.select('td[3]/font/text()').extract()[0]
    movie['release_date'] = m.select('td[7]/font//text()').extract()[0]

    items.append(movie)
    return items

关于mysql - Scrapy MySQL 管道——所有数据库条目都相同,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18306303/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com