来源:蜘蛛抓取(WebSpider)
时间:2017-02-09 08:26
标签:
scrapy 没有crawl命令
系统发生错误
页面错误!请稍后再试~dingbo8128 的BLOG
用户名:dingbo8128
文章数:130
访问量:17842
注册日期:
阅读量:5863
阅读量:12276
阅读量:386137
阅读量:1077288
51CTO推荐博文
什么是爬虫?&从逻辑角度讲,爬虫对应一个树。树枝是网页,树叶是感兴趣的信息。当我们从一个URL出发查找感兴趣的信息时,当前URL返回的内容可能包含我们感兴趣的信息,也可能包含另一个可能包含我们感兴趣的信息的URL。一个爬虫对应一次信息搜索,信息搜索过程会建立起一棵树。scrapy.Spider这个类提供了接口,允许我们设计整个信息搜索的流程。给spider传递运行时所需的参数。比如URL ?号后面的参数信息。这些信息可以选择使用crawl -a 命令传递2. Spider循环对spider来说,爬取的循环类似下文:以初始的URL初始化Request,并设置回调函数。 当该request下载完毕并返回时,将生成response,并作为参数传给该回调函数。spider中初始的request是通过调用&&来获取的。&&读取&中的URL, 并以&&为回调函数生成&&。在回调函数内分析返回的(网页)内容,返回&&对象或者&&或者一个包括二者的可迭代容器。 返回的Request对象之后会经过Scrapy处理,下载相应的内容,并调用设置的callback函数(函数可相同)。在回调函数内,您可以使用&&(您也可以使用BeautifulSoup, lxml 或者您想用的任何解析器) 来分析网页内容,并根据分析的数据生成item。最后,由spider返回的item将被存到数据库(由某些&&处理)或使用&&存入到文件中。Scrapy提供了几种spider的默认实现,下面重点介绍:默认Spider和CrawlSpider3.最简单的spider(默认的Spider)用实例属性start_urls中的URL构造Request对象框架负责执行request将request返回的response对象传递给parse方法做分析简化后的源码:class&Spider(object_ref):
&&&&"""Base&class&for&scrapy&spiders.&All&spiders&must&inherit&from&this
&&&&class.
&&&&name&=&None
&&&&def&__init__(self,&name=None,&**kwargs):
&&&&&&&&if&name&is¬&None:
&&&&&&&&&&&&self.name&=&name
&&&&&&&&elif¬&getattr(self,&'name',&None):
&&&&&&&&&&&&raise&ValueError("%s&must&have&a&name"&%&type(self).__name__)
&&&&&&&&self.__dict__.update(kwargs)
&&&&&&&&if¬&hasattr(self,&'start_urls'):
&&&&&&&&&&&&self.start_urls&=&[]
&&&&def&start_requests(self):
&&&&&&&&for&url&in&self.start_urls:
&&&&&&&&&&&&yield&self.make_requests_from_url(url)
&&&&def&make_requests_from_url(self,&url):
&&&&&&&&return&Request(url,&dont_filter=True)
&&&&def&parse(self,&response):
&&&&&&&&raise&NotImplementedError
BaseSpider&=&create_deprecated_class('BaseSpider',&Spider)一个回调函数返回多个request的例子import&scrapyfrom&myproject.items&import&MyItemclass&MySpider(scrapy.Spider):
&&&&name&=&''
&&&&allowed_domains&=&['']
&&&&start_urls&=&[
&&&&&&&&'/1.html',
&&&&&&&&'/2.html',
&&&&&&&&'/3.html',
&&&&def&parse(self,&response):
&&&&&&&&sel&=&scrapy.Selector(response)
&&&&&&&&for&h3&in&response.xpath('//h3').extract():
&&&&&&&&&&&&yield&MyItem(title=h3)
&&&&&&&&for&url&in&response.xpath('//a/@href').extract():
&&&&&&&&&&&&yield&scrapy.Request(url,&callback=self.parse)构造一个Request对象只需两个参数: URL和回调函数4. CrawlSpider通常我们需要在spider中决定:哪些网页上的链接需要跟进, 哪些网页到此为止,无需跟进里面的链接。CrawlSpider为我们提供了有用的抽象――Rule,使这类爬取任务变得简单。你只需在rule中告诉scrapy,哪些是需要跟进的。回忆一下我们爬行mininova网站的spider.class&MininovaSpider(CrawlSpider):
&&&&name&=&'mininova'
&&&&allowed_domains&=&['mininova.org']
&&&&start_urls&=&['http://www.mininova.org/yesterday']
&&&&rules&=&[Rule(LinkExtractor(allow=['/tor/\d+']),&'parse_torrent')]
&&&&def&parse_torrent(self,&response):
&&&&&&&&torrent&=&TorrentItem()
&&&&&&&&torrent['url']&=&response.url
&&&&&&&&torrent['name']&=&response.xpath("//h1/text()").extract()
&&&&&&&&torrent['description']&=&response.xpath("//div[@id='description']").extract()
&&&&&&&&torrent['size']&=&response.xpath("//div[@id='specifications']/p[2]/text()[2]").extract()
&&&&&&&&return&torrent上面代码中 rules的含义是:匹配/tor/\d+的URL返回的内容,交给parse_torrent处理,并且不再跟进response上的URL。官方文档中也有个例子:&&&&rules&=&(
&&&&&&&&#&提取匹配&'category.php'&(但不匹配&'subsection.php')&的链接并跟进链接(没有callback意味着follow默认为True)
&&&&&&&&Rule(LinkExtractor(allow=('category\.php',&),&deny=('subsection\.php',&))),
&&&&&&&&#&提取匹配&'item.php'&的链接并使用spider的parse_item方法进行分析
&&&&&&&&Rule(LinkExtractor(allow=('item\.php',&)),&callback='parse_item'),
&&&&)除了Spider和CrawlSpider外,还有XMLFeedSpider, CSVFeedSpider, SitemapSpider
了这篇文章
类别:┆阅读(0)┆评论(0)python - Modifiying CSV export in scrapy - Stack Overflow
to customize your list.
Join the Stack Overflow Community
Stack Overflow is a community of 6.7 million programmers, just like you, helping each other.
J it only takes a minute:
I seem to be missing something very simple. All i want to do is use ; as a
delimiter in the CSV exporter instead of ,.
I know the CSV exporter passes kwargs to csv writer, but i cant seem to
figure out how to pass this the delimiter.
I am calling my spider like so:
scrapy crawl spidername --set FEED_URI=output.csv --set FEED_FORMAT=csv
25.7k1587143
5,70722864
In contrib/feedexport.py,
class FeedExporter(object):
def open_spider(self, spider):
file = TemporaryFile(prefix='feed-')
exp = self._get_exporter(file)
# &-- this is where the exporter is instantiated
exp.start_exporting()
self.slots[spider] = SpiderSlot(file, exp)
def _get_exporter(self, *a, **kw):
return self.exporters[self.format](*a, **kw)
# &-- not passed in :(
You will need to make your own, here's an example:
from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter
class CsvOptionRespectingItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
delimiter = settings.get('CSV_DELIMITER', ',')
kwargs['delimiter'] = delimiter
super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)
In the settings.py file of your crawler directory, add this:
FEED_EXPORTERS = {
'csv': 'importable.path.to.CsvOptionRespectingItemExporter',
Now, you can execute your spider as follows:
scrapy crawl spidername --set FEED_URI=output.csv --set FEED_FORMAT=csv --set CSV_DELIMITER=';'
9,88132545
Your Answer
Sign up or
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Post as a guest
By posting your answer, you agree to the
Not the answer you're looking for?
Browse other questions tagged
rev .25026
Stack Overflow works best with JavaScript enabled