注:如果是使用conda环境,使用conda install scrapy
执行scrapy命令可能会找不到解释器。
一、scrapy基本
scrapy全局命令:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader 去根据项目配置去请求指定的url
genspider Generate new spider using pre-defined templates 创建模版
runspider Run a self-contained spider (without creating a project)运行一个爬虫(单独的文件、不是基于项目的)
settings Get settings values 设置,不常用
shell Interactive scraping console 进行调试使用
startproject Create new project 创建项目文件夹
version Print Scrapy version 查看版本
view Open URL in browser, as seen by Scrapy 用浏览器打开url
list 查看spider列表
创建项目初始化流程:
scrapy startproject mytest
创建项目文件夹
cd mytest
scrapy genspider my_test baidu.com
创建spider模版
scrapy crawl my_test
运行mybaidu爬虫
二、setting.py 理解
# -*- coding: utf-8 -*-
# Scrapy settings for mytest project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'mytest'
SPIDER_MODULES = ['mytest.spiders']
NEWSPIDER_MODULE = 'mytest.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'mytest (+http://www.yourdomain.com)' 配置ua
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 是否遵循robot协议,使用是为False,默认True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32 # Item Processor存储值时的最大并发
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3 # 下载同一个网站的时候需要等待的时间,用来限制并发效果,注释掉为0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16 #下载器请求并发量
CONCURRENT_REQUESTS_PER_IP = 16 #对单个IP进行并发请求的最大值。如果非0,则忽略 [`CONCURRENT_REQUESTS_PER_DOMAIN`](https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/settings.html?highlight=concurrent_requests#std:setting-CONCURRENT_REQUESTS_PER_DOMAIN) 设定, 使用该设定。 也就是说,并发限制将针对IP,而不是网站。
# Disable cookies (enabled by default)
COOKIES_ENABLED = False # 是否使用cookie
# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False # 是否启用telent终端等插件
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
} # 默认的请求头
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'mytest.middlewares.MytestSpiderMiddleware': 543,
} # spider中间件,自己编写使用,543代表权重,数值越小,越优先
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'mytest.middlewares.MytestDownloaderMiddleware': 543,
} # 下载中间件启用
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
} # 保存项目中启用的插件及其顺序的字典。
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'mytest.pipelines.MytestPipeline': 300,
} # 保存项目中启用的pipeline及其顺序的字典。该字典默认为空,值(value)任意。 不过值(value)习惯设定在0-1000范围内。
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True # 启用AutoThrottle扩展。
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5 # 初始下载延迟
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60 # 在高延迟情况下最大的下载延迟
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # 平均每个网站并发请求
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False #起用AutoThrottle调试(debug)模式,展示每个接收到的response。 您可以通过此来查看限速参数是如何实时被调整的。
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True # HTTP缓存是否开启。
HTTPCACHE_EXPIRATION_SECS = 0 # 缓存的request的超时时间,单位秒。超过这个时间的缓存request将会被重新下载。如果为0,则缓存的request将永远不会超时
HTTPCACHE_DIR = 'httpcache' # 存储(底层的)HTTP缓存的目录。如果为空,则HTTP缓存将会被关闭。 如果为相对目录,则相对于项目数据目录(project data dir)
HTTPCACHE_IGNORE_HTTP_CODES = [] # 不缓存设置中的HTTP返回值(code)的request
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' # 实现缓存存储后端的类
三、scrapy使用 demo
使用上述命令创建好的项目文件夹后:
./mytest
spider/
__init__.py
my_test.py
__init__.py
items.py
middlewares.py
piplines.py
setting.py
scrapy.cfg
其中,my_test.py是需要主要编写的spider。
# -*- coding: utf-8 -*-
import scrapy
class MyBaiduSpider(scrapy.Spider): # 继承自scrapy的Spider对象
name = 'my_baidu' # 爬虫命名
allowed_domains = ['baidu.com'] # 爬取内容的域名
start_urls = ['http://baidu.com/'] # 初始的url
def parse(self, response): # parse 是默认的解析的函数,正常使用在这里进行解析
pass
启动scrapy crawl my_test
得到如下,
2020-04-08 09:50:17 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: mytest)
2020-04-08 09:50:17 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.20.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 22:22:21) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-04-08 09:50:17 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-08 09:50:17 [scrapy.crawler] INFO: Overridden settings: # 以上是初始化
{'BOT_NAME': 'mytest',
'NEWSPIDER_MODULE': 'mytest.spiders',
'SPIDER_MODULES': ['mytest.spiders']}
2020-04-08 09:50:17 [scrapy.extensions.telnet] INFO: Telnet Password: 717073d665f29514
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled extensions: # 加载扩展中间件
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled downloader middlewares:# 加载下载中间件
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled spider middlewares: # jiazai spider中间件
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-08 09:50:17 [scrapy.core.engine] INFO: Spider opened
2020-04-08 09:50:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-08 09:50:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-08 09:50:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (meta refresh) to <GET http://www.baidu.com/> from <GET http://baidu.com/>
2020-04-08 09:50:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None)
2020-04-08 09:50:18 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-08 09:50:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 422,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1818,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.393769,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 8, 1, 50, 18, 345679),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 4, 8, 1, 50, 17, 951910)}
2020-04-08 09:50:18 [scrapy.core.engine] INFO: Spider closed (finished)
大概的看一下上边的流程:
主要是先初始化加载scrapy的工具(xpath,cssselector等),
然后加载扩展中间件
加载下载中间件,进行请求数据
将数据交给spider,经过spider中间件
由于没有写管道部分,上边管道内容为空,
最后输出的是整个过程中加载的主要内容:
共多少字节数据量、请求多少次、get请求几次、响应多大数据量等。
网友评论