Scrapy --基础（2）

作者: 周周周__ | 来源:发表于2020-04-07 23:24 被阅读0次

Scrapy --基础（2）
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
scrapy笔记
Scrapy（一） | 介绍和安排
2019-01-16《Learning Scrapy》（中文版）
《Learning Scrapy》（中文版）第2章理解HTML
《Learning Scrapy》（中文版）第11章 Scrap
《Learning Scrapy》（中文版）第4章从Scrap
《Learning Scrapy》（中文版）第3章爬虫基础
《Learning Scrapy》（中文版）第1章 Scrapy

注：如果是使用conda环境，使用conda install scrapy执行scrapy命令可能会找不到解释器。

一、scrapy基本

scrapy全局命令：

  bench         Run quick benchmark test  
  fetch         Fetch a URL using the Scrapy downloader 去根据项目配置去请求指定的url
  genspider     Generate new spider using pre-defined templates 创建模版
  runspider     Run a self-contained spider (without creating a project)运行一个爬虫(单独的文件、不是基于项目的)
  settings      Get settings values 设置，不常用
  shell         Interactive scraping console 进行调试使用
  startproject  Create new project 创建项目文件夹
  version       Print Scrapy version 查看版本
  view          Open URL in browser, as seen by Scrapy 用浏览器打开url
  list            查看spider列表

创建项目初始化流程：
scrapy startproject mytest创建项目文件夹
cd mytest
scrapy genspider my_test baidu.com创建spider模版
scrapy crawl my_test运行mybaidu爬虫

二、setting.py 理解

# -*- coding: utf-8 -*-

# Scrapy settings for mytest project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'mytest'

SPIDER_MODULES = ['mytest.spiders']
NEWSPIDER_MODULE = 'mytest.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'mytest (+http://www.yourdomain.com)'  配置ua

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  # 是否遵循robot协议，使用是为False，默认True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32  # Item Processor存储值时的最大并发

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3   # 下载同一个网站的时候需要等待的时间，用来限制并发效果,注释掉为0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16 #下载器请求并发量
CONCURRENT_REQUESTS_PER_IP = 16   #对单个IP进行并发请求的最大值。如果非0，则忽略 [`CONCURRENT_REQUESTS_PER_DOMAIN`](https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/settings.html?highlight=concurrent_requests#std:setting-CONCURRENT_REQUESTS_PER_DOMAIN) 设定， 使用该设定。 也就是说，并发限制将针对IP，而不是网站。


# Disable cookies (enabled by default)
COOKIES_ENABLED = False  # 是否使用cookie

# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False  # 是否启用telent终端等插件

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
} # 默认的请求头

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'mytest.middlewares.MytestSpiderMiddleware': 543,
}  # spider中间件，自己编写使用，543代表权重，数值越小，越优先

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'mytest.middlewares.MytestDownloaderMiddleware': 543,
} # 下载中间件启用

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
}  # 保存项目中启用的插件及其顺序的字典。

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'mytest.pipelines.MytestPipeline': 300,
}  # 保存项目中启用的pipeline及其顺序的字典。该字典默认为空，值(value)任意。 不过值(value)习惯设定在0-1000范围内。

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True # 启用AutoThrottle扩展。
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5 # 初始下载延迟
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60  # 在高延迟情况下最大的下载延迟
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0  # 平均每个网站并发请求
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False #起用AutoThrottle调试(debug)模式，展示每个接收到的response。 您可以通过此来查看限速参数是如何实时被调整的。

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True # HTTP缓存是否开启。
HTTPCACHE_EXPIRATION_SECS = 0  # 缓存的request的超时时间，单位秒。超过这个时间的缓存request将会被重新下载。如果为0，则缓存的request将永远不会超时
HTTPCACHE_DIR = 'httpcache' # 存储(底层的)HTTP缓存的目录。如果为空，则HTTP缓存将会被关闭。 如果为相对目录，则相对于项目数据目录(project data dir)
HTTPCACHE_IGNORE_HTTP_CODES = [] # 不缓存设置中的HTTP返回值(code)的request
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' # 实现缓存存储后端的类

三、scrapy使用 demo

使用上述命令创建好的项目文件夹后：

./mytest
    spider/
        __init__.py
        my_test.py
    __init__.py
    items.py
    middlewares.py
    piplines.py
    setting.py
scrapy.cfg

其中，my_test.py是需要主要编写的spider。

# -*- coding: utf-8 -*-
import scrapy
class MyBaiduSpider(scrapy.Spider):  # 继承自scrapy的Spider对象
    name = 'my_baidu'                # 爬虫命名
    allowed_domains = ['baidu.com']  # 爬取内容的域名
    start_urls = ['http://baidu.com/'] # 初始的url

    def parse(self, response):         # parse 是默认的解析的函数，正常使用在这里进行解析
        pass

启动scrapy crawl my_test得到如下，

2020-04-08 09:50:17 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: mytest)
2020-04-08 09:50:17 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.20.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 22:22:21) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-04-08 09:50:17 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-08 09:50:17 [scrapy.crawler] INFO: Overridden settings:   # 以上是初始化
{'BOT_NAME': 'mytest',
 'NEWSPIDER_MODULE': 'mytest.spiders',
 'SPIDER_MODULES': ['mytest.spiders']}
2020-04-08 09:50:17 [scrapy.extensions.telnet] INFO: Telnet Password: 717073d665f29514
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled extensions:  # 加载扩展中间件
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled downloader middlewares:# 加载下载中间件
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled spider middlewares:  # jiazai spider中间件
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-08 09:50:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-08 09:50:17 [scrapy.core.engine] INFO: Spider opened
2020-04-08 09:50:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-08 09:50:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-08 09:50:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (meta refresh) to <GET http://www.baidu.com/> from <GET http://baidu.com/>
2020-04-08 09:50:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None)
2020-04-08 09:50:18 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-08 09:50:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 422,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1818,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.393769,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 4, 8, 1, 50, 18, 345679),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 4, 8, 1, 50, 17, 951910)}
2020-04-08 09:50:18 [scrapy.core.engine] INFO: Spider closed (finished)

大概的看一下上边的流程：
主要是先初始化加载scrapy的工具(xpath,cssselector等)，
然后加载扩展中间件
加载下载中间件，进行请求数据
将数据交给spider，经过spider中间件
由于没有写管道部分，上边管道内容为空，
最后输出的是整个过程中加载的主要内容：
共多少字节数据量、请求多少次、get请求几次、响应多大数据量等。

Scrapy --基础（2）
注：如果是使用conda环境，使用conda install scrapy执行scrapy命令可能会找不到解释器。...
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
[TOC] 0.0、Scrapy基础 Python2：适合爬取非中文 Python3：适合爬取中文 Scrapy是...
scrapy笔记
1 scrapy的运行原理参考：Learning Scrapy笔记（三）- Scrapy基础Scrapy爬虫入门...
Scrapy（一） | 介绍和安排
第一讲：Scrapy框架的配置和基础知识的介绍，所涉及的主要内容如下： 1.Scrapy的安装2.Scrapy框架...
2019-01-16《Learning Scrapy》（中文版）
序言第1章 Scrapy介绍第2章理解HTML和XPath 第3章爬虫基础第4章从Scrapy到移动应...
《Learning Scrapy》（中文版）第2章理解HTML
序言第1章 Scrapy介绍第2章理解HTML和XPath第3章爬虫基础第4章从Scrapy到移动应用第5...
《Learning Scrapy》（中文版）第11章 Scrap
序言第1章 Scrapy介绍第2章理解HTML和XPath第3章爬虫基础第4章从Scrapy到移动应用第5...
《Learning Scrapy》（中文版）第4章从Scrap
序言第1章 Scrapy介绍第2章理解HTML和XPath第3章爬虫基础第4章从Scrapy到移动应用第5...
《Learning Scrapy》（中文版）第3章爬虫基础
序言第1章 Scrapy介绍第2章理解HTML和XPath第3章爬虫基础第4章从Scrapy到移动应用第5章...
《Learning Scrapy》（中文版）第1章 Scrapy
序言第1章 Scrapy介绍第2章理解HTML和XPath第3章爬虫基础第4章从Scrapy到移动应用第5...