python scrapy selenium phantomJS

作者: 志明S | 来源:发表于2017-02-14 15:13 被阅读4420次

wget,selenium,scrapy以及phantomjs的
python scrapy selenium phantomJS
Python3.X 爬虫实战（动态页面爬取解析）
Python2.7的环境搭建 py2.7 beautifulS
Scrapy Splash
爬取百度图片各种狗狗的图片，使用caffe训练模型分类
selenium和pantomjs学习
第八章 scrapy进阶
Python自定义豆瓣电影种类，排行，点评的爬取与存储（进阶上）
爬虫进阶-selenium和phantomJS

之前用selenium和phantomJS单线程爬取tyc的对外投资信息，无奈爬取速度太慢，单个企业抓取速度大概在>30-60s，这还不是最关键的，最令人崩溃的是刚抓取一会就有bug，导致程序中断，程序中断的原因大概在爬取程序卡在某个部分不动了，经检查也没发现bug在哪，所以爬虫一直处于手动爬虫-手动中断-继续爬虫的状态。今天学了scrapy，果断用scrapy+selenium+phantomJS来爬。
先上代码

#coding:utf-8
from selenium.webdriver.common.keys import Keys  
import time
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import pymongo
import xlrd
import time 

import scrapy
from tyc.items import TycItem
import logging
from scrapy.http import Request


class TycSpider(scrapy.Spider):
    name = 'tyc'
    allowed_domains = ['tianyancha.com']
    fname = "C:\\Users\\Administrator\\Desktop\\test.xlsx"
    workbook = xlrd.open_workbook(fname)
    sheet = workbook.sheet_by_name('Sheet1')
    urls = list()
    cols = sheet.col_values(0)
    #要爬取的url
    start_urls =['http://www.tianyancha.com/search?key={}&checkFrom=searchBox' .format(col) for col in cols]     

    def parse(self,response):
        #用phantomJs模拟浏览器，添加headers
        dcap = dict(DesiredCapabilities.PHANTOMJS)
        dcap["phantomjs.page.settings.userAgent"] = (
            "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36"
        )
        browser = webdriver.PhantomJS(desired_capabilities=dcap)
        browser.get(response.url)
        time.sleep(4)
        #获取企业url
        try:
            url = browser.find_element_by_class_name('query_name').get_attribute('href')
            browser.quit()
            self.logger.info('成功搜索到 %s',url)
            yield Request(url = url,callback = self.parse_detail)
            
        except Exception as e:
            self.logger.info('经查询没有这个企业！')
    

    def parse_detail(self,response):
        #获取企业对外投资情况
        dcap = dict(DesiredCapabilities.PHANTOMJS)
        dcap["phantomjs.page.settings.userAgent"] = (
            "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36"
        )
        browser = webdriver.PhantomJS(desired_capabilities=dcap)
     
        browser.get(response.url)
        self.logger.info('url %s', response.url)
        time.sleep(3)
        soup = BeautifulSoup(browser.page_source, 'lxml')
        # driver.implicitly_wait(10)
        browser.quit()
        item = TycItem()
        
        name = soup.select('.base-company')[0].text.split(' ')[0]
        self.logger.info('企业名 %s',name)
        try:
            inv = soup.select('#nav-main-outInvestment .m-plele')
            print (len(inv))
            for i in inv:
                inv = i.select('div')
                companyName = inv[0].text
                legalPerson = inv[2].text
                industry = inv[3].text
                state = inv[4].text
                invest = inv[5].text
                item['company'] = name
                item['enterprise_name'] = companyName
                item['legal_person_name'] = legalPerson
                item['industry'] = industry
                item['status'] = state
                item['reg_captial'] = invest
                
                yield (item)
        except Exception as e:
            self.logger.info('这个企业没有对外投资！')

有几处需要注意：

虽然用selenium模拟浏览器了，但是仍然要添加headers，不添加headers，网页的代码还是不全。
现在速度是有些提升了，不过面对海量的数据，还是要利用分布式爬虫scrapy-redis或者scrapyd。

后续继续学习scrapy分布式......

wget,selenium,scrapy以及phantomjs的
wget: selenium,scrapy,scrapy-redis: 测试是否安装成功: phantomjs:下...
python scrapy selenium phantomJS
之前用selenium和phantomJS单线程爬取tyc的对外投资信息，无奈爬取速度太慢，单个企业抓取速度大概在...
Python3.X 爬虫实战（动态页面爬取解析）
Python3+Scrapy+phantomJs+Selenium爬取今日头条在实现爬虫的过程中，我们不可避免的会...
Python2.7的环境搭建 py2.7 beautifulS
py2.7 beautifulSoup 、Scrapy 、selenium 、phantomjs 1.[下载py...
Scrapy Splash
Scrapy Splash 用来爬取动态网页，其效果和scrapy selenium phantomjs一样，都是...
爬取百度图片各种狗狗的图片，使用caffe训练模型分类
tag: python, selenium, PhantomJS, sklearn, BeautifulSoup,...
selenium和pantomjs学习
Selenium + PhantomJS + python 简单实现爬虫的功能 Selenium 一、简介 sel...
第八章 scrapy进阶
scrapy 进阶标签（空格分隔）： python scrapy selenium selenium动态网页与请...
Python自定义豆瓣电影种类，排行，点评的爬取与存储（进阶上）
Python 2.7IDE Pycharm 5.0.3 具体Selenium及PhantomJS请看Python+...
爬虫进阶-selenium和phantomJS
selenium和phantomJS 目录清单 [x] . selenium和phantomjs概述 [x] . ...

网友评论

4ffe7586f4e7: webdriver.PhantomJS 直接报错
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10061]

志明S:@大树而倜傥你这个估计是浏览器的问题，你试试chromge、firefox可以否

4ffe7586f4e7:@志明S 哥你真快

！webdriver.PhantomJS(executable_path="D:/phantomjs/bin/phantomjs.exe")
加了路径还是一样，
File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 407, in _call_chain
result = func(*args)
File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 1198, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10061] >

志明S:你这个是需要加你的phantomjs存放路径吗

罗罗攀:写详细点，多多投稿

志明S:@罗罗攀对的对的

罗罗攀:@志明S 没事

，大家一起爬虫才开心啊

志明S:@罗罗攀好的我也是初次写文档写的不好多多包涵

向右奔跑:可以直接保存csv文件，更简单

志明S:@向右奔跑用起来和在外边一样好用

不过速度还是慢

向右奔跑:@志明S scrapy保存csv,json，写入mongodb很方便。不过scrapy中selenium 我只试了一下

志明S:@向右奔跑最后要json格式的所以我还是要麻烦点存成json或者mongodb

4ffe7586f4e7: webdriver.PhantomJS 直接报错
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10061]
志明S:@大树而倜傥你这个估计是浏览器的问题，你试试chromge、firefox可以否
4ffe7586f4e7:@志明S 哥你真快！webdriver.PhantomJS(executable_path="D:/phantomjs/bin/phantomjs.exe")
加了路径还是一样，
File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 407, in _call_chain
result = func(*args)
File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 1198, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10061] >
志明S:你这个是需要加你的phantomjs存放路径吗
罗罗攀:写详细点，多多投稿
志明S:@罗罗攀对的对的
罗罗攀:@志明S 没事，大家一起爬虫才开心啊
志明S:@罗罗攀好的我也是初次写文档写的不好多多包涵
向右奔跑:可以直接保存csv文件，更简单
志明S:@向右奔跑用起来和在外边一样好用不过速度还是慢
向右奔跑:@志明S scrapy保存csv,json，写入mongodb很方便。不过scrapy中selenium 我只试了一下
志明S:@向右奔跑最后要json格式的所以我还是要麻烦点存成json或者mongodb

python scrapy selenium phantomJS

有几处需要注意：

后续继续学习scrapy分布式......

相关文章

wget,selenium,scrapy以及phantomjs的

python scrapy selenium phantomJS

Python3.X 爬虫实战（动态页面爬取解析）

Python2.7的环境搭建 py2.7 beautifulS

Scrapy Splash

爬取百度图片各种狗狗的图片，使用caffe训练模型分类

selenium和pantomjs学习

第八章 scrapy进阶

Python自定义豆瓣电影种类，排行，点评的爬取与存储（进阶上）

爬虫进阶-selenium和phantomJS

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python数据采集与爬虫

爬虫专题

python

我爱编程

@IT·互联网

爬虫专题

大数据爬虫Python AI Sql

Python 爬虫专栏