python scrapy selenium phantomJS

作者: 志明S | 来源:发表于2017-02-14 15:13 被阅读4420次

之前用selenium和phantomJS单线程爬取tyc的对外投资信息,无奈爬取速度太慢,单个企业抓取速度大概在>30-60s,这还不是最关键的,最令人崩溃的是刚抓取一会就有bug,导致程序中断,程序中断的原因大概在爬取程序卡在某个部分不动了,经检查也没发现bug在哪,所以爬虫一直处于手动爬虫-手动中断-继续爬虫的状态。今天学了scrapy,果断用scrapy+selenium+phantomJS来爬。
先上代码

#coding:utf-8
from selenium.webdriver.common.keys import Keys  
import time
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import pymongo
import xlrd
import time 

import scrapy
from tyc.items import TycItem
import logging
from scrapy.http import Request


class TycSpider(scrapy.Spider):
    name = 'tyc'
    allowed_domains = ['tianyancha.com']
    fname = "C:\\Users\\Administrator\\Desktop\\test.xlsx"
    workbook = xlrd.open_workbook(fname)
    sheet = workbook.sheet_by_name('Sheet1')
    urls = list()
    cols = sheet.col_values(0)
    #要爬取的url
    start_urls =['http://www.tianyancha.com/search?key={}&checkFrom=searchBox' .format(col) for col in cols]     

    def parse(self,response):
        #用phantomJs模拟浏览器,添加headers
        dcap = dict(DesiredCapabilities.PHANTOMJS)
        dcap["phantomjs.page.settings.userAgent"] = (
            "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36"
        )
        browser = webdriver.PhantomJS(desired_capabilities=dcap)
        browser.get(response.url)
        time.sleep(4)
        #获取企业url
        try:
            url = browser.find_element_by_class_name('query_name').get_attribute('href')
            browser.quit()
            self.logger.info('成功搜索到 %s',url)
            yield Request(url = url,callback = self.parse_detail)
            
        except Exception as e:
            self.logger.info('经查询没有这个企业!')
    

    def parse_detail(self,response):
        #获取企业对外投资情况
        dcap = dict(DesiredCapabilities.PHANTOMJS)
        dcap["phantomjs.page.settings.userAgent"] = (
            "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36"
        )
        browser = webdriver.PhantomJS(desired_capabilities=dcap)
     
        browser.get(response.url)
        self.logger.info('url %s', response.url)
        time.sleep(3)
        soup = BeautifulSoup(browser.page_source, 'lxml')
        # driver.implicitly_wait(10)
        browser.quit()
        item = TycItem()
        
        name = soup.select('.base-company')[0].text.split(' ')[0]
        self.logger.info('企业名 %s',name)
        try:
            inv = soup.select('#nav-main-outInvestment .m-plele')
            print (len(inv))
            for i in inv:
                inv = i.select('div')
                companyName = inv[0].text
                legalPerson = inv[2].text
                industry = inv[3].text
                state = inv[4].text
                invest = inv[5].text
                item['company'] = name
                item['enterprise_name'] = companyName
                item['legal_person_name'] = legalPerson
                item['industry'] = industry
                item['status'] = state
                item['reg_captial'] = invest
                
                yield (item)
        except Exception as e:
            self.logger.info('这个企业没有对外投资!') 
有几处需要注意:
  • 虽然用selenium模拟浏览器了,但是仍然要添加headers,不添加headers,网页的代码还是不全。
  • 现在速度是有些提升了,不过面对海量的数据,还是要利用分布式爬虫scrapy-redis或者scrapyd。
后续继续学习scrapy分布式......

相关文章

网友评论

  • 4ffe7586f4e7: webdriver.PhantomJS 直接报错
    raise URLError(err)
    urllib2.URLError: <urlopen error [Errno 10061]
    志明S:@大树而倜傥 你这个估计是浏览器的问题,你试试chromge、firefox可以否
    4ffe7586f4e7:@志明S 哥你真快:joy: !webdriver.PhantomJS(executable_path="D:/phantomjs/bin/phantomjs.exe")
    加了路径还是一样,
    File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 407, in _call_chain
    result = func(*args)
    File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File "D:\Program Files (x86)\python2.7\lib\urllib2.py", line 1198, in do_open
    raise URLError(err)
    urllib2.URLError: <urlopen error [Errno 10061] >
    志明S:你这个是需要加你的phantomjs存放路径吗
  • 罗罗攀:写详细点,多多投稿:stuck_out_tongue_winking_eye::stuck_out_tongue_winking_eye:
    志明S:@罗罗攀 对的 对的
    罗罗攀:@志明S 没事:blush:,大家一起爬虫才开心啊
    志明S:@罗罗攀 好的 我也是初次写 文档写的不好多多包涵
  • 向右奔跑:可以直接保存csv文件,更简单
    志明S:@向右奔跑 用起来和在外边一样好用:smile: 不过速度还是慢
    向右奔跑:@志明S scrapy保存csv,json,写入mongodb很方便。不过scrapy中selenium 我只试了一下
    志明S:@向右奔跑 最后要json格式的 所以我还是要麻烦点存成json或者mongodb

本文标题:python scrapy selenium phantomJS

本文链接:https://www.haomeiwen.com/subject/uebxwttx.html