美文网首页python
boss 直聘爬虫数据分析职位分析

boss 直聘爬虫数据分析职位分析

作者: ygquincy | 来源:发表于2019-03-16 19:55 被阅读0次

首先说明这篇文章的数据来源,是爬虫BOSS直聘"数据分析师"这一职位信息所得来的。并且主要分析了数据分析师总体薪酬情况、不同城市薪酬分布、不同学历薪酬分布、北京上海工作经验薪酬分布情况、北上广深对数据分析职位需求量以及有招聘需求的公司所处行业的词云图分析。

1.数据采集
2.数据清洗与处理
3.数据分析

数据采集

import requests
from fake_useragent import UserAgent
from lxml import etree
import pymysql
import pymongo
import json
import time
from requests import RequestException

mongo_url = 'localhost'
mongo_db = 'zhaopin'

ua = UserAgent()

class Boss(object):
    def __init__(self):
        self.url = 'https://www.zhipin.com/{}/?query=数据分析&page={}'
        self.headers = {'user-agent': ua.random,
           'referer':'https://www.zhipin.com/c101020100/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&page=1',
           'cookie': ''}
        self.client = pymongo.MongoClient(mongo_url)
        self.db = self.client[mongo_db]
        self.cityList = {'广州':'c101280100','北京':'c101010100','上海':'c101020100','深圳':'c101280600','杭州':'c101210100','天津':'c101030100','西安':'c101110100','苏州':'c101190400','武汉':'c101200100','厦门':'c101230200','长沙':'c101250100','成都':'c101270100','郑州':'c101180100','重庆':'c101040100'}


    # def get_proxy(self):
    #     PROXY_POOL_URL = 'http://localhost:5555/random'
    #     try:
    #         response = requests.get(PROXY_POOL_URL)
    #         if response.status_code == 200:
    #             return response.text
    #     except ConnectionError:
    #         return None


    def get_one_page(self, url):
        try:
            # proxy = self.get_proxy()
            # proxies = {'http': proxy}
            # print(proxies)
            response = requests.get(url, headers=self.headers)
            if response.status_code == 200:
                return response.text

            return None
        except RequestException:
            print("请求错误")

    def parse_one_page(self,html):
        html = etree.HTML(html)
        content = html.xpath("//li/div[@class='job-primary']")

        for con in content:

            pos_name = con.xpath(".//div[@class='job-title']/text()")[0]
            comp_name = con.xpath(".//div[@class='info-company']/div/h3/a/text()")[0]
            salary = con.xpath(".//h3/a/span/text()")[0]
            scale = con.xpath("./div[@class='info-company']//p/text()[last()]")[0]
            education = con.xpath("./div/p/text()[3]")[0]
            industry = con.xpath(".//div[@class='company-text']/p//text()")[0]
            workyear = con.xpath("./div[@class='info-primary']/p/text()")[1]
            location = con.xpath("./div[@class='info-primary']/p/text()")[0]


            item = {'pos_name':pos_name,
                    'comp_name':comp_name,
                    'salary':salary,
                    'scale':scale,
                    'education':education,
                    'industry':industry,
                    'workyear':workyear,
                    'location':location}
            yield item

    def write_to_file(self, item):
        with open('boss.txt', 'a', encoding='utf-8') as f:
            f.write(json.dumps(item, ensure_ascii=False)+'\n')

    def write_to_csv(self, item):
        with open('爬虫BOSS直聘.txt','a', encoding='utf-8') as file:
            line = str(item['pos_name']) + ',' + str(item['comp_name']) + ',' + str(item['salary']) + ',' + \
                   str(item['scale']) + ',' + str(item['education']) + ',' + str(item['industry']) + ',' + \
                   str(item['workyear']) + ',' + str(item['location']) + '\n'
            file.write(line)

    def save_to_mongo(self, item):
        if self.db['boss'].insert(item):
            print("save successfully")

    def save_mo_mysql(self, item):
        conn = pymysql.connect(host='localhost', user='root', password='', db='test7', port=3306,
                               charset='utf8')
        cur = conn.cursor()
        insert_data = "INSERT INTO boss(pos_name, comp_name, salary, scale, education, industry, workyear,location) VALUES(%s, %s, %s, %s, %s, %s, %s, %s)"
        val = (item['pos_name'], item['comp_name'], item['salary'], item['scale'], item['education'], item['industry'], item['workyear'], item['location'])
        cur.execute(insert_data, val)
        conn.commit()

    def run(self):
        title = u'posName,companyName,salary,scale,education,industry,workyear,location'+'\n'
        file = open('%s.txt' % '爬虫BOSS直聘', 'w',encoding='utf-8')  # 创建爬虫拉勾网.txt文件
        file.write(title)
        file.close()


        for city in self.cityList.values():
            for i in range(1,11):
                url = self.url.format(city, i)
            # url = self.url.format(1)
                response = self.get_one_page(url)
                for i in self.parse_one_page(response):
                    self.write_to_csv(i)
                time.sleep(3)


if __name__ == '__main__':
    boss = Boss()
    boss.run()

数据清洗与处理

image.png

首先看到爬下来的数据地区location,太详细了,我们只保留市的前两个字。


image.png

可以观察到工资的格式也有些问题,是一个区间的形式,用函数把工资清理成最大值,最小值,以及平均值的形式便于分析。


image.png

数据分析

总体工资分布情况


image.png

不同城市工资分布的情况


image.png

不同学历的分布情况


image.png

再仔细看看详细的招聘人数情况


image.png

现在来看看北京上海工作经验分布情况


image.png

来看看北上广深对数据分析类职位的需求量


image.png

做个招聘的公司所处行业领域的词云图分析


image.png

可以观察到需求数据分析这一职位的主要在互联网,移动互联网,电子商务,金融等方面。所以向这些领域求职的话成功率会大很多。

相关文章

网友评论

    本文标题:boss 直聘爬虫数据分析职位分析

    本文链接:https://www.haomeiwen.com/subject/vgxemqtx.html