链家房源爬虫（含源码）

作者: 机器学习爱好者1 | 来源:发表于2017-11-21 11:30 被阅读0次

链家房源爬虫（含源码）
scrapy 爬取链家北京租房信息
深圳链家在售二手房数据抓取
鲸致|链接朋友圈广告分析
python爬虫爬房多多链家房源信息
Python爬虫实战之爬取链家广州房价_03存储
深圳链家房源分析
链家网房源数据
我爱生活更爱你-租房那些事儿（1）
简述产品（3）-链家

链家APP上有很多在售房源信息以及成交房源信息，如果可以把这些信息爬下来，可以得到很多有价值的信息。因此本文将讲一讲如何爬取这些数据，并保存下来供以后分析。
本文将介绍以下几个方面：

程序介绍
使用教程
实现思路
数据存储
可视化分析

程序介绍

该程序支持爬取链家在线二手房数据，历史成交数据，在线租房数据和指定城市所有小区数据。
数据存储目前支持三种数据库格式（mysql，postgreSql, Sqlite3)。
由于链家网采取对IP限流设置，所以该程序没有采取多线程爬取，并且限制了爬取速度来防止被封。
提供mysql数据转到ES的解决方案，方便进行数据可视化分析。

使用教程

源码地址 https://github.com/XuefengHuang/lianjia-scrawler 如果喜欢，请给个star支持一下，谢谢！
下载源码并安装依赖包

1. git clone https://github.com/XuefengHuang/lianjia-scrawler.git
2. cd lianjia-scrawler
# If you'd like not to use [virtualenv](https://virtualenv.pypa.io/en/stable/), please skip step 3 and 4.
3. virtualenv lianjia
4. source lianjia/bin/activate
5. pip install -r requirements.txt

设置数据库信息以及爬取城市行政区信息（支持三种数据库格式）

DBENGINE = 'mysql' #ENGINE OPTIONS: mysql, sqlite3, postgresql
DBNAME = 'test'
DBUSER = 'root'
DBPASSWORD = ''
DBHOST = '127.0.0.1'
DBPORT = 3306
CITY = 'bj' # only one, shanghai=sh shenzhen=sh......
REGIONLIST = [u'chaoyang', u'xicheng'] # 只支持拼音

运行 python scrawl.py! (请注释14行如果已爬取完所想要的小区信息)
可以修改scrawl.py来只爬取在售房源信息或者成交房源信息或者租售房源信息

实现思路

开始抓取前先观察下目标页面或网站的结构，其中比较重要的是URL的结构。链家网的二手房列表页面共有100个，URL结构为http://bj.lianjia.com/ershoufang/pg9/，其中bj表示城市，/ershoufang/是频道名称，pg9是页面码。我们要抓取的是北京的二手房频道，所以前面的部分不会变，属于固定部分，后面的页面码需要在1-100间变化，属于可变部分。将URL分为两部分，前面的固定部分赋值给url，后面的可变部分使用for循环。我们以根据小区名字搜索二手房出售情况为例：

BASE_URL = u"http://bj.lianjia.com/"
url = BASE_URL + u"ershoufang/rs" + urllib2.quote(communityname.encode('utf8')) + "/"
total_pages = misc.get_total_pages(url) //获取总页数信息
for page in range(total_pages):
    if page > 0:
        url_page = BASE_URL + u"ershoufang/pg%drs%s/" % (page+1, urllib2.quote(communityname.encode('utf8')))

//获取总页数信息代码
def get_total_pages(url):
    source_code = get_source_code(url)
    soup = BeautifulSoup(source_code, 'lxml')
    total_pages = 0
    try:
        page_info = soup.find('div',{'class':'page-box house-lst-page-box'})
    except AttributeError as e:
        page_info = None

    if page_info == None:
        return None
    page_info_str = page_info.get('page-data').split(',')[0]  #'{"totalPage":5,"curPage":1}'
    total_pages = int(page_info_str.split(':')[1])
    return total_pages

页面抓取完成后无法直接阅读和进行数据提取，还需要进行页面解析。我们使用BeautifulSoup对页面进行解析。

soup = BeautifulSoup(source_code, 'lxml')
nameList = soup.findAll("li", {"class":"clear"})

完成页面解析后就可以对页面中的关键信息进行提取了。下面我们分别对房源各个信息进行提取。

for name in nameList: # per house loop
    i = i + 1
    info_dict = {}
    try:
        housetitle = name.find("div", {"class":"title"})
        info_dict.update({u'title':housetitle.get_text().strip()})
        info_dict.update({u'link':housetitle.a.get('href')})

        houseaddr = name.find("div", {"class":"address"})
        info = houseaddr.div.get_text().split('|')
        info_dict.update({u'community':info[0].strip()})
        info_dict.update({u'housetype':info[1].strip()})
        info_dict.update({u'square':info[2].strip()})
        info_dict.update({u'direction':info[3].strip()})

        housefloor = name.find("div", {"class":"flood"})
        floor_all = housefloor.div.get_text().split('-')[0].strip().split(' ')
        info_dict.update({u'floor':floor_all[0].strip()})
        info_dict.update({u'years':floor_all[-1].strip()})

        followInfo = name.find("div", {"class":"followInfo"})
        info_dict.update({u'followInfo':followInfo.get_text()})

        tax = name.find("div", {"class":"tag"})
        info_dict.update({u'taxtype':tax.get_text().strip()})

        totalPrice = name.find("div", {"class":"totalPrice"})
        info_dict.update({u'totalPrice':int(totalPrice.span.get_text())})

        unitPrice = name.find("div", {"class":"unitPrice"})
        info_dict.update({u'unitPrice':int(unitPrice.get('data-price'))})
        info_dict.update({u'houseID':unitPrice.get('data-hid')})
    except:
        continue

提取完后，为了之后数据分析，要存进之前配置的数据库中。

model.Houseinfo.insert(**info_dict).upsert().execute()
model.Hisprice.insert(houseID=info_dict['houseID'], totalPrice=info_dict['totalPrice']).upsert().execute()

数据存储

可支持数据库：mysql，postgreSql, Sqlite3
数据库信息：

Community小区信息（id, title, link, district, bizcurcle, taglist）

Houseinfo在售房源信息（houseID, title, link, community, years, housetype, square, direction, floor, taxtype, totalPrice, unitPrice, followInfo, validdate)

Hisprice历史成交信息（houseID，totalPrice，date）

Sellinfo成交房源信息(houseID, title, link, community, years, housetype, square, direction, floor, status, source,, totalPrice, unitPrice, dealdate, updatedate)

Rentinfo租售房源信息 (houseID, title, link, region, zone, meters, other, subway, decoration, heating, price, pricepre, updatedate)