用Python在豆瓣上找书看

作者: dalalaa | 来源:发表于2017-01-17 21:51 被阅读331次

用Python在豆瓣上找书看
豆瓣最受好评的20本Python图书
高效阅读入门：3分钟教会你选书买书（下）
2020我的私人阅读十佳
用python分析豆瓣短评(二)
pathon爬取豆瓣纸书
如何找到一本好书？
Python学习日志1
Nice to meet you | 2018年度观影总结
美版《我的前半生》女主甩罗子君500条大街？

我有时候会上豆瓣上看书评，一般是通过这个标签页面来找：

Paste_Image.png

但是这个页面不像淘宝，没有筛选功能，所以用打算用爬虫爬下来自己筛选。
我主要爬取了这几个信息：标题、评分、阅读人数、页数、出版日期和价格。这些是我看书比较关注的东西。爬取下来的数据我选择直接存入pandas中的DataFrame来进行筛选。
下面是代码

\#-*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
import urllib
import  re
import pandas as pd
\#豆瓣这个页面没有反爬虫，所以不需要伪装成浏览器。
def findTag(url):
    source_code = urllib.request.urlopen(url)
    soup = BeautifulSoup(source_code,"html.parser")
    return soup
def findTitle(soup):
    titles = []
    titletag = soup.findAll('h2',{'class':True})
    for title in titletag:
        t = title.get_text()
        t = re.sub('\n','',t)
        t = re.sub(' ','',t)
        titles.append(t)
    return (titles)
def findRating(soup):
    ratin = soup.findAll('div',{'class':'star clearfix'})
    rating = []
    for item in ratin:
        try:
            r = item.find('span',{'class':'rating_nums'}).get_text()#有可能没有rating
            r = float(r)
        except:
            r = 0.0
        rating.append(r)
    return rating
def findPopularity(soup):
    popularity = []
    popu = soup.findAll('div',{'class':'star clearfix'})
    for item in popu:
        p = item.find('span',{'class':'pl'}).get_text()
        p = re.sub('\n','',p)
        p = re.sub(' ','',p)
        p = re.sub('人评价\)','',p)
        p = re.sub('\(','',p)
        p = re.sub('少于','',p)
        p = float(p)
        popularity.append(p)
    return popularity  
def findInfor(soup):#这里注意，整个爬虫中最耗时的是urlopen()函数，尽量少用，能合并就合并
    thickness = []
    year = []
    price = []
    thick = soup.findAll('h2',{'class':True})
    for item in thick:
        href = item.find('a').attrs['href']
        soup1 = BeautifulSoup(urllib.request.urlopen(href),"html.parser")
        thickne = soup1.find('span',text = re.compile('页数')).next_sibling
        thickness.append(thickne)
        yea = soup1.find('span',text = re.compile('出版年')).next_sibling
        year.append(yea)
        pric = soup1.find('span',text = re.compile('定价')).next_sibling
        price.append(pric)
    infor = [thickness,year,price]
    return infor
def switchPages(keyword):
    book_title_list = []
    rating_list = []
    popularity_list = []
    thickness_list = []
    year_list = []
    price_list = []
    for i in range(1):
        page = "https://book.douban.com/tag/"+urllib.parse.quote(keyword)+"?start="+str(20*i)+"&type=T"
        a1 = findTag(page)
        b1 = findTitle(a1)
        book_title_list.extend(b1)
        b3 = findRating(a1)
        rating_list.extend(b3)
        b4 = findPopularity(a1)
        popularity_list.extend(b4)
        b5 = findInfor(a1)
        thickness_list.extend(b5[0])
        year_list.extend(b5[1])
        price_list.extend(b5[2])
 print(len(book_title_list),len(rating_list),len(popularity_list),len(thickness_list),len(year_list),len(price_list)) 
    df = pd.DataFrame({'Title':book_title_list,'rating':rating_list,'popularity':popularity_list,'thickness':thickness_list,'year':year_list,'price':price_list})
    print(df)

switchPages("编程")

用Python在豆瓣上找书看
我有时候会上豆瓣上看书评，一般是通过这个标签页面来找：但是这个页面不像淘宝，没有筛选功能，所以用打算用爬虫爬下来...
豆瓣最受好评的20本Python图书
用豆瓣API爬取了豆瓣上的Python图书，筛选了评分人数>80的书，按照评分高低排序，删除了些中英文版本重复的书...
高效阅读入门：3分钟教会你选书买书（下）
分享两种便捷的找书方法：豆瓣标签和书店逛读。【网上选书】网上找书，“豆瓣标签”很好用。 “豆瓣标签”是豆瓣网的...
2020我的私人阅读十佳
我一直用豆瓣记录每年的阅读、写作和观影看剧的数据。2020年，我在豆瓣上总共标记了138本书，不过因为要陪伴小朋友...
用python分析豆瓣短评(二)
上一篇用python分析豆瓣短评(一)讲了通过编写爬虫代码获取豆瓣电影短评数据。本文则利用pandas、matpl...
pathon爬取豆瓣纸书
用python爬取豆瓣纸书，先来代码：中间有两个注意点，第一个时要注意找各种子节点时要分析html源码，第一次写...
如何找到一本好书？
曾经是豆瓣，现在我也会看看知乎。用豆瓣找书是一个不错的选择。豆瓣的评分和短评给了我不少帮助，书评则帮助不大。偶尔...
Python学习日志1
用的是macOS，研究了半天才发现自带Python，看的书不是Python入门，直接是看机器学习附录的Python...
Nice to meet you | 2018年度观影总结
说来惭愧，自从2012年开始用豆瓣，没有在豆瓣上写过一篇日记。对于豆瓣，用的最多的也是“豆瓣电影”这个频道了，豆瓣...
美版《我的前半生》女主甩罗子君500条大街？
最近我剧荒到脑裂，疯狂找剧无意中发现这部《了不起的麦瑟尔夫人》。我一直都有在观影前在豆瓣看评分的习惯，发现豆瓣上...