BeautifulSoup 遇到的提取问题

作者: 木下瞳 | 来源:发表于2019-07-25 18:14 被阅读0次

BeautifulSoup 遇到的提取问题
爬虫任务二
Python基础学习19
python八爬虫框架
python爬虫2：BeautifulSoup 初识爬虫
python网络爬虫：BeautifulSoup
Python3爬虫神器之BeautifulSoup（一）——初识
BeautifulSoup
Datawhale 爬虫学习笔记2
05 页面解析

了解更多关注微信公众号“木下学Python”吧~
原文：https://blog.csdn.net/zjkpy_5/article/details/81041407

1.安装库并导入

安装这个库是pip install bs4,不是BeautifulSoup，导入为‘from bs4 import BeautifulSoup’

2.适用情况

例如要爬取简书网热评，每一篇文章都对应有消息，当有些有打赏，有些没有，这样匹配出来的数据就会对应不上，就如 1号没有打赏，但2号有打赏，那2号的打赏就会匹配给1号。只有当要爬取的每一组数据都有对应的值得时候，才可用bs4

3.爬取网站代码结构：

访问

res = request.get(url,headers = headers，timeout = 10)
soup = Beautiful(res.text,'lxml')

爬取

names = soup.select('copy selector内容')
apps = soup.select('copy selector内容')

       copy selector内容最好写全路径

4.select（‘html的copy select’）方法

         li:nth-of-type(1)为1条信息，li为多条信息

        有时候爬取不全可能死里面路径的问题，适当修改一下，删除一些路径，如：

                tops = soup.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_num > strong')

                tops = soup.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_num)

        这里只是一个例子，原来没删除我的结果只有3条，删除后就全部都有了

       NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type

               nth-child 改为 nth-of-type(1)

               li:nth-of-type(1)为1条信息，li为多条信息

5.get_text()

提取select返回列表中标签中的信息

6.find() 与 find_all() 用法

前者返回一个，后者返回所有的数据的列表！
find(name,attrs,recursive,text,**wargs)

find.attrs 获得标签所有属性

<ul class="name">123
<li>
Area
</li?
</ul>
要获得 <li> 标签的文本

ul = html.find('ul',attrs={'class':'name'})
li = ul.find('li').text

例如：

<div class="house-title">
<a data-from="" data-company="" title="良乡房山线苏庄站旁大两居落地窗采光好楼层合适诚心出售看房方便" href="https://beijing.anjuke.com/prop/view/A1379345707?from=filter-saleMetro-salesxq&spread=commsearch_p&position=1&kwtype=filter&now_time=1535599420" target="_blank" class="houseListTitle">
良乡房山线苏庄站旁大两居落地窗采光好楼层合适诚心出售看房方便</a>

                                <em title="该房源已现金担保，保证房源真实，保证可带看" class="guarantee_icon1">安选验真</em>
                            
                                                </div>

要提取其中的数据，可用 .find_all('div',class_="house-title").a.text.strip() 得到，.text 可换成 .string

注意：class 是 class_;其中的 .a 是指 <a> 标签

       例如：

<span>95m²</span><em class="spe-lines">|</em>

<span>低层(共6层)</span><em class="spe-lines">|</em>

<span>2009年建造</span><span class="brokername"><i class="iconfont"></i>刘亚男</span>

</div>
提取 2室1厅： .find('div',class_="details-item").span.text 或者 .find('div',class_="details-item").contents[1].text

       提取 95m²：.find('div',class_="details-item").contents[3].text

       提取 低层(共6层)：.find('div',class_="details-item").contents[5].text

7.findALL 和 find_all

     通过get函数获得标签的属性：

soup=BeautifulSoup(html,'html.parser')
pid = soup.findAll('a',{'class':'sister'})
for i in pid:
print i.get('href') #对每项使用get函数取得tag属性值

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
对其他的标签也是同样可用的，并且输出的结果为文档中第一个匹配的对象，如果要搜索其他的标签需要使用 find findAll函数。
BeautifulSoup提供了强大的搜索函数find 和findall，这里的两个方法(findAll和 find)仅对Tag对象以及，顶层剖析对象有效。

     findAll(name, attrs, recursive, text, limit, keyword）

attrs 是一个若干属性和对应属性值，class 属性对应 name，也可以对应name，以及ass，{'class':{'name','ass'}}

递归参数recursive是一个布尔变量，默认是True，抓取文档标签结构里多少层的信息·，查找name标签下的所有子标签；

设为False，只查找一级标签

文本参数text，，用标签的内容文本去匹配，而不是用标签的属性，假如我们先查找网页内容中包含‘pan’内容的标签数量，可以这样写 html.findALL(text='pan')

范围限制参数limit，对网页中获取的前x项干兴趣，可以设置

关键词参数keyword，可以选择那些指定属性的标签，html。findALL（id='text'）

 for link in soup.find_all('a'): #soup.find_all返回的为列表
     print(link.get('href'))
 # http://example.com/elsie
 # http://example.com/lacie
 # http://example.com/tillie
     findAll也可以使用标签的属性搜索标签，寻找 id=”secondpara”的 p 标记，返回一个结果集：

pid=soup.findAll('p',id='hehe')  #通过tag的id属性搜索标签
print pid
[<p class="title" id="hehe"><b>The Dormouse's story</b></p>]
pid = soup.findAll('p',{'id':'hehe'}) #通过字典的形式搜索标签内容，返回的为一个列表[]
print pid
[<p class="title" id="hehe"><b>The Dormouse's story</b></p>]


     例如去取url：

  for link in html.find_all('a'):
      #if 'href' in link.attrs:
      print(link.get('href'))
  for link in html.findAll('a'):
      if 'href' in link.attrs:
      print(link.attrs['href'])

8.提取 url 标签看到的属性和趴下来的属性不一样

再浏览器中看到的 url 的属性为 src

<img width="220" height="220" class="" data-img="1" source-data-lazy-img="" data-lazy-img="done" src="//img12.360buyimg.com/n7/jfs/t1/2481/15/12216/274259/5bd1af8bE2de8c15f/c56a6788061f4d46.jpg">
爬取下来的实际 url 的属性为 source-data-lazy-img

<img class="err-product" data-img="1" height="220" source-data-lazy-img="//img12.360buyimg.com/n7/jfs/t1/2481/15/12216/274259/5bd1af8bE2de8c15f/c56a6788061f4d46.jpg" width="220"/>
按照浏览器中看到的属性 src 爬 url 没有结果，经过断点调试，发现实际属性和看到的不一样

9..prettify()

打印 html 代码

10.判断是否是标签

    if isinstance(tr,bs4.element.Tag): #过滤非标签
        tds = tr('td') #对 tr 子标签中的 td 标签做查询,tds 为列表
        university_list.append([tds[0].string,tds[1].string,tds[2].string]) #获得前三个文本

11.link.attrs['href'].startswith('/')