一.项目分析

爬取知乎大v张佳玮的文章“标题”、“摘要”、“链接”，并存储到本地文件。

1.查看爬取信息是否存在HTML页面里面

张佳玮的知乎文章URL在这里：https://www.zhihu.com/people/zhang-jia-wei/posts?page=1
点击右键——检查——Network，选All（而非XHR），然后刷新网页，点进去第0个请求:posts_by_votes，点Preview。

发现有文章标题，看来数据是放在HTML里。那么，走的应该是【知识地图】里上面那条路径

2.编写代码

那好，就可以去观察一下网页源代码了，点回Elements。分析根据哪个标签获取
获取标题文章 :

import requests
from bs4 import BeautifulSoup
#引入request和bs
url='https://www.zhihu.com/people/zhang-jia-wei/posts/posts_by_votes?page=1'
headers={'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
#使用headers是一种默认的习惯，默认你已经掌握啦~
res=requests.get(url,headers=headers)
#发起请求，将响应的结果赋值给变量res。
print(res.status_code)
#检查状态码 
bstitle=BeautifulSoup(res.text,'html.parser')
#用bs进行解析
title=bstitle.findAll(class_='ContentItem-title')
#提取我们想要的标签和里面的内容
print(title)
#打印title

会发现只有两个标题,先来做个排查，使用 res.text 打印一下网页源代码。会发现第一页的最后一篇文章的标题“出走半生，关山万里，归来仍是少女心气”在这个网页源代码里面搜索是搜不到的,那么重新开始分析

3.重新分析XHR

打开Network，点开XHR，同时刷新页面，看到出现了很多个请求。
浏览一下，看到两个带articles的请求，感觉有戏。点开首个articles看看preview，一层层点开，看到“title：记住就是一切”，猜测这是一个文章标题。
在网页里面用command+f(windows电脑用ctrl+f)搜索一下“记住就是一切”，发现搜不到，奇怪。
那就看看跟首个articles请求长得很像的另一个articles的请求好啦，仍然看preview，看到title: "国产航母下水……让我想到李鸿章和北洋舰队"，仍然在网页里搜一下：

果然在这里。看来这个articles的请求里面存的是第一页的文章标题。这下妥了，我们知道向哪个url获取数据了。
那首个带articles的请求是什么？其实这是知乎的网站设计，当你刷新第一页的时候，默认你也请求了第二页的文章数据，这样你加载就会比较流畅。
现在，理论上我们可以拿到第一页的文章数据了，那如果要拿到之后所有页面的数据，还不够吧。
好，我们去观察第1页对第2页的请求，和第2页里对第3页请求的参数区别，是在headers里面的query string parameters里面。

然后发现除了offset都一样，offset代表起始值，limit表示加载的限制数，通过循环我们是可以爬到所有页数的内容了。

我们的大致思路也就出来了:

4.拿到前3页的数据

import requests
from bs4 import BeautifulSoup
import  csv

articleInfo = []
url = 'https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles'
for i in range(3):
    params = {
    'include' : 'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topic',
    'offset' : str(i*20),
    'limit' : '20',
    'sort_by' : 'voteups'
    }
    header = {
        'referer' : 'https://www.zhihu.com/people/zhang-jia-wei/posts/posts_by_votes?page=' + str(i),
        'user-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    }
    res = requests.get(url,params=params,headers = header)
    article = res.json()
    for oneArticle in article['data']:
        articleInfo.append([oneArticle['title'],oneArticle['excerpt'],oneArticle['url']])

print(articleInfo)

那么如何能判断是否是最后一页了呢
第一页和最后一页请求的参数区别：

对比一下，你会发第一页的is_end是显示false，最后一页的is_end是显示true，这个元素可以帮我们结束循环。
至于那个totals: 919元素，我算了一下页码和每页的文章数，判断这是文章的总数，也同样可以作为结束循环的条件。两个元素都可以用，在这里我们用is_end元素。
所以最终代码为 :

import requests
from bs4 import BeautifulSoup
import  csv

articleInfo = []
url = 'https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles'
i = 0
file=open("article.csv",'w',newline='')
writer = csv.writer(file)
while True:
    params = {
    'include' : 'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topic',
    'offset' : str(i*20),
    'limit' : '20',
    'sort_by' : 'voteups'
    }
    header = {
        'referer' : 'https://www.zhihu.com/people/zhang-jia-wei/posts/posts_by_votes?page=' + str(i),
        'user-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    }
    res = requests.get(url,params=params,headers = header)
    article = res.json()
    for oneArticle in article['data']:
        articleInfo.append([oneArticle['title'],oneArticle['excerpt'],oneArticle['url']])
        writer.writerow([oneArticle['title'],oneArticle['excerpt'],oneArticle['url']])
    print(str(i) + " " + str(article['paging']['is_end']))
    if article['paging']['is_end']:
        break;
    i += 1
file.close()
print(articleInfo)