爬虫遭遇StackPath反爬的应对之策

作者: 是东东 | 来源:发表于2020-09-17 10:17 被阅读0次

ppt内容
爬虫遭遇StackPath反爬的应对之策
反反爬虫之js加密参数获取
常见反爬虫与应对措施
反爬虫到底是怎么一回事？
爬虫、反爬虫与突破反爬虫
如何快速掌握Python数据采集与网络爬虫技术
抖音爬虫教程，python爬虫采集反爬策略
自学Python爬虫：常见的反爬与反爬处理
16.常见的反爬手段和解决思路

遇到StackPath反爬时出现图下提示：

处理方式很简单，通过selenium获取cookie即可。

此处目标：

https://dailynewsegypt.com/

代码如下：

import time
import requests
from selenium import webdriver

UA = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"


def get_cookie(url):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('user-agent=' + UA)
    chrome_options.add_argument('blink-settings=imagesEnabled=false')
    chrome_options.add_argument('--window-size=1920,1080')
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    browser = webdriver.Chrome(options=chrome_options)
    browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })
    browser.get(url)
    time.sleep(5)
    _d = {}
    for i in browser.get_cookies():
        _d[i.get('name')] = i.get('value')
    browser.close()
    return _d


headers = {
    "Host": "dailynewsegypt.com",
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": UA,
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-User": "?1",
    "Sec-Fetch-Dest": "document",
    "Referer": "https://dailynewsegypt.com/",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,zh-TW;q=0.8,th;q=0.7,en;q=0.6",
}
url = 'https://dailynewsegypt.com/category/opinion/page/2/'
cookies = get_cookie(url)
req = requests.get(url=url, headers=headers, cookies=cookies)
time.sleep(5)
print(req.text)