爬虫——爬虫初识、requests模块

系列文章目录

第一章爬虫——爬虫初识、requests模块

第二章代理搭建、爬取视频网站、爬取新闻、BeautifulSoup4介绍、bs4 遍历文档树、bs4搜索文档树、bs4使用选择器

前言

爬虫是非正规，且违法的，所以本博客主旨在于交流学习

一、爬虫初识

爬虫实质上就是模拟浏览器向服务器发送请求获取数据。爬虫做的好不好主要在于对模拟浏览器的请求是否足够真实，只要骗过服务器，之后再对获取到的数据进行清洗，再保存下来。

爬虫有爬虫协议：
只要遵守爬虫协议，爬虫就不会违法。但我们爬虫一般都只会爬取一些不在服务器robots.txt规定中的数据。

二、requests模块

1.requests模块介绍

urllib内置模块、requests模块的api，可以发送http请求，但是api使用复杂，这个时候可以使用requests模拟浏览器的请求，比起之前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3），该模块不仅仅用于爬虫，服务器与其他服务器交流也能使用requests模块。

注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

2.requests安装

pip3 install requests
1

3.携带get参数

import requests

# 方法一直接带在url中
res = requests.get('https://www.xxsy.net/search?&s_wd=将军，夫人')
# 方法二在params携带
params_dict = {'s_wd':'将军，夫人'}
res1 = requests.get('https://www.xxsy.net/search', params=params_dict)

1
2
3
4
5
6
7
8

4.携带请求头

常见请求头：
user-agent：客户端类型
Referer: https://www.lagou.com/gongsi/ 上一次访问的地址

import requests

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',

}
res=requests.get('https://dig.chouti.com/',headers=header)
1
2
3
4
5
6
7

5.携带cookie

import requests

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',

}

# 方式一：带在请求头中
res = requests.post('https://dig.chouti.com/link/vote',
                    headers=header,
                    data={
                        'linkId': '35811284'
                    })

print(res.text)

# 方式二：使用cookie参数:之前登录成功了，就有cookie，cookie是CookieJar的对象，直接传
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}

res = requests.post('https://dig.chouti.com/link/vote',
                    headers=header,
                    # Dict or CookieJar
                    # cookies={},
                    data={
                        'linkId': '35811284'
                    })

print(res.text)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

6.post请求

res = requests.post('http://www.aa7a.cn/user.php', data={
    'username': '@qq.com',
    'password': '',
    'captcha': 'aaaa',
    'remember': 1,
    'ref': 'http://www.aa7a.cn/user.php?act=logout',
    'act': ' act_login',
})
print(res.text)
print(res.cookies)  # 登录成功的cookie，cookieJar对象是一个字典
# 登录成功
res1=requests.get('http://www.aa7a.cn/',cookies=res.cookies)
print('12344554@qq.com' in res1.text)
1
2
3
4
5
6
7
8
9
10
11
12
13

7.post请求携带数据

post请求有三种携带数据的格式：
from-data、urlencoded(默认)、json

res=requests.post('xxx',json={}) #使用关键字参数指定即可
1

8.session

requests.session的使用，整个过程中自动维护cookie

session=requests.session()
# 使用session发送请求
session.post('http://www.aa7a.cn/user.php', data={
    'username': '1233434@qq.com',
    'password': 'bbc123',
    'captcha': 'aaaa',
    'remember': 1,
    'ref': 'http://www.aa7a.cn/user.php?act=logout',
    'act': ' act_login',
})
res1=session.get('http://www.aa7a.cn/')
print('1233434@qq.com' in res1.text)
1
2
3
4
5
6
7
8
9
10
11
12

9 response属性

repsonse对象的属性和方法，把http的响应封装成了response

respone=requests.get('https://www.cnblogs.com/')
respone=requests.get('http://www.autohome.com/news')
print(respone.text)   # 响应体的字符串
print(respone.content) # 响应体二进制数据
print(respone.status_code) #响应状态码
print(respone.headers)# 响应头
print(respone.cookies) #响应的cookie
print(respone.cookies.get_dict()) #cookie转成dict
print(respone.cookies.items())  # cookie拿出key和value
print(respone.url)         # 请求的地址
print(respone.history)     # 列表，有重定向，里面放了重定向之前的地址
print(respone.encoding)   # 响应编码格式
# 后期下载图片，视频，需要使用它
respone.iter_content()

res=requests.get('https://video.pearvideo.com/mp4/adshort/20220427/cont-1760318-15870165_adpkg-ad_hd.mp4')
with open('sp3.mp4','wb') as f:
    # f.write(res.content)
    for line in res.iter_content(chunk_size=1024): # 按1024字节写
        f.write(line)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

10 编码问题

编码问题：大部分网站都是utf-8编码，老网站中文编码使用gbk，gb2312


# respone = requests.get('http://www.autohome.com/news')
# # respone.encoding='gbk'
# print(respone.text)  # 默认使用utf-8可能会导致中文乱码
1
2
3
4

11 获取二进制数据

response.content
response.iter_content(chunk_size=1024)
1
2

11 解析json

res = requests.post('http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword', data={
    'cname': '',
    'pid': '',
    'keyword': '北京',
    'pageIndex': 1,
    'pageSize': 10,

})

import json

dic_res = json.loads(res.text)
print(type(dic_res))
print(dic_res['Table1'][0]['storeName'])

res = requests.post('http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword', data={
    'cname': '',
    'pid': '',
    'keyword': '北京',
    'pageIndex': 1,
    'pageSize': 10,

})
print(type(res.json()))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

13高级用法之 Cert Verification（仅了解）

高级用法之证书（仅了解）
发送https请求，浏览器内置了证书，如果网站与的证书是第三方的就会提示整数过期，这个时候发送请求手动携带证书，才能获取到对应数据

14 代理

代理ip之后网站的频率限制就无法通过限制ip，黑名单（封ip）来限制我们发送大量请求获取数据
代理搭建在下一篇博客中


import requests
proxies = {
    'http': '112.14.47.6:52024',
}
#180.164.66.7
respone=requests.get('https://www.cnblogs.com/',proxies=proxies)
print(respone.status_code)
1
2
3
4
5
6
7
8

15 超时设置

import requests
respone=requests.get('https://www.baidu.com',timeout=0.0001)
1
2

16 异常处理

from requests.exceptions import *

try:
    r=requests.get('http://www.baidu.com',timeout=0.00001)
# except ReadTimeout: 读取超时
#     print('===:')
# except ConnectionError: #网络不通
#     print('-----')
# except Timeout: # 等待时间超时
#     print('aaaaa')
except Exception: # 全异常捕获
    print('x')
1
2
3
4
5
6
7
8
9
10
11
12

17 上传文件

import requests
files={'file':open('a.jpg','rb')}
respone=requests.post('http://httpbin.org/post',files=files)
print(respone.status_code)
1
2
3
4

相关阅读:
力扣108. 将有序数组转换为二叉搜索树
java+mysql基于SSM共享型汽车租赁系统-计算机毕业设计
调度算法1
策略模式、策略模式与工厂模式相结合
1、Docker最新入门教程-Docker概述
设计模式---抽象工厂模式
【算法】【递归与动态规划模块】跳跃游戏
透视未来：现代发电厂地区可视化与智慧能源的结合
Android13 动态切换默认laucnher
Parity Game——种类并查集、权值并查集、离散化

原文地址：https://blog.csdn.net/kdq18486588014/article/details/126062389