爬虫从入门到入牢

文章目录

1. 爬虫简介

爬虫一般指网络爬虫。网络爬虫（又称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫

大部分的软件 cs 或 bs，主流都是用 http 协议通信，实际上爬虫就是模拟发送 http 请求，例如 Postman 也可以模拟发送，爬虫则是在 python 中使用代码进行模拟发送请求。服务端把数据返回( html,xml,json )，在进行数据的清洗（re，bs4），清洗完后再入库（文件，mysql，redis，es，mongo）

mysql： tcp自定定制的协议
redis： tcp自定定制的协议
docker：http协议，符合resful规范
es：    http协议，符合resful规范
1
2
3
4

python 中使用 requests 可以模拟浏览器的请求，比起之前用到的 urllib，requests 模块的 api 更加便捷（本质就是封装了urllib3）

注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

安装：pip3 install requests

各种请求方式：常用的就是 requests.get() 和 requests.post()

>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')
1
2
3
4
5
6
7

2. requests 模块介绍

在 python 中模拟发送请求使用 requests 模块，或者使用 urllib 内置模块，但是其 api 使用复杂。

该模块不仅可以用作爬虫，在后端跟另一个服务交互，也需要使用它

例如公司有一个长链转成短链的服务（把很长的url链接生成短的url链接），可以申请一个域名，将长链和自
己设置的短链进行绑定在库中，并加到自己的域名，当访问短链时会重定向到长链所在地址。
1
2
3
4

2.1 requests get 请求

HTTP默认的请求方法就是GET

没有请求体
数据必须在1K之内
GET请求数据会暴露在浏览器的地址栏中

GET请求常用的操作：

在浏览器的地址栏中直接给出 URL，那么就一定是 GET 请求
点击页面上的超链接也一定是 GET 请求
提交表单时，表单默认使用 GET 请求，但可以设置为 POST

基础使用

import requests

# res 中包含了响应体的内容
res = requests.get('https://www.1biqug.com/')
1
2
3
4

添加 params 参数

import requests

# 类似于 https://www.cnblogs.com?name=xwx&age=19
res = requests.get('https://www.cnblogs.com/', params={'name':'xwx','age':19})
1
2
3
4

注意点：如果地址中包含中文则涉及到 url 的编码和解码，需要使用 urllib.parse.quote 和 urllib.parse.unquote 处理

例如路由中含 ‘谢帅哥’ 中文，复制下来为：
https://blog.csdn.net/m0_58987515?type=blog&name=%E8%B0%A2%E5%B8%85%E5%93%A5

from urllib import parse

url = '哈哈哈'
res = parse.quote(url)
print(res)
res = parse.unquote(url)
print(res)
1
2
3
4
5
6
7

添加请求头

常见的请求头参数有

参数	说明
Host	指明了服务器的域名及服务器监听的TCP端口号。
Referer	告诉服务器该网页是从哪个页面链接过来。
Accept-Charset	规定服务器处理表单数据所接受的字符集。（常用字符集有 UTF-8-Unicode等）
Accept-Language	告知服务器用户代理能够处理的自然语言集。
Authorization	告知服务器客户端的Web认证信息。
User-Agent	告知服务器HTTP 客户端程序的信息。

解决简单的反扒需要获取 user-agent 添加到请求头中，如下示例

header = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get('https://dig.chouti.com/', headers=header)
print(res.text)
1
2
3
4
5

添加 cookie

添加了 cookie 后会有登录信息，才能操作登录后相关操作。

携带的方式一：放在请求头中

import requests
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.post('https://dig.chouti.com/link/vote',
                    headers=header,
                    data={
                        'linkId': '35811284'
                    })
1
2
3
4
5
6
7
8
9

使用 cookies 参数

Cookie信息虽然包含在请求头里，但requests模块有单独的参数来处理他，headers={}内就不要放它了，cookie 是 CookieJar 的对象

import requests

Cookies={   
	'user_session':'wGMHFJKgDcmRIVvcA14_Wrt_3xaUyJNsBnPbYzEL6L0bHcfc',
}

# github对请求头没有什么限制，我们无需定制user-agent，对于其他网站可能还需要定制
response=requests.get('https://github.com/settings/emails', cookies=Cookies) 

print('378533872@qq.com' in response.text) #True
1
2
3
4
5
6
7
8
9
10

2.2 post 请求

POST请求

数据不会出现在地址栏中
数据的大小没有上限
有请求体
请求体中如果存在中文，需要使用URL编码

requests.post() 用法与 requests.get() 完全一致，特殊的是 requests.post() 有一个data参数，用来存放请求体数据

常见响应标头

标头	说明
Keep-Alive	表示 Connection 非持续链接的存活时间。
Server	包含有关原始服务器用来处理请求的软件的信息。
Set-Cookie	用于服务器向客户端发送 sessionID。
Transfer-Encoding	规定了传输报文主题时采用的编码方式。
Location	令客户端重定向至指定的URI。
WWW-Authenticate	表示服务器对客户端的认证信息。

携带数据

请求的数据格式有：from-data、urlencoded(默认)、json

import requests

# 在 data 中取
res = requests.post('http://www.aa7a.cn/user.php', data={
    'username': '111111',
    'password': '111111',
    'captcha': '111111',
    'remember': 1,
    'ref': 'http://www.aa7a.cn',
    'act': ' act_login',
})

# 登录成功的 cookie，是 cookieJar对象，可以看作是字典。（登录失败也会有cookie，但是少了）
print(res.cookies)
res1 = requests.get('http://www.aa7a.cn/', cookies=res.cookies)
print('123@qq.com' in res1.text)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

在这里插入图片描述

携带 json 数据

携带 json 数据可以在 json 参数中，如下所示

res=requests.post('xxx', json={})
1

request.session

request.session 的作用是在整个过程中自动维护 cookie
1

session=requests.session()
# 使用session发送请求
session.post('http://www.aa7a.cn/user.php', data={
    'username': '123@qq.com',
    'password': '123',
    'captcha': 'aaaa',
    'remember': 1,
    'ref': 'http://www.aa7a.cn/user.php?act=logout',
    'act': ' act_login',
})
res1=session.get('http://www.aa7a.cn/')		# 登录成功后不需要在 get 方法中添加 cookies
print('123@qq.com' in res1.text)
1
2
3
4
5
6
7
8
9
10
11
12

2.3 response 属性

repsonse对象的属性和方法，是把 http 的响应封装成了 response

属性方法	说明
respone.text	响应体的字符串
respone.content	响应体二进制数据
respone.status_code	响应状态码
respone.headers	响应头
respone.cookies	响应的 cookie
respone.cookies.get_dict()	cookie 转成 dict
respone.cookies.items()	cookie 拿出 key 和 value
respone.url	请求的地址
respone.history	列表，有重定向，里面放了重定向之前的地址
respone.encoding	响应编码格式
respone.iter_content()	下载图片，视频，需要使用它，可以使用 chunk_size 指定字节大小

with open('致命诱惑3.mp4','wb') as f:
	f.write(res.content)
	for line in res.iter_content(chunk_size=1024):  # 按1024字节写
		f.write(line)
1
2
3
4

2.4 编码问题

若出现中文乱码，可以指定编码的格式。大部分网站都是 utf-8 编码，老网站中文编码使用 gbk，gb2312。

respone = requests.get('http://www.autohome.com/news')
respone.encoding='gbk'
print(respone.text)  # 默认使用utf-8可能会导致中文乱码
1
2
3

2.5 获取二进制数据

response.content
response.iter_content(chunk_size=1024)
res=requests.get('https://gd-hbimg.huaban.com/e1abf47cecfe5848afc2a4a8fd2e0df1c272637f2825b-e3lVMF_fw658')
with open('a.png','wb') as f:
	f.write(res.content)
1
2
3
4
5

2.6 解析 json

import requests
response=requests.get('http://httpbin.org/get')

import json
res1=json.loads(response.text) #太麻烦

res2=response.json() #直接获取json数据

print(res1 == res2) #True
1
2
3
4
5
6
7
8
9

2.7 高级用法之 Cert Verification

高级用法之证书

#证书验证(大部分网站都是https)
import requests
respone=requests.get('https://www.12306.cn') #如果是ssl请求,首先检查证书是否合法,不合法则报错,程序终端



#改进1:去掉报错,但是会报警告
import requests
respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,报警告,返回200
print(respone.status_code)

#改进2:去掉报错,并且去掉警报信息
import requests
from requests.packages import urllib3
urllib3.disable_warnings() #关闭警告
respone=requests.get('https://www.12306.cn',verify=False)
print(respone.status_code)

#改进3:加上证书
#很多网站都是https,但是不用证书也可以访问,大多数情况都是可以携带也可以不携带证书
#知乎\百度等都是可带可不带
#有硬性要求的,则必须带，比如对于定向的用户,拿到证书后才有权限访问某个特定网站
import requests
respone=requests.get('https://www.12306.cn',
                     cert=('/path/server.crt',
                           '/path/key'))
print(respone.status_code)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

2.8 代理

代理简单来说就是使用别人的 IP 来访问资源，并返回到自己这。

国内免费 HTTP 代理

import requests

proxies = {
    'http': '112.14.47.6:52024',
}
# 180.164.66.7
respone = requests.get('https://www.cnblogs.com/', proxies=proxies)
print(respone.status_code)
1
2
3
4
5
6
7
8

2.9 超时，认证，异常，上传文件

超时设置

import requests
respone = requests.get('https://www.baidu.com', timeout=0.0001)
1
2

异常处理

from requests.exceptions import *

try:
    r = requests.get('http://www.baidu.com', timeout=0.00001)
except ReadTimeout:
    print('===:')
except ConnectionError:  # 网络不通
    print('-----')
except Timeout:
    print('aaaaa')
except Exception:
    print('x')
    
1
2
3
4
5
6
7
8
9
10
11
12
13

上传文件

import requests

files = {'file': open('a.jpg', 'rb')}
respone = requests.post('http://httpbin.org/post', files=files)
print(respone.status_code)
1
2
3
4
5

3. 代理池

3.1 搭建简易代理池

可以使用 proxy_pool 来搭建简单的代理池，官网：proxy_pool

简易高效的代理池，提供如下功能：

定时抓取免费代理网站，简易可扩展。
使用 Redis 对代理进行存储并对代理可用性进行排序。
定时测试和筛选，剔除不可用代理，留下可用代理。
提供代理 API，随机取用测试通过的可用代理。

第一步：clone代码
	git clone git@github.com:jhao104/proxy_pool.git

第二步：安装依赖
	pip3 install -r requirements.txt

第三步：修改配置文件 settings.py
	DB_CONN = 'redis://127.0.0.1:6379/1'

第四步：启动项目
	# 启动爬虫程序
	python3 proxyPool.py schedule
	# web服务程序
	python3 proxyPool.py server

第五步：获取代理
	http://127.0.0.1:5010/get/
	
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

在这里插入图片描述

3.2 django 后端获取客户端的 ip

import requests

# 从代理池中取出一个IP
# 格式为
'''
{
    "anonymous":"",
    "check_count":1,
    "fail_count":0,
    "https":false,
    "last_status":true,
    last_time":"2022-08-01 17:47:29",
    "proxy":"183.250.163.175:9091",
    "region":"",
    "source":"freeProxy08/freeProxy06"
}
'''

res = requests.get('http://127.0.0.1:5010/get/').json()
print(res['proxy'])

# 拼接成完整代理地址
h = 'https' if res['https'] else h = 'http'

proxies = {
    h: res['proxy'],
}

# 通过代理地址访问个人服务器上的项目，显示IP地址
res1 = requests.get('http://121.4.75.248/gip/', proxies=proxies)
print(res1.text)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

注意点：

服务器的sqlit3版本可能会出问题，可以提前配置好MySQL数据库去迁移文件。
部署端口的时候如果失败可以关闭nginx ** nginx -s stop**
部署的语句：python manage.py runserver 0.0.0.0:80

4. 小案例

4.1 爬取视频

以梨视频为例

import requests
import re

res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=5&start=0')
# print(res.text)

video_list = re.findall('', res.text)
print(video_list)
# https://www.pearvideo.com/video_1768482
for video in video_list:
    video_id = video.split('_')[-1]
    video_url = 'https://www.pearvideo.com/' + video
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        'Referer': video_url
    }
    res1 = requests.get('https://www.pearvideo.com/videoStatus.jsp?contId=%s&mrd=0.5602821872545047' % video_id,
                        headers=header
                        ).json()
    # print(res1['videoInfo']['videos']['srcUrl'])
    mp4_url = res1['videoInfo']['videos']['srcUrl']
    real_mp4_url = mp4_url.replace(mp4_url.split('/')[-1].split('-')[0], 'cont-%s' % video_id)
    print(real_mp4_url)
    # 下载视频
    res2 = requests.get(real_mp4_url)
    with open('video/%s.mp4' % video_id, 'wb') as f:
        for line in res2.iter_content(1024):
            f.write(line)

# 直接发送请求，拿不到视频，它是发送了ajax请求获取了视频，但是需要携带referer
# res=requests.get('https://www.pearvideo.com/video_1768482')
# print(res.text)


# https://video.pearvideo.com/mp4/third/20220729/     1659324669265     -11320310-183708-hd.mp4   # 不能播
# https://video.pearvideo.com/mp4/third/20220729/     cont-1768482      -11320310-183708-hd.mp4    #能播

# url='https://video.pearvideo.com/mp4/third/20220729/   1659324669265    -11320310-183708-hd.mp4'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

其他：

关于全站爬取：更换分类id和起始爬取的数字即可
同步爬取，速度一般，加入线程(线程池)，提高爬取速度
封 ip 问题（使用代理池）
视频处理（截取视频，拼接视频使用 ffmpeg 软件，通过命令调用软件
python操作软件：subprocess 模块执行 ffmpeg 的命令完成视频操作
python模块操作 opencv（c写的，编译后，使用python调用），实现非常高级的功能（文件操作给视频加头去尾部）

4.2 爬取新闻

以汽车之家为例。使用 bs4 解析

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.autohome.com.cn/news/1/#liststart')

# print(res.text)
# 之前使用re解析，解析这个比较麻烦，我们使用bs4解析
#  第一个参数是要解析的字符串（html，xml格式）
#  第二个参数是解析方式：html.parser
soup = BeautifulSoup(res.text, 'html.parser')

# 开始使用，查找内容
# 查找所有的类名为article的ul标签
ul_list = soup.find_all(name='ul', class_='article')
for ul in ul_list:
    li_list = ul.find_all(name='li')
    for li in li_list:
        h3 = li.find(name='h3')
        if h3:
            # 从h3中取出文本内容,新闻标题
            title = h3.text
            desc = li.find(name='p').text
            # url=li.find(name='a')['href']
            url = 'http:' + li.find(name='a').attrs['href']
            img = 'http:' + li.find(name='img')['src']

            print('''
            新闻标题：%s
            新闻摘要：%s
            新闻地址：%s
            新闻图片：%s
            ''' % (title, desc, url, img))
            # 1 把图片保存到本地
            # 2 把清洗过后的数据存到mysql中
            # 3 全站爬取变更页码数（https://www.autohome.com.cn/news/1/#liststart）

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

在这里插入图片描述

4.3 爬取哔站视频

'''

通过该程序下载的视频和音频是分成连个文件的，没有合成，
视频为：视频名_video.mp4
音频为：视频名_audio.mp4
修改url的值，换成自己想下载的页面节课
'''

# 导入requests模块，模拟发送请求
import requests
# 导入json
import json
# 导入re
import re

# 定义请求头
headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.5',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
}


# 正则表达式，根据条件匹配出值
def my_match(text, pattern):
    match = re.search(pattern, text)
    print(match.group(1))
    print()
    return json.loads(match.group(1))


def download_video(old_video_url, video_url, audio_url, video_name):
    headers.update({"Referer": old_video_url})
    print("开始下载视频：%s" % video_name)
    video_content = requests.get(video_url, headers=headers)
    print('%s视频大小：' % video_name, video_content.headers['content-length'])
    audio_content = requests.get(audio_url, headers=headers)
    print('%s音频大小：' % video_name, audio_content.headers['content-length'])
    # 下载视频开始
    received_video = 0
    with open('%s_video.mp4' % video_name, 'ab') as output:
        while int(video_content.headers['content-length']) > received_video:
            headers['Range'] = 'bytes=' + str(received_video) + '-'
            response = requests.get(video_url, headers=headers)
            output.write(response.content)
            received_video += len(response.content)
    # 下载视频结束
    # 下载音频开始
    audio_content = requests.get(audio_url, headers=headers)
    received_audio = 0
    with open('%s_audio.mp4' % video_name, 'ab') as output:
        while int(audio_content.headers['content-length']) > received_audio:
            # 视频分片下载
            headers['Range'] = 'bytes=' + str(received_audio) + '-'
            response = requests.get(audio_url, headers=headers)
            output.write(response.content)
            received_audio += len(response.content)
    # 下载音频结束
    return video_name


if __name__ == '__main__':
    # 换成你要爬取的视频地址
    url = 'https://www.bilibili.com/video/BV1QG41187tj?'
    # 发送请求，拿回数据
    res = requests.get(url, headers=headers)
    # 视频详情json
    playinfo = my_match(res.text, '__playinfo__=(.*?)