• 【爬虫】第二部分 urllib_handler处理器


    爬虫】第二部分 urllib_handler处理器



    2. urllib_handler处理器

    handler处理器urllib库中继urlopen() 方法之后又一种模拟浏览器向服务器发起请求的方法或技术

    那么我们为什么要学习它呢?

    因为随着我们的业务逻辑越来越复杂,定制请求对象的已经不能够满足我们的需求,所以我们需要借助handler处理器

    2.1 handler的基本使用

    import urllib.request
    url = 'https://www.xxx.com'
    headers = {
        'Referer': 'https://www.xxx.com/link?url=Rg4aCsjouphcJ5OGEv4z8RlR9Wc4ERQipSjI1HqkVfG&wd=&eqid=af45491f000c228e000000036357fbd3',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
        'Cookie': 'BIDUPSID=F8062DDC1948E6DAC11A23B9185211BC; PSTM=1644290569; ab_sr=1.0.1_ZWEzMTIzMDUyNmJjOTQ3MmExYTRhNDkwMGY0M2FkM2U0NzE5MzM2OGY0NjFhNTBlZDJjM2FmNDY2NDg0MDlhY2FkMDlmY2IyODdmNTIzMDg2YzU2MThlYTdhODUxYWRiMWRmN2IzYmNjOWY5ZjNkZWM0MjY3OTRkNGZkOGZjMDliYjY4YzFhNDU0NzdjMDYxNGQ0MTNhZDM3ZjZiYmIzMjUxZDNlNTU3NGM0MmUzYzdjZWU2M2FiZDY4MzFlZjM1; BD_HOME=1; H_PS_PSSID=36546_37553_37518_37355_37584_36885_37626_36786_37536_37581_26350; delPer=0; BD_CK_SAM=1; PSINO=1; H_PS_645EC=0e91%2B6teFT2SdG7NZB2Vx8CCNZDdA40fjlteqshvLmXMdu9%2F3Xm9G32mkpQ; baikeVisitId=f0b5771a-9a25-4525-8569-22025278cb44'
    }
    
    # 伪装
    request = urllib.request.Request(url=url, headers=headers)
    
    # 获取handler对象
    handler = urllib.request.HTTPHandler()
    
    # 获取opener对象
    opener = urllib.request.build_opener(handler)
    
    # 调用opener,这部操作相当于urlopen()
    response = opener.open(request).read().decode()
    
    print(response)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22

    2.2 代理服务器

    代理的功能:

    • 突破IP地址访问的限制,去访问国外的网站
    • 访问一些单位或者是公司内部的资源
    • 提高访问的速度
    • 隐藏真实的IP地址
    • 当IP被禁止了就可以通过代理继续爬取数据

    具体使用看下面的代码

    import urllib.request
    
    url = 'https://www.xxx.com'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
        'Cookie': 'BIDUPSID=F8062DDC1948E6DAC11A23B9185211BC; PSTM=1644290569; BAIDUID=F8062DDC1948E6DA5E48F5D4AE020258:FG=1; __yjs_duid=1_79801aa46424f5e5bc57241a10490cb21644290773722; BD_UPN=12314753; BDUSS_BFESS=NvckF2cUNuMkhWY25QdUs5ekhQWjczM09RTUY0ekRCV0VNbDJhaG13OVVOMVpqRUFBQUFBJCQAAAAAAAAAAAEAAABujHGhs7~q2NPrzqLCtgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFSqLmNUqi5jcV; BAIDUID_BFESS=F8062DDC1948E6DA5E48F5D4AE020258:FG=1; BA_HECTOR=al2hag850kakak0k0g8k86f11hlfsa11b; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; B64_BOT=1; RT="z=1&dm=baidu.com&si=zuxuc8uuzs&ss=l9oas0nm&sl=2&tt=4h8&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=4cl&ul=ash&hd=ath"; H_PS_645EC=169b7wti5DafF3FtoCuCPgYMW%2BvxNf%2F8byaWwhPXCkfgwRW7AdS%2FgK3ReXA; baikeVisitId=e0c37c9d-7434-4297-808b-9b0fb72d4635'
    }
    
    # 伪装
    request = urllib.request.Request(url=url, headers=headers)
    
    # proxies 代理服务器信息
    proxies = {
        # 可以到快代理中去使用免费的或者进行购买
        'http': '223.96.90.216:8085'
    }
    
    # 获取handler对象
    handler = urllib.request.ProxyHandler(proxies)
    
    # 获取opener对象
    opener = urllib.request.build_opener(handler)
    
    # 调用opener,这部操作相当于urlopen()
    response = opener.open(request).read().decode('utf-8')
    
    with open('baidu.html', 'w', encoding='utf-8') as fs:
        fs.write(response)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29

    2.3 代理池

    import urllib.request
    import random
    
    # 模拟一个简单的代理池
    proxies_pool = [
        {'http': '61.216.185.88:60808'},
        {'http': '223.96.90.216:8085'},
        {'http': '223.96.90.216:8085'},
        {'http': '58.20.184.187:9091'},
        {'http': '183.247.202.208:30001'}
    ]
    
    # 随机从列表中选择一条数据
    proxies = random.choice(proxies_pool)
    
    url = 'https://www.xxx.com'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
        'Referer': 'https://www.xxx.com/link?url=BaNnxSgMcoGHDufIkglmT9y6jMWok_p8P6JD0GdUFue&wd=&eqid=fe3d76e8000024600000000663580923',
        'Cookie': 'Hm_lvt_f4f76646cd877e538aa1fbbdf351c548=1666713895; Hm_lpvt_f4f76646cd877e538aa1fbbdf351c548=1666713895'
    }
    
    # 伪装
    request = urllib.request.Request(url=url, headers=headers)
    
    # 获取handler
    handler = urllib.request.ProxyHandler(proxies)
    
    # 获取opener
    opener = urllib.request.build_opener(handler)
    
    # 模拟浏览器发送请求获取数据
    response = opener.open(request).read().decode('utf-8')
    
    with open('ip.html', 'w', encoding='utf-8') as fs:
        fs.write(response)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37

    总结

    以上就是今天要讲的内容,希望对大家有所帮助!!!

  • 相关阅读:
    labelme标注的json数据集转换成coco数据集
    如何在 R 中执行稳健回归
    通过AppLink把拼多多热门榜单商品同步至小红书
    C++入门基础
    运维经验记录 在CentOS上挂载Windows共享磁盘
    设计模式——原型模式05
    E. Binary Inversions——前缀+后缀
    【证明】线性空间的基本性质
    apollo lidar 模块3.0&6.0
    ASP.NET Core 项目部署
  • 原文地址:https://blog.csdn.net/Trees__/article/details/127721394