小红书关键词爬虫

标题

1 统计要收集的关键词，制作一个文件夹
2 爬取每一页的内容
3 爬取标题和内容
4 如果内容可以被查看，爬取评论内容
5 将结果进行汇总，并且每个帖子保存为一个json文件，具体内容
6 总结

1 统计要收集的关键词，制作一个文件夹

例如，我要收集旅游相关的，就收集：
旅游、旅行、旅游攻略，这些词，做成一个txt文件。

用一个浏览器登录上小红书账号，然后记录写cookies，例如：
在这里插入图片描述

2 爬取每一页的内容

主要使用request，js模块，将爬取的内容保存为res，里面包含一页20条数据。

info = re.sub(r'"page":".*?"', f'"page":"{page}"', info)
        ret = js.call('get_xs', api, info, cookies["a1"])
        headers['x-s'], headers['x-t'] = ret['X-s'], str(ret['X-t'])
        response = requests.post(search_url, headers=headers, cookies=cookies, data=info.encode('utf-8'))
        res = response.json()
1
2
3
4
5

3 爬取标题和内容

从每一个note里面解析出标题，内容等信息。

result = {}
    result["title"] = data['note_card']['title']
    result["desc"] = data['note_card']['desc'].replace("\n", "").replace("\t", "")
    tags_temp = data['note_card']['tag_list']
    tags = []
    for tag in tags_temp:
        try:
            tags.append(tag['name'])
        except:
            pass
    result["tags"] = tags
    result["time"] = timestamp_to_str(data['note_card']['time'])
1
2
3
4
5
6
7
8
9
10
11
12

4 如果内容可以被查看，爬取评论内容

每个帖子里面的评论的单独的url，需要根据id号进行拼接，所以根据第3节获取的user-id，进行拼接，然后再用get进行访问，最后获得每条评论，注意有些帖子是不能被查看的，所以需要进行判断。

note_id = url.split('/')[-1]
    comments_url = "https://edith.xiaohongshu.com/api/sns/web/v2/comment/page?note_id={}&image_scenes=FD_WM_WEBP,CRD_WM_WEBP".format(
        note_id)
    response = requests.get(comments_url, headers=headers, cookies=cookies)
    res = response.json()
    comments = []
    for line in res["data"]["comments"]:
        comment_str = line["content"]
        comments.append(comment_str)
1
2
3
4
5
6
7
8
9

5 将结果进行汇总，并且每个帖子保存为一个json文件，具体内容

包含：标题，具体内容，标题，创建时间，评论内容。每个关键词一个文件夹。
在这里插入图片描述

6 总结

详细代码私聊，注意本内容没有爬取图片，如果需要可以添加。

相关阅读:
2023 年 42 周 - 学习 & 倦怠期回顾
【阿里云天池大赛赛题解析】刷书笔记 Lesson 1 数据探索基础知识
曝 15 寸 iPad 或将变身 Mac？谷歌：大屏设备应具备智能手机体验
Redis Twemproxy 集群规范部署手册
Windows Server 2012服务器无法识别ADB Interface的解决办法
Vue3中图片上传组件封装-element-plus的el-upload二次封装-案例
基于Petri网模型的柔性加工系统能耗动态优化调度方法
代码随想录day41 || 动态规划 || 整数拆分 || 不同的二叉搜索树
论文精读:detr:End-to-End Object Detection with Transformers
信息安全实验——网络扫描技术

原文地址：https://blog.csdn.net/ww596520206/article/details/136309997