【C++项目】boost搜索引擎项目

项目的gitee地址

项目地址，复制到浏览器打开：
https://gitee.com/xiao-jiheng/boost_search_engine

项目基本演示

主要是：服务端开启服务，客户端就可以通过浏览器进入页面进行搜索服务，搜索的内容就是
BOOST库的内容；

项目的基本目录：
在这里插入图片描述

启动项目的过程命令；

[xjh@VM-12-10-centos boost_searcher]$ make #对整个项目进行编译
[xjh@VM-12-10-centos boost_searcher]$ ./parser  #编译成功对网页进行去标签
[xjh@VM-12-10-centos boost_searcher]$ ./http_server #启动服务器
1
2
3

去标签目的：是对网页内容进行清洗，因为我们搜索的内容不是需要网页标签，所以需要去掉；

启动服务：服务端要提供网页资源供用户搜索，该网页资源需要构建索引；
如何验证是否启动服务器成功？

[xjh@VM-12-10-centos boost_searcher]$ netstat -nltp#查看网络状态
1

在这里插入图片描述

用户需要提供搜索关键字进行搜索；搜索页面如下：
默认使用端口为8081；
在这里插入图片描述

搜索结果大概如下：
在这里插入图片描述

点击任何一条链接：肯定会包含我们搜索关键字;

讲解思路

项目的相关背景
搜索引擎的相关宏观原理
搜索引擎技术栈和项目环境
正排索引 vs倒排索引–搜索引擎具体原理
编写数据去标签与数据清洗的模块Parser
编写建立索引的模块Index
编写搜索引擎模块Searcher
编写http_server模块
编写前端模块

一：项目相关背景

公司：百度、搜狗、360搜索、头条新闻客户端 - 我们自己实现是不可能的！
技术门槛高，保存海量的网络资源就是一个问题了；
更别说根据客户的关键字，对关键字排序，显示网页内容的工作；
站内搜索：搜索的数据更垂直，数据量其实更小；
boost的官网是没有站内搜索的，需要我们自己做一个；

我们写的是一个站内搜索，就是搜索资源就是boost库的内容！
展示内容就是：标题；网页内容的摘要，和url 这三个关键的信息；
并且点击内容可以跳转相关的网站
不像百度的，既有图片，还有视频，还有广告等内容，甚至对关键字搞了标红了；
我们的站内搜索仅仅是利用了搜索引擎的基本原理去完成的；
在这里插入图片描述

二：搜索引擎的相关宏观原理

首先服务器：
内部是提前准备好要搜索的资源的，该资源是通过爬虫程序爬取网络的信息，然后保存在自己服务器上的磁盘；
然后对爬取到的网页内容进行数据清晰工作，去掉标签，保留主要关键信息；
同时对爬取的内容进行建立索引，目的为了用户方便查找服务器资源，加快用户查找效率；

对于客户端，也就是浏览器，要通过GET请求方式上传自己的关键字，服务器收到后，就会对请求报文进行处理，检索关键字，得到相关资源，构建相关资源的html信息返回给用户！

在这里插入图片描述

三：搜索引擎技术栈和项目环境

技术栈: C/C++ C++11, STL, 准标准库Boost，Jsoncpp，cppjieba，cpp-httplib ;
html5， css，js、jQuery、Ajax(本项目前端技术的基本很少使用，主要在后端)
项目环境： Centos 7云服务器，vim/gcc(g++)/Makefile , vs code

cppjieba: 分词工具，主要对用户搜索关键字进行切分，切分搜索，并且返回切分搜索到的结果；同时服务器建立索引时候，也需要对关键字进行切分；
cpp-httplib:直接构建服务器的开源库；

四：正排索引 vs 倒排索引 - 搜索引擎具体原理

正排和倒排索引文章链接(直接点击就可以跳转网页资源)
上面那文章是网络搜索的，对正排倒排的解释；我自己也会解释一下，但是是简单说明！不是具体解释概念；
我所将的是正排倒排的特点，及其在搜索引擎承担什么角色任务！

正排索引：文档id和文档内容的映射关系；就是通过文档id去找到文档内容（也有人说是找到文档内的关键字）；

以后我们搜索肯定是根据关键字进行搜索文档内容的；
所以我们服务器必须对文档内容进行分词，分词目的就是为了方便建立倒排索引；

分词：
文档1[雷军买了四斤小米 ]: 雷军/买/四斤/小米/四斤小米；
文档2[雷军发布了小米手机]：雷军/发布/小米/小米手机；

这里文档1 分词就是分为了这几个部分 [雷军] [买] [四斤] [小米] [四斤小米] （举个例子这里，分词的策略有很多种的）；
我们就是通过这些分词结果对其进行倒排索引建立，方便用户更具关键字查找到内容；

在这里插入图片描述
模拟一次查找的过程：
用户输入：小米 -> 倒排索引中查找 -> 提取出文档ID(1,2) -> 根据正排索引 -> 找到文档的内容 ->
title+conent（desc）+url 文档结果进行摘要->构建响应结果；

注意：编写代码时候，我们需要构建倒排索引，构建倒排索引需要文档内容进行分词，用分词结果去构建倒排索引；
然后用户搜索时候，我们也需要对用户搜索关键字进行分词，根据分词，也即是关键字，去倒排索引找到关键字对应文档ID，再拿到文档ID去正排索引找到文档内容！

五：编写数据去标签与数据清洗的模块 Parser

先下boost库的资源到Linux中，让其作为服务器搜索资源；

boost 官网： https://www.boost.org/
//目前只需要boost_1_78_0/doc/html目录下的html文件，用它来进行建立索引
1
2

进入官网：找到该图标
在这里插入图片描述

点击下载该版本的到你的桌面（当然下载哪个版本都无所谓，只是我的boost版本就是该版本）；
在这里插入图片描述
使用命令：

[xjh@VM-12-10-centos boost_search]$ rz -E #把桌面的boost库传到Linux中；
1

在这里插入图片描述

成功对齐解压即可：

tar -zxvf boost_1_78_0.tar.gz #解压即可
1

这就是boost库的官网的内容！

在这里插入图片描述

但是我们进行站内搜索的内容：只是使用该路径的资源：

boost_1_78_0/doc/html/
1

里面包含boost库的所有内容！也就是该项目可以被搜索到的资源

将该文件内容拷贝到data/input目录，也就是我们boost搜素引擎的搜索内容

在这里插入图片描述

后序工作就是拿到data/input的内容，构建索引！

创建一个parser.cc文件的主要功能就是去标签的任务！

在这里插入图片描述

把去标签的内容保存再 raw.txt文档内容
在这里插入图片描述

目标：把每个data\input下的文档都去标签，然后写入到同一个raw.txt文件中！
每个文档内容不需要任何\n！文档和文档之间用 \3 区分；
XXXXXXXXXXXXXXXXX\3YYYYYYYYYYYYYYYYYYYYY\3ZZZZZZZZZZZZZZZZZZZZZZZZZ\3；

选择\3原因：它是不可显示字符，不会污染我们的数据源！

5.1 parser基本代码结构

该代码是在文件 parser.cc，的基本结构；
该文件的主要完成功能是：对所有要搜索的boost资源html文档，进行数据清洗工作；
步骤：

读取所有的该路径下const std::string src_path = "data/input";所有的html文档的名称到一个数组中保存vector &files_list；
读取每一个html文档，也就是枚举数组vector &files_list的每一个元素，对其进行去标签的，获取标题，文档内容，和url 三个主要的信息存储在std::vector results;数组中；
将去标签的html文档信息从数组std::vector results读取兵保存到在const std::string output = "data/raw_html/raw.txt";文档中；

#include 
#include 
#include 
#include 
#include 
#include "util.hpp"

const std::string src_path = "data/input";
const std::string output = "data/raw_html/raw.txt";

typedef struct DoInfo
{
  std::string title;   //文档的标题
  std::string content; //文档的内容
  std::string url;     //文档在官网url
} DocInfo_t;

//函数参数命名规范小细节;
/*
 * const& :输入参数
 * * :输出参数
 * & :输入输出参数
 * */

bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list);

bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results);

bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output);

int main()
{
  std::vector<std::string> files_list; //保存 src_path路径下所有的html文件名

  // 1.递归式的把src_path路径下的所有文件名(带路径的)保存在files_list,目的方便后期读取
  if (!EnumFile(src_path, &files_list))
  {
    std::cerr << "enum file name error" << std::endl;
    return 1;
  }

  // 2.对每个文件html文件进行读取其内容，并解析出结果存放在DocInfo结构体中
  std::vector<DocInfo_t> results;

  if (!ParseHtml(files_list, &results))
  {
    std::cerr << "parse html error" << std::endl;
    return 2;
  }

  // 3.将解析到的各个文档的DocInfo信息存放到output文件中，并通过\3作为每个文档解析结果进行分割
  if (!SaveHtml(results, output))
  {
    std::cerr << "save html error" << std::endl;
    return 3;
  }

  return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

5.2 使用boost库函数枚举每个html文件名

有了5.1小节的基本结构parser.cc文件清洗数据基本结构，接下来就完成每一步的细节；
5.2小节就是完成bool EnumFile(const std::string &src_path, std::vector *files_list);该函数的；
该函数的功能就是：枚举src_path路径下的所有文件，并把读取的.html文件名结尾的文件保存在files_list当中;

说白了就是该路径以html结尾的文件，读取到内存中；
在这里插入图片描述

该函数的具体实现代码：

bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list)
{
  namespace fs = boost::filesystem;
  fs::path root_path(src_path); // root_path是开始递归搜索的根目录路径
  //判断搜索的文件路径是否是存在
  if (!fs::exists(root_path))
  {
    std::cerr << src_path << " not exists " << std::endl;
    return false;
  }
  //递归遍历root_path
  fs::recursive_directory_iterator end; //空迭代器，用来判断递归结束标志
  for (fs::recursive_directory_iterator it(root_path); it != end; it++)
  {
    //遍历的文件：需要拿到的是普通文件，目录和其他文件就不处理
    if (!fs::is_regular_file(*it))
      continue;

    //是普通文件还要判断是否为html文件
    if (it->path().extension() != ".html") // extension获取文件名的后缀
      continue;

    // std::cout << "debug: " << it->path().string() << std::endl;
    //来到这里肯定是一个合法以.html结尾的合法文件

    files_list->push_back(it->path().string());
  }
  return true;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

当然里面使用了很多是boost库提供的函数；我是用的是boost 1.53版本的函数；

5.3 解析html代码编写

当我们获取到html文档的每个文件名，就需要对其每个html文档进行解析；
要解析之前，肯定要根据每个html文档的文件名进行读取html的文档，再对其解析；
解析获取三个信息：标题，内容，url 即可；

该模块主要是完成：bool ParseHtml(const std::vector &files_list, std::vector *results);函数的编写；

bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results)
{

  for (const std::string &file : files_list)
  {
    // 1.读取文件名file的内容
    std::string result;
    if (!ns_util::FileUtil::ReadFile(file, &result))
      continue;

    DocInfo_t doc;
    // 2.解析内容获得title
    if (!ParseTitle(result, &doc.title))
      continue;
    // 3.解析内容获取content
    if (!ParseContent(result, &doc.content))
      continue;
    // 4.解析内容获取url
    if (!ParseUrl(file, &doc.url))
      continue;
      
    //来到这里说名：解析一个文件内容成功，当前解析结果放在doc中
    // results->push_back(doc); //小细节：push_back扩容会发送拷贝，效率低
    results->push_back(std::move(doc)); //这个doc内容太大了，并且是临时对象，我们可以直接移动构造很棒，减少拷贝
  return true;
}
//*****************************************************//
static bool ParseTitle(const std::string &file, std::string *title)
{
  std::size_t begin = file.find(""</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>begin <span class="token operator">==</span> std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span>
    <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>

  std<span class="token double-colon punctuation">::</span>size_t end <span class="token operator">=</span> file<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">"");
  if (end == std::string::npos)
    return false;

  begin += std::string(""</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>begin <span class="token operator">></span> end<span class="token punctuation">)</span>
    <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>

  <span class="token operator">*</span>title <span class="token operator">=</span> file<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span> end <span class="token operator">-</span> begin<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

<span class="token comment">//参数file是一个html文件的内容（还没被解析的html文件内容）</span>
<span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseContent</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>file<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>content<span class="token punctuation">)</span>
<span class="token punctuation">{<!-- --></span>

  <span class="token comment">//去标签，基于简单的的状态机编写</span>
  <span class="token keyword">enum</span> <span class="token class-name">status</span>
  <span class="token punctuation">{<!-- --></span>
    LABLE<span class="token punctuation">,</span>  <span class="token comment">//标签</span>
    CONTENT <span class="token comment">//内容</span>
  <span class="token punctuation">}</span><span class="token punctuation">;</span>

  <span class="token keyword">enum</span> <span class="token class-name">status</span> s <span class="token operator">=</span> LABLE<span class="token punctuation">;</span> <span class="token comment">//默认的所有html网页刚开始的字符串肯定是标签</span>
  <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">char</span> c <span class="token operator">:</span> file<span class="token punctuation">)</span>    <span class="token comment">//遍历html网页的内容里面的每一个字符</span>
  <span class="token punctuation">{<!-- --></span>
    <span class="token comment">//检测状态</span>
    <span class="token keyword">switch</span> <span class="token punctuation">(</span>s<span class="token punctuation">)</span>
    <span class="token punctuation">{<!-- --></span>
      <span class="token comment">//当我们读到的是标签，也就是处于LABLE状态，那么我们什么都不做，继续读取下一个</span>
      <span class="token comment">//什么时候该LABLE状态结束呢？当读取到'>'表示LABLE状态结束</span>
    <span class="token keyword">case</span> LABLE<span class="token operator">:</span>
      <span class="token keyword">if</span> <span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">'>'</span><span class="token punctuation">)</span>
        s <span class="token operator">=</span> CONTENT<span class="token punctuation">;</span>
      <span class="token keyword">break</span><span class="token punctuation">;</span>
      <span class="token comment">//处于CONTENT状态就把读取到的字符假如content,</span>
      <span class="token comment">//什么时候该CONTENT状态结束呢？只要碰到'<'就表示结束了</span>
    <span class="token keyword">case</span> CONTENT<span class="token operator">:</span>
      <span class="token keyword">if</span> <span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">'<'</span><span class="token punctuation">)</span>
        s <span class="token operator">=</span> LABLE<span class="token punctuation">;</span>
      <span class="token keyword">else</span>
      <span class="token punctuation">{<!-- --></span>
        <span class="token comment">//读取到的字符可能有\n，我们不希望保留，因为要做html解析后文本的分隔符</span>
        <span class="token keyword">if</span> <span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">'\n'</span><span class="token punctuation">)</span>
          c <span class="token operator">=</span> <span class="token char">' '</span><span class="token punctuation">;</span> <span class="token comment">//小细节：源文档的file的\n是没有被修改的，这里遍历file拿到的c字符串不是引用，所以不会修改</span>
        content<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span><span class="token punctuation">;</span>
      <span class="token punctuation">}</span>
      <span class="token keyword">break</span><span class="token punctuation">;</span>

    <span class="token keyword">default</span><span class="token operator">:</span>
      <span class="token keyword">break</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
  <span class="token punctuation">}</span>
  <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

<span class="token comment">//file_path：就是要查询的html文档在我们Linux的 ./data/input/ 目录下的文件路径;</span>
<span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseUrl</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>file_path<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>url<span class="token punctuation">)</span>
<span class="token punctuation">{<!-- --></span>
  std<span class="token double-colon punctuation">::</span>string url_head <span class="token operator">=</span> <span class="token string">"https://www.boost.org/doc/libs/1_78_0/doc/html"</span><span class="token punctuation">;</span>
  std<span class="token double-colon punctuation">::</span>string url_tail <span class="token operator">=</span> file_path<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>src_path<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token operator">*</span>url <span class="token operator">=</span> url_head <span class="token operator">+</span> url_tail<span class="token punctuation">;</span>

  <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
<div class="hljs-button signin" data-title="登录后复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><div class="hide-preCode-box"><span class="hide-preCode-bt" data-report-view="{"spm":"1001.2101.3001.7365"}"><img class="look-more-preCode contentImg-no-view" src="https://1000bd.com/contentImg/2022/06/27/191644837.png" alt="" title=""></span></div><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li><li style="color: rgb(153, 153, 153);">25</li><li style="color: rgb(153, 153, 153);">26</li><li style="color: rgb(153, 153, 153);">27</li><li style="color: rgb(153, 153, 153);">28</li><li style="color: rgb(153, 153, 153);">29</li><li style="color: rgb(153, 153, 153);">30</li><li style="color: rgb(153, 153, 153);">31</li><li style="color: rgb(153, 153, 153);">32</li><li style="color: rgb(153, 153, 153);">33</li><li style="color: rgb(153, 153, 153);">34</li><li style="color: rgb(153, 153, 153);">35</li><li style="color: rgb(153, 153, 153);">36</li><li style="color: rgb(153, 153, 153);">37</li><li style="color: rgb(153, 153, 153);">38</li><li style="color: rgb(153, 153, 153);">39</li><li style="color: rgb(153, 153, 153);">40</li><li style="color: rgb(153, 153, 153);">41</li><li style="color: rgb(153, 153, 153);">42</li><li style="color: rgb(153, 153, 153);">43</li><li style="color: rgb(153, 153, 153);">44</li><li style="color: rgb(153, 153, 153);">45</li><li style="color: rgb(153, 153, 153);">46</li><li style="color: rgb(153, 153, 153);">47</li><li style="color: rgb(153, 153, 153);">48</li><li style="color: rgb(153, 153, 153);">49</li><li style="color: rgb(153, 153, 153);">50</li><li style="color: rgb(153, 153, 153);">51</li><li style="color: rgb(153, 153, 153);">52</li><li style="color: rgb(153, 153, 153);">53</li><li style="color: rgb(153, 153, 153);">54</li><li style="color: rgb(153, 153, 153);">55</li><li style="color: rgb(153, 153, 153);">56</li><li style="color: rgb(153, 153, 153);">57</li><li style="color: rgb(153, 153, 153);">58</li><li style="color: rgb(153, 153, 153);">59</li><li style="color: rgb(153, 153, 153);">60</li><li style="color: rgb(153, 153, 153);">61</li><li style="color: rgb(153, 153, 153);">62</li><li style="color: rgb(153, 153, 153);">63</li><li style="color: rgb(153, 153, 153);">64</li><li style="color: rgb(153, 153, 153);">65</li><li style="color: rgb(153, 153, 153);">66</li><li style="color: rgb(153, 153, 153);">67</li><li style="color: rgb(153, 153, 153);">68</li><li style="color: rgb(153, 153, 153);">69</li><li style="color: rgb(153, 153, 153);">70</li><li style="color: rgb(153, 153, 153);">71</li><li style="color: rgb(153, 153, 153);">72</li><li style="color: rgb(153, 153, 153);">73</li><li style="color: rgb(153, 153, 153);">74</li><li style="color: rgb(153, 153, 153);">75</li><li style="color: rgb(153, 153, 153);">76</li><li style="color: rgb(153, 153, 153);">77</li><li style="color: rgb(153, 153, 153);">78</li><li style="color: rgb(153, 153, 153);">79</li><li style="color: rgb(153, 153, 153);">80</li><li style="color: rgb(153, 153, 153);">81</li><li style="color: rgb(153, 153, 153);">82</li><li style="color: rgb(153, 153, 153);">83</li><li style="color: rgb(153, 153, 153);">84</li><li style="color: rgb(153, 153, 153);">85</li><li style="color: rgb(153, 153, 153);">86</li><li style="color: rgb(153, 153, 153);">87</li><li style="color: rgb(153, 153, 153);">88</li><li style="color: rgb(153, 153, 153);">89</li><li style="color: rgb(153, 153, 153);">90</li><li style="color: rgb(153, 153, 153);">91</li><li style="color: rgb(153, 153, 153);">92</li><li style="color: rgb(153, 153, 153);">93</li><li style="color: rgb(153, 153, 153);">94</li><li style="color: rgb(153, 153, 153);">95</li><li style="color: rgb(153, 153, 153);">96</li><li style="color: rgb(153, 153, 153);">97</li><li style="color: rgb(153, 153, 153);">98</li><li style="color: rgb(153, 153, 153);">99</li></ul></pre> 
<p>该代码的基本四个逻辑：<br> <strong>1. 读取每一个html文档；<br> 2. 解析html的标题；<br> 3. 解析html的内容；<br> 4. 解析html的url；</strong></p> 
<hr> 
<p><strong>如何读取每个html文档？</strong><br> 根据每个文件名（带路径的html文件名）按行读即可；</p> 
<hr> 
<p><strong>如何解析html的标题？</strong><br> 其实标题就是在<code><title>head... 标前里面：只要我们读取到该标签的下标，对其进行截取内容即可；
  
 
如何解析html的内容？ 
这里是使用的方式是基于简易的状态机编写；
 从头开始遍历html的文档内容，读取到标签 左尖括号< 就认为是标签，其实也是读取内容的结束位置，读取到右尖括号>就是读取标签结束,也是读取真正内容的开始； 
 
如何解析html的url呢？ 
boost库的官方文档，和我们下载下来的文档，是有路径的对应关系的 
官网URL样例：    https://www.boost.org/doc/libs/1_78_0/doc/html/accumulators.html
我们下载下来的url样例：boost_1_78_0/doc/html/accumulators.html
我们拷贝到我们项目中的样例：data/input/accumulators.html //我们把下载下来的boost库 doc/html/* copy
data/input/
url_head = "https://www.boost.org/doc/libs/1_78_0/doc/html";
url_tail = [data/input](删除) /accumulators.html -> url_tail = /accumulators.html
url = url_head + url_tail ; 相当于形成了一个官网链接
1
2
3
4
5
6
7 
 
如何保存html文件呢？ 
其实就是读取解析到的html文件数组 std::vector results;到const std::string output = "data/raw_html/raw.txt";文件中；
 但是我们要处理 标题 内容 url 之间的间隔，以\3作为分隔符；方便日后读取； 
 
六: 编写建立索引的模块 Index 
在 第五个模块中，我们成功对我们要搜索的资源进行了数据清理，并将清理的所有html文件读取到了
 一个文件const std::string output = "data/raw_html/raw.txt中；
 接下来我们就需要根据该文件的内容进行建立索引； 
 
该模块的内容是：在index.hpp文件中; 
 
我们该模块的结构大概如下： 
设计正排节点 struct DocInfo和倒排节点 InvertedElem；
设计倒排索引 std::unordered_map inverted_index;和正排索引 std::vector forword_index;结构；
提供获取正排索引函数 DocInfo* GetForwordIndex(uint64_t doc_id); 和倒排索引的函数 InvertedList* GetInvertedList(std::string& word)；
提供建立索引的函数 bool BulidIndex(const std::string& input)；
提供建立倒排索引函数 bool BuildInvertedIndex(const DocInfo& doc); 和正排索引的函数 DocInfo* BulidForWordIndex(const std::string& line);
对索引设计为单例模式； 
 
具体函数说明和设计思想理解如下： 
#pragma once
#include 
#include 
#include 
#include
#include
#include
#include
#include"util.hpp"
#include"log.hpp"


namespace ns_index
{
  //由于要设计正排索引,也就是根据文档id找到文档内容，那么文档内容就需要用一个结构体去描述，所以设计出DocInfo
    struct DocInfo //文档内容
    {
        std::string title;
        std::string content;
        std::string url;
        uint64_t doc_id;
    };

  //由于要设计倒排索引,也就是根据关键词找到文档id，那么需要用一个结构体去描述，所以设计出InvertedElem
    struct InvertedElem
    {
        int doc_id;
        std::string word;
        int weight;
    };

    //倒排拉链
    typedef std::vector<InvertedElem> InvertedList;

    class Index
    {
    private:
        /*设计正排索引：使用数据的结构来设计*/

        //正排索引：下标天然就是文档ID ID快速找-->文档内容
        std::vector<DocInfo> forword_index;

        //倒排索引：通过关键字-->快速找到对应的文档
        /*倒排索引中，一个关键字，对应多个文档id*/
        //(我们只要拿到一个关键字，就可以拿到一个vector,这个vector每个节点就是到倒排节点，也是文档id啦)
        std::unordered_map<std::string, InvertedList> inverted_index;

        static Index* instance; 
        static std::mutex mtx;
    private:
        Index(){}
        Index(const Index& ) = delete;
        Index& operator=(const Index& )=delete;
    public:
      static Index* GetInstance()
        {
          if(nullptr == instance)
          {
           mtx.lock();
           if(nullptr == instance)
            {
              instance  = new Index();
            }
            mtx.unlock();
          }
          return instance;
        }
        ~Index(){}
        public:
        //根据ID找到文档内容(也就是根据doc_id找到正排索引节点)
        DocInfo* GetForwordIndex(uint64_t doc_id);
        //根据关键字找到倒排拉链
        InvertedList* GetInvertedList(std::string& word)
        //建立索引（正排索引和倒排索引）
        //根据传入的parser.cc函数处理完毕的/data/raw_html/raw.txt文件，构建索引
        bool BulidIndex(const std::string& input); //根据input文档内容构建索引
        private:
        //就是读到的line构建DocInfo,再插入到vector这个正排索引中
        //构建成功后，我们就可以直接根据doc_id快速查到文档内容DocInfo了
        DocInfo* BulidForWordIndex(const std::string& line); //line就是row.txt每一行的内容
        //对建立好的正排索引的一个结构DocInfo进行处理：做建立倒排索引
        bool BuildInvertedIndex(const DocInfo& doc);        
};
     Index* Index::instance = nullptr; 
     std::mutex Index::mtx;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85 
 
6.1 获取正排索引和倒排拉链函数具体实现 
具体函数实现： 
 
 //根据ID找到文档内容(也就是根据doc_id找到正排索引节点)
        DocInfo* GetForwordIndex(uint64_t doc_id)
        {
            if(doc_id >=forword_index.size())
            {
                std::cerr<<"doc_id out range error!"<<std::endl;
                return nullptr;
            }
            return &forword_index[doc_id];
        }
1
2
3
4
5
6
7
8
9
10 
 
 //根据关键字找到倒排拉链
        InvertedList* GetInvertedList(std::string& word)
        {
            if(inverted_index.find(word) == inverted_index.end())
            {
                std::cerr<<word<<" have no InvertedList"<<std::endl;
                return nullptr;
            }
            return &inverted_index[word]; //&(iter->second)
        }
1
2
3
4
5
6
7
8
9
10 
 
6.2 构建索引具体函数实现 
构建索引：该函数，其实挺复杂的，把功能才分三部分：1，读取文件，2.建立正排，4.根据正排建立倒排 
//建立索引（正排索引和倒排索引）
        //根据传入的parser.cc函数处理完毕的/data/raw_html/raw.txt文件，构建索引
        bool BulidIndex(const std::string& input) //根据input文档内容构建索引
        {
            //读取input的每一行进行建立索引

            //1.打开要进行建立索引的文件
            std::ifstream in(input,std::ios::in | std::ios::binary);
            if(!in.is_open())
            {
                std::cerr<<"open "<<input<<" filed!"<<std::endl;
                return false;
            }
            //2.对每一行进行内容进行建立索引（其实就是每一个html被解析的文件建立索引）
            std::string line; //这个line-->  tile\3content\3url\n
            int count =0;
            while(std::getline(in,line))
            {   
                //3. 建立正排索引
                DocInfo* doc = BulidForWordIndex(line);
                if(doc == nullptr)
                {
                    std::cerr<<"sorry:...\n"<<line<<"\nerror"<<std::endl;//for debug
                    continue;
                }
                //4. 根据正排再建立倒排
                BuildInvertedIndex(*doc);
                //for debug
                count++;
                if(count %50==0)
                {
                  LOG(NORMAL,"当前已经建立的索引文档："+std::to_string(count));
                }
            }
            in.close();
            return true;
        }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 
 
6.3 构建正排索引具体函数实现 
建立正排索引的函数其实是建立索引函数里面的一个子功能；
 在建立索引的函数中，我们是读取raw.txt文档的每一行进行进行建立正排索引的；也就是说遍历raw.txt文档所有内容，每读取一行就建立一个正排索引，更加准确地说，是读取每一个html文档，被解析过的html文档进行建立倒排索引； 
 
此时：我们需要对之前进行数据清理时候的文档进行切分，因为我们之前对html清理为了三部分：
 标题，内容，url 都是以\3区分，所以我们要以\3进行分隔符切分，获取内容，插入到正排索引数组中； 
 
//构建正排索引本质就是读到的line构建DocInfo,再插入到vector这个正排索引中
        //构建成功后，我们就可以直接根据doc_id快速查到文档内容DocInfo了
        DocInfo* BulidForWordIndex(const std::string& line) //line就是row.txt每一行的内容
        {
            //解析line,-->分割line-->title content url
            //解析本质就是切分字符串
            std::vector<std::string> results; //切分字符串存放的数组
            const std::string sep = "\3";
            ns_util::StringUtil::Split(line,&results,sep);

            if(results.size() !=3)
                return nullptr;
            //解析结果插入到DocInfo
            DocInfo doc;
            doc.title = results[0];
            doc.content = results[1];
            doc.url = results[2];

            doc.doc_id = forword_index.size();
            //将DocInfo插入到vector
            forword_index.push_back(std::move(doc));
            return &forword_index.back();
        }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 
 
6.4 构建倒排索引具体函数实现 
到底如何建立倒排索引呢？ 
1.由于根据正排索引获取到了文档的 标题 内容 url；
 2. 根据该 标题 和 内容 进行分词得到关键字，同时统计 词频，建立关键字和词频映射关系；
 分词使用的库文件cppjieba分词库,该分词库是一个hander only 的开源库;
 4. 根据分词的关键字，构建倒排拉链，并且构建倒排索引； 
 
具体分析和实现，看代码： 
 
        //对建立好的正排索引的一个结构DocInfo进行处理：做建立倒排索引
        bool BuildInvertedIndex(const DocInfo& doc)
        {
          //建立完正排索引之后,拿到doc，也就是[tile content url doc_id] 建立关键字和doc之间的联系
          
          //1.对tile 和 content 进行分词(分词就是获取关键字，建立倒排索引)并且统计分词结果的词频率
          
          struct word_cnt
          {
            int title_cnt; //标题词频
            int content_cnt; //内容词频
            word_cnt():title_cnt(0),content_cnt(0){}
          };

          
          std::unordered_map<std::string,word_cnt> word_map; //存放title 和 content 分词后的关键字和词频映射关系
          //对标题进行分词
          std::vector<std::string> title_words;//对title分词的结果
          ns_util::JiebaUtil::CurString(doc.title,&title_words);
          
          //遍历title分词出的结果进行词频统计
          for(std::string word : title_words) //这里不加&原因是：转化小写，不想修改原文档的内容
          {
            boost::to_lower(word);
            word_map[word].title_cnt++;  
          }
          //对内容进行分词
          std::vector<std::string> content_words;
          ns_util::JiebaUtil::CurString(doc.content,&content_words);

          for(std::string word : content_words)
          {
            boost::to_lower(word);
            word_map[word].content_cnt++;
          }
          
#define X 10
#define Y 1
          /*小细节：用户输入的是关键字：hello HELLO HEllO.... 等这关键字是否有区别？
           *实际搜索引擎是不做区分大小写，也就是你收缩的词是大小写，我们返回给你的信息可以不做区分
           *
           * 所以我们文档出现的词，在我切词做词频统计，还有建立倒排索引时候，是需要忽略大小写的
           *
           * 结论：对用户来说：搜索关键字是不区分大小写
           *       对我们编写代码来说：如何做到，对分词结果转小写，这样搜索引擎不区分大小写
           *       用户输入大小写，如何使其不区分？那就是在我们的倒排索引中，把用户输入的也转为小写即可
           *       这样用户的词不管是大小写都变成小写，那么就可以拿到用户的关键词去倒排索引查找了
           * */
          //对title和content的分词后得到的关键字进行建立倒排拉链
          for(auto& word_pair : word_map)
          {
              InvertedElem item; //倒排索引的一个元素
              item.doc_id = doc.doc_id;//因为我们是在一个文档内进行建立倒排索引，所以这里倒排索引的id就是该文档id
              item.word = word_pair.first; //分词得到的关键字
              item.weight = X*word_pair.second.title_cnt+Y*word_pair.second.content_cnt;//相关性

              
              //inverted_index是map结构
              //建立关键字和一个或者多个item的映射（因为一个关键字，可能对应多个文档），其实就是关键字和倒排索引的映射
              //map[]重载： key存在就读取，没有插入
              InvertedList &inverted_list = inverted_index[word_pair.first];//这是把关键字添加到倒排索引中
              //把item添加到倒排拉链中
              inverted_list.push_back(std::move(item)); //给倒排拉链添加item

          }
            return true;
        }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67 
 
七: 编写搜索引擎模块 Searcher 
在前面我们完成了，对后端的数据进行了索引的建立，建立完成索引不是目的，建立索引之后提供的搜索服务才是目的；所以我们需要完成一个新的模块功能：sercher.hpp;该模块就是根据用户提交的搜索关键字，提供搜索服务并返回结果给用户的功能； 
 
基本结构代码： 
#include "index.hpp"
	//对搜索结果去重的
  //搜索关键字，被jieba分词后，多个分词对应同一个文档，那么该搜索结果应该合并
  struct InvertedElemPrint
  {
    uint64_t doc_id;                //多个分词对应一个doc_id,
    int weight;                     //对多个分词的权重累加
    std::vector<std::string> words; //对分词进行处理放在一起
    InvertedElemPrint() : doc_id(0), weight(0) {}
  };
  
namespace ns_searcher{
  class Searcher{
    private:
      ns_index::Index *index; //供系统进行查找的索引
    public:
      Searcher(){}
      ~Searcher(){}
    public:
     void InitSearcher(const std::string &input)
    {
      // 1.获取或者创建index对象
      index = ns_index::Index::GetInstance();
      LOG(NORMAL, "获取索引单例对象成功...");
      // 2.根据index对象创建索引:
      index->BulidIndex(input);
      LOG(NORMAL, "建立倒排索引和正排索引成功...");
    }
      //query: 搜索关键字
      //json_string: 返回给用户浏览器的搜索结果
      void Search(const std::string &query, std::string *json_string)
     {
        //1.[分词]:对我们的query进行按照searcher的要求进行分词
        //2.[触发]:就是根据分词的各个"词"，进行index查找
        //3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序
        //4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp
     }
 };
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39 
 
7.1 编写Search代码 
 
该模块代码最主要是对收缩结果去重：
 因为用户提交的关键字：被jieba分词后，可能会得到多个关键字对应同一个倒排拉链；
 意味着有不同关键字会对应同一个文档id；此时我们就需要去掉不同关键字，相同的重复文档；
 也就是说：只保留一份文档，即使关键字不同的情况下； 
/*
     *   该函数功能：主要是提供给用户进行搜索的服务
     *  query是搜索的关键字
     *  json_string 返回给用户的搜索结果
     * */
    void Search(const std::string &query, std::string *json_string)
    {
      //[分词]：对用户的关键字进行分词
      std::vector<std::string> words;
      ns_util::JiebaUtil::CurString(query, &words);

      //[触发]：根据分词的结果的各个词，进行index 查找
      std::vector<InvertedElemPrint> inverted_list_all; //存放被去重过的倒排结点
      for (std::string word : words) //遍历用户的搜索语句的分词后的每一个关键字
      {
        boost::to_lower(word); //同意转换为小写再搜索：目的就是为了保证不区分大小写的搜索

        //通过关键字先找到关键字对应的倒排拉链
        ns_index::InvertedList *inverted_list = index->GetInvertedList(word);
        if (nullptr == inverted_list) //假如用户搜索关键字找不到对应的倒排拉链，就没必要再搜索该关键字了
          continue;
        //来到这里肯定找到了关键字的倒排拉链
        //有了倒排拉链肯定就能有文档的id,那么就可以查正排索引找到文档内容了

        std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; 

        //遍历每个关键字倒排拉链的结点(也就是倒排索引节点InvertedElem：包含id,weight,word)
        for (const auto &elem : *inverted_list)
        {
          InvertedElemPrint &item = tokens_map[elem.doc_id]; //根据倒排结点的doc_id获取到InvertedElemPrint结点

          item.doc_id = elem.doc_id;
          item.weight += elem.weight;
          item.words.push_back(elem.word); //一个关键字对应的倒排拉链中的每个倒排索引的关键字都是一样的
        }

        //将不重复的打印倒排拉链结点放到inverted_list_all中
        for (const auto &item : tokens_map)
        {
          inverted_list_all.push_back(std::move(item.second));
        }
      //[合并排序]：汇总查找结果，按相关性进行降序排序
      sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2)
           { return e1.weight > e2.weight; });
      Json::Value root; //存放键值对的集合，也就是json结构串的集合
      for (auto &item : inverted_list_all) //item是用户搜索的query所分词得到关键字对应的InvertedElemPrint
      {
        //根据找到的倒排结点item里面的doc_id拿到了文档内容
        ns_index::DocInfo *doc = index->GetForwordIndex(item.doc_id);
        if (nullptr == doc)
          continue;
        // doc就是包含的你关键字对应文档的信息
        //构建json_string
        Json::Value elem;
        elem["title"] = doc->title;
        elem["desc"] = GetDesc(doc->content, item.words[0]);
        elem["url"] = doc->url;
        root.append(elem);
      }
      //对搜索结果doc进行序列化
      Json::FastWriter writer;
      *json_string = writer.write(root);
    }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64 
 
八: 编写http_server模块 
该模块主要是对外提供http服务的；
 使用的开源库是：cpp-httplib; 
#include "searcher.hpp"
#include "cpp-httplib/httplib.h"

const std::string input = "data/raw_html/raw.txt";
const std::string src_path = "./wwwroot"; //这是我们的web根目录
int main()
{

  ns_searcher::Searcher search;
  search.InitSearcher(input); //构建索引单例，同时构建索引

  httplib::Server srv;
  srv.set_base_dir(src_path.c_str()); //默认访问的是web根目录

  //分析url
  srv.Get("/s", [&search](const httplib::Request &req, httplib::Response &resp){
        if(!req.has_param("word")){
          resp.set_content("url必须带有参数word!","text/plain; charset=utf-8");
            return;
        }

      //1. 用户提交的url上有有关键字
      std::string word = req.get_param_value("word");//获得用户提交的参数
      LOG(NORMAL,"用户在搜索的关键字："+word);

      //2. 给用户提供搜索服务
      std::string json_string;
      search.Search(word,&json_string);

      //3. 将搜索结果返回给用户
      resp.set_content(json_string.c_str(),"application/json"); });

  LOG(NORMAL, "服务器启动成功...");
  srv.listen("0.0.0.0", 8081);

  return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 
九: 编写前端代码 
前端代码主要是提供一个简单的搜索窗口供用户进行搜索； 
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>

    <title>boost 搜索引擎</title>
    <style>
        /* 去掉网页中的所有的默认内外边距，html的盒子模型 */
        * {
            /* 设置外边距 */
            margin: 0;
            /* 设置内边距 */
            padding: 0;
        }
        /* 将我们的body内的内容100%和html的呈现吻合 */
        html,
        body {
            height: 100%;
        }
        /* 类选择器.container */
        .container {
            /* 设置div的宽度 */
            width: 800px;
            /* 通过设置外边距达到居中对齐的目的 */
            margin: 0px auto;
            /* 设置外边距的上边距，保持元素和网页的上部距离 */
            margin-top: 15px;
        }
        /* 复合选择器，选中container 下的 search */
        .container .search {
            /* 宽度与父标签保持一致 */
            width: 100%;
            /* 高度设置为52px */
            height: 52px;
        }
        /* 先选中input标签， 直接设置标签的属性，先要选中， input：标签选择器*/
        /* input在进行高度设置的时候，没有考虑边框的问题 */
        .container .search input {
            /* 设置left浮动 */
            float: left;
            width: 600px;
            height: 50px;
            /* 设置边框属性：边框的宽度，样式，颜色 */
            border: 1px solid black;
            /* 去掉input输入框的有边框 */
            border-right: none;
            /* 设置内边距，默认文字不要和左侧边框紧挨着 */
            padding-left: 10px;
            /* 设置input内部的字体的颜色和样式 */
            color: #CCC;
            font-size: 14px;
        }
        /* 先选中button标签， 直接设置标签的属性，先要选中， button：标签选择器*/
        .container .search button {
            /* 设置left浮动 */
            float: left;
            width: 150px;
            height: 52px;
            /* 设置button的背景颜色，#4e6ef2 */
            background-color: #4e6ef2;
            /* 设置button中的字体颜色 */
            color: #FFF;
            /* 设置字体的大小 */
            font-size: 19px;
            font-family:Georgia, 'Times New Roman', Times, serif;
        }
        .container .result {
            width: 100%;
        }
        .container .result .item {
            margin-top: 15px;
        }

        .container .result .item a {
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* a标签的下划线去掉 */
            text-decoration: none;
            /* 设置a标签中的文字的字体大小 */
            font-size: 20px;
            /* 设置字体的颜色 */
            color: #4e6ef2;
        }
        .container .result .item a:hover {
            text-decoration: underline;
        }
        .container .result .item p {
            margin-top: 5px;
            font-size: 16px;
            font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
        }

        .container .result .item i{
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* 取消斜体风格 */
            font-style: normal;
            color: green;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="search">
            <input type="text" value="请输入搜索关键字">
            <button onclick="Search()">搜索一下</button>
        </div>
        <div class="result">
        </div>
    </div>
    <script>
        function Search(){
            // 是浏览器的一个弹出框
            // alert("hello js!");
            // 1. 提取数据, $可以理解成就是JQuery的别称
            let query = $(".container .search input").val();
            console.log("query = " + query); //console是浏览器的对话框，可以用来进行查看js数据

            //2. 发起http请求,ajax: 属于一个和后端进行数据交互的函数，JQuery中的
            $.ajax({
                type: "GET",
                url: "/s?word=" + query,
                success: function(data){
                    console.log(data);
                    BuildHtml(data);
                }
            });
        }

        function BuildHtml(data){
            // 获取html中的result标签
            let result_lable = $(".container .result");
            // 清空历史搜索结果
            result_lable.empty();

            for( let elem of data){
                // console.log(elem.title);
                // console.log(elem.url);
                let a_lable = $("", {
                    text: elem.title,
                    href: elem.url,
                    // 跳转到新的页面
                    target: "_blank"
                });
                let p_lable = $("", {
                    text: elem.desc
                });
                let i_lable = $("", {
                    text: elem.url
                });
                let div_lable = $("", {
                    class: "item"
                });
                a_lable.appendTo(div_lable);
                p_lable.appendTo(div_lable);
                i_lable.appendTo(div_lable);
                div_lable.appendTo(result_lable);
            }
        }
    </script>
</body>
</html>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
 
 
十: 工具类的编写 
该类的模块是在util.hpp模块中的； 
#pragma once
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include "cppjieba/Jieba.hpp"
#include"log.hpp"
namespace ns_util
{
  class FileUtil
  {
  public:
    static bool ReadFile(const std::string &file_name, std::string *out)
    {
      //创建一个读取文件的对象
      std::ifstream in(file_name, std::ios::in);
      if (!in.is_open())
      {
        std::cerr << "open file" << file_name << " error" << std::endl;
        return false;
      }
      //打开成功读取文件,就读取文件内容
      std::string line;

      while (std::getline(in, line))
      {
        *out += line;
      }
      in.close();
      return true;
    }
  };
  class StringUtil
  {
  public:
    static void Split(const std::string &target, std::vector<std::string> *out, const std::string &sep)
    {
      // boost split
      boost::split(*out, target, boost::is_any_of(sep), boost::token_compress_on);
    }
  };
  // cppjieba词库路径
  const char *const DICT_PATH = "./dict/jieba.dict.utf8";
  const char *const HMM_PATH = "./dict/hmm_model.utf8";
  const char *const USER_DICT_PATH = "./dict/user.dict.utf8";
  const char *const IDF_PATH = "./dict/idf.utf8";
  const char *const STOP_WORD_PATH = "./dict/stop_words.utf8"; //暂停词词库

  //该结巴分词的类是没有去掉暂停词
  // class JiebaUtil
  //   {
  //     private:
  //       static cppjieba::Jieba jieba;
  //     public:
  //       //对src字符串进行分词，分词结果存在out中
  //       static void CurString(const std::string& src,std::vector* out)
  //       {
  //         jieba.CutForSearch(src,*out);
  //       }
  //   };
  //     cppjieba::Jieba JiebaUtil:: jieba(DICT_PATH, HMM_PATH,USER_DICT_PATH,IDF_PATH,STOP_WORD_PATH);
  // }

  //分词时候，去掉暂停词
  class JiebaUtil
  {
  private:
    cppjieba::Jieba jieba;
    std::unordered_set<std::string> stop_words; //暂停词，set方便快速查找
    static JiebaUtil *instance;

  private:
    JiebaUtil() : jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH) {}
    JiebaUtil(const JiebaUtil &) = delete;
    JiebaUtil &operator=(const JiebaUtil &) = delete;

  public:
    static JiebaUtil *GetInstance()
    {
      std::mutex mtx;
      if (nullptr == instance)
      {
        mtx.lock();
        if (nullptr == instance)
        {
          instance = new JiebaUtil();
          instance->InitJiebaUtil();
        }
        mtx.unlock();
      }
      return instance;
    }
    void InitJiebaUtil()
    {
      std::ifstream in(STOP_WORD_PATH);
      if (!in.is_open())
      {
        LOG(FATAL, "load stop word failed...");
        return;
      }
      std::string line;
      while (std::getline(in, line))
      {
        stop_words.insert(line);
      }
      in.close();
    }
    
    void CutStringHelper(const std::string &src, std::vector<std::string> *out)
    {
      jieba.CutForSearch(src, *out);
      //去暂停词:遍历分词的vector集合
      for (auto it = out->begin(); it != out->end();)
      {
        auto iter = stop_words.find(*it);
        if (iter !=stop_words.end())
        {
          //当前的分词是暂停词
          it = out->erase(it);
        }
        else
        {
          ++it;
        }
      }
    }

  public:
    static void CurString(const std::string &src, std::vector<std::string> *out)
    {
      GetInstance()->CutStringHelper(src, out);
    }
  };
   JiebaUtil *JiebaUtil::instance = nullptr;

}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141 
 
十一：添加日志部同时部署服务到Linux 
添加建议的日志功能：
 该日志仅仅是为了打印一下一些信息，方便调试和观看； 
 
新建log.hpp文件：该文件代码为 
#pragma once 

#include
#include
#include

#define NORMAL  1
#define WARNING 2
#define DEBUG   3
#define FATAL   4

#define LOG(LEVEL,MESSAGE) log(#LEVEL,MESSAGE,__FILE__,__LINE__)

void log(std::string level,std::string message,std::string file,int line)
{
  std::cout<<"等级 "<<"["<<level<<"]"\
    <<"时间戳 "<<"["<<time(nullptr)<<"]"\
    <<"["<<message<<"]"\
    <<"文件 "<<"["<<file<<"]"\
    <<"行号 "<<"["<<line<<"]"\
    <<std::endl;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 
 
部署到Linux服务器中，日后你只需要根据ip和端口就可以直接访问了该搜索功能 
[xjh@VM-12-10-centos boost_searcher]$ nohup ./http_server &
1 
该命令会自动生成一个 nohup.out 文件，该文件就是你的日志信息输出的位置 
 
项目的拓展方向 
 建立整站搜索，但是这个对服务器的资源配置比较高要求；
 
 设计一个在线更新的方案，信号，爬虫，完成整个服务器的设计；
 信号方式定期去建立倒排正排索引，爬虫爬取相关信息；
 
 不使用组件，而是自己设计一下对应的各种方案；
 比如自己写一个http服务啦,或者使用一些Nginx等服务器
 
 在我们的搜索引擎中，添加竞价排名；
 
 热次统计，智能显示搜索关键词（字典树，优先级队列）;
 
 设置登陆注册，引入对mysql的使用；

相关阅读:
一份简便的PyTorch教程，从不用自己配置环境开始。
JSD-2204-Redis缓存实战-Spring AOP-Day18
为了面试大厂，熬夜肝完这份 “测试” 笔记后，我终于“硬”了一回
 全志R128应用开发案例——SPI 驱动 TFT LCD 屏
 无需云盘，不限流量实现Zotero跨平台同步：内网穿透+私有WebDAV服务器
 docker入门加实战—docker数据卷
 pytorch,tf维度理解RNN
【luogu AGC034F】RNG and XOR（FWT）
3、云原生安全之falco的部署
 对极几何-三角测量-知识点

原文地址：https://blog.csdn.net/m0_46606290/article/details/126022469

最新文章

攻防演习之三天拿下官网站群
 数据安全治理学习——前期安全规划和安全管理体系建设
 企业安全 | 企业内一次钓鱼演练准备过程
 内网渗透测试 | Kerberos协议及其部分攻击手法
 0day的产生 | 不懂代码的"代码审计"
安装scrcpy-client模块av模块异常，环境问题解决方案
 leetcode hot100【LeetCode 279. 完全平方数】java实现
 OpenWrt下安装Mosquitto
AnatoMask论文汇总
 【AI日记】24.11.01 LangChain、openai api和github copilot

热门文章

十款代码表白小特效一个比一个浪漫赶紧收藏起来吧！！！
奉劝各位学弟学妹们，该打造你的技术影响力了！
五年了，我在 CSDN 的两个一百万。
Java俄罗斯方块，老程序员花了一个周末，连接中学年代！
面试官都震惊，你这网络基础可以啊！
你真的会用百度吗？我不信 — 那些不为人知的搜索引擎语法
 心情不好的时候，用 Python 画棵樱花树送给自己吧
 通宵一晚做出来的一款类似CS的第一人称射击游戏Demo！原来做游戏也不是很难，连憨憨学妹都学会了！
13 万字 C 语言从入门到精通保姆级教程2021 年版
 10行代码集2000张美女图，Python爬虫120例，再上征途

【C++项目】boost搜索引擎项目

文章目录

项目的gitee地址

项目基本演示

讲解思路

一：项目相关背景

二：搜索引擎的相关宏观原理

三：搜索引擎技术栈和项目环境

四：正排索引 vs 倒排索引 - 搜索引擎具体原理

五：编写数据去标签与数据清洗的模块 Parser

5.1 parser基本代码结构

5.2 使用boost库函数枚举每个html文件名

5.3 解析html代码编写

六: 编写建立索引的模块 Index

6.1 获取正排索引和倒排拉链函数具体实现

6.2 构建索引具体函数实现

6.3 构建正排索引具体函数实现

6.4 构建倒排索引具体函数实现

七: 编写搜索引擎模块 Searcher

7.1 编写Search代码

八: 编写http_server模块

九: 编写前端代码

十: 工具类的编写

十一：添加日志部同时部署服务到Linux

项目的拓展方向