ES对比两个索引的数据差

一、前言

我们在修改索引的mapping后，为了不影响线上的服务，一般需要新建索引,然后刷新数据过去，然而新索引的数据是否正常，跟旧索引数据比起来差异在哪里，这块总是难以验证。

有幸参考大佬的文章，具体实施了以下两个方案，对比新旧索引的数据，大佬文章链接：图解 | Elasticsearch 获取两个索引数据不同之处的四种方案

二、kibana的方式

1. kibana对比两个索引的数据差

有时候我们需要对比两个索引的字段差，比如两个索引Id的差，从而找到缺失的数据，我们可以用下面这个sql搞定。(本地或者其他环境均可以使用该方法)

（1）打开kibana的dev tools
（2）输入以下sql
（3）index_old,index_new是要对比的索引名称
（4）id 是对比的字段，最好是业务上的唯一字段
（5）执行，查看结果即可。
原理：使用聚合的方式，如果两个索引id相同，则聚合结果为2.我们查询聚合结果<2的数据，那么结果里面就是缺失的id.


POST index_new,index_old/_search
{
  "size": 0,
  "aggs": {
    "group_by_uid": {
      "terms": {
        "field": "id",
        "size": 1000000
      },
      "aggs": {
        "count_indices": {
          "cardinality": {
            "field": "_index"
          }
        },
        "values_bucket_filter_by_index_count": {
          "bucket_selector": {
            "buckets_path": {
              "count": "count_indices"
            },
            "script": "params.count < 2"
          }
        }
      }
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

结果：

注意：这里的 "key" : 6418 就代表差值里面有id为6418的记录，需要自己去检查为什么会出现差异。。

{
  "took" : 1851,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 21969,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_uid" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 6418,
          "doc_count" : 1,
          "count_indices" : {
            "value" : 1
          }
        },
        {
          "key" : 6419,
          "doc_count" : 1,
          "count_indices" : {
            "value" : 1
          }
        }
}}}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

二、其他轮子

github: esdiff
ps：这个插件的作者就是olivere/elastic 的作者，大佬出品，可以一试

1、本地使用步骤

1.下载
go install github.com/olivere/esdiff@latest

2.执行命令
./esdiff -u=true -d=false 'http://localhost:9200/index_old/type' 'http://localhost:9200/index_new/type'

3.效果
Unchanged       1
Updated 3       {*diff.Document}.Source["message"]:
        -: "Playing the piano is fun as well"
        +: "Playing the guitar is fun as well"
 
Created 4       {*diff.Document}:
        -: (*diff.Document)(nil)
        +: &diff.Document{ID: "4", Source: map[string]interface {}{"message": "Climbed that mountain", "user": "sandrae"}}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

2.常用参数

新增或者删除字段的时候，使用exclude 或者include 比较好用，对比指定字段之外的数据准确性。

esdiff [flags]  

 -dsort string  [根据destination索引字段排序] {"term":{"name.keyword":"Oliver"}}
-ssort string   [根据source索引字段排序]"id" or "-id"
-exclude string  [source中排除某些字段]"hash_value,sub.*"
-include string  [source中包含某些字段] "obj.*"
1
2
3
4
5
6

3.自定义文档Id

由于博主目前文档的ID字段是根据索引名来的，比如：

//虽然id都是1，但是文档Id不一样，导致会出现在差异中
index_old_1
index_new_1
1
2
3

我们的需求主要是对比source里面的字段，因此新增了-replace-with参数，指定唯一ID.
例如：

//使用id来替换文档ID，实现source字段的对比，获取差异

go run main.go -ssort=unit_id -dsort=unit_id -replace-with=id'http://localhost:9200/index_old/type' 'http://localhost:9200/index_new/type'

1
2
3
4

4.轮子对比差异原理

1.根据参数批量读取es数据，使用scroll游标查询，默认一次100条
2.使用go-cmp包的cmp.Equal(srcDoc.Source, dstDoc.Source) 对比数据
3.根据参数打印created,updated,deleted等差异数据
1
2
3

end

相关阅读:
Java 如何在 Array 和 Set 之间进行转换
gd32F470串口重定义
【阿旭机器学习实战】【8】逻辑斯蒂回归原理及实战
博客记录生活
【三维目标检测】VoteNet（二）
Java如何创建支付接口
千兆以太网——MDIO接口协议
面向对象思想
maven安装及配置
《大模型时代-ChatGPT开启通用人工智能浪潮》精华摘抄

原文地址：https://blog.csdn.net/LJFPHP/article/details/125882840