ElasticSearch聚合分析
发布日期:2021-05-08 23:31:03 浏览次数:21 分类:博客文章

本文共 7245 字,大约阅读时间需要 24 分钟。

聚合用于分析查询结果集的统计指标,我们以观看日志分析为例,介绍各种常用的ElasticSearch聚合操作。

目录:

首先展示一下我们要分析的文档结构:

{    "video_id": 1289643545120062253, // 视频id    "video_uid": 3931482202390368051, // 视频发布者id    "uid": 47381776787453866, // 观看用户id    "time": 1533891263224, // 时间发生时间    "watch_duration": 30 // 观看时长}

每个文档记录了一个观看事件,我们通过聚合分析用户的观看行为。

ElasticSearch引入了两个相关概念:

  • 桶(Buckets): 满足特定条件的文档的集合
  • 指标(Metrics): 桶中文档的统计值,如特定字段的平均值

查询用户观看视频数和观看时长

首先用sql语句描述这个查询:

SELECT uid, count(*) as view_countFROM view_logWHERE time >= #{since} AND time <= #{to} GROUP BY uid;

ES 查询:

GET /view_log/_search{   "size" : 0,   "query": {       "range": {           "time": {               "gte": 0, // since               "lte": 0 // to           }       }   },   "aggs": {      "agg": { // agg为聚合的名称        "terms": { // 聚合的条件为 uid 相同          "field": "uid"        }      }   }}

response:

{  "took": 10,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "skipped": 0,    "failed": 0  },  "hits": {    "total": 100000,    "max_score": 0,    "hits": []  },  "aggregations": {    "agg": {      "buckets": [        {          "key": 21836334489858688,          "doc_count": 4026        },        {          "key": 31489302390368051,          "doc_count": 2717        }      ]    }}

result.aggregations.agg.buckets列表中包含了查询的结果。

因为我们按照terms:uid进行聚合,每个bucket为uid相同的文档集合,key字段即为uid。

doc_count 字段表明bucket中文档的数目即sql语句中的count(*) as view_count

我们可以为查询添加额外的统计指标, sql描述:

SELECT uid, count(*) as view_count, avg(watch_duration) as avg_duration FROM view_logWHERE time >= #{since} AND time <= #{to} GROUP BY uid;

ES 查询:

GET /view_log/_search{   "size" : 0,   "query": {       "range": {           "time": {               "gte": 0, // since               "lte": 0 // to           }       }   },   "aggs": {      "agg": { // agg为聚合的名称        "terms": { // 聚合的条件为 uid 相同          "field": "uid"        },        "aggs": { // 添加统计指标(Metrics)          "avg_duration": {               "avg": { // 统计 watch_duration 的平均值                "field": "watch_duration"               }          }        }      }   }}

response:

{  "took": 10,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "skipped": 0,    "failed": 0  },  "hits": {    "total": 100000,    "max_score": 0,    "hits": []  },  "aggregations": {    "agg": {      "buckets": [        {          "key": 21836334489858688,          "doc_count": 4026,          "avg_duration": {            "value": 12778.882352941177          }        },        {          "key": 31489302390368051,          "doc_count": 2717,          "avg_duration": {            "value": 2652.5714285714284          }        }      ]    }}

avg_duration.value 表示 watch_duration 的平均值即该用户的平均观看时长。

聚合分页器

在实际应用中用户的数量非常惊人, 不可能通过一次查询得到全部结果因此我们需要分页器分批取回:

GET /view_log/_search{   "size" : 0,   "query": {       "range": {           "time": {               "gte": 0, // since               "lte": 0 // to           }       }   },   "aggs": {      "agg": {         "terms": {             "field": "uid",            "size": 10000, // bucket 的最大个数            "include": { // 将聚合结果分为10页,序号为[0,9], 取第一页                "partition": 0,                "num_partitions": 10             }        },        "aggs": {           "avg_duration": {               "avg": {                 "field": "watch_duration"               }          }        }      }   }}

上述查询与上节的查询几乎完全相同,只是在aggs.agg.terms字段中添加了include字段进行分页。

查询视频uv

单个视频uv

uv是指观看一个视频的用户数(unique visit),与此相对没有按照用户去重的观看数称为pv(page visit)。

用SQL语句来描述:

SELECT video_id, count(*) as pv, count(distinct uid) as uvFROM view_logWHERE video_id = #{video_id};

ElasticSearch可以方便的进行count(distinct)查询:

GET /view_log/_search{    "aggs": {      "uv": {        "cardinality": {          "field": "uid"        }      }   }}

response:

{  "took": 255,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "skipped": 0,    "failed": 0  },  "hits": {    "total": 17579,    "max_score": 0,    "hits": []  },  "aggregations": {    "uv": {      "value": 11    }  }}

批量查询视频uv

ElasticSearch也可以批量查询count(distinct), 先用SQL进行描述:

SELECT video_id, count(*) as pv, count(distinct uid) as uvFROM view_logGROUP BY video_id;

查询:

GET /view_log/_search{    "size": 0,    "aggs": {      "video": {        "terms": {          "field": "video_id"        },        "aggs": {          "uv": {              "cardinality": {                "field": "uid"              }          }        }      }   }}

response:

{  "took": 313,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "skipped": 0,    "failed": 0  },  "hits": {    "total": 16940,    "max_score": 0,    "hits": []  },  "aggregations": {    "video": {      "buckets": [        {          "key": 25417499722062, // 视频id          "doc_count": 427, // 视频观看次数 pv          "uv": {            "value": 124 // 观看视频的用户数 uv          }        },        {          "key": 72446898144,          "doc_count": 744,          "uv": {            "value":233          }        }      ]    }  }}

Having查询

SQL可以使用HAVING语句根据聚合结果进行过滤,ElasticSearch可以使用pipeline aggregations达到此效果不过语法较为繁琐。

根据 count 进行过滤

使用SQL查询观看超过200次的视频:

SELECT video_id, count(*) as view_countFROM view_logGROUP BY video_idHAVING count(*) > 200;
GET /view_log/_search{  "size": 0,  "aggs": {    "view_count": {      "terms": {        "field": "video_id"      },      "aggs": {        "having": {          "bucket_selector": {            "buckets_path": { // 选择 view_count 聚合的 doc_count 进行过滤              "view_count": "_count"            },            "script": {              "source": "params.view_count > 200"            }          }        }      }    }  }}

response:

{  "took": 83,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "skipped": 0,    "failed": 0  },  "hits": {    "total": 775,    "max_score": 0,    "hits": []  },  "aggregations": {    "view_count": {      "buckets": [        {          "key": 35025417499764062,          "doc_count": 529        },        {          "key": 19913672446898144,          "doc_count": 759        }      ]    }  }}

ElasticSearch实现类似HAVING查询的关键在于使用选择聚合结果进行过滤。

根据其它指标进行过滤

接下来我们尝试查询平均观看时长大于5分钟的视频, 用SQL描述该查询:

SELECT video_id FROM view_logGROUP BY video_idHAVING avg(watch_duration) > 300;
GET /view_log/_search{  "size": 0,  "aggs": {    "video": {      "terms": {        "field": "video_id"      },      "aggs": {        "avg_duration": {          "avg": {            "field": "watch_duration"          }         },        "avg_duration_filter": {          "bucket_selector": {            "buckets_path": {              "avg_duration": "avg_duration"              },              "script": {                "source": "params.avg_duration > 200"              }          }          }      }    }  }}

response:

{  "took": 137,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "skipped": 0,    "failed": 0  },  "hits": {    "total": 255,    "max_score": 0,    "hits": []  },  "aggregations": {    "video": {      "buckets": [        {          "key": 5417499764062,          "doc_count": 91576,          "avg_duration": {            "value": 103          }        },        {          "key": 19913672446898144,          "doc_count": 15771,          "avg_duration": {            "value": 197          }        }      ]    }  }}
上一篇:Cassandra 数据模型
下一篇:ElasticSearch入门简介

发表评论

最新留言

很好
[***.229.124.182]2025年04月22日 05时59分14秒

关于作者

    喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!

推荐文章

005从零开始学Python—字符串处理 2025-03-29
2024年网络安全八大前沿趋势,零基础入门到精通,收藏这篇就够了 2025-03-29
2024年薪酬最高的五个网络安全职位,零基础入门到精通,收藏这一篇就够 2025-03-29
2024年非科班的人合适转行做程序员吗? 2025-03-29
2024数字安全创新性案例报告,从零基础到精通,收藏这篇就够了! 2025-03-29
2024最新最全CTF入门指南(非常详细)零基础入门到精通,收藏这一篇就够了 2025-03-29
2024最新科普什么是大模型?零基础入门到精通,收藏这篇就够了 2025-03-29
2024最新程序员接活儿搞钱平台盘点 2025-03-29
2024最火专业解读:信息安全(非常详细)零基础入门到精通,收藏这一篇就够了 2025-03-29
2024版最新SRC漏洞挖掘思路手法(非常详细),零基础入门到精通,收藏这一篇就够了 2025-03-29
2024版最新渗透测试零基础入门教程,带你入门到精通(超详细),收藏这篇就够了 2025-03-29
2024版最新网络安全入门必备读书清单(非常详细)零基础入门到精通,收藏这一篇就够了 2025-03-29
2024版最新网络安全教程从入门到精通,看完这一篇就够了 2025-03-29
2024网络安全岗就业前景如何?零基础入门到精通,收藏这篇就够了 2025-03-29
2024零基础如何入门网络安全? 2025-03-29
2024,java开发,已经炸了吗? 2025-03-29
2025入门黑客技术必读书籍(非常全面)带你从小白进阶大佬!收藏这一篇就够了 2025-03-29
2025入门黑客技术必读书籍(非常全面)带你从小白进阶大佬!收藏这篇就够了 2025-03-29
2025大语言模型入门该怎么学?零基础入门到精通,收藏这篇就够了 2025-03-29
2025年3月全国计算等级考试(报名操作指南)从零基础到精通,收藏这篇就够了! 2025-03-29