Aggregation

Aggregation is an attribute of Search API.

POST employees/_search
{
    "size": 0 // Since we don't need the items
    "aggs": {
        "max_salary": {
            "max": {
                "field": "salary"
            }
        },
        "min_salary": {
            "min": {
                "field": "salary"
            }
        },
    }
}

Bucket

It can split the data into different buckets

POST employee/_search
{
    "size": 0,
    "aggs": {
        "age_buckets": {
            "terms": {
                "field": "age",
                "size": 3
            }
        }
    }
}

Nested aggregation

We can put an aggregation under another one.

POST employees/_search
{
    "size": 0 // Since we don't need the items
    "aggs": {
        "job_buckets": {
            "terms": {
                "fields": "job.keyword"
            },
            "aggs": {
                "senior_employees": {
                    "top_hits": {
                        "size": 3,
                        "sort": [
                            {
                                "age": {
                                    "order": "desc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

Pipeline aggregation

Apply functions on the results of aggregation.

POST employees/_search
{
    "size": 0 // Since we don't need the items
    "aggs": {
        "jobs": {
            "terms": {
                "fields": "job.keyword"
            },
            "aggs": {
                "avg_salary": {
                    "avg": {
                        "field": "salary"
                    }
                }
            }
        },
        "min_salary_by_jobs": {
            "min_bucket": { // pipeline function
                "bucket_path": "jobs>avg_salary" // bucket_path indicates it's a pipeline aggregation
            }
        }
    }
}

Query + Aggregation

Query gives a range for aggregation result.

POST employees/_search
{
    "size": 0,
    "query": { // the employee should be over 30 years old.
        "range": {
            "age": {
                "gte": 30
            }
        }
    },
    "aggs": {
        "max_salary": {
            "max": {
                "field": "salary"
            }
        },
        "all": {
            "global": {} // It overrides the query range.
            "aggs": {
                "max_salary": {
                    "max": {
                        "field": "salary"
                    }
                }
            }
        }
    }
}

Performance

es-performance

If the data size is small, ES works very good.
If the data size is big, ES will share the data on different shards, and the precision goes down.

To improve the it, there are 2 ways:

put all the data in one primary size.
increase the shard_size in the term aggregation.

The terms aggregation fetches more than the top size terms from each shard. It fetches the top shard_size terms, which defaults to size * 1.5 + 10.

{
  "size": 0,
  "aggs": {
    "<a-name>": {
      "terms": {
        "field": "someKeyword",
        "size": 3,
        "shard_size": 10 // the size of data the ES will fetch on each shard
      }
    }
  }
}