STATS
Serverless Stack
The STATS
processing command groups rows according to a common value
and calculates one or more aggregated values over the grouped rows.
Syntax
STATS [column1 =] expression1 [WHERE boolean_expression1][,
...,
[columnN =] expressionN [WHERE boolean_expressionN]]
[BY grouping_expression1[, ..., grouping_expressionN]]
Parameters
columnX
- The name by which the aggregated value is returned. If omitted, the name is
equal to the corresponding expression (
expressionX
). If multiple columns have the same name, all but the rightmost column with this name will be ignored. expressionX
- An expression that computes an aggregated value.
grouping_expressionX
- An expression that outputs the values to group by. If its name coincides with one of the computed columns, that column will be ignored.
boolean_expressionX
- The condition that must be met for a row to be included in the evaluation of
expressionX
.
Individual null
values are skipped when computing aggregations.
Description
The STATS
processing command groups rows according to a common value
and calculates one or more aggregated values over the grouped rows. For the
calculation of each aggregated value, the rows in a group can be filtered with
WHERE
. If BY
is omitted, the output table contains exactly one row with
the aggregations applied over the entire dataset.
The following aggregation functions are supported:
-
Stack
Serverless
ABSENT
AVG
COUNT
COUNT_DISTINCT
-
Stack
Serverless
FIRST
-
Stack
Serverless
LAST
MAX
MEDIAN
MEDIAN_ABSOLUTE_DEVIATION
MIN
PERCENTILE
-
Stack
Serverless
PRESENT
SAMPLE
-
Stack
Serverless
ST_CENTROID_AGG
-
Stack
Serverless
ST_EXTENT_AGG
STD_DEV
SUM
TOP
-
Stack
Serverless
VALUES
WEIGHTED_AVG
When STATS
is used under the TS
source command,
time series aggregation functions
are also supported.
The following grouping functions are supported:
STATS
without any groups is much much faster than adding a group.
Grouping on a single expression is currently much more optimized than grouping
on many expressions. In some tests we have seen grouping on a single keyword
column to be five times faster than grouping on two keyword
columns. Do
not try to work around this by combining the two columns together with
something like CONCAT
and then grouping - that is not going to be faster.
Calculating a statistic and grouping by the values of another column:
FROM employees
| STATS count = COUNT(emp_no) BY languages
| SORT languages
count:long | languages:integer |
---|---|
15 | 1 |
19 | 2 |
17 | 3 |
18 | 4 |
21 | 5 |
10 | null |
Omitting BY
returns one row with the aggregations applied over the entire
dataset:
FROM employees
| STATS avg_lang = AVG(languages)
avg_lang:double |
---|
3.1222222222222222 |
It’s possible to calculate multiple values:
FROM employees
| STATS avg_lang = AVG(languages), max_lang = MAX(languages)
avg_lang:double | max_lang:integer |
---|---|
3.1222222222222222 | 5 |
To filter the rows that go into an aggregation, use the WHERE
clause:
FROM employees
| STATS avg50s = AVG(salary)::LONG WHERE birth_date < "1960-01-01",
avg60s = AVG(salary)::LONG WHERE birth_date >= "1960-01-01"
BY gender
| SORT gender
avg50s:long | avg60s:long | gender:keyword |
---|---|---|
55462 | 46637 | F |
48279 | 44879 | M |
The aggregations can be mixed, with and without a filter and grouping is optional as well:
FROM employees
| EVAL Ks = salary / 1000
| STATS under_40K = COUNT(*) WHERE Ks < 40,
inbetween = COUNT(*) WHERE 40 <= Ks AND Ks < 60,
over_60K = COUNT(*) WHERE 60 <= Ks,
total = COUNT(*)
- thousands
under_40K:long | inbetween:long | over_60K:long | total:long |
---|---|---|---|
36 | 39 | 25 | 100 |
It’s also possible to group by multiple values:
FROM employees
| EVAL hired = DATE_FORMAT("yyyy", hire_date)
| STATS avg_salary = AVG(salary) BY hired, languages.long
| EVAL avg_salary = ROUND(avg_salary)
| SORT hired, languages.long
If the grouping key is multivalued then the input row is in all groups:
ROW price = 10, color = ["blue", "pink", "yellow"]
| STATS SUM(price) BY color
SUM(price):long | color:keyword |
---|---|
10 | blue |
10 | pink |
10 | yellow |
If all the grouping keys are multivalued then the input row is in all groups:
ROW price = 10, color = ["blue", "pink", "yellow"], size = ["s", "m", "l"]
| STATS SUM(price) BY color, size
SUM(price):long | color:keyword | size:keyword |
---|---|---|
10 | blue | l |
10 | blue | m |
10 | blue | s |
10 | pink | l |
10 | pink | m |
10 | pink | s |
10 | yellow | l |
10 | yellow | m |
10 | yellow | s |
The input ROW is in all groups. The entire row. All the values. Even group keys. That means that:
ROW color = ["blue", "pink", "yellow"]
| STATS VALUES(color) BY color
VALUES(color):keyword | color:keyword |
---|---|
[blue, pink, yellow] | blue |
[blue, pink, yellow] | pink |
[blue, pink, yellow] | yellow |
The VALUES
function above sees the whole row - all of the values of the group
key. If you want to send the group key to the function then MV_EXPAND
first:
ROW color = ["blue", "pink", "yellow"]
| MV_EXPAND color
| STATS VALUES(color) BY color
VALUES(color):keyword | color:keyword |
---|---|
blue | blue |
pink | pink |
yellow | yellow |
Refer to elasticsearch/issues/134792 for an even more in depth explanation.
Both the aggregating functions and the grouping expressions accept other
functions. This is useful for using STATS
on multivalue columns.
For example, to calculate the average salary change, you can use MV_AVG
to
first average the multiple values per employee, and use the result with the
AVG
function:
FROM employees
| STATS avg_salary_change = ROUND(AVG(MV_AVG(salary_change)), 10)
avg_salary_change:double |
---|
1.3904535865 |
An example of grouping by an expression is grouping employees on the first letter of their last name:
FROM employees
| STATS my_count = COUNT() BY LEFT(last_name, 1)
| SORT `LEFT(last_name, 1)`
my_count:long | LEFT(last_name, 1):keyword |
---|---|
2 | A |
11 | B |
5 | C |
5 | D |
2 | E |
4 | F |
4 | G |
6 | H |
2 | J |
3 | K |
5 | L |
12 | M |
4 | N |
1 | O |
7 | P |
5 | R |
13 | S |
4 | T |
2 | W |
3 | Z |
Specifying the output column name is optional. If not specified, the new column
name is equal to the expression. The following query returns a column named
AVG(salary)
:
FROM employees
| STATS AVG(salary)
AVG(salary):double |
---|
48248.55 |
Because this name contains special characters, it needs to be quoted with backticks (```) when using it in subsequent commands:
FROM employees
| STATS AVG(salary)
| EVAL avg_salary_rounded = ROUND(`AVG(salary)`)
AVG(salary):double | avg_salary_rounded:double |
---|---|
48248.55 | 48249.0 |