`STATS`

Serverless Stack

The STATS processing command groups rows according to a common value and calculates one or more aggregated values over the grouped rows.

Syntax

		STATS [column1 =] expression1 [WHERE boolean_expression1][,
      ...,
      [columnN =] expressionN [WHERE boolean_expressionN]]
      [BY grouping_expression1[, ..., grouping_expressionN]]
		
	

Parameters

columnX: The name by which the aggregated value is returned. If omitted, the name is equal to the corresponding expression (expressionX). If multiple columns have the same name, all but the rightmost column with this name will be ignored.
expressionX: An expression that computes an aggregated value.
grouping_expressionX: An expression that outputs the values to group by. If its name coincides with one of the computed columns, that column will be ignored.
boolean_expressionX: The condition that must be met for a row to be included in the evaluation of expressionX.

Note

Individual null values are skipped when computing aggregations.

Description

The STATS processing command groups rows according to a common value and calculates one or more aggregated values over the grouped rows. For the calculation of each aggregated value, the rows in a group can be filtered with WHERE. If BY is omitted, the output table contains exactly one row with the aggregations applied over the entire dataset.

The following aggregation functions are supported:

Stack Planned Serverless ABSENT
AVG
COUNT
COUNT_DISTINCT
Stack Planned Serverless FIRST
Stack Planned Serverless LAST
MAX
MEDIAN
MEDIAN_ABSOLUTE_DEVIATION
MIN
PERCENTILE
Stack Planned Serverless PRESENT
SAMPLE
Stack Preview Serverless Preview ST_CENTROID_AGG
Stack Preview Serverless Preview ST_EXTENT_AGG
STD_DEV
SUM
TOP
Stack Preview Serverless Preview VALUES
WEIGHTED_AVG

When STATS is used under the TS source command, time series aggregation functions are also supported.

The following grouping functions are supported:

Note

STATS without any groups is much much faster than adding a group.

Note

Grouping on a single expression is currently much more optimized than grouping on many expressions. In some tests we have seen grouping on a single keyword column to be five times faster than grouping on two keyword columns. Do not try to work around this by combining the two columns together with something like CONCAT and then grouping - that is not going to be faster.

Examples

Calculating a statistic and grouping by the values of another column:

		FROM employees
| STATS count = COUNT(emp_no) BY languages
| SORT languages
		
	

count:long	languages:integer
15	1
19	2
17	3
18	4
21	5
10	null

Omitting BY returns one row with the aggregations applied over the entire dataset:

		FROM employees
| STATS avg_lang = AVG(languages)

avg_lang:double
3.1222222222222222

It’s possible to calculate multiple values:

		FROM employees
| STATS avg_lang = AVG(languages), max_lang = MAX(languages)

avg_lang:double	max_lang:integer
3.1222222222222222	5

To filter the rows that go into an aggregation, use the WHERE clause:

		FROM employees
| STATS avg50s = AVG(salary)::LONG WHERE birth_date < "1960-01-01",
        avg60s = AVG(salary)::LONG WHERE birth_date >= "1960-01-01"
        BY gender
| SORT gender
		
	

avg50s:long	avg60s:long	gender:keyword
55462	46637	F
48279	44879	M

The aggregations can be mixed, with and without a filter and grouping is optional as well:

		FROM employees
| EVAL Ks = salary / 1000
| STATS under_40K = COUNT(*) WHERE Ks < 40,
        inbetween = COUNT(*) WHERE 40 <= Ks AND Ks < 60,
        over_60K  = COUNT(*) WHERE 60 <= Ks,
        total     = COUNT(*)
		
	

thousands

under_40K:long	inbetween:long	over_60K:long	total:long
36	39	25	100

It’s also possible to group by multiple values:

		FROM employees
| EVAL hired = DATE_FORMAT("yyyy", hire_date)
| STATS avg_salary = AVG(salary) BY hired, languages.long
| EVAL avg_salary = ROUND(avg_salary)
| SORT hired, languages.long
		
	

Multivalued inputs

If the grouping key is multivalued then the input row is in all groups:

		ROW price = 10, color = ["blue", "pink", "yellow"]
| STATS SUM(price) BY color

SUM(price):long	color:keyword
10	blue
10	pink
10	yellow

If all the grouping keys are multivalued then the input row is in all groups:

		ROW price = 10, color = ["blue", "pink", "yellow"], size = ["s", "m", "l"]
| STATS SUM(price) BY color, size

SUM(price):long	color:keyword	size:keyword
10	blue	l
10	blue	m
10	blue	s
10	pink	l
10	pink	m
10	pink	s
10	yellow	l
10	yellow	m
10	yellow	s

The input ROW is in all groups. The entire row. All the values. Even group keys. That means that:

		ROW color = ["blue", "pink", "yellow"]
| STATS VALUES(color) BY color

VALUES(color):keyword	color:keyword
[blue, pink, yellow]	blue
[blue, pink, yellow]	pink
[blue, pink, yellow]	yellow

The VALUES function above sees the whole row - all of the values of the group key. If you want to send the group key to the function then MV_EXPAND first:

		ROW color = ["blue", "pink", "yellow"]
| MV_EXPAND color
| STATS VALUES(color) BY color
		
	

VALUES(color):keyword	color:keyword
blue	blue
pink	pink
yellow	yellow

Refer to elasticsearch/issues/134792 for an even more in depth explanation.

Multivalue functions

Both the aggregating functions and the grouping expressions accept other functions. This is useful for using STATS on multivalue columns. For example, to calculate the average salary change, you can use MV_AVG to first average the multiple values per employee, and use the result with the AVG function:

		FROM employees
| STATS avg_salary_change = ROUND(AVG(MV_AVG(salary_change)), 10)

avg_salary_change:double
1.3904535865

An example of grouping by an expression is grouping employees on the first letter of their last name:

		FROM employees
| STATS my_count = COUNT() BY LEFT(last_name, 1)
| SORT `LEFT(last_name, 1)`
		
	

my_count:long	LEFT(last_name, 1):keyword
2	A
11	B
5	C
5	D
2	E
4	F
4	G
6	H
2	J
3	K
5	L
12	M
4	N
1	O
7	P
5	R
13	S
4	T
2	W
3	Z

Naming

Specifying the output column name is optional. If not specified, the new column name is equal to the expression. The following query returns a column named AVG(salary):

		FROM employees
| STATS AVG(salary)

AVG(salary):double
48248.55

Because this name contains special characters, it needs to be quoted with backticks (```) when using it in subsequent commands:

		FROM employees
| STATS AVG(salary)
| EVAL avg_salary_rounded = ROUND(`AVG(salary)`)
		
	

AVG(salary):double	avg_salary_rounded:double
48248.55	48249.0