Summary: in this tutorial, you will learn how to calculate cumulative distribution values of rows using the SQL CUME_DIST()
function.
Introduction to SQL CUME_DIST() Function
The CUME_DIST()
is a window function that calculates the cumulative distribution of value within a set of values.
The CUME_DIST()
function returns a value that represents the number of rows with values less than or equal to (<=
)the current row’s value divided by the total number of rows:
N / total_rows
Code language: SQL (Structured Query Language) (sql)
In this formula:
N
is the number of rows with the value less than or equal to the current row value.total_rows
is the number of rows in the partition or result set being evaluated.
The return value of the CUME_DIST()
function has a range of the low value greater than 0 and the high value less than or equal to 1.
0 < CUME_DIST() <= 1
Code language: SQL (Structured Query Language) (sql)
The repeated column values receive the same CUME_DIST()
value.
The syntax of the CUME_DIST()
function is as follows:
CUME_DIST() OVER (
PARTITION BY expr1, expr2 ...
ORDER BY expr1 [ASC | DESC], expr2...
)
Code language: SQL (Structured Query Language) (sql)
In this syntax:
- First, the
PARTITION BY
clause divides the result set into partitions. If you omit thePARTITION BY
clause, the function treats the whole result set as a single partition. - Then, the
ORDER BY
clause sorts the rows within each partition. - Finally, the
CUME_DIST()
function is applied to each sorted partition independently.
SQL CUME_DIST() function example
See the following employees
and departments
table from the sample database:
The following statement creates a view named department_headcounts
based on the employees
and departments
tables for the demonstration:
CREATE VIEW department_headcounts
AS
SELECT
department_name,
COUNT(employee_id) headcount
FROM
employees e
INNER JOIN departments d
ON d.department_id = e.department_id
GROUP BY
e.department_id;
Code language: SQL (Structured Query Language) (sql)
The following statement finds the cumulative distribution values by headcount of each department:
SELECT
department_name,
headcount,
ROUND(
CUME_DIST() OVER (
ORDER BY headcount
)
,2) cume_dist_val
FROM
department_headcounts;
Code language: SQL (Structured Query Language) (sql)
The following shows the output:
In this example, departments were sorted by their number of headcounts in ascending order. The total of rows in the result set is 11.
The Administration
department has one headcount. The Human Resources
and Public Relations
also have the same headcount as Administration. As the result, there are three departments that have one headcount. The CUME_DIST()
function will use the following formula to calculate the cumulative distribution values of the first row:
3 / 11 = 0.27
Code language: SQL (Structured Query Language) (sql)
The same logic is applied to the second and third rows.
The Marketing
department has two headcounts. The function will find other departments which have the number of headcounts less than or equal to 2. The result is 5. Therefore, the CUME_DIST()
of the Marketing
department is 5 / 11 = 0.45
The same logic is applied to the remaining rows.
In this tutorial, you have learned how to calculate the cumulative distribution of a value in a set of values using the SQL CUME_DIST()
function.