Understanding Analytic Functions in Oracle SQL
by Rocky Jagtiani
Oracle Certified Associate
I train on Oracle Certification from last 7 years
Analytic functions in Oracle SQL are very useful and are utilized by many people, but sometimes without completely understanding how they work or what data they operate on.
This article does not focus on a specific function, but on analytic functions in general.
I decided to write it because I have seen people being confused by the results they get when they make even a very small change in the analytic clause of a function call, which is the part that determines on which data the function will operate and how it will calculate the results, and also because several friends have asked me, during my Oracle Training sessions in the past.
So, if you are new to analytic functions, or have only used them sporadically without completely understanding how they work, you might find this article useful.
Contents [45 mins read & execute]
- So, what are analytic functions?
- The Analytic Clause.
- So, what is the partition clause?
- Now, let’s talk about the order by clause.
- And now, let’s talk about the windowing clause!
For understanding Analytic Functions , make sure of :
1> Installing Oracle-11g ( steps of installation and Oracle-11g download copy is here , please see Oracle SQL - Pl/SQL > downloads section ). By the way any version from 9i and upwards would do. In case if don't want to install Oracle , the use liveSQL from here.
2> Reading each and every step.
3> & finally Executing as you read on.
So, what are analytic functions?
Analytic functions are functions that calculate aggregate values based on a group of rows, so they are similar to aggregate functions, but the difference is that as opposed to aggregate functions, analytic functions return multiple rows for each group.They are very useful for performing complex calculations on data, and although most of these operations can be performed using other features, such as joins, sub-queries, user defined functions, etc, analytic functions usually allow you to write simpler and more efficient queries.
Take a look at this example:
SELECT e.*, COUNT(*) OVER(PARTITION BY department_id) AS the_count
FROM top_n_employees e
ORDER BY department_id,employee_id;
For beginners, who wish get the tables used in queries on this blog-post may see here.
Can you tell me what this query does?
Well, it lists all of the records in my employee table, but besides the columns from the table, it includes a column that returns the count of employees in the same department of the employee you are looking at. I’m talking about the "the_count" column. We get the below output :
This (above) query looks very simple, but it is pretty cool, because if I wanted to write a query that gave me the same results, WITHOUT using analytic functions, I would need to do something like this:
SELECT e.*,
(
SELECT COUNT( * )
FROM top_n_employees
WHERE department_id = e.department_id
) AS the_count
FROM top_n_employees e
ORDER BY department_id, employee_id;
I needed an additional correlated scalar sub-query, so, not only does this query look a little more complex. It is also less efficient.
And this is actually one of the simplest things you can do with analytic functions. They are actually very powerful.
Let me show you another example: Let’s say I have to create a report that lists all employees with their salaries, along with a column that tells me the difference between their salary and the average salary of their department, and another column that tells me the difference between their salary and the average salary of whole company.
That would be like this:
SELECT employee_id,first_name,department_id,salary,
salary-AVG(salary) OVER(PARTITION BY department_id) AS diff_dept_avg,
salary-AVG(salary) OVER() AS diff_total_avg
FROM top_n_employees
ORDER BY department_id,employee_id;
As you can see, with analytic functions I can have different levels of aggregation in the same query. In this case, in the first average, I’m essentially grouping by department_id, but in the second average, I’m including the whole table.
Let’s analyze the analytic average I used in this example:
As you remember, AVG is an aggregate function too, so what makes it be used as analytic function in this example? That OVER keyword that you see int lines 2 and 3 is what makes it an analytic function. Without the OVER keyword, it would be treated as a normal aggregate function, and would require a GROUP BY clause added to the query. So, every time you see the OVER keyword, you are looking at an analytic function.
The Analytic Clause:
The analytic clause is the part that comes after the OVER keyword, and it is always enclosed in parenthesis. In the first average the analytic clause includes "partition by department_id", and in the second one, the analytic clause is empty, but as you see, the parenthesis are still there after the OVER keyword.
Now, the analytic clause can include 3 different things:
- A partition clause, as in the first average in our example.
- An order by clause, which we will use in another example.
- And a windowing clause, which we will also discuss in a moment.
So, what is the partition clause?
The partition clause works in a similar way than the GROUP BY clause. It "partitions" the result set into groups defined by the column or columns specified and applies the function to each of those groups. In the case of the first average, the partition clause divides the result set in groups by the department id. In the case of the second average, since we didn’t include a partition clause, the function is applied to the whole result set.
Now, let’s talk about the order by clause.
Did you read my article about writing TOP-N queries? In one of the examples I used the ROW_NUMBER analytic function to create a top-5 query, and it was something like this:
WITH ordered AS
(
SELECT employee_id,first_name,last_name,salary
FROM top_N_employees
ORDER BY salary DESC
)
SELECT *
FROM ordered
WHERE rownum <= 5;
In this case, the analytic clause doesn’t include a partition clause, which means that the function is applied to the whole result set, but it does include an order by clause.
As the name implies, this clause defines the order that will be used to apply the function. So, since the function assigns a row number to every record, if you order by salary in descending order, as in this example, the number 1 will be assigned to the employee with the highest salary, and the last number will be assigned to the employee with the lowest salary, but if you order in ascending order, the row number 1 will be assigned to the employee with the lowest salary.
So, as you can see, not all of the sub clauses of the analytic clause are useful for all of the functions. In the case of row_number and the ranking functions in general, an order by clause is mandatory.
Now, in these examples, I included only one column in the partition and order by clauses, but you can include more than one if you want to.
And now, let’s talk about the windowing clause!
The first thing I need to point out is that the windowing clause cannot be used with all analytic functions. You need to know that not all of the available functions support a windowing clause.
OK, so, the windowing clause allows you to define a "sliding window" on which the function will operate. We call "a window" to a group of rows on which the function will operate, so imagine that we have an actual window as in this image:
The function will operate on the rows that are visible through the window, so if I move the window, the function will operate on a different set of rows.
The window is different from the partition because the partition is static. Once you define a partition, it doesn’t change, but the window can be different or can move as we process different rows.
Let’s see it with an example:
SELECT employee_id, first_name, salary, department_id,
SUM(salary) over (PARTITION BY department_id
ORDER BY employee_id ROWS UNBOUNDED PRECEDING) AS accumulated
FROM top_N_employees
ORDER BY department_id, employee_id;
Let’s review the results.
For the first row in my result set you can see that the accumulated column has the same value as the salary of that employee, but for the second row, it has the sum of the salary from the first row and the salary from the second row which is 75000 plus 23000, which is equal to 98000, and for the third row the accumulated total includes the salary from the first three rows.
So, those records that are included to calculate the function, in this case, the SUM, constitute the WINDOW, and as you can see, in this case, the window is not moving, but growing, because the beginning of the window remains at the beginning of the partition, but the end of the window is the current row.
Now, as you can see, the accumulated value resets when the department id changes. See? For "Rocky", which is the first row of the partition for department id 101, the accumulated salary is again equal to the salary of that employee. Why? Because the window moves or grows only inside the partition, and in this case, we partitioned the result set by department_id.
Now, let’s take a look at the windowing clause, to understand how we achieved this result.
First of all, you need to know that you can add a windowing clause ONLY if you added an ORDER BY clause, and as a matter of fact, when you add an ORDER BY clause, you automatically add a default windowing clause, which you can override by adding your own… windowing clause.
So, in this example, the windowing clause starts with the "ROWS" keyword , which is one of two possibilities. We can use ROWS or RANGE. When we use ROWS, it means that we will define the window in terms of physical numbers of rows. We will see an example of this in a moment. And when we use RANGE, it means that we will define the window in terms of a logical range of values, that is relative to the column or columns we ordered by.
It might sound a little confusing now, but it will be clearer if we use an example.
Take a look at this query:
SELECT employee_id, first_name, salary, department_id,
SUM(salary) OVER (PARTITION BY department_id
ORDER BY employee_id ROWS 1 PRECEDING) accumulated_rows
FROM top_N_employees
ORDER BY department_id, employee_id;
In this case, I’m telling it that I am defining my window in terms of numbers of rows, and I’m stating that my window starts 1 row before the current row (that is what the PRECEDING keyword means).
And if I don’t specify where the window ends, as in this case, it defaults to end with the current row.
So, in this example, the accumulated salary for each row is the sum of the salary from the previous row and the current one.
Now, let’s change this ROWS keyword, to RANGE, to see the difference:
SELECT employee_id, first_name, salary, department_id,
SUM(salary) over (PARTITION BY department_id
ORDER BY employee_id range 1 PRECEDING) accumulated_range
FROM top_N_employees
ORDER BY department_id, employee_id;
When I use RANGE, the number "1" that I put after that word, doesn’t mean 1 row. It means a difference of "1" from the column by which I'm ordering the partition, which in this case is the employee id. This means that if I'm looking at the employee with id 8, the window will include the previous row if the id in that row is 9 because I defined a range of 1.
In our example, the accumulated for Simran is the same as his salary, because she is the first row in the partition. The accumulated for Anant is also his own salary, because the id from the previous row, which is the six for Neetu, doesn't fall inside the range of 1 ID that we defined. For the row of Neetu to be included in the window, it would need to have an id of 7, because the id of Anant is 8, and we defined a range of 1. Does it make sense?
Take a look at the accumulated for Ganesh. It is the sum of the salary of Anant and his salary. Why?, because the ID of Anant is 8 and the Id of Ganesh is 9, so it falls between the range of 1 ID that we specified.
Here is another example that might help understanding how RANGE works: If I apply an analytic function on a transactions table and I define the window using RANGE based on the transaction date, if the range is defined as 5 PRECEDING, the window will include all the transactions whose date is between 5 days before and the date of the current row, and that means that the window could include thousands of rows if there are a lot of transactions per day, or may include only the current row if that was the only transaction in that 6-day period. On the other hand, if I use ROWS 5 PRECEDING, the window will always be composed of the previous 5 rows, and the current one, regardless of their date (as long as they are included in the same partition).
There is one restriction that you need to be aware of regarding the use of RANGE in the windowing clause. You can use RANGE only if you order the partition by a column that is of a numeric or a date data type. If you order by a varchar2 and try to use a RANGE window, you will get an error. Why? If you think about it, it makes sense, because, you can define a range of numbers, or a range of dates, but how would you define a "range" of strings?
Okay, this is the end of the article. After reading this you should have a basic understanding of how analytic functions work, and what the role of the different parts of the analytic clause is.
As always, please feel free to comment, make suggestions or ask for clarification of anything in the comments section below.
See you at one of my training programmes !!
Comments
Post a Comment