Tuesday, November 16, 2010

Using a Correlated Subquery in a T-SQL Statement

What is a Correlated Subquery?

A correlated subquery is a SELECT statement nested inside another T-SQL statement, which contains a reference to one or more columns in the outer query. Therefore, the correlated subquery can be said to be dependent on the outer query. This is the main difference between a correlated subquery and just a plain subquery. A plain subquery is not dependent on the outer query, can be run independently of the outer query, and will return a result set. A correlated subquery, since it is dependent on the outer query will return a syntax errors if it is run by itself.
A correlated subquery will be executed many times while processing the T-SQL statement that contains the correlated subquery. The correlated subquery will be run once for each candidate row selected by the outer query. The outer query columns, referenced in the correlated subquery, are replaced with values from the candidate row prior to each execution. Depending on the results of the execution of the correlated subquery, it will determine if the row of the outer query is returned in the final result set.

Using a Correlated Subquery in a WHERE Clause

Suppose you want a report of all "OrderID's" where the customer did not purchase more than 10% of the average quantity sold for a given product. This way you could review these orders, and possibly contact the customers, to help determine if there was a reason for the low quantity order. A correlated subquery in a WHERE clause can help you produce this report. Here is a SELECT statement that produces the desired list of "OrderID's":
select distinct OrderId
  from Northwind.dbo.[Order Details] OD
  where 
    Quantity &ltl; (select avg(Quantity) * .1 
                      from Northwind.dbo.[Order Details] 
                      where OD.ProductID = ProductID)
The correlated subquery in the above command is contained within the parenthesis following the greater than sign in the WHERE clause above. Here you can see this correlated subquery contains a reference to "OD.ProductID". This reference compares the outer query's "ProductID" with the inner query's "ProductID". When this query is executed, the SQL engine will execute the inner query, the correlated subquery, for each "[Order Details]" record. This inner query will calculate the average "Quantity" for the particular "ProductID" for the candidate row being processed in the outer query. This correlated subquery determines if the inner query returns a value that meets the condition of the WHERE clause. If it does, the row identified by the outer query is placed in the record set that will be returned from the complete T-SQL SELECT statement.
The code below is another example that uses a correlated subquery in the WHERE clause to display the top two customers, based on the dollar amount associated with their orders, per region. You might want to perform a query like this so you can reward these customers, since they buy the most per region.
select CompanyName, ContactName, Address,
       City, Country, PostalCode from Northwind.dbo.Customers OuterC
where CustomerID in ( 
select top 2 InnerC.CustomerId
     from Northwind.dbo.[Order Details] OD join Northwind.dbo.Orders O
               on OD.OrderId = O.OrderID
          join Northwind.dbo.Customers InnerC
               on O.CustomerID = InnerC.CustomerId
     Where Region = OuterC.Region
     group by Region, InnerC.CustomerId
     order by sum(UnitPrice * Quantity * (1-Discount)) desc
     ) 
order by Region
Here you can see the inner query is a correlated subquery because it references "OuterC", which is the table alias for the "Northwind.DBO.Customer" table in the outer query. This inner query uses the "Region" value to calculate the top two customers for the region associated with the row being processed from the outer query. If the "CustomerID" of the outer query is one of the top two customers, then the record is placed in the record set to be returned.

Correlated Subquery in the HAVING Clause

Say your organizations wants to run a yearlong incentive program to increase revenue. Therefore, they advertise to your customers that if each order they place, during the year, is over $750 you will provide them a rebate at the end of the year at the rate of $75 per order they place. Below is an example of how to calculate the rebate amount. This example uses a correlated subquery in the HAVING clause to identify the customers that qualify to receive the rebate. Here is my code for this query:
select C.CustomerID, Count(*)*75 Rebate
  from Northwind.DBO.Customers C
         join
       Northwind.DBO.Orders O
         on c.CustomerID = O.CustomerID
  where Datepart(yy,OrderDate) = '1998'
  group by C.CustomerId
  having 750 < ALL(select sum(UnitPrice * Quantity * (1-Discount)) 
           from Northwind.DBO.Orders O
                  join
                Northwind.DBO.[Order Details] OD
                  on O.OrderID = OD.OrderID
           where CustomerID = C.CustomerId 
             and Datepart(yy,OrderDate) = '1998'
           group by O.OrderId
            )      
By reviewing this query, you can see I am using a correlated query in the HAVING clause to calculate the total order amount for each customer order. I use the "CustomerID" from the outer query and the year of the order "Datepart(yy,OrderDate)", to help identify the Order records associated with each customer, that were placed the year '1998'. For these associated records I am calculating the total order amount, for each order, by summing up all the "[Order Details]" records, using the following formula: sum(UnitPrice * Quantity * (1-Discount)). If each and every order for a customer, for year 1998 has a total dollar amount greater than 750, I then calculate the Rebate amount in the outer query using this formula "Count(*)*75 ".
SQL Server's query engine will only execute the inner correlated subquery in the HAVING clause for those customer records identified in the outer query, or basically only those customer that placed orders in "1998".

Performing an Update Statement Using a Correlated Subquery

A correlated subquery can even be used in an update statement. Here is an example:
create table A(A int, S int)
create table B(A int, B int)

set nocount on 
insert into A(A) values(1)
insert into A(A) values(2)
insert into A(A) values(3)
insert into B values(1,1)
insert into B values(2,1)
insert into B values(2,1)
insert into B values(3,1)
insert into B values(3,1)
insert into B values(3,1)

update A 
  set S = (select sum(B)
             from B 
             where A.A = A group by A)
 
select * from A

drop table A,B
Here is the result set I get when I run this query on my machine:
A           S           
----------- ----------- 
1           1
2           2
3           3
In my query above, I used the correlated subquery to update column A in table A with the sum of column B in table B for rows that have the same value in column A as the row being updated.

Conclusion

Let me summarize. A subquery and a correlated subquery are SELECT queries coded inside another query, known as the outer query. The correlated subquery and the subquery help determine the outcome of the result set returned by the complete query. A subquery, when executed independent of the outer query, will return a result set, and is therefore not dependent on the outer query. Where as, a correlated subquery cannot be executed independently of the outer query because it uses one or more references to columns in the outer query to determine the result set returned from the correlated subquery. I hope that you now understand the different of subqueries and correlated subqueries, and how they can be used in your T-SQL code. 

Source: http://www.databasejournal.com/features/mssql/article.php/3485291/Using-a-Correlated-Subquery-in-a-T-SQL-Statement.htm

No comments: