T-SQL query can use any of three clausesON, WHERE, and HAVINGfor specifying logical expressions that SQL Server uses as filters. The three filters play different roles, but sometimes it can be difficult to know which clause you should specify a logical expression in. Most T-SQL programmers know when to specify a logical expression in the HAVING clause because SQL Server applies the HAVING clause after grouping the input data. But confusion around when to use the ON and WHERE clauses leads to some of the most frequently asked questions in the public SQL Server programming newsgroup (news://msnews.microsoft.com/microsoft.public.sqlserver.programming). Thanks to Lubor Kollar, a program manager with the SQL Server Engine team, for suggesting that I write an article to clarify this often perplexing subject.
Logical Order of Query Processing
To figure out where to specify filters, you need to understand the logical order that ANSI defines for processing a query's clauses. In this article, I consider only the logical aspects of query processing, not the physical ones. During physical processing, the optimizer takes shortcuts in generating an execution plan for a query if it knows that the result will be correct.
My examples use the Customers and Orders tables, which you create and populate by running the code that Listing 1 shows. The Customers table contains rows for four customers: customers A, B, and C from Seattle and customer D from Denver. The Orders table contains five orders. Customer A placed one order, and customer B placed three orders. An unknown customer placed one order, so this row has a NULL in the custid column.
In most programming environments, the server processes the code in the order you write it in. But SQL Server doesn't process the clauses in the order you write them in, mainly because you first specify (in the SELECT list) what you want to see in the result, but SQL Server has to access, join, filter, and group the data before it can generate the final result. Figure 1 shows the general syntax of a query that contains all ANSI clauses.
Let's first look at the logical processing steps of the three clauses in which you can specify logical expressions for filtering. I'll provide a short description for each step, then get into more details later by walking through a couple of examples. As defined by ANSI, the logical processing steps of any query that follows the form that Figure 1 shows are
1.Generate a Cartesian product (cross join) between T1 and T2, which matches all rows from T1 with all rows from T2. Call the resulting virtual table V1. The number of rows in V1 is the product of the number of rows in T1 and the number of rows in T2.
2.Apply the join condition (the filter specified in the ON clause) to the rows in V1. Only rows for which this filter returns TRUE continue to the next step. Call the virtual result table V2.
3.If the join type is outer, mark the tables as preserved or unpreserved. For a left outer join, SQL Server marks the left table (T1) as preserved; for a right outer join, the right table (T2) is marked as preserved; and for a full outer join, both tables are preserved. Add to the rows in V2 all rows from the preserved table that had no match in the unpreserved table, and call the result V3. Note that SQL Server returns NULLs in the columns from the unpreserved table. If the FROM clause specifies more than two tables, apply steps 1 through 3 repeatedlyjoin V3 to the third table, join that result to the fourth table, and so on.
4.Apply the WHERE filter. Only rows for which the filter returns TRUE can continue to the next step. Call the result V4.
5.Group the rows from V4 according to the GROUP BY list. Each unique combination of values in the GROUP BY list represents a group. Call the result V5.
6.Apply the HAVING filter to the rows from V5. Only rows for which the filter returns TRUE continue to the next step. Call the result V6.
7.Process the SELECT list; if DISTINCT is present, remove duplicates. Call the result V7.
8.Sort the rows in V7 according to the ORDER BY list. Call the result V8.
9.If a TOP clause exists, process it (TOP is non-ANSI). Call the result V9.
If a query doesn't contain all possible query clauses, SQL Server simply skips the steps related to those clauses. Now, let's look at two examples and process the queries by applying these steps.
Customers Who Placed No Orders
Suppose you need to write a join query that returns customers who have registered but not yet placed orders. According to the sample data, the expected result is customers C and D, as Figure 2 shows. The join query in Listing 2 generates the desired result. Why did I specify C.custid = O.custid in the ON clause and O.custid IS NULL in the WHERE clause and not both filters in the ON clause? If you try specifying both these logical expressions in the ON clause, you get all customers in the result, not just C and D. To understand why you must specify the first logical expression in the ON clause and the second in the WHERE clause, let's apply the logical processing steps to the query.
Generating a Cartesian product of Customers and Orders produces a result table (V1) that has 20 rows (4 customers x 5 orders). When you apply the join condition (C.custid = O.custid) to the rows in V1, the possible results from the join condition are TRUE, FALSE, and UNKNOWN. Figure 3 shows the rows in V1 and the result of the join condition for each row under the match? column. Only the rows where match? is TRUE appear in V2, which the next step uses. Notice that match? is UNKNOWN for rows where O.custid is NULL, and V2 doesn't include FALSE or UNKNOWN. Figure 4 shows V2's rows.
Prev. page  
[1]
2
next page