There is so much to learn about SQL Server that I am both enthusiastic and overwhelmed when I think about it. Of course, depending on what an individual likes to do, some things will be easier for them. For example, I would trust Itzik Ben Gan with my life if it depended on writing a T-SQL query. But I wouldn’t expect him to be the same level of guru on SSIS or SSAS. (Maybe I am wrong here; he is brilliant and might be an expert on those technologies, too.)
For me, the joy of working with computers is creating things. Maybe that thing is a report or an SSIS package or a T-SQL query. When I was a developer, before focusing on SQL Server, I did a lot of consulting work for very small businesses. The question was frequently “Can you give us a button that does <fill in the blank>?” and more often than not, answer was yes. I loved that I could figure out how to automate processes and make a small business more efficient.
Since I love creating things so much, sometimes it is the process of creating something more than the internals that interest me. I read a lot about internals, but the details don’t always stick with me as much as I would hope they would.
One topic area that just kept dropping out of my brain are the three types of joins: Hash Match, Nested Loop, and Merge Join. Within a 48 hour period, I had two opportunities to explain when each of these would be used. In one case, it is for a book chapter I am working on. In the other case, a mentee of mine asked the question. They say the best way to learn something is to teach it to someone else, and this is a great case in point.
Without going into how each of the three join operators work, I can tell you when they will be used for inner joins.
A nested loop will be used when one of the inputs is estimated to have less than 10 rows. Here is a query for the AdventureWorks database that should utilize a Nested Loop:
SELECT SOH.CustomerID, SOH.SalesOrderID, C.* FROM Sales.SalesOrderHeader SOH JOIN Sales.Customer C ON SOH.CustomerID = C.CustomerID WHERE SOH.CustomerID = 11000;
The cluster key for the Sales.Customer table is CustomerID. The Sales.SalesOrderHeader has a nonclustered index on CustomerID with no other key columns or included columns. Since the WHERE clause limits the Sales.Customer rows to just 1, a Nested Loop is the choice.
If both inputs are estimated to be 10 or more rows, and the inputs are sorted on the join columns, then a Merge Join will be used. This query should use a Merge Join: SELECT SOH.CustomerID, SOH.SalesOrderID, C.* FROM Sales.SalesOrderHeader SOH JOIN Sales.Customer C ON SOH.CustomerID = C.CustomerID;
This query is just like the first except that it is missing the WHERE clause. The same indexes are used, but this time all of the customers are required, not just one. Since the number of rows involved are over 10 from both sides and the input is sorted on the joining column, the Merge Join is selected.
If both inputs are estimated to be 10 or more rows and one or both of the inputs are not sorted by the join columns, then a Hash Match will be used. Here is a query that uses a Hash Match:
SELECT SOH.CustomerID, SOH.SalesOrderID,SOH.OrderDate, C.* FROM Sales.SalesOrderHeader SOH JOIN Sales.Customer C ON SOH.CustomerID = C.CustomerID
This query looks just like the previous query except that an extra column has been added to the SELECT list from Sales.SalesOrderHeader. Because this column is not part of the index on CustomerID, the Sales.SalesOrderHeader’s clustered index is used instead. The clustered index is sorted by SalesOrderID, so it does not meet the requirements for a Merge Join.
So, I guess this means I need to write another post soon that drills into how these work under the hood. Stay tuned…