Styles

Friday, December 18, 2015

INNER JOIN vs WHERE EXISTS Sub-Queries in MSSQL

Currently working at WiseTech, I've learn't a great deal about the pros and cons of monolithic applications, and more so about why microservices have been a massive talking point in the industry today. There are many specialists in the technology stack here and recently I got into a discussion about SQL performance tuning.

For a long time I always thought there was a standard golden rule when it came to arguing against sub-queries, and it was almost as sacrilege as cursors, however there have been a few circumstances where the difference may actually vary very little.

Here is a very simple statement that contains one INNER JOIN:

SELECT *
FROM StmNote
INNER JOIN JobDeclaration ON JE_PK = ST_ParentID
WHERE ST_Description = 'Justification Note'
       AND ST_Table <> 'JobDeclaration';

And below is an identical query that returns the same results, only with the difference being that it contains a WHERE EXISTS sub-query clause instead of an INNER JOIN:

SELECT *
FROM StmNote
WHERE ST_Description = 'Justification Note'
       AND ST_Table <> 'JobDeclaration'
       AND EXISTS (SELECT 1 FROM JobDeclaration WHERE JE_PK = ST_ParentID);

The below execution plan proves there is absolutely no difference in the decision path that the query optimizer makes.


This could easily make one argue that the difference in both statements would simply come down to a matter of preference.

However as we know the query optimizer behaves very differently when considering the number of rows involved in an index-seek/table-scan and the statistics it has to work with in determining and efficient execution path.

So what if there were other joins or larger result sets which could ultimately mean more variations in the execution path for query optimizer?

Below is a more complex scenario involving multiple joins that in the first case uses the following query:

SELECT bna.*
FROM BmncnAttachment bna
INNER JOIN BmncnShape bns
       ON bns.BNS_PK = bna.BNA_BNS_FromShape OR bns.BNS_PK = bna.BNA_BNS_ToShape OR bns.BNS_PK = bna.BNA_BNS_Owner
INNER JOIN ProcessHeader fh
       ON bns.BNS_FH_ProcessHeader = fh.FH_PK
INNER JOIN WorkItem wki
       ON wki.WKI_PK IS NOT NULL
WHERE wki.WKI_PK = fh.FH_ParentId
       AND wki.WKI_Summary LIKE 'Issue %'
       AND wki.WKI_Status IN ('ASN', 'OPN', '');

While the second case uses the following query with a WHERE EXISTS sub-query clause:

SELECT bna.*
FROM BmncnAttachment bna
INNER JOIN BmncnShape bns
       ON bns.BNS_PK = bna.BNA_BNS_FromShape OR bns.BNS_PK = bna.BNA_BNS_ToShape OR bns.BNS_PK = bna.BNA_BNS_Owner
INNER JOIN WorkItem wki
       ON wki.WKI_PK IS NOT NULL
WHERE wki.WKI_Summary LIKE 'Issue %'
       AND wki.WKI_Status IN ('ASN', 'OPN', '')
       AND EXISTS (SELECT 1 FROM ProcessHeader fh WHERE bns.BNS_FH_ProcessHeader = fh.FH_PK AND wki.WKI_PK = fh.FH_ParentId);

And here is the varying execution plans for both:




It seems they differ significantly on the final Hash Matching where it is probing the hash key fh.FH_ParentId and taking 62% or the total time longer than the standard inner join. The is mainly because you lose parallelism when using the sub-query.

So basically to sum it all up, even though the first statement may show identical results, performance issues would exists if there were multiple joins, because the optimizer could choose the sub-query over another join which would otherwise have been more efficient.

Conversely however, performance could be effected for joins that return duplicate rows and that may be another variable to consider when making the decision between WHERE EXIST sub-query clauses and JOINS, and is definitely something to keep in mind when making the final decision on what type of statement to use.