Hub of Tech: December 2015

Currently working at WiseTech, I've learn't a great deal about the pros and cons of monolithic applications, and more so about why microservices have been a massive talking point in the industry today. There are many specialists in the technology stack here and recently I got into a discussion about SQL performance tuning.

For a long time I always thought there was a standard golden rule when it came to arguing against sub-queries, and it was almost as sacrilege as cursors, however there have been a few circumstances where the difference may actually vary very little.

Here is a very simple statement that contains one INNER JOIN:


SELECT *

FROM StmNote

INNER JOIN JobDeclaration ON
JE_PK = ST_ParentID

WHERE ST_Description = 'Justification Note'

       AND
ST_Table <> 'JobDeclaration';

And below is an identical query that returns the same results, only with the difference being that it contains a WHERE EXISTS sub-query clause instead of an INNER JOIN:


SELECT *

FROM StmNote

WHERE ST_Description = 'Justification Note'

       AND
ST_Table <> 'JobDeclaration'

       AND EXISTS (SELECT 1 FROM JobDeclaration WHERE
JE_PK = ST_ParentID);

The below execution plan proves there is absolutely no difference in the decision path that the query optimizer makes.

This could easily make one argue that the difference in both statements would simply come down to a matter of preference.

However as we know the query optimizer behaves very differently when considering the number of rows involved in an index-seek/table-scan and the statistics it has to work with in determining and efficient execution path.

So what if there were other joins or larger result sets which could ultimately mean more variations in the execution path for query optimizer?

Below is a more complex scenario involving multiple joins that in the first case uses the following query:


SELECT bna.* 

FROM BmncnAttachment bna

INNER JOIN BmncnShape bns

       ON bns.BNS_PK = bna.BNA_BNS_FromShape OR
bns.BNS_PK =
bna.BNA_BNS_ToShape OR
bns.BNS_PK =
bna.BNA_BNS_Owner

INNER JOIN ProcessHeader fh

       ON bns.BNS_FH_ProcessHeader =
fh.FH_PK

INNER JOIN WorkItem wki

       ON wki.WKI_PK IS NOT NULL

WHERE wki.WKI_PK = fh.FH_ParentId

       AND wki.WKI_Summary LIKE 'Issue %'

       AND wki.WKI_Status IN ('ASN', 'OPN', '');

While the second case uses the following query with a WHERE EXISTS sub-query clause:


SELECT bna.* 

FROM BmncnAttachment bna

INNER JOIN BmncnShape bns

       ON bns.BNS_PK = bna.BNA_BNS_FromShape OR
bns.BNS_PK =
bna.BNA_BNS_ToShape OR
bns.BNS_PK =
bna.BNA_BNS_Owner

INNER JOIN WorkItem wki

       ON wki.WKI_PK IS NOT NULL

WHERE wki.WKI_Summary LIKE 'Issue %'

       AND wki.WKI_Status IN ('ASN', 'OPN', '')

       AND EXISTS (SELECT 1 FROM ProcessHeader fh WHERE
bns.BNS_FH_ProcessHeader = fh.FH_PK AND wki.WKI_PK = fh.FH_ParentId);

And here is the varying execution plans for both:

It seems they differ significantly on the final Hash Matching where it is probing the hash key fh.FH_ParentId and taking 62% or the total time longer than the standard inner join. The is mainly because you lose parallelism when using the sub-query.

So basically to sum it all up, even though the first statement may show identical results, performance issues would exists if there were multiple joins, because the optimizer could choose the sub-query over another join which would otherwise have been more efficient.

Conversely however, performance could be effected for joins that return duplicate rows and that may be another variable to consider when making the decision between WHERE EXIST sub-query clauses and JOINS, and is definitely something to keep in mind when making the final decision on what type of statement to use.

Hub of Tech

Styles

Friday, December 18, 2015

INNER JOIN vs WHERE EXISTS Sub-Queries in MSSQL