Notifications
Sign Up Sign In
Q&A

Why would excluding records by creating a temporary table of their primary keys be faster than simply excluding by value?

+11
−0

I have two tables with millions of records. Every so often I need to join them and exclude just a handful of records where a bit(1) column is set to 1 instead of 0.

I can do it with either,

WHERE is_excluded !=1

or

WHERE example_table.pk NOT IN
(
  SELECT pk FROM(
    SELECT pk FROM
    example_table
    WHERE is_excluded =1) 
AS t)

For example

UPDATE example_table
SET textfield = 'X'
WHERE textfield = 'Y'
and pk not in (SELECT pk FROM (SELECT pk FROM example_table WHERE do_not_touch =1)as t) ;

is faster than

UPDATE example_table
SET textfield = 'X'
WHERE textfield = 'Y'
and do_not_touch !=1

The second way is sometimes way faster, even though it takes much longer to write out.

Why would the second way be faster?

Why should this post be closed?

7 comments

Please show the complete queries you are doing and the query plans for each query. Along with details of any index. ‭Ringi‭ about 2 months ago

Use EXPLAIN to get some more details and (unless that results in a quick solution to the problem) post that information here. https://dev.mysql.com/doc/refman/8.0/en/using-explain.html ‭manassehkatz‭ about 2 months ago

Are the results different if you use is_excluded = 0 instead of "!= 1"? ‭manassehkatz‭ about 2 months ago

@manassehkatz The query cost is slightly different but the results are not ‭Charlie Brumbaugh‭ about 2 months ago

Why the double-select? Why not just a single subquery, as in "where not in (select where is_excluded = 1)"? ‭‮edoCfOtrA‭ about 2 months ago

Show 2 more comments

2 answers

+8
−0

Why would the second way be faster?

Generally speaking, the first form will perform worse (as well as looking a lot worse) than the second. You are hitting an edge case where the opposite is true, because:

  1. The not in in your first example is likely to be transformed into an anti-join (something like this). Because you also have "…just a handful of records where a bit(1) column is set to 1…" that anti-join is likely to be fairly fast.

  2. Bad stats or bad luck means that the optimizer is making a wrong choice when filtering. Perhaps it is choosing a full table scan in the second case, or failing to use a good index.

We'd need to know your actual plans/indexes/etc to be able to say more, as several people have mentioned in comments.

0 comments

+4
−0

is_excluded = 1 is very different from do_not_touch != 1. Whenever possible, try to structure your data and queries so that you can do an equi-join - that is, compare things using an = comparison. > and < and != can be really bad because the database will at the very least have to do an index scan, if there's an appropriate index available - if not, full table scan, baby! Wooo! If you can use do_not_touch = 0 that would be nice, but I obviously have no idea what that field contains.

Also, it helps to have an appropriate index. For your second query

UPDATE example_table
  SET textfield = 'X'
  WHERE textfield = 'Y'
    and do_not_touch != 1

it would seem that an index on example_table(textfield, do_not_touch) would perhaps be helpful.

1 comment

An index on textfield, do_not_touch will be less efficient than one on just textfield because the values in do_not_touch are heavily skewed — adding it to the index just means more IO when scaning. ‭Jack Douglas‭ about 1 month ago

Sign up to answer this question »