2

I have ran into the age-old problem of MySQL refusing to use an index for seemingly basic stuff. The query in question:

SELECT c.*
FROM app_comments c
LEFT JOIN app_comments reply_c ON c.reply_to = reply_c.id
WHERE (c.external_id = '840774' AND c.external_context = 'deals')
 OR (reply_c.external_id = '840774' AND reply_c.external_context = 'deals')
ORDER BY c.reply_to ASC, c.date ASC

EXPLAIN:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  c   ALL external_context,external_id,idx_app_comments_externals NULL    NULL    NULL    903507  Using filesort
1   SIMPLE  reply_c eq_ref  PRIMARY PRIMARY 4   altero_full.c.reply_to  1   Using where

There are indexes on external_id and external_context separately, and I also tried adding a composite index (idx_app_comments_externals), but that did not help at all.

The query executes in 4-6 seconds in production (>1m records), but removing the OR part of the WHERE condition decreases that to 0.05s (it still uses filesort though). Clearly indexes don't work here, but I have no idea why. Can anyone explain this?

P.S. We're using MariaDB 10.3.18, could that be at fault here?

4

3 回答 3

2

With the equality predicates on external_id and external_context columns in the WHERE clause, MySQL could make effective use of an index... when those predicates specify the subset of rows that can possibly satisfy the query.

But with the OR added to the WHERE clause, now the rows to be returned from c are not limited by external_id and external_content values. It's now possible that rows with other values of those columns could be returned; rows with any values of those columns.

And that negates the big benefit of using an index range scan operation... very quickly eliminating vast swaths of rows from being considered. Yes, an index range scan is used to quickly locate rows. That is true. But the meat of the matter is that the range scan operation uses the index to quickly to bypass millions and millions of rows that can't possibly be returned.


This is not behavior specific to MariaDB 10.3. We are going to observe the same behavior in MariaDB 10.2, MySQL 5.7, MySQL 5.6.


I'm questioning the join operation: Is it necessary to return multiple copies of rows from c when there are multiple matching rows from reply_c ? Or is the specification to just return distinct rows from c ?


We can look at the required resultset as two parts.

1) the rows from app_contents with equality predicates on external_id and external_context

  SELECT c.*
    FROM app_comments c
   WHERE c.external_id       = '840774'
     AND c.external_context  = 'deals'
   ORDER
      BY c.external_id
       , c.external_context
       , c.reply_to
       , c.date

For optimal performance (excluding considering a covering index because of the * in the SELECT list), an index like this could be used to satisfy both the range scan operation and the order by (eliminating a Using filesort operation)

   ... ON app_comments (external_id, external_context, reply_to, date)

2) The second part of the result is the reply_to rows related to matching rows

  SELECT d.*
    FROM app_comments d
    JOIN app_comments e
      ON e.id = d.reply_to
   WHERE e.external_id       = '840774'
     AND e.external_context  = 'deals'
   ORDER
      BY d.reply_to
       , d.date

The same index recommended before can be used to accessing rows in e (range scan operation). Ideally, that index would also include the id column. Our best option is probably to modify the index to include id column following date

   ... ON app_comments (external_id, external_context, reply_to, date, id)

Or, for equivalent performance, at the expense of an extra index, we could define an index like this:

   ... ON app_comments (external_id, external_context, id)

For accessing rows from d with a range scan, we likely want an index:

   ... ON app_comments (reply_to, date)

We can combine the two sets with a UNION ALL set operator; but there's potential for the same row being returned by both queries. A UNION operator would force a unique sort to eliminate duplicate rows. Or we could add a condition to the second query to eliminate rows that will be returned by the first query.

  SELECT d.*
    FROM app_comments d
    JOIN app_comments e
      ON e.id = d.reply_to
   WHERE e.external_id       = '840774'
     AND e.external_context  = 'deals'
  HAVING NOT ( d.external_id      <=> '840774'
           AND d.external_context <=> 'deals'
             )
   ORDER
      BY d.reply_to
       , d.date

Combining the two parts, wrap each part in a set of parens add the UNION ALL set operator and an ORDER BY operator at the end (outside the parens), something like this:

(
  SELECT c.*
    FROM app_comments c
   WHERE c.external_id       = '840774'
     AND c.external_context  = 'deals'
   ORDER
      BY c.external_id
       , c.external_context
       , c.reply_to
       , c.date
)
UNION ALL
(
  SELECT d.*
    FROM app_comments d
    JOIN app_comments e
      ON e.id = d.reply_to
   WHERE e.external_id       = '840774'
     AND e.external_context  = 'deals'
  HAVING NOT ( d.external_id      <=> '840774'
           AND d.external_context <=> 'deals'
             )
   ORDER
      BY d.reply_to
       , d.date
)
ORDER BY `reply_to`, `date`

This will need a "Using filesort" operation over the combined set, but now we've got a really good shot at getting good execution plan for each part.


There's still my question of how many rows we should return when there are multiple matching reply_to rows.

于 2019-10-21T17:02:10.310 回答
2

MySQL (and MariaDB) cannot optimize OR conditions on different columns or tables. Note that in the context of the query plan c and reply_c are considered different tables. These queries are usually optimized "by hand" with UNION statements, which often contain a lot of code duplication. But in your case and with a quite recent version, which supports CTEs (Common Table Expressions) you can avoid most of it:

WITH p AS (
    SELECT *
    FROM app_comments
    WHERE external_id      = '840774'
      AND external_context = 'deals'
)
SELECT * FROM p
UNION DISTINCT
SELECT c.* FROM p JOIN app_comments c ON c.reply_to = p.id
ORDER BY reply_to ASC, date ASC

Good indices for this query would be a composite one on (external_id, external_context) (in any order) and a separate one on (reply_to).

You will though not avoid a "filesort", but that shouldn't be a problem, when the data are filtered to a small set.

于 2019-10-21T18:11:54.917 回答
-2

However, the name index is not used for lookups in the following queries:

SELECT * FROM test
WHERE last_name='Jones' OR first_name='John';

enter link description here

于 2020-03-05T06:06:53.957 回答