MySQL :: MySQL 3.23, 4.0, 4.1 Reference Manual :: 12.2.8.10 Optimizing Subqueries

MySQL 3.23, 4.0, 4.1 Reference Manual :: 12 SQL Statement Syntax :: 12.2 Data Manipulation Statements :: 12.2.8 Subquery Syntax :: 12.2.8.10 Optimizing Subqueries

« 12.2.8.9 Subquery Errors

12.2.8.11 Rewriting Subqueries as Joins for Earlier MySQL Versions »

Section Navigation [Toggle]

12.2.8.10. Optimizing Subqueries

Development is ongoing, so no optimization tip is reliable for the long term. The following list provides some interesting tricks that you might want to play with:

Use subquery clauses that affect the number or order of the rows in the subquery. For example:

SELECT * FROM t1 WHERE t1.column1 IN
  (SELECT column1 FROM t2 ORDER BY column1);
SELECT * FROM t1 WHERE t1.column1 IN
  (SELECT DISTINCT column1 FROM t2);
SELECT * FROM t1 WHERE EXISTS
  (SELECT * FROM t2 LIMIT 1);

Replace a join with a subquery. For example, try this:

SELECT DISTINCT column1 FROM t1 WHERE t1.column1 IN (
  SELECT column1 FROM t2);

Instead of this:

SELECT DISTINCT t1.column1 FROM t1, t2
  WHERE t1.column1 = t2.column1;

Some subqueries can be transformed to joins for compatibility with older versions of MySQL before 4.1 that do not support subqueries. However, in some cases, converting a subquery to a join may also improve performance. See Section 12.2.8.11, “Rewriting Subqueries as Joins for Earlier MySQL Versions”.

Move clauses from outside to inside the subquery. For example, use this query:

SELECT * FROM t1
  WHERE s1 IN (SELECT s1 FROM t1 UNION ALL SELECT s1 FROM t2);

Instead of this query:

SELECT * FROM t1
  WHERE s1 IN (SELECT s1 FROM t1) OR s1 IN (SELECT s1 FROM t2);

For another example, use this query:

SELECT (SELECT column1 + 5 FROM t1) FROM t2;

Instead of this query:

SELECT (SELECT column1 FROM t1) + 5 FROM t2;

Use a row subquery instead of a correlated subquery. For example, use this query:

SELECT * FROM t1
  WHERE (column1,column2) IN (SELECT column1,column2 FROM t2);

Instead of this query:

SELECT * FROM t1
  WHERE EXISTS (SELECT * FROM t2 WHERE t2.column1=t1.column1
  AND t2.column2=t1.column2);

Use NOT (a = ANY (...)) rather than a <> ALL (...).
Use x = ANY (table containing (1,2)) rather than x=1 OR x=2.
Use = ANY rather than EXISTS.

For uncorrelated subqueries that always return one row, IN is always slower than =. For example, use this query:

SELECT * FROM t1 WHERE t1.col_name
  = (SELECT a FROM t2 WHERE b = some_const);

Instead of this query:

SELECT * FROM t1 WHERE t1.col_name
  IN (SELECT a FROM t2 WHERE b = some_const);

These tricks might cause programs to go faster or slower. Using MySQL facilities like the BENCHMARK() function, you can get an idea about what helps in your own situation. See Section 11.10.3, “Information Functions”.

Some optimizations that MySQL itself makes are:

MySQL executes uncorrelated subqueries only once. Use EXPLAIN to make sure that a given subquery really is uncorrelated.
MySQL rewrites IN, ALL, ANY, and SOME subqueries in an attempt to take advantage of the possibility that the select-list columns in the subquery are indexed.
MySQL replaces subqueries of the following form with an index-lookup function, which EXPLAIN describes as a special join type (unique_subquery or index_subquery):
```
... IN (SELECT indexed_column FROM single_table ...)
```
MySQL enhances expressions of the following form with an expression involving MIN() or MAX(), unless NULL values or empty sets are involved:
```
value {ALL|ANY|SOME} {> | < | >= | <=} (uncorrelated subquery)
```
For example, this WHERE clause:
```
WHERE 5 > ALL (SELECT x FROM t)
```
might be treated by the optimizer like this:
```
WHERE 5 > (SELECT MAX(x) FROM t)
```

See also the MySQL Internals Manual chapter How MySQL Transforms Subqueries.

Previous / Next / Up / Table of Contents

User Comments

Posted by Are you mortal? Then prepare to die. on December 16 2004 10:21am

[Delete] [Edit]

Some tests I did show that using the join syntax (for a right join to exclude certain rows from the results set based on the values found in another table by using the WHERE KEY IS NULL tric) is slightly (but consistently) faster than using the sub query syntax (with a WHERE NOT IN).

Here are the two queries...

-- Faster
SELECT COUNT(*) FROM (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
) AS X;

-- Slower
SELECT COUNT(*) FROM (
SELECT
region.PDB,
region.CHAIN
FROM
region
WHERE
SUNID NOT IN (
SELECT
SUNID
FROM
split_domain
)
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
) AS X;

Just in case you got confused, the results of both queries are identical, but the former is significantly faster.

Posted by Are you mortal? Then prepare to die. on December 16 2004 10:29am

[Delete] [Edit]

I found using a 'virtual table' instead of a ROW sub query is *much* faster on my table. It seems that the row subquery isn't optimized, where the join over the 'virtual table' is optimized.

Below are the queryies and the 'EXPLAIN' returned for educational purposes.

-- Query using ROW subquery
EXPLAIN
SELECT
*
FROM
region
WHERE
ROW (PDB,CHAIN) IN (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
)
LIMIT
10
;

+----+--------------------+--------------+--------+---------------+---------+---------+------------------------+-------+--------------------------------------+
| id | select_type        | table        | type   | possible_keys | key     | key_len | ref                    | rows  | Extra                                |
+----+--------------------+--------------+--------+---------------+---------+---------+------------------------+-------+--------------------------------------+
|  1 | PRIMARY            | region       | ALL    | NULL          | NULL    |    NULL | NULL                   | 57362 | Using where                          |
|  2 | DEPENDENT SUBQUERY | region       | ALL    | NULL          | NULL    |    NULL | NULL                   | 57362 | Using temporary; Using filesort      |
|  2 | DEPENDENT SUBQUERY | split_domain | eq_ref | PRIMARY       | PRIMARY |       3 | scop_1_65.region.SUNID |     1 | Using where; Using index; Not exists |
+----+--------------------+--------------+--------+---------------+---------+---------+------------------------+-------+--------------------------------------+

3 rows in set (0.04 sec)

I can't get any results out of the above (takes too long) - Perhaps the limit clause isn't kicking in?

-- Query using joined virtual table
EXPLAIN
SELECT
*
FROM
region
INNER JOIN (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
) AS x
ON
region.PDB = x.PDB
AND
region.CHAIN = x.CHAIN
LIMIT
10
;

+----+-------------+--------------+--------+---------------------+-----------+---------+------------------------+-------+--------------------------------------+
| id | select_type | table        | type   | possible_keys       | key       | key_len | ref                    | rows  | Extra                                |
+----+-------------+--------------+--------+---------------------+-----------+---------+------------------------+-------+--------------------------------------+
|  1 | PRIMARY     | <derived2>   | ALL    | NULL                | NULL      |    NULL | NULL                   |  8624 |                                      |
|  1 | PRIMARY     | region       | ref    | PDB,CHAIN,pdb_chain | pdb_chain |       5 | x.PDB,x.CHAIN          |     1 |                                      |
|  2 | DERIVED     | region       | ALL    | NULL                | NULL      |    NULL | NULL                   | 57362 | Using temporary; Using filesort      |
|  2 | DERIVED     | split_domain | eq_ref | PRIMARY             | PRIMARY   |       3 | scop_1_65.region.SUNID |     1 | Using where; Using index; Not exists |
+----+-------------+--------------+--------+---------------------+-----------+---------+------------------------+-------+--------------------------------------+

4 rows in set (1.02 sec)

The above returns ...
10 results in about 1 second
100 results in about 1 second
1000 results in about 1.5 seconds
Full set (20437) in about 2 seconds

The former query dosn't return in 5 mins (even with limit 10).

I hope this is useful to anyone designing (or trying to optimize) complex subqueries, and that the precice details of the data are not necessary to convey the results presented here.

Posted by Matthew Sayler on October 13 2005 9:18pm

[Delete] [Edit]

As an example of factoring out correllated subqueries:

In at least 4.1, the following query is very slow:

-- retreive rows with duplicate usernames

SELECT id, username, companyname
FROM users
WHERE id IN
(SELECT id FROM users GROUP BY username HAVING COUNT(username) > 2)

This takes about 4'40" on a 2.8GHz machine with well-indexed tables (index on username)

This rewrite is very odd looking (to my eyes), but much faster:

SELECT id, username, companyname
FROM users AS u,
(SELECT id
FROM users
GROUP BY username
HAVING COUNT(username) > 2
) AS dups
WHERE u.id = dups.id

(executes in a few ms)

I thought maybe rewriting the first query to use two different implied table names would help, but no dice:

SELECT u.id, u.username, u.companyname
FROM users AS u
WHERE u.id IN
(SELECT d.id FROM users as d GROUP BY d.username HAVING COUNT(d.username) > 2)

Add your own comment.

Top / Previous / Next / Up / Table of Contents