Optimize MySQL Queries By Using BETWEEN Instead of LIMIT and OFFSET

Optimize MySQL Queries By Using BETWEEN Instead of LIMIT and OFFSET

Optimize MySQL Queries By Using BETWEEN Instead of LIMIT and OFFSET

I discovered a simple optimization to a troublesome slow MySQL query the other day and wanted to share the tip.

I have a very large table of users — around 1.5M records. When joining the users table with the email_list table, I get a list of subscribers. My email service provider’s API accepts batches of 1000 email addresses to which I send via worker scripts.

The obvious query to get batches of 1000 email addresses is to join the tables and LIMIT results to 1000 records and progressively increase the OFFSET by 1000.

mysql> SELECT `users`.`id`, `users`.`email`
FROM `users`, `email_list` 
WHERE `users`.`id` = `email_list`.`user_id`
AND `email_list`.`type` = 1 
LIMIT 1000 
OFFSET 1000000;
...
1000 rows in set (4.47 sec)

This query represents the 1000th iteration starting at the 1 millionth record. We only need 1000 rows, so why is this query taking a whopping 4.47 seconds?!

My understanding is though we only need 1000 records, we still need to count off the first 1,000,000. Even optimized tables with properly indexed keys may require memory, temp tables and swap file to fetch the rows at a speed impediment.

Take 2…
I have the luxury of not needing exactly 1000 email addresses (rows) for my workers, so I tried limiting the examined rows with a WHERE on the user ID. The BETWEEN comparison operator will limit the need to count off the previous records. Check it out:

mysql> SELECT `users`.`id`, 'users`.`email`
FROM `users`, `email_list`
WHERE `users`.`id` = `email_list`.`user_id`
AND `email_list`.`type` = 1 
AND `users`.`id` BETWEEN 1000000 AND 1001000;
...
971 rows in set (0.00 sec)

The above query does same thing as its predecessor, with the exception that the count of results might vary because not every user is subscribed to a list or users might have been deleted. I’ve effectively reduced the speed of the query to milliseconds because it only has to examine 1000s of rows instead of a million.

Here are the EXPLAINs on the queries to further illustrate the number of rows being examined by MySQL. See how the BETWEEN is the only example that takes advantage of the primary key.

-- uses email index and scans all rows
mysql> EXPLAIN SELECT `id`, `email` 
FROM `users`, `email_list` 
WHERE `users`.`id` = `email_list`.`user_id` 
AND `email_list`.`type` = 1 
LIMIT 1000 OFFSET 1000000;
+----+-------------+------------+--------+--------------------+---------+---------+-------------------+---------+-------------+
| id | select_type | table      | type   | possible_keys      | key     | key_len | ref               | rows    | Extra       |
+----+-------------+------------+--------+--------------------+---------+---------+-------------------+---------+-------------+
|  1 | SIMPLE      | users      | index  | PRIMARY            | email   | 767     | NULL              | 1472873 | Using index |
|  1 | SIMPLE      | email_list | eq_ref | PRIMARY,users,type | PRIMARY | 5       | db.users.id,const |       1 | Using index |
+----+-------------+------------+--------+--------------------+---------+---------+-------------------+---------+-------------+


-- even forcing the primary key, it still scans all rows
mysql> EXPLAIN SELECT `id`, `email` 
FROM `users` 
JOIN `email_list` force INDEX (primary) on `users`.`id` = `email_list`.`user_id` 
WHERE `email_list`.`type` = 1 
LIMIT 1000 OFFSET 1000000;
+----+-------------+------------+--------+---------------+---------+---------+-------------------+---------+-------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref               | rows    | Extra       |
+----+-------------+------------+--------+---------------+---------+---------+-------------------+---------+-------------+
|  1 | SIMPLE      | users      | index  | PRIMARY       | email   | 767     | NULL              | 1472873 | Using index |
|  1 | SIMPLE      | email_list | eq_ref | PRIMARY       | PRIMARY | 5       | db.users.id,const |       1 | Using index |
+----+-------------+------------+--------+---------------+---------+---------+-------------------+---------+-------------+


-- BETWEEN uses the primary key
mysql> EXPLAIN SELECT `users`.`id`, `users`.`email` 
FROM `users`, `email_list` 
WHERE `users`.`id` = `email_list`.`user_id` 
AND `email_list`.`type` = 1 
AND `users`.`id` BETWEEN 1000000 AND 1001000;
+----+-------------+------------+--------+--------------------+---------+---------+-------------------+------+-------------+
| id | select_type | table      | type   | possible_keys      | key     | key_len | ref               | rows | Extra       |
+----+-------------+------------+--------+--------------------+---------+---------+-------------------+------+-------------+
|  1 | SIMPLE      | users      | range  | PRIMARY            | PRIMARY | 4       | NULL              | 2016 | Using where |
|  1 | SIMPLE      | email_list | eq_ref | PRIMARY,users,type | PRIMARY | 5       | db.users.id,const |    1 | Using index |
+----+-------------+------------+--------+--------------------+---------+---------+-------------------+------+-------------+

Do you have any MySQL query optimization tips?

  • Tinashe Matate

    How does this scale?

    You only considered cases where each of your betweens has some rows of interest. If for the first 1000 of those betweens there is no matching user why are you wasting your time with those lookups. Even though it is fast for those betweens, how will that affect the whole iteration, these small fast betweens with no results would be of no help, right?

    With offset, if there are only 1000 users matching your constraints you have a slow query using offset but you do it once and proceed sending your users their emails. Now the question is, in the worst case which approach would do better than the other. The comparison you made seems not to really cover such or I missed it?

    But thanks

    • Tinashe, Thanks for your thoughtful response.

      The whole reason I chose between instead of offset is so I COULD scale. The bigger the user table gets, the more expensive a join gets (even with an offset) – the query still has to examine every row to join.

      The between only has to examine 1000 – thus the speed. The between lookup takes 0.00sec and an offset takes 4 secs — how many betweens can you loop over in 4 sec? I’d rather skip an empty set and move on to the next than to wait 4.47secs – do the math – 1,000,000 / 1000 = 1000 queries * 4.47sec = 1 hour… while looping the between takes less than a minute.

      Also keep in mind that it’s NOT just this query that affects this table. The user table is constantly in use with registrations and logins (and other site activities). I’m not willing to tie up the table with 4 sec queries (it affects overall performance of database).

      Hope this makes sense — and let me know if I’ve missed the point of your comment. I know there are alternatives and probably better ways to do this. I just found the concept interesting and don’t necessarily recommend it for anyone else. Just sharing my findings.

      -doli