MySQL Long Query Progress Monitoring
MysqlSqlMysql Problem Overview
Just to preface my question, I understand that there is no direct support for something like this. What I am looking for is any sort of work-around, or convoluted derivation that would get me a half-respectable result.
I am working with a rather large MySQL cluster (tables > 400 million rows) using the cluster engine.
Is anyone aware of a way to either directly retrieve or otherwise derive a somewhat (or better) accurate indication of progress through a long query in mysql? I have some queries that can take up to 45 minutes, and I need to determine if we're 10% or 90% through the processing.
EDIT:
As requested in the comments here is a distilled and generified version of one of the queries that is leading to my original question...
SELECT `userId`
FROM `openEndedResponses` AS `oe`
WHERE
`oe`.`questionId` = 3 -- zip code
AND (REPLACE( REPLACE( `oe`.`value`, ' ', '' ), '-', '' ) IN ( '30071', '30106', '30122', '30134', '30135', '30168', '30180', '30185', '30187', '30317', '30004' ));
This query is run against a single table with ~95 million rows. It takes 8 seconds to run the query and another 13 to transfer the data (21 sec total). Considering the size of the table, and the fact that there are string manipulation functions being used, I'd say it's running pretty damn fast. However, to the user, it's still 21 seconds appearing either stuck or idle. Some indication of progress would be ideal.
Mysql Solutions
Solution 1 - Mysql
I know this is an old question, but I was looking for a similar answer, when trying to figure out how much longer my update would take on a query of 250m rows.
If you run:
SHOW ENGINE INNODB STATUS \G
Then under TRANSACTIONS find the transaction in question, examine this section:
---TRANSACTION 34282360, ACTIVE 71195 sec starting index read
mysql tables in use 2, locked 2
1985355 lock struct(s), heap size 203333840, 255691088 row lock(s), undo log entries 21355084
The important bit is "undo log entries". For each updated row, in my case it seemed to add an undo log entry (trying running it again after a few seconds and see how many have been added).
If you skip to the end of the status report, you'll see this:
Number of rows inserted 606188224, updated 251615579, deleted 1667, read 54873415652
0.00 inserts/s, 1595.44 updates/s, 0.00 deletes/s, 3190.88 reads/s
Here we can see that the speed updates are being applied is 1595.44 rows per second (although if you're running other update queries in tandem, then this speed might be separated between your queries).
So from this, I know 21m have been updated with (250m-21m) 229m rows left to go.
229,000,000 / 1600 = 143,125 seconds to go (143,125 / 60) / 60 = 39.76 hours to go
So it would appear I can twiddle my thumbs for another couple of days. Unless this answer is wrong, in which case I'll update it sometime before then!
Solution 2 - Mysql
I was able to estimate something like this by querying the number of rows to process then breaking the processing into a loop, working on only a subset of the total rows at a time.
The full loop was rather involved, but the basic logic went like:
SELECT @minID = Min(keyColumn) FROM table WHERE condition
SELECT @maxID = Max(keyColumn) FROM table WHERE condition
SELECT @potentialRows = (@maxID - @minID) / @iterations
WHILE @minID < @maxID
BEGIN
SET @breakID = @minID + @potentialRows
SELECT columns FROM table WITH (NOLOCK, ...)
WHERE condition AND keyColumn BETWEEN @minID AND @breakID
SET @minID = @breakID + 1
END
Note this works best if IDs are evenly distributed.
Solution 3 - Mysql
There is a promising answer to this old question which I found here, written by Baron Schwartz. It's not a precise and complete solution, but it does provide some objective material for estimates, if you're only running that query and nothing else on your server.
You run this command while the query is already running:
mysqladmin extended -r -i 10 | grep Handler
- that
10
is the number of seconds after which the command repeats itself, so wait for the refreshes - add something like
-u root -p
if you need to authenticate - if you know exaclty which Handler you're looking for, you can make the
grep
more focused, for exampleHandler_read_rnd_next
seems to be good forSELECT
's - ignore the first output, use the second and following
- use
Ctrl-C
to exit
Now get that number and do your math. Determine rows handled per second, and with your knowledge of table sizes you might be able to get a fairly precise estimate of total time.
Free extra tip: the command doesn't seem to go into Bash history (maybe because of the exiting with Ctrl-C
, you can add it there by hand with history -s mysqladmin extended -r -i 10 -u root -p | grep Handler
Solution 4 - Mysql
I don't think that mysql supports I'm sure MySQL doesn't support any indication about the progress of the running queries. The only solution is to optimize/split queries.
Select could be split by id as Dour High Arch suggested. Here is a query from 33 milion row table:
mysql> SELECT SQL_NO_CACHE min(id), max(id) FROM `urls`;
+---------+----------+
| min(id) | max(id) |
+---------+----------+
| 5000 | 35469678 |
+---------+----------+
1 row in set (0.00 sec)
You better use integer ot at least date field for splitting. It should be primary or unique index and should not allow null values.
Solution 5 - Mysql
If it's a complex query you are attempting, the EXPLAIN SQL command or MySQL Query Analyzer might help to understand what is going on. If it's simply a large query, you might try creating a temporary table with SELECT INTO and/or using LIMIT/OFFSET clauses in SELECT queries. If you use LIMIT/OFFSET on the original tables, you might need to set the transaction level to serializable, IIRC, so that you get consistent reads while iterating over the data. If you create a temporary table first, that table should stay consistent regardless.
Solution 6 - Mysql
For now -- for my very specific situation -- there seems to be no real solution for this. Since I can't split my query into several smaller ones and it's proving counterproductive to select count(*)
first, and then running the "real" query (doubles execution time of an already painfully slow query), none of the workarounds seem viable either. Maybe soon, MySQL will support something like this
Solution 7 - Mysql
If your query involves a linear scan through a large table, you can often obtain an excellent estimate by running pmonitor on the file containing that table. Include the --update
option, because MySQL opens table files in update mode.
Example:
$ sudo pmonitor --update --file=/home/mysql/ghtorrent/commits.MYD --interval=5/home/mysql/ghtorrent/commits.MYD 31.66% /home/mysql/ghtorrent/commits.MYD 33.16% ETA 0:03:42 /home/mysql/ghtorrent/commits.MYD 34.85% ETA 0:03:24 /home/mysql/ghtorrent/commits.MYD 36.43% ETA 0:03:32 /home/mysql/ghtorrent/commits.MYD 38.36% ETA 0:03:12 /home/mysql/ghtorrent/commits.MYD 40.21% ETA 0:03:01 /home/mysql/ghtorrent/commits.MYD 41.95% ETA 0:02:54 [...] /home/mysql/ghtorrent/commits.MYD 92.01% ETA 0:00:24 /home/mysql/ghtorrent/commits.MYD 93.85% ETA 0:00:18 /home/mysql/ghtorrent/commits.MYD 95.76% ETA 0:00:12 /home/mysql/ghtorrent/commits.MYD 97.60% ETA 0:00:07 /home/mysql/ghtorrent/commits.MYD 98.83% ETA 0:00:03 /home/mysql/ghtorrent/commits.MYD 100% ETA 0:00:00
If you don't know the file to monitor, run pmonitor with the --diff option. This will show you the file(s) where progress is made.
Example
$ sudo pmonitor --update -diff --command=mysqld -i 60 [...] /home/mysql/ghtorrent/projects.MYD 22.41% ETA 2:01:41 /home/mysql/ghtorrent/projects.MYD 23.13% ETA 1:53:23 /home/mysql/ghtorrent/projects.MYD 23.84% ETA 1:50:27
Solution 8 - Mysql
Here's what you'll need to do to improve the following query:
SELECT `userId`
FROM `openEndedResponses` AS `oe`
WHERE
`oe`.`questionId` = 3 -- zip code
AND (REPLACE( REPLACE( `oe`.`value`, ' ', '' ), '-', '' ) IN ( '30071', '30106', '30122', '30134', '30135', '30168', '30180', '30185', '30187', '30317', '30004' ));
You'll need to ensure oe.questionId is indexed; You'll need to ensure oe.value does not have any space across the entire table when oe.questionId is 3; assuming that 4 or 5 can be, let's say, city names, where you still want to allow spaces.
By doing this, you'll be able to remove all the REPLACEs , which will let MySQL use an index in oe.value.
MySQL will then merge both indices and give you the result much faster, in terms of processing.
In the case you have many repeated userId; you'll want to group them; in such a way that entries from the index are immediately discarded. You still need to scan the whole merged-index; but the size of the resultset will take less time to be transferred; much less than 13 seconds!
Give it a shot and keep us posted about the result
Best!
Solution 9 - Mysql
How about looking into partitioning your mysql table so you can spread the read/write load. Look at trying to limit each partition to 50 Million rows (obviously dependent on your hardware)