MySQL Long Query Progress Monitoring

Mysql Problem Overview

Just to preface my question, I understand that there is no direct support for something like this. What I am looking for is any sort of work-around, or convoluted derivation that would get me a half-respectable result.

I am working with a rather large MySQL cluster (tables > 400 million rows) using the cluster engine.

Is anyone aware of a way to either ~~directly retrieve~~ or otherwise derive a somewhat (or better) accurate indication of progress through a long query in mysql? I have some queries that can take up to 45 minutes, and I need to determine if we're 10% or 90% through the processing.

EDIT:

As requested in the comments here is a distilled and generified version of one of the queries that is leading to my original question...

SELECT `userId`
FROM	`openEndedResponses` AS `oe`
WHERE
	`oe`.`questionId` = 3 -- zip code
	AND (REPLACE( REPLACE( `oe`.`value`, ' ', '' ), '-', '' ) IN ( '30071', '30106', '30122', '30134', '30135', '30168', '30180', '30185', '30187', '30317', '30004' ));

This query is run against a single table with ~95 million rows. It takes 8 seconds to run the query and another 13 to transfer the data (21 sec total). Considering the size of the table, and the fact that there are string manipulation functions being used, I'd say it's running pretty damn fast. However, to the user, it's still 21 seconds appearing either stuck or idle. Some indication of progress would be ideal.

Mysql Solutions

Solution 1 - Mysql

I know this is an old question, but I was looking for a similar answer, when trying to figure out how much longer my update would take on a query of 250m rows.

If you run:

SHOW ENGINE INNODB STATUS \G

Then under TRANSACTIONS find the transaction in question, examine this section:

---TRANSACTION 34282360, ACTIVE 71195 sec starting index read
mysql tables in use 2, locked 2
1985355 lock struct(s), heap size 203333840, 255691088 row lock(s), undo log entries 21355084

The important bit is "undo log entries". For each updated row, in my case it seemed to add an undo log entry (trying running it again after a few seconds and see how many have been added).

If you skip to the end of the status report, you'll see this:

Number of rows inserted 606188224, updated 251615579, deleted 1667, read 54873415652
0.00 inserts/s, 1595.44 updates/s, 0.00 deletes/s, 3190.88 reads/s

Here we can see that the speed updates are being applied is 1595.44 rows per second (although if you're running other update queries in tandem, then this speed might be separated between your queries).

So from this, I know 21m have been updated with (250m-21m) 229m rows left to go.

229,000,000 / 1600 = 143,125 seconds to go (143,125 / 60) / 60 = 39.76 hours to go

So it would appear I can twiddle my thumbs for another couple of days. Unless this answer is wrong, in which case I'll update it sometime before then!

Solution 2 - Mysql

I was able to estimate something like this by querying the number of rows to process then breaking the processing into a loop, working on only a subset of the total rows at a time.

The full loop was rather involved, but the basic logic went like:

SELECT @minID = Min(keyColumn) FROM table WHERE condition
SELECT @maxID = Max(keyColumn) FROM table WHERE condition
SELECT @potentialRows = (@maxID - @minID) / @iterations

WHILE @minID < @maxID
BEGIN
	SET @breakID = @minID + @potentialRows
	SELECT columns FROM table WITH (NOLOCK, ...)
	WHERE condition AND keyColumn BETWEEN @minID AND @breakID

	SET @minID = @breakID + 1
END

Note this works best if IDs are evenly distributed.

Solution 3 - Mysql

There is a promising answer to this old question which I found here, written by Baron Schwartz. It's not a precise and complete solution, but it does provide some objective material for estimates, if you're only running that query and nothing else on your server.

You run this command while the query is already running:

mysqladmin extended -r -i 10 | grep Handler

that 10 is the number of seconds after which the command repeats itself, so wait for the refreshes
add something like -u root -p if you need to authenticate
if you know exaclty which Handler you're looking for, you can make the grep more focused, for example Handler_read_rnd_next seems to be good for SELECT's
ignore the first output, use the second and following
use Ctrl-C to exit

Now get that number and do your math. Determine rows handled per second, and with your knowledge of table sizes you might be able to get a fairly precise estimate of total time.

Free extra tip: the command doesn't seem to go into Bash history (maybe because of the exiting with Ctrl-C, you can add it there by hand with history -s mysqladmin extended -r -i 10 -u root -p | grep Handler

Solution 4 - Mysql

~~I don't think that mysql supports~~ I'm sure MySQL doesn't support any indication about the progress of the running queries. The only solution is to optimize/split queries. Select could be split by id as Dour High Arch suggested. Here is a query from 33 milion row table:

mysql> SELECT SQL_NO_CACHE min(id), max(id) FROM `urls`;
+---------+----------+
| min(id) | max(id)  |
+---------+----------+
|    5000 | 35469678 |
+---------+----------+
1 row in set (0.00 sec)

You better use integer ot at least date field for splitting. It should be primary or unique index and should not allow null values.

Solution 5 - Mysql

If it's a complex query you are attempting, the EXPLAIN SQL command or MySQL Query Analyzer might help to understand what is going on. If it's simply a large query, you might try creating a temporary table with SELECT INTO and/or using LIMIT/OFFSET clauses in SELECT queries. If you use LIMIT/OFFSET on the original tables, you might need to set the transaction level to serializable, IIRC, so that you get consistent reads while iterating over the data. If you create a temporary table first, that table should stay consistent regardless.

Solution 6 - Mysql

For now -- for my very specific situation -- there seems to be no real solution for this. Since I can't split my query into several smaller ones and it's proving counterproductive to select count(*) first, and then running the "real" query (doubles execution time of an already painfully slow query), none of the workarounds seem viable either. Maybe soon, MySQL will support something like this

Solution 7 - Mysql

If your query involves a linear scan through a large table, you can often obtain an excellent estimate by running pmonitor on the file containing that table. Include the --update option, because MySQL opens table files in update mode.

Example:

$ sudo pmonitor --update --file=/home/mysql/ghtorrent/commits.MYD --interval=5

/home/mysql/ghtorrent/commits.MYD 31.66% /home/mysql/ghtorrent/commits.MYD 33.16% ETA 0:03:42 /home/mysql/ghtorrent/commits.MYD 34.85% ETA 0:03:24 /home/mysql/ghtorrent/commits.MYD 36.43% ETA 0:03:32 /home/mysql/ghtorrent/commits.MYD 38.36% ETA 0:03:12 /home/mysql/ghtorrent/commits.MYD 40.21% ETA 0:03:01 /home/mysql/ghtorrent/commits.MYD 41.95% ETA 0:02:54 [...] /home/mysql/ghtorrent/commits.MYD 92.01% ETA 0:00:24 /home/mysql/ghtorrent/commits.MYD 93.85% ETA 0:00:18 /home/mysql/ghtorrent/commits.MYD 95.76% ETA 0:00:12 /home/mysql/ghtorrent/commits.MYD 97.60% ETA 0:00:07 /home/mysql/ghtorrent/commits.MYD 98.83% ETA 0:00:03 /home/mysql/ghtorrent/commits.MYD 100% ETA 0:00:00

If you don't know the file to monitor, run pmonitor with the --diff option. This will show you the file(s) where progress is made.

Example

$ sudo pmonitor --update -diff --command=mysqld -i 60
[...]
/home/mysql/ghtorrent/projects.MYD      22.41% ETA 2:01:41
/home/mysql/ghtorrent/projects.MYD      23.13% ETA 1:53:23
/home/mysql/ghtorrent/projects.MYD      23.84% ETA 1:50:27

Solution 8 - Mysql

Here's what you'll need to do to improve the following query:

SELECT `userId`
FROM    `openEndedResponses` AS `oe`
WHERE
    `oe`.`questionId` = 3 -- zip code
    AND (REPLACE( REPLACE( `oe`.`value`, ' ', '' ), '-', '' ) IN ( '30071', '30106', '30122', '30134', '30135', '30168', '30180', '30185', '30187', '30317', '30004' ));

You'll need to ensure oe.questionId is indexed; You'll need to ensure oe.value does not have any space across the entire table when oe.questionId is 3; assuming that 4 or 5 can be, let's say, city names, where you still want to allow spaces.

By doing this, you'll be able to remove all the REPLACEs , which will let MySQL use an index in oe.value.

MySQL will then merge both indices and give you the result much faster, in terms of processing.

In the case you have many repeated userId; you'll want to group them; in such a way that entries from the index are immediately discarded. You still need to scan the whole merged-index; but the size of the resultset will take less time to be transferred; much less than 13 seconds!

Give it a shot and keep us posted about the result

Best!

Solution 9 - Mysql

How about looking into partitioning your mysql table so you can spread the read/write load. Look at trying to limit each partition to 50 Million rows (obviously dependent on your hardware)

Content Type	Original Author	Original Content on Stackoverflow
Question	KOGI	View Question on Stackoverflow
Solution 1 - Mysql	lightsurge	View Answer on Stackoverflow
Solution 2 - Mysql	Dour High Arch	View Answer on Stackoverflow
Solution 3 - Mysql	pgr	View Answer on Stackoverflow
Solution 4 - Mysql	NickSoft	View Answer on Stackoverflow
Solution 5 - Mysql	penguin359	View Answer on Stackoverflow
Solution 6 - Mysql	KOGI	View Answer on Stackoverflow
Solution 7 - Mysql	Diomidis Spinellis	View Answer on Stackoverflow
Solution 8 - Mysql	Nico Andrade	View Answer on Stackoverflow
Solution 9 - Mysql	Christian	View Answer on Stackoverflow