Text

Finding Problem Queries, Part 1 - The Slow Stuff

At ideeli, we use a variety of techniques to eliminate single points of failure and contention. These involve load balancing multiple application servers, caching layers, and calls to asynchronous services. However, at the end of the day we have a single master MySQL database. It doesn’t matter how many dozens of application servers are handling requests if the database is stalled. Without the right tools, it can be difficult to determine exactly what the database is doing at any given time. In this two part post, I will discuss some techniques for finding those problematic queries.

FINDING THE SLOW STUFF

The MySQL slow query log should be the first stop when looking for queries to optimize. Even with tools like mysqldumpslow, the format of the slow log is verbose and it can be difficult to see what is taking all the time. The challenge when looking in the slow log is determining what to optimize first. What’s more important, a query that runs 600 times a day and takes 1 second each time, or a single query that takes 1 minute to run? Chances are, the query that runs 600 times a day has a greater impact on database performance, but it is difficult to gather these types of statistics from the raw log.

That is where pt-query-digest from the Percona Toolkit comes in. It is the Swiss Army Knife(™) for analyzing queries in aggregate. pt-query-digest distills each query in the slow log to a normalized form and generates statistics about each one. It also provides an aggregated view of the statements, ranked by the total amount of time each normalized query took.

For example, pt-query-digest will take the following query
SELECT * FROM table_1 WHERE id = 123;

and normalize it to
select * from table_? where id = ?;

The default mode is to parse a MySQL slow query log. Just pass the name of the file or pipe it to the utility.

# pt-query-digest mysql-slow.log
or
# gunzip -c mysql-slow.log.1.gz | pt-query-digest

There are a few sections in the default output, but the first place to look is the “Profile” section.

The `SELECT o` query took an average of 0.6443s per execution, ran 6672 times, totaling 4298s of CPU time, which amounts to 54.8% of the time accounted for within the log. That looks like a good candidate for optimization! The Query ID column is a hex number that corresponds to the query’s “fingerprint” (normalized form) and is unique to the query.

For each query in that list, there is detailed output below along with an actual sample so there is no need to dig through logs to find a real example to work with:

This view provides a wealth of statistical information including a histogram of the distribution of run times. In this case, all the runs were in the 100ms range. Along with the standard deviation, the histogram is useful to help determine if the query consistently runs in the same time range, or if it is wildly variable. As the folks at Facebook have pointed out, it is better to be slow and consistent than to have highly variable response times.

The sample above was generated using the —review-history flag (details in the documentation) which stores some information about each query in a table. This provides a little more detail about each query including how long it has been around (first_seen: 2011-07-15 06:27:32). That can be used to help correlate the query with a code deploy or configuration change. The review history table also allows us to add comments to the log by updating the comments field, referencing the Query ID in the checksum field:

UPDATE maatkit.review_history SET comments = ‘FIXIT-1234’ WHERE checksum = ‘0x20112A1CCDBDBA4E’;

I like to put JIRA ticket IDs in there so I don’t keep bugging the devs about the same query over and over again. We have this report mailed out to all of the Operations and Development teams every morning as part of log rotation. Each week, a developer is assigned to be responsible for slow query triage, so someone’s eyes are always on this list.

If you are sick of seeing a query in the log every day because there is nothing to be done about it, update the reviewed_on and reviewed_by fields and it won’t bother you anymore:

UPDATE maatkit.review_history SET reviewed_by = ‘abrown’, reviewed_on = ‘2011-11-28’ WHERE checksum = ‘0x20112A1CCDBDBA4E’;

SHOW FULL PROCESSLIST REPEAT

This is the poor man’s way of profiling a running server, but is actually quite educational. It goes something like this:

  1. Open MySQL prompt
  2. type ‘SHOW FULL PROCESSLIST;’
  3. Look at the results
  4. Press UP
  5. Press ENTER
  6. GOTO 2

If the same query or queries are appearing in the process list constantly, chances are they are problematic and should be optimized. Either that, or they are interacting in some bad way with other queries that are running concurrently. The same queries are probably also in your slow query log, but looking at the process list directly will provide context and help determine if the queries are interacting poorly with each other.

To filter out all the idle transactions, which are plentiful when there are 600+ threads connected from application servers, query the INFORMATION_SCHEMA.PROCESSLIST table instead of running SHOW FULL PROCESSLIST:

SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE INFO ‘null’; (up arrow, enter, repeat)

This can easily be scripted to constantly log this data for further analysis.

…which is exactly what mysql_logger.rb is for…

MYSQL_LOGGER

I wrote a simple utility that condenses and logs the output of SHOW FULL PROCESSLIST to syslog every second, which then gets dumped into Splunk. This utility has very little overhead and runs on all of our MySQL servers. When troubleshooting a performance problem, it can provide enormous value. With a little Splunk-fu (or grep/awk-fu), we can get an idea about what the server was doing at the time of a problem — in context.

Here is how to get it going:

./mysql_logger.rb —user myuser —pass mypass —syslog —daemonize —comand start

Voila, there will be a new syslog tag, “mysql_logger” that outputs what is currently running on the the database at that instant, every second.

GREATER GRANULARITY

There is a lot of information packed in the slow log and process list, and if you are just starting down the query optimization road, it should probably be the first stop. However, it only exposes the very slow stuff and that is not a complete picture. What about fast queries that run a bazillion times an hour? It is likely that those are just as important as the less frequent slow queries. How do you find out everything that the server is doing? This will be addressed in part 2.

- Aaron B, Senior Systems Engineer

Notes:

  1. dallas reblogged this from ideelitech
  2. ideelitech posted this