linux perf: how to interpret and find hotspots

C++ Problem Overview

I tried out linux' perf utility today and am having trouble in interpreting its results. I'm used to valgrind's callgrind which is of course a totally different approach to the sampling based method of perf.

What I did:

perf record -g -p $(pidof someapp)
perf report -g -n

Now I see something like this:

16.92%  kdevelop  libsqlite3.so.0.8.6               [.] 0x3fe57                                                                                                              ↑

10.61%  kdevelop  libQtGui.so.4.7.3                 [.] 0x81e344                                                                                                             ▮

 7.09%  kdevelop  libc-2.14.so                      [.] 0x85804                                                                                                              ▒

 4.96%  kdevelop  libQtGui.so.4.7.3                 [.] 0x265b69                                                                                                             ▒

 3.50%  kdevelop  libQtCore.so.4.7.3                [.] 0x18608d                                                                                                             ▒

 2.68%  kdevelop  libc-2.14.so                      [.] memcpy                                                                                                               ▒

 1.15%  kdevelop  [kernel.kallsyms]                 [k] copy_user_generic_string                                                                                             ▒

 0.90%  kdevelop  libQtGui.so.4.7.3                 [.] QTransform::translate(double, double)                                                                                ▒

 0.88%  kdevelop  libc-2.14.so                      [.] __libc_malloc                                                                                                        ▒

 0.85%  kdevelop  libc-2.14.so                      [.] memcpy

...

Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.

Basically I am looking for some kind of callgraph annotated with accumulated cost, where my functions have a higher inclusive sampling cost than the library functions I call.

Is this possible with perf? If so - how?

Note: I found out that "E" unwraps the callgraph and gives somewhat more information. But the callgraph is often not deep enough and/or terminates randomly without giving information about how much info was spent where. Example:




10.26%  kate  libkatepartinterfaces.so.4.6.0  [.] Kate::TextLoader::readLine(int&...

Kate::TextLoader::readLine(int&, int&)

Kate::TextBuffer::load(QString const&, bool&, bool&)

KateBuffer::openFile(QString const&)

KateDocument::openFile()

0x7fe37a81121c

Could it be an issue that I'm running on 64 bit? See also: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html (I'm not using fedora but seems to apply to all 64bit systems).

C++ Solutions

Solution 1 - C++

With Linux 3.7 perf is finally able to use DWARF information to generate the callgraph:

perf record --call-graph dwarf -- yourapp
perf report -g graph --no-children

Neat, but the curses GUI is horrible compared to VTune, KCacheGrind or similar... I recommend to try out FlameGraphs instead, which is a pretty neat visualization: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Note: In the report step, -g graph makes the results output simple to understand "relative to total" percentages, rather than "relative to parent" numbers. --no-children will show only self cost, rather than inclusive cost - a feature that I also find invaluable.

If you have a new perf and Intel CPU, also try out the LBR unwinder, which has much better performance and produces far smaller result files:

perf record --call-graph lbr -- yourapp

The downside here is that the call stack depth is more limited compared to the default DWARF unwinder configuration.

Solution 2 - C++

You should give hotspot a try: https://www.kdab.com/hotspot-gui-linux-perf-profiler/

It's available on github: https://github.com/KDAB/hotspot

It is for example able to generate flamegraphs for you.

Solution 3 - C++

> Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.

Are you sure that your application someapp is built with the gcc option -fno-omit-frame-pointer (and possibly its dependant libraries) ? Something like this:

g++ -m64 -fno-omit-frame-pointer -g main.cpp

Solution 4 - C++

You can get a very detailed, source level report with perf annotate, see Source level analysis with perf annotate. It will look something like this (shamelessly stolen from the website):

------------------------------------------------
 Percent |   Source code & Disassembly of noploop
------------------------------------------------
         :
         :
         :
         :   Disassembly of section .text:
         :
         :   08048484 <main>:
         :   #include <string.h>
         :   #include <unistd.h>
         :   #include <sys/time.h>
         :
         :   int main(int argc, char **argv)
         :   {
    0.00 :    8048484:       55                      push   %ebp
    0.00 :    8048485:       89 e5                   mov    %esp,%ebp
[...]
    0.00 :    8048530:       eb 0b                   jmp    804853d <main+0xb9>
         :                           count++;
   14.22 :    8048532:       8b 44 24 2c             mov    0x2c(%esp),%eax
    0.00 :    8048536:       83 c0 01                add    $0x1,%eax
   14.78 :    8048539:       89 44 24 2c             mov    %eax,0x2c(%esp)
         :           memcpy(&tv_end, &tv_now, sizeof(tv_now));
         :           tv_end.tv_sec += strtol(argv[1], NULL, 10);
         :           while (tv_now.tv_sec < tv_end.tv_sec ||
         :                  tv_now.tv_usec < tv_end.tv_usec) {
         :                   count = 0;
         :                   while (count < 100000000UL)
   14.78 :    804853d:       8b 44 24 2c             mov    0x2c(%esp),%eax
   56.23 :    8048541:       3d ff e0 f5 05          cmp    $0x5f5e0ff,%eax
    0.00 :    8048546:       76 ea                   jbe    8048532 <main+0xae>
[...]

Don't forget to pass the -fno-omit-frame-pointer and the -ggdb flags when you compile your code.

Solution 5 - C++

Unless your program has very few functions and hardly ever calls a system function or I/O, profilers that sample the program counter won't tell you much, as you're discovering. In fact, the well-known profiler gprof was created specifically to try to address the uselessness of self-time-only profiling (not that it succeeded).

What actually works is something that samples the call stack (thereby finding out where the calls are coming from), on wall-clock time (thereby including I/O time), and report by line or by instruction (thereby pinpointing the function calls that you should investigate, not just the functions they live in).

Furthermore, the statistic you should look for is percent of time on stack, not number of calls, not average inclusive function time. Especially not "self time". If a call instruction (or a non-call instruction) is on the stack 38% of the time, then if you could get rid of it, how much would you save? 38%! Pretty simple, no?

An example of such a profiler is Zoom.

There are more issues to be understood on this subject.

Added: @caf got me hunting for the perf info, and since you included the command-line argument -g it does collect stack samples. Then you can get a call-tree report. Then if you make sure you're sampling on wall-clock time (so you get wait time as well as cpu time) then you've got almost what you need.

Content Type	Original Author	Original Content on Stackoverflow
Question	milianw	View Question on Stackoverflow
Solution 1 - C++	milianw	View Answer on Stackoverflow
Solution 2 - C++	Martin Gerhardy	View Answer on Stackoverflow
Solution 3 - C++	user184968	View Answer on Stackoverflow
Solution 4 - C++	Ali	View Answer on Stackoverflow
Solution 5 - C++	Mike Dunlavey	View Answer on Stackoverflow