Monitor your application using Pinba

Introduction

At Dailymotion, we use a lot of monitoring tools to watch, analyse and optimize our applications. We use statsd, graphite, collectd and Datadog, but also a lesser-known tool: Pinba (which stands for PHP Is Not A Bottleneck Anymore).

We're currently working on the migration of an old monolithic PHP application with a bunch of legacy features, to a modern scalable set of services. For this, we're using Pinba to gather technical metrics about:

  • our current PHP application: it's helping us for maintenance and rollouts (detecting releases that adversely affect performance) and also to find and understand bottlenecks. It allows us to establish which parts of the framework will give the greatest performance improvement by an early migration. In other words, identifying low-hanging fruit.

  • our new microservices, using different clients according to the language we used to develop the service (Python and Go for the time being).

What is Pinba?

According to its creator tony2001, Pinba is a statistics server using MySQL as an interface. It accumulates and processes data sent over UDP by multiple PHP processes and displays statistics in a nice human-readable form of simple "reports", also providing a read-only interface to the raw data in order to make possible generation of more sophisticated reports.

How does it work?

The Pinba PHP extension is deployed on all the servers in our web farm. For each PHP thread, it collects technical metrics, and sends these metrics to a central Pinba server over UDP after the response generation is complete (using the Google Protobuf data interchange format). On the central server, the Pinba engine aggregates the contents of these messages, and bulk inserts them periodically into a read-only MySQL database, providing a simple and well-known exploration tool: a MySQL client. Simple and efficient.

Pinba is neither, as is often heard, a debugging tool nor a graphing tool. It is simply an exploration tool, providing a picture of what's going on in your application on production, in real time, for the last minute or two. This is very powerful:

  • it provides live metrics in production with no impact on performance,
  • it's a good tool to detect bottlenecks,
  • if you use another tool to consolidate Pinba metrics, you can build some very useful time series graphs.

Pinba is not a magical tool that allows you to do anything with it. A typical mistake is to use it to gather exact data (such as a counter: how many time did an event occur in the last minute). Pinba isn't designed to perform this kind of task (as we saw before, protobuf messages are sent via UDP for performance reasons, so it's not impossible for some packets to be lost in transit). Use it to observe tendencies, not to measure actions with precision.

Our architecture

https://www.dropbox.com/s/p4iaru6vfpm2qen/pinba.png?dl=0

What metrics can you get from it?

Out of the box, the technical metrics gathered in all PHP processes are already very extensive, including:

  • doc_size: response size
  • mem_peak_usage: memory allocation (maximum)
  • request_time: time spent to generate the process (microsend resolution)
  • ru_utime/ru_stime: statistics about resource usage on the server (user and system)
  • status: HTTP response code
  • memory_footprint: size of the process

The engine stores these metrics per process in the request table, so in this table we have one row for every PHP response produced during last pinba_stats_history seconds (60 seconds in our case).

mysql> SELECT script_name, doc_size, mem_peak_usage, status, memory_footprint, hostname  
FROM request  
LIMIT 10;  
script_name doc_size mem_peak_usage status memory_footprint hostname
[PROD]video_item 5.422 10240 200 54728 web-065
[PROD]widget_dispatch_v3 38.056 14848 200 31624 web-031
[PROD]video_list 154.369 20736 200 87176 web-004
[PROD]rest_api 0.235 22784 200 51568 web-128
[PROD]rest_api 9.306 34048 200 55624 web-024
[PROD]rest_api 5.448 35072 200 54676 web-041
[PROD]controller_dispatch 2.003 11776 200 36388 web-033
[PROD]widget_v3_chunks 22.525 12544 200 33544 web-165
[PROD]cdn_director 0 9984 302 42892 web-034
[PROD]rest_api 0.019 20736 200 32500 web-085

As the request table exposes metrics for every PHP process, access to the raw data can be pretty cumbersome if you have a lot of traffic:

mysql> SELECT COUNT(1) from request;  
COUNT(1)
1197502

In other words, 1197502 php responses were generated during a given sixty seconds on our entire web farm.

Fortunately, the engine aggregates on the fly, storing consolidated metrics in report_* tables simultaneously. For instance, metrics grouped by script_name are available in report_by_script_name table, so we can explore metrics for a specific script_name. For instance, if we want statistics about the ten most frequently called routes:

mysql> SELECT script_name, req_count, req_per_sec, memory_footprint_total, req_time_median  
FROM report_by_script_name  
ORDER BY req_count DESC  
LIMIT 10;  
script_name req_count req_per_sec memory_footprint_total req_time_median
[PROD]rest_api 276508 4608.47 12916600000 0.0514983
[PROD]embed_player 162808 2713.47 8074990000 0.109837
[PROD]cdn_director 122526 2042.1 5103160000 0.0280768
[PROD]widget_dispatch_v3 105822 1763.7 4349350000 0.049904
[PROD]history_logger 98481 1641.35 4060860000 0.0127946
[PROD]video_item 66250 1104.17 3563420000 0.254091
[PROD]autocomplete_list 56313 938.55 2322790000 0.0104993
[PROD]webapp_ads_mopub 52187 869.783 2183980000 0.0326299
[PROD]gravity_report 45992 766.533 1895290000 0.0167253
[PROD]player_data 33584 559.733 1724540000 0.113651

Here we can calculate the mean memory footprint, just fetching memory_footprint_total/req_count AS avg_mem_footprint

Out of the box, the engine consolidates metrics in a number of other report tables, which you can explore easily:

  • report_by_server_name
  • report_by_hostname
  • report_by_server_and_script
  • report_by_hostname_and_server
  • report_by_hostname_server_and_script

You can drill down merely by using a table that consolidates data across more dimensions, for instance most used scripts per host and server:

mysql> SELECT hostname, server_name, script_name, req_count, req_per_sec, req_time_median  
FROM report_by_hostname_server_and_script  
ORDER BY req_count DESC  
LIMIT 2;  
hostname server_name script_name req_count req_per_sec req_time_median
web-121 api.dailymotion.com [PROD]api_oauth_token 4144 69.0667 0.0104435
web-038 api.dailymotion.com [PROD]rest_api 3455 57.5833 0.0441618

Collecting your own metrics

As shown, Pinba provides a lot of useful data out of the box, but you can go further, really! Besides report tables, you can use tag_report_* tables. Metrics are consolidated in these tables according to the values of tags you define in your codebase.

For instance, in the old Dailymotion codebase, we have two kinds of tags :

  • I/O: all external access, blocking the execution of the process (principly access to back-ends like MySQL, Elastic, Redis, memcached, ... or external APIs called server to server such as Facebook or Twitter).

  • algo: algorithms we want to monitor

As an example, this is how we tag Memcache:

Client code

$pinbaTimer = $this->pinbaService->startTimer([
        'group' => 'memcached',
        'memcached' => $this->pinba_pool,
        'method' => 'get',
        'namespace' => 'dm_cache'
]);
$value = $this->memcache->get($this->prefix . $key);
$this->pinbaService->stopTimer($pinbaTimer);

Pinba service

public function startTimer(array $tags) {  
    if (!$this->isPinbaInstalled) {
        return null;
    }

    $timerId = 't' . ++$this->incr;
    $this->timers[$timerId] = pinba_timer_start($tags);
    return $timerId;
}

public function stopTimer($timerId) {  
    if ($this->isPinbaInstalled && isset($this->timers[$timerId])) {
         pinba_timer_stop($this->timers[$timerId]);
        unset($this->timers[$timerId]);
    }
}

We then define custom tables in the Pinba database, telling to the engine what to store in it using table comments.

For instance, the table to store metrics about memcached usage is as follows:

CREATE TABLE `tag_report_memcached_method_namespace` (  
  `script_name` varchar(128) DEFAULT NULL,
  `memcached` varchar(64) DEFAULT NULL,
  `method` varchar(64) DEFAULT NULL,
  `namespace` varchar(64) DEFAULT NULL,
  `req_count` int(11) DEFAULT NULL,
  `req_per_sec` float DEFAULT NULL,
  `hit_count` int(11) DEFAULT NULL,
  `hit_per_sec` float DEFAULT NULL,
  `timer_value` float DEFAULT NULL,
  `timer_median` float DEFAULT NULL,
  `index_value` varchar(256) DEFAULT NULL
) ENGINE=PINBA DEFAULT CHARSET=latin1 COMMENT='tagN_report:memcached,method,namespace';

And a table to store metrics about application I/O:

CREATE TABLE `tag_report_group_method` (  
  `script_name` varchar(128) DEFAULT NULL,
  `group` varchar(64) DEFAULT NULL,
  `method` varchar(64) DEFAULT NULL,
  `req_count` int(11) DEFAULT NULL,
  `req_per_sec` float DEFAULT NULL,
  `hit_count` int(11) DEFAULT NULL,
  `hit_per_sec` float DEFAULT NULL,
  `timer_value` float DEFAULT NULL,
  `timer_median` float DEFAULT NULL,
  `index_value` varchar(256) DEFAULT NULL
) ENGINE=PINBA DEFAULT CHARSET=latin1 COMMENT='tag2_report:group,method';

Note to self: next time, don't use SQL reserved words for column names, like group...

Using these tables, we can obtain fine-grained statistics. For instance, statistics about all I/O activity on a specific route:

mysql> SELECT `group`,  
    SUM(req_count) AS req,
    AVG(timer_median) AS timer_median,
    SUM(timer_value)/SUM(hit_count) AS tph,
    SUM(timer_value)/SUM(req_count) AS tpr,
    SUM(timer_value) AS timer_total
FROM tag_report_group_method  
WHERE script_name='[PROD]video_item'  
    AND `group` != "int_api"
GROUP BY `group`  
ORDER BY timer_total DESC;  
group req timer_median tph tpr timer_total
mysql 118124 0.009819403290748596 0.003329653064723691 0.05909533790217434 6980.5776943564415
memcached 188049 0.00976656749844551 0.0003179184672595343 0.01067464985796328 2007.3572311401367
Elastic 32049 0.011334048118442297 0.016838098360988627 0.026841974640797184 860.2584452629089
Redis 48999 0.009611431136727333 0.011986066101148827 0.012204265879965226 597.9968238524161
DMX 61436 0.009767306968569756 0.0016191925102048436 0.0073448760845254615 451.23980712890625
Curl 24882 0.06987743234882753 0.008533559191620778 0.008533559191620778 212.3320198059082
cleeng 50 0.12915635108947754 0.1352721940790749 1.0902938842773438 54.51469421386719
Facebook API 22 0.5436198115348816 0.5515929568897594 0.5515929568897594 12.135045051574707
Live API 1 0.01953125 0.0049230000004172325 0.0049230000004172325 0.0049230000004172325

Specific details about MySQL SELECTs :

mysql> SELECT mysql, method, timer_value,  
    timer_value/req_count AS avg_timer_value,
    hit_count/req_count AS avg_op_count,
    timer_value/hit_count AS avg_op_value,
    hit_count,
    req_count
FROM tag_report_mysql_method  
WHERE method='select'  
    AND script_name='[PROD]video_item'
    AND req_count > 200
ORDER BY avg_op_count DESC  
LIMIT 4;  
mysql method timer_value avg_timer_value avg_op_count avg_op_value hit_count req_count
video select 3971.99 0.10144789052722653 16.6880 0.006079103953896177 653384 39153
video_view_summary select 264.027 0.007824405328308366 12.6024 0.000620868312581275 425254 33744
static_asset_video_sprite select 15.4905 0.022613802443455604 7.6496 0.0029561936400318875 5240 685
video_has_repost select 1468.93 0.027541118671605483 6.7955 0.0040528553527407 362444 53336


We keep an eye on these types of queries: an unexpectedly high number of selects probably indicates insufficient caching.

Going further

We're using another great feature provided by Pinba: request tags. You use them to differentiate requests of different types. The request is flagged with values of the tags you defined. For instance we use request tag to flag queries for:

  • format: response format (html, rss, atom, ...)
  • site_content: localization of the site content (en, de, fr, jp, ru, ...)
  • bot: is the client a bot? (yes/no)
  • auth: is the user authenticated (logged in)? (yes/no)
  • provider: used to see variations between different php versions: native PHP versus HHVM, ...

We then create tables, setting comment directives for the engine, so it knows what to store where. For instance, there is a table based on report_by_script_name, where the engine will store data only for PHP processes with value 'no' for request tag 'bot', and 'no' for request tag 'auth' (in other words, unlogged real users - the usual kind):

CREATE TABLE `auth_no_bot_no_report_by_script_name` (  
  `req_count` int(11) DEFAULT NULL,
  `req_per_sec` float DEFAULT NULL,
...
  `req_time_median` float DEFAULT NULL,
  `index_value` varchar(256) DEFAULT NULL
) ENGINE=PINBA DEFAULT CHARSET=latin1 COMMENT='report1::tag.auth=no,tag.bot=no';

Building graphs

If you want to build graphs from data gathered in Pinba, you must to consolidate it outside (as you store metrics for a minute or two before discarding them). For this, we use collectd coupled to a Mysql plugin. It runs queries against the Pinba database, and stores results in Whisper, and then we query Whisper using Graphite and Tessera or Grafana.

Conclusion

Pinba usage in our old PHP application allowed us to:

  • detect bottlenecks in the application
  • find bugs (excessive backend usage in some routes, ...)
  • build graphs and plug alerting to detect regressions after a release

We're also using Pinba in our Python projects, using Pynba, written by our old friend (and now colleague) johnnoone.

One thing we have noticed with our installation is that the housekeeping thread is not always perfect at evacuating old data: sometimes report tables contain phantom results of activity that took place long before the previous minute. This can be a problem when you're using Pinba to determine that something is not happening (usually after landing new code in production). This is one of the few cases where, for once, the bug is not in your code.

We have found that the simplest way to fix this problem is to restart the Pinba engine. This produces a brief glitch in time series graphs based on Pinba, but effectively cleans out phantom data. (Note: you can't use TRUNCATE TABLE, because it's not really a Mysql engine behind the curtain).

 Essential Pinba Resources