OpenTSDB is a time-series database built on top of the venerable hbase. It allows you to aggregate and crunch many thousands of time-series metrics and digest them into useful statistics and graphs.
But the best part is the tagging system that allows you to build dynamic and useful graphs on the fly. With every metric you send you simply attach arbitrary tags "datacenter=ec2 cluster=production05 branch=master". Later on you can bring these up to compare minute differences between systems.
This kind of monitoring blows "enterprise" solutions like Zabbix and Nagios out of the water. There's no way you could fit this kind of data into either rrdtool or whatever the heck Zabbix uses to store it (MYSQL??!?!). It's also an "agentless" solution, which makes it well suited for the cloud.
Tcollector
Now you can get realtime metrics on how your varnish web accelerator is doing. I wrote a tcollector plugin to slurp counters from varnishstat and send them to TSDB.There's a pull request up to merge the collector into the tcollector repo, but in the meantime you can find the varnish collector script here.
The Code
#!/usr/bin/python """Send varnishstat counters to TSDB""" import re import subprocess import sys import json import time from collectors.lib import utils interval = 15 # seconds # Prefixes here will be prepended to each metric name before being sent metric_prefix = ['varnishstat'] # Add any additional tags you would to include into this array as strings # # tags = ['production=false', 'cloud=amazon'] tags = [] # By default varnishstat returns about 300 metrics and not all of them are # very useful. # # If you would like to collect all of the counters simply set vstats to "all" # # vstats = 'all' # Some useful default values to send vstats = [ 'client_conn', 'client_drop', 'client_req', 'cache_hit', 'cache_hitpass', 'cache_miss' ] def main(): utils.drop_privileges() while True: try: if vstats == "all": stats = subprocess.Popen( ["varnishstat", "-1", "-j"], stdout=subprocess.PIPE, ) else: fields = ",".join(vstats) stats = subprocess.Popen( ["varnishstat", "-1", "-f" + fields, "-j"], stdout=subprocess.PIPE, ) except OSError, (errno, msg): # Die and signal to tcollector not to run this script. sys.stderr.write("Error: %s" % msg) sys.exit(13) metrics = "" for line in stats.stdout.readlines(): metrics += line metrics = json.loads(metrics) # We'll use the timestamp provided by varnishstat for our metrics pattern ='%Y-%m-%dT%H:%M:%S' timestamp = int(time.mktime(time.strptime(metrics['timestamp'], pattern))) for k, v in metrics.iteritems(): if k != 'timestamp': # Prepend any provided prefixes to each metric name metric_name = ".".join(metric_prefix) + "." + k print "%s %d %s %s" % \ (metric_name, timestamp, v['value'], ",".join(tags)) sys.stdout.flush() time.sleep(interval) if __name__ == "__main__": sys.exit(main())