Ensuring that database tier is always available and performing optimally is one of the most important tasks for engineers running apps in the Microsoft Azure cloud. Performance slowdowns or outages in the database tier typically impact overall applications in most fundamental ways. In this blog entry, we’ll highlight the secret sauce behind some of the metrics that CloudMonix tracks every minute of every hour when monitoring SQL Azure databases.

There are three important categories of metrics when monitoring SQL Azure databases that one should be aware of

  • Performance Metrics – metrics indicating poor performance of the database, typically because of poorly written queries or un-optimized indices
  • Utilization Metrics – metrics indicating the load on host that is running SQL Azure database
  • Connectivity Metrics – metrics tracking connections, transactions, and if the database is up and responding

Below, we’ll describe how to capture key metrics from these categories. While utilizing CloudMonix, database administrators can easily and efficiently keep track of all publically available monitoring metrics thru a friendly interface and be alerted of issues immediately. However, some IT professionals may choose to gather these metrics into their own tracking systems or simply execute them on demand whenever they have a need to do so.

Performance Metrics

Perhaps the best way to improve performance of the database is to monitor and optimize frequently executing queries.
To see the most utilized queries in the database, one must query dynamic management view sys.dm_exec_query_stats

SELECT
execution_count AS ExecutionCount,
highest_cpu_queries.total_worker_time AS TotalWorkerTime,
(highest_cpu_queries.total_worker_time / execution_count) AS AverageCpuTime,
total_logical_reads AS TotalReads,
total_logical_writes AS TotalWrites,
LEFT(q.[text],2048) AS Text
FROM
(SELECT TOP 10 *
FROM sys.dm_exec_query_stats qs
ORDER BY qs.total_worker_time desc) AS highest_cpu_queries
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS q
ORDER BY TotalWorkerTime DESC

In order to detect blocking (deadlocked) queries, query dynamic management view sys.dm_exec_requests

SELECT COUNT(*) FROM sys.dm_exec_requests WHERE blocking_session_id = 0 AND session_id IN (SELECT DISTINCT(blocking_session_id) FROM sys.dm_exec_requests)

Presence of blocking queries indicates a potential problem that should be investigated further.

Utilization Metrics

Database Throughput Units (DTUs) (that were recently introduced to SQL Azure) indicate performance-capacity of the database. DTU is a combination of processor, read/write and memory utilizations. Measuring these utilizations can accurately predict when SQL Azure database is over-utilized and must be upgraded or the queries in it, optimized.
To measure processor, memory and read/write utilization one must query dynamic management view sys.dm_db_resourcestats

SELECT
AVG(avg_cpu_percent) AS avg_cpu_percent,
AVG(avg_log_write_percent) AS avg_log_write_percent,
AVG(avg_data_io_percent) AS avg_data_io_percent,
AVG(avg_memory_usage_percent) AS avg_memory_usage_percent
FROM sys.dm_db_resource_stats
WHERE end_time BETWEEN 'X' AND 'Y'

‘X’ and ‘Y’ indicate a time range to query. ¬†DBA’s should pay close attention when even one of these metrics is over 80% for sustained amount of time.

Connectivity Metrics

When needing to know as soon as possible of any potential outages, nothing beats just executing simple connectivity tests to monitored database on a frequent basis.

In order to determine what connections are misbehaving or if there is some buggy code is not properly closing connections, one can inspect dynamic management view sys.dm_exec_sessions

SELECT
s.host_name as HostName,
s.session_id as SessionId,
s.login_name as LoginName,
s.last_request_end_time as LastRequestTime,
s.cpu_time as Cpu,
e.num_reads as Reads,
e.num_writes as Writes,
s.status as Status,
s.open_transaction_count as Transactions
FROM
sys.dm_exec_sessions s INNER JOIN sys.dm_exec_connections e ON s.session_id = e.session_id

When monitoring for high utilization, it may be possible to isolate misbehaving connections coming from certain clients as the culprit of the issue.