DataDoctorX — DBA & DE Runbook

🗺6-Step RCA Workflow Memorize this▼

# STEP 1 — Is the service even running?
pgrep -fl mysql || pgrep -fl postgres || pgrep -fl mongod

# STEP 2 — System health (30 seconds)
uptime && free -h && df -h

# STEP 3 — What do the logs say? (run for your DB)
tail -50 /var/log/mysql/error.log | grep -iE 'error|warn'
tail -50 /var/log/postgresql/postgresql-*.log | grep -iE 'error|fatal'
journalctl -u mysql --since "10 minutes ago" | grep -iE 'error|fail'

# STEP 4 — Resource bottleneck?
top -b -n1 | head -25          # CPU + memory snapshot
iostat -x 1 3                  # Disk I/O (await > 20ms = bottleneck)
vmstat 1 3                     # si/so > 0 = swapping = bad

# STEP 5 — Connection / lock problem?
# MySQL:
mysql -e "SHOW FULL PROCESSLIST;" | grep -v Sleep
# PostgreSQL:
psql -c "SELECT pid,state,wait_event,query_start,LEFT(query,80) FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;"

# STEP 6 — Timeline: when did it start?
grep -iE 'error|fatal' /var/log/mysql/error.log | awk '{print $1" "$2}' | uniq -c | tail -20

❤️Health Snapshot — One-Liners Run first▼

Quick health dump to file

{
  echo "=== $(date) ===" && uptime
  echo "--- MEMORY ---" && free -h
  echo "--- DISK ---" && df -h
  echo "--- TOP 5 CPU ---" && ps aux --sort=-%cpu | head -6
  echo "--- TOP 5 MEM ---" && ps aux --sort=-%mem | head -6
  echo "--- CONNECTIONS ---"
  ss -ant | awk '{print $1}' | sort | uniq -c
} | tee /tmp/health_$(date +%H%M%S).txt

📈CPU & Load Performance▼

uptime                              # load avg — compare to CPU count
lscpu | grep "^CPU(s)"              # how many CPUs you have
top -b -n1 | head -20               # snapshot (non-interactive)
mpstat -P ALL 1 3                   # per-core breakdown (needs sysstat)
ps -eo pid,%cpu,etime,cmd --sort=-%cpu | head -10   # top CPU processes

🧠Memory & Swap Performance▼

⚠️Any swap usage on a database host is a performance cliff. Even 1MB of swap activity means you have a problem.

free -h                             # look at 'available', not 'free'
vmstat 1 5                          # si/so columns — swap in/out
swapon --show                       # is swap even enabled?
cat /proc/meminfo | grep -E 'MemTotal|MemAvailable|SwapTotal|SwapFree'
ps aux --sort=-%mem | head -10      # top memory consumers
dmesg | grep -i "killed process"    # OOM killer fired?

💾Disk Space & I/O Performance▼

⚠️Always run df -i too — full inodes will refuse new files even with gigabytes free.

df -h                               # disk space (human-readable)
df -i                               # inode usage — the silent killer
iostat -x 1 5                       # %util > 80% = saturated, await > 20ms = slow

# Find what's eating disk
du -sh /var/lib/mysql/* 2>/dev/null | sort -hr | head -10
du -sh /var/lib/postgresql/*/base/* 2>/dev/null | sort -hr | head -10
find /var/log -type f -size +200M -exec ls -lh {} \;

# Ghost space: deleted files still held open
lsof | grep deleted | awk '{sum+=$7} END {print sum/1024/1024 " MB held by deleted files"}'

# Which process is doing I/O right now?
iotop -o                            # only processes with active I/O
pidstat -d 1 3                      # per-process I/O (fallback if no iotop)

🔌Connections Network▼

# Count connections by state
ss -ant | awk '{print $1}' | sort | uniq -c

# Connections to specific DB port
ss -ant state established | grep :3306 | wc -l    # MySQL
ss -ant state established | grep :5432 | wc -l    # PostgreSQL

# Who is connected (what processes/IPs)
ss -antp | grep :3306
lsof -i :3306
lsof -i :5432

# Connection count by remote IP (find the flood source)
ss -ant | awk '{print $5}' | grep -v '\*' | cut -d: -f1 | sort | uniq -c | sort -nr | head -10

# System file descriptor limits (connections use fds)
ulimit -n                           # current shell limit
cat /proc/sys/fs/file-max           # system-wide max
cat /proc/sys/fs/file-nr            # open / unused / max

🔬Process Investigation & Kill Emergency▼

🚨D-state processes cannot be killed with kill -9. They are waiting for I/O. Only fixing the underlying I/O (or rebooting) resolves them.

# Find stuck processes (D = blocked on I/O, Z = zombie)
ps -eo pid,stat,wchan:25,cmd | grep -E '^[0-9]+ [DZ]'

# Full process investigation
ps aux | grep -E 'mysql|postgres|spark|glue|python|airflow' | grep -v grep
ps -eo pid,ppid,stat,etime,%cpu,%mem,cmd | grep -E 'spark|glue'

# Long-running processes (etimes = elapsed seconds)
ps -eo pid,etimes,cmd --sort=-etimes | head -15

# Kill gracefully first, force only if needed
kill -15 PID                        # SIGTERM (graceful)
kill -9  PID                        # SIGKILL (force — last resort)

# Service management
systemctl restart mysql
systemctl restart postgresql
journalctl -u mysql -n 100 --no-pager

📋Log Investigation — Essential Patterns Logs▼

# Live error monitoring
tail -f /var/log/mysql/error.log | grep --line-buffered -i error
tail -f app.log | grep --line-buffered -iE 'error|exception|failed'

# Error frequency over time (spot the spike)
grep -i error /var/log/mysql/error.log | awk '{print $1" "$2}' | cut -d: -f1 | uniq -c

# Most common errors (top 10)
grep -i error /var/log/pipeline.log | sort | uniq -c | sort -nr | head -10

# Error with context (5 lines before and after)
grep -i "connection refused" /var/log/pipeline.log -A 5 -B 5

# Kernel / OOM events
dmesg -T | grep -iE 'oom|killed|error|hung' | tail -30
journalctl -k --since "1 hour ago" | grep -i error

👥MySQL / Aurora — Active Sessions MySQL▼

Who is running what right now

MySQL — run inside DB

-- Full process list (excludes sleeping connections)
SHOW FULL PROCESSLIST;

-- Better version with duration and host
SELECT
  id, user, host, db,
  command, time AS secs,
  state,
  LEFT(info, 120) AS query
FROM information_schema.processlist
WHERE command != 'Sleep'
ORDER BY time DESC;

-- Just the long-running ones (> 30 seconds)
SELECT id, user, db, time, state, LEFT(info,100) AS query
FROM information_schema.processlist
WHERE time > 30 AND command != 'Sleep'
ORDER BY time DESC;

Kill a query or connection

-- Kill just the query (connection stays open)
KILL QUERY 12345;

-- Kill the entire connection
KILL 12345;

-- Kill ALL long-running queries at once (> 60s) — USE WITH CARE
SELECT CONCAT('KILL QUERY ', id, ';')
FROM information_schema.processlist
WHERE time > 60 AND command != 'Sleep' AND user != 'root';
-- Copy output and run manually

Connection counts & limits

-- Current vs max connections
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';

-- Connection history (peak ever reached)
SHOW STATUS LIKE 'Max_used_connections';

-- Connections by user (find who's using the most)
SELECT user, COUNT(*) AS connections
FROM information_schema.processlist
GROUP BY user ORDER BY connections DESC;

🔒MySQL / Aurora — Locks & Transactions MySQL▼

Active transactions & locks

-- Active InnoDB transactions
SELECT
  trx_id, trx_state, trx_started,
  TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS secs_running,
  trx_mysql_thread_id AS connection_id,
  trx_query
FROM information_schema.innodb_trx
ORDER BY trx_started;

-- Who is blocking who (MySQL 8+ / Aurora 2+)
SELECT
  r.trx_id AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_conn,
  LEFT(r.trx_query, 80) AS waiting_query,
  b.trx_id AS blocking_trx,
  b.trx_mysql_thread_id AS blocking_conn,
  LEFT(b.trx_query, 80) AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

InnoDB status (deadlock + lock analysis)

-- Full InnoDB engine status (look for LATEST DEADLOCK section)
SHOW ENGINE INNODB STATUS\G

-- Lock wait timeout setting
SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';

-- Deadlock count since restart
SHOW STATUS LIKE 'Innodb_row_lock_waits';
SHOW STATUS LIKE 'Innodb_deadlocks';

📊MySQL / Aurora — Performance MySQL▼

Slow query log (most valuable tool — enable in RDS parameter group)

-- Check current slow query settings
SHOW VARIABLES LIKE 'slow_query_log%';
SHOW VARIABLES LIKE 'long_query_time';

-- Enable on-the-fly (EC2/self-managed only — use param group for RDS)
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;        -- log queries > 1 second
SET GLOBAL log_queries_not_using_indexes = 'ON';

-- Top slow queries from performance_schema (available on RDS)
SELECT
  LEFT(digest_text, 100) AS query_pattern,
  count_star AS exec_count,
  ROUND(avg_timer_wait/1000000000000, 3) AS avg_sec,
  ROUND(max_timer_wait/1000000000000, 3) AS max_sec,
  ROUND(sum_rows_examined / count_star) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE schema_name NOT IN ('performance_schema','information_schema','mysql')
ORDER BY avg_timer_wait DESC
LIMIT 15;

EXPLAIN — query analysis

-- Basic explain
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;

-- Extended (MySQL 8+) — shows optimizer decisions
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

-- JSON format (most detailed)
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 42\G

-- What to look for:
-- type = 'ALL'        → full table scan (BAD — needs index)
-- type = 'index'      → full index scan (better but still slow on large tables)
-- type = 'ref'/'eq_ref' → good, using index properly
-- rows > 100000       → scanning too many rows
-- Extra 'Using filesort' → sorting without index (expensive)

Buffer pool & key metrics

-- Buffer pool hit rate (should be > 99%)
SELECT
  ROUND(
    (1 - innodb_buffer_pool_reads / innodb_buffer_pool_read_requests) * 100, 2
  ) AS buffer_hit_rate_pct
FROM (
  SELECT
    (SELECT variable_value FROM performance_schema.global_status WHERE variable_name='Innodb_buffer_pool_reads') + 0 AS innodb_buffer_pool_reads,
    (SELECT variable_value FROM performance_schema.global_status WHERE variable_name='Innodb_buffer_pool_read_requests') + 0 AS innodb_buffer_pool_read_requests
) t;

-- Current buffer pool size vs configured
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
SHOW STATUS LIKE 'Innodb_buffer_pool_bytes_data';

🗂MySQL / Aurora — Schema & Tables MySQL▼

-- Top 20 largest tables in a database
SELECT
  table_name,
  ROUND(data_length / 1024 / 1024, 1) AS data_mb,
  ROUND(index_length / 1024 / 1024, 1) AS index_mb,
  ROUND((data_length + index_length) / 1024 / 1024, 1) AS total_mb,
  table_rows AS est_rows
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY (data_length + index_length) DESC
LIMIT 20;

-- Tables with no primary key (data quality / replication risk)
SELECT table_schema, table_name
FROM information_schema.tables t
WHERE table_type = 'BASE TABLE'
  AND NOT EXISTS (
    SELECT 1 FROM information_schema.table_constraints tc
    WHERE tc.table_schema = t.table_schema
      AND tc.table_name = t.table_name
      AND tc.constraint_type = 'PRIMARY KEY'
  )
ORDER BY table_schema, table_name;

-- Unused indexes (rows = 0 in statistics since last restart)
SELECT
  object_schema, object_name AS table_name, index_name
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE index_name IS NOT NULL
  AND count_star = 0
  AND object_schema NOT IN ('mysql','performance_schema','information_schema')
ORDER BY object_schema, object_name;

☁️RDS MySQL / Aurora — Specific Checks RDS▼

# Download slow query log from RDS
aws rds describe-db-log-files \
  --db-instance-identifier my-rds-instance \
  --filename-contains slow

aws rds download-db-log-file-portion \
  --db-instance-identifier my-rds-instance \
  --log-file-name slowquery/mysql-slowquery.log \
  --starting-token 0 --output text

# Check RDS events (restarts, failovers, parameter changes)
aws rds describe-events \
  --source-identifier my-rds-instance \
  --source-type db-instance \
  --duration 60  # last 60 minutes

# Aurora replica lag (in seconds)
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name AuroraReplicaLag \
  --dimensions Name=DBClusterIdentifier,Value=my-aurora-cluster \
  --start-time $(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Maximum

# Check parameter group settings
aws rds describe-db-parameters \
  --db-parameter-group-name my-param-group \
  --query 'Parameters[?ParameterValue!=`null`].[ParameterName,ParameterValue]' \
  --output table | grep -iE 'slow|connect|timeout|innodb_buffer'

🔫MySQL / Aurora — Kill Commands Top priority▼

⚠️On RDS/Aurora use CALL mysql.rds_kill(id) — not KILL id directly. rds_kill_query stops the query but keeps the connection open. rds_kill closes the whole connection.

Generate kill commands — safe approach

Produces ready-to-run CALL statements — review before executing

-- Kill idle connections > 5 min (quick cleanup)
SELECT CONCAT('CALL mysql.rds_kill(', ID, ');') AS kill_command,
       USER, SUBSTRING_INDEX(HOST,':',1) AS ip, TIME AS idle_secs
FROM performance_schema.processlist
WHERE COMMAND = 'Sleep'
  AND TIME > 300
  AND USER NOT IN ('rdsadmin','event_scheduler','system user')
  AND ID != CONNECTION_ID()
ORDER BY TIME DESC;

-- Kill long-running ACTIVE queries > 10 min (avoids active transactions)
SELECT CONCAT('CALL mysql.rds_kill(', p.ID, ');') AS kill_command,
       p.USER, p.TIME AS secs, LEFT(p.INFO,80) AS query_preview
FROM performance_schema.processlist p
LEFT JOIN information_schema.innodb_trx t ON p.ID = t.trx_mysql_thread_id
WHERE p.COMMAND != 'Sleep'
  AND p.TIME > 600
  AND p.USER NOT IN ('rdsadmin','event_scheduler','system user')
  AND p.ID != CONNECTION_ID()
  AND t.trx_id IS NULL        -- skip active transactions
ORDER BY p.TIME DESC;

-- Kill ALL connections from a flooding IP (get kill list first)
SELECT
  SUBSTRING_INDEX(HOST,':',1) AS source_ip,
  USER, COUNT(*) AS total_connections,
  SUM(CASE WHEN COMMAND='Sleep' THEN 1 ELSE 0 END) AS sleeping,
  GROUP_CONCAT(CONCAT('CALL mysql.rds_kill(',ID,');') SEPARATOR ' ') AS kill_all
FROM performance_schema.processlist
WHERE USER NOT IN ('rdsadmin','event_scheduler')
  AND ID != CONNECTION_ID()
GROUP BY SUBSTRING_INDEX(HOST,':',1), USER
HAVING total_connections > 3
ORDER BY total_connections DESC;

-- Kill just the query, keep connection alive (gentler)
-- CALL mysql.rds_kill_query(thread_id);

Blocking transaction kill (with blocker identification)

-- Shows who is blocking who + ready-to-run kill for the blocker
SELECT
  r.trx_mysql_thread_id AS waiting_thread,
  LEFT(r.trx_query, 80)  AS waiting_query,
  TIMESTAMPDIFF(SECOND, r.trx_started, NOW()) AS waiting_secs,
  b.trx_mysql_thread_id AS blocking_thread,
  LEFT(b.trx_query, 80)  AS blocking_query,
  TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_secs,
  CONCAT('CALL mysql.rds_kill(', b.trx_mysql_thread_id, ');') AS kill_blocker
FROM information_schema.INNODB_TRX r
JOIN information_schema.INNODB_TRX b ON r.trx_id != b.trx_id
WHERE r.trx_state = 'LOCK WAIT'
ORDER BY waiting_secs DESC;

🔌MySQL / Aurora — Connection Analysis MySQL▼

Health metrics snapshot

-- Current vs peak vs max — the three numbers that matter
SELECT 'Threads_connected'    AS metric, VARIABLE_VALUE AS value FROM performance_schema.global_status  WHERE VARIABLE_NAME = 'Threads_connected'
UNION ALL
SELECT 'Max_used_connections', VARIABLE_VALUE FROM performance_schema.global_status  WHERE VARIABLE_NAME = 'Max_used_connections'
UNION ALL
SELECT 'max_connections (limit)', VARIABLE_VALUE FROM performance_schema.global_variables WHERE VARIABLE_NAME = 'max_connections'
UNION ALL
SELECT 'Aborted_connects',    VARIABLE_VALUE FROM performance_schema.global_status  WHERE VARIABLE_NAME = 'Aborted_connects'
UNION ALL
SELECT 'Connections (total ever)', VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Connections';

Connection sources — find flooding apps or IPs

SELECT
  SUBSTRING_INDEX(HOST,':',1) AS source_ip,
  USER,
  COUNT(*)                                             AS total_connections,
  SUM(CASE WHEN COMMAND='Sleep' THEN 1 ELSE 0 END)    AS sleeping,
  SUM(CASE WHEN COMMAND!='Sleep' THEN 1 ELSE 0 END)   AS active,
  MAX(TIME)                                            AS longest_sec,
  ROUND(MAX(TIME)/60,2)                                AS longest_min
FROM performance_schema.processlist
WHERE USER NOT IN ('rdsadmin','event_scheduler','system user')
  AND ID != CONNECTION_ID()
GROUP BY SUBSTRING_INDEX(HOST,':',1), USER
HAVING total_connections > 3
ORDER BY total_connections DESC
LIMIT 30;

⏱MySQL / Aurora — Long Queries & Idle Transactions MySQL▼

Long-running queries with severity + kill command pre-generated

SELECT
  p.ID,
  p.USER,
  SUBSTRING_INDEX(p.HOST,':',1) AS host_ip,
  p.DB,
  p.TIME AS seconds,
  ROUND(p.TIME/60,2) AS minutes,
  p.STATE,
  LEFT(p.INFO,100) AS query_preview,
  CASE
    WHEN p.TIME > 600 THEN 'CRITICAL — KILL NOW'
    WHEN p.TIME > 300 THEN 'WARNING'
    ELSE 'MONITOR'
  END AS priority,
  CONCAT('CALL mysql.rds_kill(', p.ID, ');') AS kill_command
FROM performance_schema.processlist p
WHERE p.COMMAND != 'Sleep'
  AND p.INFO IS NOT NULL
  AND p.USER NOT IN ('rdsadmin','event_scheduler','system user')
  AND p.TIME > 60
ORDER BY p.TIME DESC
LIMIT 20;

Long / idle transactions — the hidden lock holders

-- Transactions open > 100s or not in RUNNING state
SELECT
  trx_id,
  trx_mysql_thread_id AS thread_id,
  trx_started,
  trx_state,
  LEFT(trx_query, 100) AS query_preview,
  TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS duration_secs,
  CASE
    WHEN TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 100 THEN 'LONG_RUNNING'
    WHEN trx_state != 'RUNNING' THEN 'NON_RUNNING'
    ELSE 'OK'
  END AS status,
  CONCAT('CALL mysql.rds_kill(', trx_mysql_thread_id, ');')       AS kill_connection,
  CONCAT('CALL mysql.rds_kill_query(', trx_mysql_thread_id, ');') AS kill_query_only
FROM information_schema.INNODB_TRX
WHERE trx_state != 'RUNNING'
   OR TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 100
ORDER BY duration_secs DESC;

-- Transactions holding many row locks (> 1000 = risky)
SELECT p.ID, p.USER, p.DB, p.TIME, p.STATE,
  LEFT(p.INFO,80) AS query_preview,
  t.trx_rows_locked, t.trx_tables_locked
FROM information_schema.processlist p
JOIN information_schema.innodb_trx t ON p.ID = t.trx_mysql_thread_id
WHERE t.trx_rows_locked > 1000
ORDER BY t.trx_rows_locked DESC;

⚙️MySQL / Aurora — Config Review & Replication MySQL▼

Key variables — covers 90% of "why is this behaving oddly"

SHOW VARIABLES WHERE Variable_name IN (
  -- Timeouts
  'wait_timeout', 'interactive_timeout', 'connect_timeout',
  'net_read_timeout', 'net_write_timeout',
  -- Connections & memory
  'max_connections', 'max_allowed_packet',
  'innodb_buffer_pool_size', 'innodb_io_capacity',
  'thread_cache_size', 'table_open_cache',
  -- Temp tables
  'tmp_table_size', 'max_heap_table_size', 'temptable_max_ram',
  -- Binlog & replication
  'binlog_format', 'binlog_expire_logs_seconds',
  'log_bin', 'read_only', 'enforce_gtid_consistency'
);

-- RDS-specific config (Aurora parameter group settings)
CALL mysql.rds_show_configuration;

-- Set binlog retention (1–168 hours) — prevents disk filling with binary logs
-- CALL mysql.rds_set_configuration('binlog retention hours', 24);

Replication & GTID status

-- Instance identity
SELECT @@hostname, @@server_id, @@read_only, @@super_read_only;

-- Binary log position (source / primary)
SHOW MASTER STATUS;
SHOW BINARY LOGS;
SELECT @@GLOBAL.gtid_executed;
SHOW VARIABLES LIKE 'gtid%';

-- Replica status (run on replica)
SHOW REPLICA STATUS;   -- MySQL 8+ / Aurora 3+
SHOW SLAVE STATUS;     -- MySQL 5.7 / Aurora 2

-- Binlog cache hit rate
SHOW GLOBAL STATUS LIKE 'Binlog_cache%';

Queries creating disk temp tables — performance canary

-- Queries spilling to disk (tmp_table_size too small or GROUP BY without index)
SELECT
  LEFT(DIGEST_TEXT, 100)          AS query_pattern,
  COUNT_STAR                      AS executions,
  SUM_CREATED_TMP_DISK_TABLES     AS tmp_disk_tables,
  SUM_CREATED_TMP_TABLES          AS tmp_total,
  ROUND(SUM_CREATED_TMP_DISK_TABLES * 100.0 / NULLIF(SUM_CREATED_TMP_TABLES,0),1) AS disk_pct
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_CREATED_TMP_DISK_TABLES > 0
ORDER BY tmp_disk_tables DESC
LIMIT 20;

🔥MySQL / Aurora — Query Hotspots MySQL▼

💡These queries use performance_schema which is available on RDS and Aurora without OS access. Most executed ≠ slowest — optimize the combination of both.

Most executed query patterns — the high-frequency targets

-- Queries run > 1000 times — best ROI for optimization
SELECT
  SCHEMA_NAME,
  LEFT(DIGEST_TEXT, 120)                                        AS query_pattern,
  COUNT_STAR                                                    AS executions,
  ROUND(AVG_TIMER_WAIT / 1000000000, 2)                         AS avg_ms,
  ROUND(SUM_TIMER_WAIT / 1000000000000, 2)                      AS total_sec,
  ROUND(SUM_ROWS_EXAMINED / COUNT_STAR)                         AS avg_rows_examined,
  ROUND(SUM_ROWS_SENT / COUNT_STAR)                             AS avg_rows_returned
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME NOT IN (
    'information_schema','__mysql__','performance_schema','sys')
  AND COUNT_STAR > 1000
ORDER BY COUNT_STAR DESC
LIMIT 20;

Worst offenders by total time — where your DB spends its life

-- High total_sec = high impact even if avg_ms looks acceptable
SELECT
  SCHEMA_NAME,
  LEFT(DIGEST_TEXT, 120)                               AS query_pattern,
  COUNT_STAR                                           AS executions,
  ROUND(AVG_TIMER_WAIT / 1000000000, 2)                AS avg_ms,
  ROUND(MAX_TIMER_WAIT / 1000000000, 2)                AS max_ms,
  ROUND(SUM_TIMER_WAIT / 1000000000000, 2)             AS total_sec,
  ROUND(SUM_ROWS_EXAMINED / COUNT_STAR)                AS avg_rows_examined,
  SUM_NO_INDEX_USED                                    AS no_index_count
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME NOT IN (
    'information_schema','__mysql__','performance_schema','sys')
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;

-- Reset stats after an optimization to measure impact
-- CALL sys.ps_truncate_all_tables(FALSE);

Queries NOT using indexes — full table scan suspects

SELECT
  SCHEMA_NAME,
  LEFT(DIGEST_TEXT, 120)                               AS query_pattern,
  COUNT_STAR                                           AS executions,
  SUM_NO_INDEX_USED                                    AS no_index_scans,
  ROUND(SUM_NO_INDEX_USED * 100.0 / COUNT_STAR, 1)     AS no_index_pct,
  ROUND(AVG_TIMER_WAIT / 1000000000, 2)                AS avg_ms,
  ROUND(SUM_ROWS_EXAMINED / COUNT_STAR)                AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME NOT IN (
    'information_schema','__mysql__','performance_schema','sys')
  AND SUM_NO_INDEX_USED > 0
  AND COUNT_STAR > 100
ORDER BY SUM_NO_INDEX_USED DESC
LIMIT 20;

🗂MySQL / Aurora — Index Size & Usage Analysis MySQL▼

Top 20 largest indexes — storage cost visibility

SELECT
  database_name,
  table_name,
  index_name,
  ROUND(stat_value * @@innodb_page_size / 1024 / 1024 / 1024, 3) AS size_gb,
  ROUND(stat_value * @@innodb_page_size / 1024 / 1024, 1)         AS size_mb
FROM mysql.innodb_index_stats
WHERE stat_name = 'size'
  AND database_name NOT IN (
      'mysql','information_schema','performance_schema','sys')
ORDER BY stat_value DESC
LIMIT 20;

Index usage — identify UNUSED and RARELY_USED indexes (drop candidates)

⚠️Usage counts reset on server restart. Run this after at least 24–48h of normal traffic for meaningful results. Never drop an index on PRIMARY key or UNIQUE constraints.

SELECT
  t.TABLE_SCHEMA,
  t.TABLE_NAME,
  s.INDEX_NAME,
  s.CARDINALITY,
  IFNULL(iu.rows_selected, 0)                          AS rows_selected,
  CASE
    WHEN iu.INDEX_NAME IS NULL       THEN 'UNUSED'
    WHEN iu.rows_selected < 100      THEN 'RARELY_USED'
    ELSE                                  'ACTIVE'
  END                                                  AS usage_status,
  ROUND(idx_size.stat_value
    * @@innodb_page_size / 1024 / 1024, 1)             AS size_mb
FROM information_schema.TABLES t
JOIN information_schema.STATISTICS s
  ON t.TABLE_SCHEMA = s.TABLE_SCHEMA
  AND t.TABLE_NAME  = s.TABLE_NAME
LEFT JOIN (
  SELECT OBJECT_SCHEMA, OBJECT_NAME, INDEX_NAME,
         SUM(COUNT_STAR) AS rows_selected
  FROM performance_schema.table_io_waits_summary_by_index_usage
  WHERE INDEX_NAME IS NOT NULL
  GROUP BY OBJECT_SCHEMA, OBJECT_NAME, INDEX_NAME
) iu ON s.TABLE_SCHEMA = iu.OBJECT_SCHEMA
     AND s.TABLE_NAME  = iu.OBJECT_NAME
     AND s.INDEX_NAME  = iu.INDEX_NAME
LEFT JOIN mysql.innodb_index_stats idx_size
  ON s.TABLE_SCHEMA  = idx_size.database_name
  AND s.TABLE_NAME   = idx_size.table_name
  AND s.INDEX_NAME   = idx_size.index_name
  AND idx_size.stat_name = 'size'
WHERE t.TABLE_SCHEMA NOT IN (
    'mysql','information_schema','performance_schema','sys')
  AND s.INDEX_NAME != 'PRIMARY'
  AND s.SEQ_IN_INDEX = 1
ORDER BY usage_status, size_mb DESC NULLS LAST;

Generate ANALYZE TABLE commands — refresh optimizer statistics

💡Run ANALYZE on the largest tables after bulk loads, large deletes, or when EXPLAIN shows wildly wrong row estimates. On RDS/Aurora this is online — no table lock.

-- Generates ready-to-run ANALYZE statements for top 10 largest tables
SELECT
  table_schema,
  table_name,
  table_rows                                                        AS est_rows,
  ROUND((data_length + index_length) / 1024 / 1024, 1)             AS total_mb,
  CONCAT('ANALYZE TABLE ', table_schema, '.', table_name, ';')     AS analyze_command
FROM information_schema.TABLES
WHERE table_schema NOT IN (
    'mysql','information_schema','performance_schema','sys')
  AND table_rows > 0
ORDER BY table_rows DESC
LIMIT 10;

-- Run analyze on a specific table
ANALYZE TABLE mydb.orders;

-- Check when stats were last updated
SELECT
  table_schema, table_name,
  create_time, update_time,
  table_rows,
  ROUND(data_length / 1024 / 1024, 1)  AS data_mb,
  ROUND(index_length / 1024 / 1024, 1) AS index_mb
FROM information_schema.TABLES
WHERE table_schema = DATABASE()
ORDER BY update_time DESC
LIMIT 20;

Tables without primary key — replication & data quality risk

-- Tables with no PK — causes full table scans on replicas with row-based replication
SELECT
  t.table_schema,
  t.table_name,
  t.engine,
  t.table_rows
FROM information_schema.tables t
LEFT JOIN (
  SELECT table_schema, table_name
  FROM information_schema.statistics
  WHERE index_name = 'PRIMARY'
  GROUP BY table_schema, table_name
) pk ON t.table_schema = pk.table_schema
     AND t.table_name  = pk.table_name
WHERE t.table_schema NOT IN (
    'mysql','information_schema','performance_schema','sys')
  AND pk.table_name IS NULL
  AND t.table_type  = 'BASE TABLE'
ORDER BY t.table_rows DESC;

👥PostgreSQL / Aurora PG — Active Sessions PostgreSQL▼

Who is running what

PostgreSQL — run inside DB as superuser

-- All active queries with duration
SELECT
  pid,
  usename AS user,
  datname AS db,
  application_name AS app,
  client_addr,
  state,
  wait_event_type,
  wait_event,
  EXTRACT(EPOCH FROM (now() - query_start))::int AS secs,
  LEFT(query, 100) AS query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start NULLS LAST;

-- Long-running queries only (> 30 seconds)
SELECT pid, usename, state, wait_event,
  EXTRACT(EPOCH FROM (now() - query_start))::int AS secs,
  LEFT(query, 120) AS query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - INTERVAL '30 seconds'
ORDER BY secs DESC;

Cancel or terminate

-- Cancel just the query (connection stays — like KILL QUERY in MySQL)
SELECT pg_cancel_backend(12345);

-- Terminate the entire connection (like KILL in MySQL)
SELECT pg_terminate_backend(12345);

-- Terminate ALL long-running queries > 5 minutes — USE WITH CARE
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - INTERVAL '5 minutes'
  AND usename != 'rdsadmin';   -- don't kill internal RDS user

Connection counts

-- Current vs max connections
SELECT current_setting('max_connections') AS max,
       COUNT(*) AS current
FROM pg_stat_activity;

-- By user and state
SELECT usename, state, COUNT(*)
FROM pg_stat_activity
GROUP BY usename, state
ORDER BY count DESC;

-- % of max_connections used
SELECT
  MAX(conn) AS current_connections,
  MAX(setting)::int AS max_connections,
  ROUND(MAX(conn) * 100.0 / MAX(setting)::int, 1) AS pct_used
FROM
  (SELECT COUNT(*) AS conn FROM pg_stat_activity) c,
  pg_settings WHERE name = 'max_connections';

🔒PostgreSQL — Locks & Blocking PostgreSQL▼

-- Who is blocking who (the most useful lock query)
SELECT
  blocked.pid AS blocked_pid,
  blocked.usename AS blocked_user,
  LEFT(blocked.query, 80) AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.usename AS blocking_user,
  LEFT(blocking.query, 80) AS blocking_query,
  EXTRACT(EPOCH FROM (now() - blocked.query_start))::int AS blocked_for_secs
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.cardinality(pg_blocking_pids(blocked.pid)) > 0;

-- All current locks
SELECT
  l.pid, l.locktype, l.relation::regclass AS table_name,
  l.mode, l.granted,
  a.usename, LEFT(a.query, 80) AS query
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.relation IS NOT NULL
ORDER BY l.granted, l.pid;

-- Lock waits summary
SELECT wait_event_type, wait_event, COUNT(*) AS count
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY count DESC;

📊PostgreSQL — Performance & Query Analysis PostgreSQL▼

Top slow queries (pg_stat_statements — must be enabled)

-- Check if pg_stat_statements is enabled
SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements';

-- Top 15 slowest query patterns
SELECT
  ROUND(mean_exec_time::numeric, 2) AS avg_ms,
  ROUND(max_exec_time::numeric, 2) AS max_ms,
  calls,
  ROUND(total_exec_time::numeric / 1000, 1) AS total_sec,
  rows,
  LEFT(query, 120) AS query
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_%'
ORDER BY mean_exec_time DESC
LIMIT 15;

-- Most called queries (high frequency targets)
SELECT calls, ROUND(mean_exec_time::numeric,2) AS avg_ms, LEFT(query,100) AS query
FROM pg_stat_statements
ORDER BY calls DESC LIMIT 10;

-- Reset stats (after fixing a query)
SELECT pg_stat_statements_reset();

EXPLAIN ANALYZE — query plan

-- Full analysis (BUFFERS shows cache hit/miss — critical for tuning)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2024-01-01';

-- What to look for:
-- Seq Scan on large table → missing index
-- Rows= vs actual rows → bad statistics (run ANALYZE)
-- Buffers: shared hit=X read=Y → Y > 0 means going to disk
-- Nested Loop with large outer → consider Hash Join
-- Sort → possible index optimization

Cache hit rate

-- Overall cache hit ratio (should be > 99%)
SELECT
  ROUND(
    blks_hit * 100.0 / NULLIF(blks_hit + blks_read, 0), 2
  ) AS cache_hit_pct
FROM pg_stat_database
WHERE datname = current_database();

-- Per-table cache ratio
SELECT relname,
  ROUND(heap_blks_hit * 100.0 / NULLIF(heap_blks_hit + heap_blks_read, 0), 1) AS hit_pct,
  heap_blks_hit, heap_blks_read
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC LIMIT 15;

🏗PostgreSQL — Table Bloat & Vacuum PostgreSQL▼

⚠️Bloat is PostgreSQL's silent performance killer. Dead rows accumulate from updates/deletes and must be cleaned by VACUUM. Monitor this daily on write-heavy tables.

-- Tables with most dead rows (bloat candidates)
SELECT
  schemaname, relname AS table_name,
  n_live_tup AS live_rows,
  n_dead_tup AS dead_rows,
  ROUND(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum,
  last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 20;

-- When was the last vacuum/analyze run?
SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY last_autovacuum NULLS FIRST
LIMIT 20;

-- Manually run vacuum on a bloated table
VACUUM ANALYZE orders;
VACUUM VERBOSE orders;   -- shows what it's doing

-- Full vacuum (rewrites table, reclaims space — blocks reads/writes!)
-- Only use during maintenance window
VACUUM FULL orders;

🗂PostgreSQL — Schema, Tables & Indexes PostgreSQL▼

-- Largest tables
SELECT
  schemaname,
  relname AS table_name,
  pg_size_pretty(pg_total_relation_size(quote_ident(relname))) AS total_size,
  pg_size_pretty(pg_relation_size(quote_ident(relname))) AS table_size,
  pg_size_pretty(pg_total_relation_size(quote_ident(relname)) - pg_relation_size(quote_ident(relname))) AS index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(quote_ident(relname)) DESC
LIMIT 20;

-- Unused indexes (never scanned since last stats reset)
SELECT
  schemaname, tablename, indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan AS times_used
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan = 0
  AND NOT indisprimary AND NOT indisunique
ORDER BY pg_relation_size(indexrelid) DESC;

-- Missing indexes (sequential scans on large tables)
SELECT schemaname, relname AS table_name, seq_scan, idx_scan,
  ROUND(seq_scan * 100.0 / NULLIF(seq_scan + idx_scan, 0), 1) AS seq_scan_pct
FROM pg_stat_user_tables
WHERE seq_scan > 1000
ORDER BY seq_scan DESC LIMIT 15;

💿Backup & Restore RDS · Aurora · EC2▼

RDS / Aurora — snapshots via AWS CLI (managed, preferred)

# Create manual snapshot before risky operations
aws rds create-db-snapshot \
  --db-instance-identifier my-rds \
  --db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M)

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier my-rds \
  --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime,Status,AllocatedStorage]' \
  --output table

# Restore snapshot to a NEW instance (never restore over existing prod)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier my-rds-restored \
  --db-snapshot-identifier pre-migration-20240115-1430 \
  --db-instance-class db.r6g.large \
  --no-publicly-accessible

# Check restore progress
aws rds describe-db-instances \
  --db-instance-identifier my-rds-restored \
  --query 'DBInstances[0].[DBInstanceStatus,PercentProgress]' \
  --output text

# Aurora cluster snapshot
aws rds create-db-cluster-snapshot \
  --db-cluster-identifier my-aurora-cluster \
  --db-cluster-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M)

# Restore Aurora snapshot to new cluster
aws rds restore-db-cluster-from-snapshot \
  --db-cluster-identifier my-aurora-restored \
  --snapshot-identifier pre-migration-20240115-1430 \
  --engine aurora-mysql

PostgreSQL — pg_dump (EC2-hosted or export from RDS)

⚠️For RDS, use snapshots first. pg_dump is for single-table exports, cross-account migrations, and dev/test refreshes — not full production backup replacement.

# Dump single database (compressed, parallel)
pg_dump -h rds-endpoint.amazonaws.com -U masteruser \
  -Fc -j 4 -d mydb -f mydb_$(date +%Y%m%d).dump

# Dump single table only
pg_dump -h rds-endpoint.amazonaws.com -U masteruser \
  -Fc -t public.orders -d mydb -f orders_backup.dump

# Restore to target DB (--no-owner prevents ownership errors)
pg_restore -h target-rds.amazonaws.com -U masteruser \
  -d mydb --no-owner --no-acl -j 4 -Fc mydb_20240115.dump

# Restore single table into existing DB
pg_restore -h target-rds.amazonaws.com -U masteruser \
  -d mydb --no-owner -t orders -Fc orders_backup.dump

# Upload dump to S3 immediately after
aws s3 cp mydb_$(date +%Y%m%d).dump s3://my-backups-bucket/postgres/

# Quick schema-only export (structure without data)
pg_dump -h rds-endpoint.amazonaws.com -U masteruser \
  --schema-only -d mydb -f schema_$(date +%Y%m%d).sql

MySQL / Aurora MySQL — mysqldump

# Full database dump
mysqldump -h rds-endpoint.amazonaws.com -u admin -p \
  --single-transaction --routines --triggers \
  mydb > mydb_$(date +%Y%m%d).sql

# Single table
mysqldump -h rds-endpoint.amazonaws.com -u admin -p \
  --single-transaction mydb orders > orders_backup.sql

# Compressed dump piped directly to S3
mysqldump -h rds-endpoint.amazonaws.com -u admin -p \
  --single-transaction mydb \
  | gzip | aws s3 cp - s3://my-backups/mysql/mydb_$(date +%Y%m%d).sql.gz

# Restore
mysql -h target-rds.amazonaws.com -u admin -p mydb < mydb_20240115.sql

# Restore from S3 directly
aws s3 cp s3://my-backups/mysql/mydb_20240115.sql.gz - \
  | gunzip | mysql -h target-rds.amazonaws.com -u admin -p mydb

Point-in-Time Recovery (PITR) — RDS / Aurora

# Restore to specific point in time (e.g. just before a bad migration)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier my-rds \
  --target-db-instance-identifier my-rds-pitr \
  --restore-time 2024-01-15T14:30:00Z \
  --db-instance-class db.r6g.large \
  --no-publicly-accessible

# Check what the earliest restore point is
aws rds describe-db-instances \
  --db-instance-identifier my-rds \
  --query 'DBInstances[0].{Earliest:EarliestRestorableTime,Latest:LatestRestorableTime}' \
  --output table

🗃Index Maintenance MySQL · PostgreSQL · Redshift · Snowflake▼

PostgreSQL / Aurora PG — non-blocking index operations

🚨CREATE INDEX without CONCURRENTLY locks the table. On a production table with millions of rows, this means downtime. Always use CONCURRENTLY on live tables.

-- Create index WITHOUT table lock (takes longer but safe in prod)
CREATE INDEX CONCURRENTLY idx_orders_customer_id
  ON public.orders (customer_id);

-- Create partial index (smaller, faster for filtered queries)
CREATE INDEX CONCURRENTLY idx_orders_pending
  ON public.orders (created_at)
  WHERE status = 'pending';

-- Drop index without lock
DROP INDEX CONCURRENTLY idx_orders_old;

-- Rebuild bloated index without lock
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;

-- Rebuild all indexes on a table (PostgreSQL 12+)
REINDEX TABLE CONCURRENTLY public.orders;

-- Check index bloat (indexes > 30% bloat are candidates for rebuild)
SELECT
  indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan AS times_used,
  ROUND(pg_relation_size(indexrelid) * 100.0 /
    NULLIF(pg_total_relation_size(indrelid), 0), 1) AS pct_of_table
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE pg_relation_size(indexrelid) > 10 * 1024 * 1024  -- > 10MB
ORDER BY pg_relation_size(indexrelid) DESC;

MySQL / Aurora MySQL

-- Add index online (ALGORITHM=INPLACE avoids full table copy for most cases)
ALTER TABLE orders
  ADD INDEX idx_orders_customer_id (customer_id),
  ALGORITHM=INPLACE, LOCK=NONE;

-- Drop unused index online
ALTER TABLE orders
  DROP INDEX idx_orders_old,
  ALGORITHM=INPLACE, LOCK=NONE;

-- Optimize table (reclaims space + rebuilds indexes — avoid on large tables in prod)
-- For InnoDB this is a full table rebuild. Schedule during maintenance window.
OPTIMIZE TABLE orders;

-- Check index fragmentation
SELECT table_name, index_name, stat_value AS pages,
  ROUND(stat_value * 16 / 1024, 1) AS size_mb
FROM mysql.innodb_index_stats
WHERE stat_name = 'size'
  AND database_name = DATABASE()
ORDER BY stat_value DESC LIMIT 20;

-- Show all indexes on a table
SHOW INDEX FROM orders;

-- Force index usage for testing (remove in prod)
SELECT * FROM orders FORCE INDEX (idx_orders_customer_id)
WHERE customer_id = 42;

Redshift — sort keys & vacuum

💡Redshift uses sort keys instead of B-tree indexes. VACUUM reclaims deleted space and re-sorts blocks. Run after large DELETE/UPDATE operations.

-- Check tables needing vacuum (unsorted rows & deleted rows)
SELECT "table", unsorted, stats_off, size AS size_mb,
       tbl_rows, estimated_visible_rows,
       vacuum_sort_benefit
FROM SVV_TABLE_INFO
WHERE unsorted > 5 OR stats_off > 20
ORDER BY vacuum_sort_benefit DESC
LIMIT 20;

-- Vacuum specific table (reclaim space + resort)
VACUUM orders;

-- Vacuum + update statistics
VACUUM orders TO 75 PERCENT;
ANALYZE orders;

-- Full vacuum all tables (run during off-hours)
VACUUM FULL;

-- Check distkey & sortkey configuration
SELECT tablename, "column", sortkey, distkey, encoding
FROM pg_table_def
WHERE schemaname = 'public'
  AND tablename = 'orders'
ORDER BY sortkey NULLS LAST;

Snowflake — clustering & search optimization

-- Check clustering depth (DML_ratio > 0.5 or clustering_errors > 0 = re-cluster needed)
SELECT SYSTEM$CLUSTERING_INFORMATION('orders', '(order_date)');

-- Manually trigger re-clustering (background, non-blocking)
ALTER TABLE orders CLUSTER BY (order_date, customer_id);

-- Enable search optimization (for equality/range lookups on large tables)
ALTER TABLE orders ADD SEARCH OPTIMIZATION;

-- Check search optimization status & cost
SELECT * FROM INFORMATION_SCHEMA.SEARCH_OPTIMIZATION_HISTORY
WHERE table_name = 'ORDERS'
ORDER BY start_time DESC LIMIT 10;

-- Show current clustering keys
SHOW TABLES LIKE 'orders';

✅Data Validation SQL Run after every pipeline▼

💡These are the 6 checks you run after any pipeline to confirm it produced correct results — not just that it finished without error. Works on MySQL, PostgreSQL, Redshift, and Snowflake unless noted.

1. Row count — source vs target

-- Compare source vs target counts for today's load
SELECT
  'source'  AS layer, COUNT(*) AS rows FROM source_db.orders WHERE DATE(created_at) = CURRENT_DATE
UNION ALL
SELECT
  'target'  AS layer, COUNT(*) AS rows FROM target_db.orders WHERE DATE(created_at) = CURRENT_DATE;

-- Row count trend (detect sudden drops to 0 or 10x spikes)
SELECT
  DATE(created_at) AS day,
  COUNT(*)         AS rows,
  COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY DATE(created_at)) AS day_over_day_delta
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY day;

2. Null rate per column

-- Null rate for key columns (adjust column list to your table)
SELECT
  COUNT(*)                                          AS total_rows,
  ROUND(COUNT(*) FILTER (WHERE customer_id IS NULL)  * 100.0 / COUNT(*), 2) AS customer_id_null_pct,
  ROUND(COUNT(*) FILTER (WHERE order_total IS NULL)   * 100.0 / COUNT(*), 2) AS order_total_null_pct,
  ROUND(COUNT(*) FILTER (WHERE created_at IS NULL)    * 100.0 / COUNT(*), 2) AS created_at_null_pct,
  ROUND(COUNT(*) FILTER (WHERE status IS NULL)        * 100.0 / COUNT(*), 2) AS status_null_pct
FROM orders
WHERE DATE(created_at) = CURRENT_DATE;

-- MySQL / Redshift / Snowflake version (no FILTER support):
SELECT
  COUNT(*) AS total,
  ROUND(SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS customer_id_null_pct,
  ROUND(SUM(CASE WHEN order_total IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS order_total_null_pct
FROM orders
WHERE DATE(created_at) = CURRENT_DATE;

3. Duplicate key detection

-- Find duplicate primary keys (should return 0 rows if pipeline is correct)
SELECT order_id, COUNT(*) AS occurrences
FROM orders
WHERE DATE(created_at) = CURRENT_DATE
GROUP BY order_id
HAVING COUNT(*) > 1
ORDER BY occurrences DESC
LIMIT 20;

-- Count of duplicates only (quick check)
SELECT COUNT(*) AS duplicate_keys FROM (
  SELECT order_id
  FROM orders
  WHERE DATE(created_at) = CURRENT_DATE
  GROUP BY order_id
  HAVING COUNT(*) > 1
) dupes;

4. Data freshness — last record loaded

-- When was the last record actually loaded? (staleness check)
SELECT
  MAX(created_at)                                    AS latest_record,
  NOW() - MAX(created_at)                            AS lag,
  CASE
    WHEN NOW() - MAX(created_at) > INTERVAL '2 hours' THEN '⚠ STALE'
    ELSE '✓ FRESH'
  END AS status
FROM orders;

-- Snowflake version
SELECT
  MAX(created_at)                          AS latest_record,
  DATEDIFF('minute', MAX(created_at), NOW()) AS lag_minutes,
  CASE WHEN DATEDIFF('minute', MAX(created_at), NOW()) > 120
       THEN 'STALE' ELSE 'FRESH' END        AS status
FROM orders;

5. Referential integrity check

-- Orders with no matching customer (orphaned foreign keys)
SELECT COUNT(*) AS orphaned_orders
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;

-- Show sample orphans
SELECT o.order_id, o.customer_id, o.created_at
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL
LIMIT 10;

6. Value distribution sanity check

-- Distribution of status values (unexpected values = upstream change)
SELECT status, COUNT(*) AS count,
  ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM orders
WHERE DATE(created_at) = CURRENT_DATE
GROUP BY status ORDER BY count DESC;

-- Min / max / avg sanity check (detect negative prices, future dates, etc.)
SELECT
  MIN(order_total)    AS min_total,
  MAX(order_total)    AS max_total,
  ROUND(AVG(order_total), 2) AS avg_total,
  MIN(created_at)     AS earliest,
  MAX(created_at)     AS latest
FROM orders
WHERE DATE(created_at) = CURRENT_DATE;

🔀PgBouncer & RDS Proxy — Connection Pooler Status PostgreSQL▼

💡When you hit max_connections on RDS PostgreSQL, the pooler itself is often the bottleneck — not the database. Check the pooler first before scaling the RDS instance.

PgBouncer — connect to admin console

# Connect to PgBouncer admin console (port 6432 by default)
psql -h localhost -p 6432 -U pgbouncer pgbouncer

# Or from inside Docker
docker exec -it pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer

PgBouncer — key status queries

Run inside PgBouncer admin console

-- Pool status: cl_active = clients using connection, sv_idle = free server connections
SHOW POOLS;
-- Key columns:
-- cl_active   = clients currently executing a query
-- cl_waiting  = clients WAITING for a connection (THIS IS BAD — means pool is full)
-- sv_idle     = server connections idle (available)
-- sv_active   = server connections in use
-- maxwait     = seconds oldest waiting client has been waiting (> 5s = problem)

-- All active client connections
SHOW CLIENTS;

-- Server connections to the real DB
SHOW SERVERS;

-- Throughput and latency stats
SHOW STATS;
-- avg_query_time in microseconds — divide by 1000 for ms

-- Current configuration
SHOW CONFIG;

-- Reload config without restart (after editing pgbouncer.ini)
RELOAD;

-- Pause a database (graceful — finish current queries first)
PAUSE mydb;

-- Resume
RESUME mydb;

PgBouncer — diagnose pool exhaustion

-- cl_waiting > 0 means clients are queuing for a DB connection
-- This query shows it clearly
SHOW POOLS;

-- Common causes and fixes:
-- 1. pool_size too small → increase pool_size in pgbouncer.ini
-- 2. Long-running transactions holding connections →
--    check pg_stat_activity for 'idle in transaction'
-- 3. Wrong pool mode → use 'transaction' mode for most apps, not 'session'
SHOW CONFIG;   -- check pool_mode

-- Current max connections consumed at the DB level
-- (run in PostgreSQL, not PgBouncer)
SELECT COUNT(*) AS pooler_connections,
       current_setting('max_connections') AS max_connections
FROM pg_stat_activity
WHERE application_name LIKE '%pgbouncer%'
   OR application_name LIKE '%bouncer%';

RDS Proxy — status via AWS CLI

# List all RDS proxies
aws rds describe-db-proxies \
  --query 'DBProxies[*].[DBProxyName,Status,Endpoint,EngineFamily]' \
  --output table

# Check proxy target health (is it actually connected to the DB?)
aws rds describe-db-proxy-targets \
  --db-proxy-name my-rds-proxy \
  --query 'Targets[*].[Endpoint,Port,TargetHealth.State,TargetHealth.Reason]' \
  --output table
# TargetHealth.State should be AVAILABLE
# UNAVAILABLE with reason = shows what's wrong

# Connection pool metrics via CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name ClientConnections \
  --dimensions Name=ProxyName,Value=my-rds-proxy \
  --start-time $(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Average --output table

# Other key RDS Proxy metrics to watch:
# DatabaseConnectionsCurrentlyBorrowed  → connections in use
# DatabaseConnectionsBorrowLatency      → how long to get a connection (ms)
# QueryDatabaseResponseLatency          → DB response time through proxy
# ClientConnections                     → total clients connected to proxy

🐳Docker — Daily Commands Docker▼

# Start / stop stack
docker compose up -d                # start all services in background
docker compose up --build -d        # rebuild images before starting
docker compose down                 # stop and remove containers + networks
docker compose restart     # restart one service only

# What's running?
docker ps                           # running containers
docker ps -a                        # all containers including stopped
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Logs — most used commands daily
docker logs -f                # follow logs (Ctrl+C to stop)
docker logs --tail 200        # last 200 lines only
docker logs --since 30m       # last 30 minutes of logs
docker logs  2>&1 | grep -i error    # grep for errors only

# Resource usage
docker stats                        # live CPU/RAM per container
docker stats --no-stream            # snapshot (non-interactive)

🔧Docker — Debug & Inspect Docker▼

# Get a shell inside a container
docker exec -it  bash         # bash shell
docker exec -it  sh           # sh if bash not available

# Check env vars inside container (crucial for debugging config)
docker exec -it  env | sort

# Connect to Postgres inside container
docker exec -it  psql -U postgres -d mydb

# Test DB port from inside another container
docker exec -it  bash -c ':/app/config.yaml    # host → container
docker cp :/app/output.csv ./output.csv       # container → host

# Inspect container details (networking, mounts, env)
docker inspect 
docker inspect  | grep -E 'IPAddress|Mounts|Env' -A 5

# Check container IP
docker inspect  --format '{{.NetworkSettings.IPAddress}}'

# Follow multiple containers at once (docker-compose)
docker compose logs -f --tail 50

🧹Docker — Cleanup Docker▼

⚠️docker volume prune deletes database data. Always check which volumes exist before pruning.

docker system prune               # remove stopped containers, dangling images, unused networks
docker image prune -a             # remove ALL unused images (frees most space)
docker container prune -f         # remove all stopped containers
docker volume ls                  # LIST volumes before pruning
docker volume prune               # remove unused volumes (DANGEROUS if DB volume)
docker system df                  # show disk usage by Docker components
docker compose down --remove-orphans   # removes leftover containers hogging ports

🚨Docker — Emergency & Performance Docker▼

When your laptop fans are screaming

# Live resource monitor — shows which container eats RAM (Spark/JVM culprit)
docker stats
docker stats --no-stream            # snapshot, non-interactive

# See actual Linux processes inside the container
docker top <name>
docker top <name> aux              # full ps output

# Freeze container CPU without killing it (temporary relief)
docker pause <name>
docker unpause <name>

# Wait for container to finish and capture exit code (CI/CD scripts)
docker wait <name>
echo "Exit code: $?"

Network testing — the "Swiss Army Knife"

💡Run this from inside your ETL container. "Connected" = network OK, your Python code is the problem. "Hangs" = Docker network or Security Group blocking you.

# Test TCP connectivity from inside a container (verbose output)
docker exec -it <etl_container> curl -v telnet://db-hostname:5432
docker exec -it <etl_container> curl -v telnet://rds-endpoint.amazonaws.com:3306

# Alternative: /dev/tcp method (no curl needed)
docker exec -it <name> timeout 3 bash -c '</dev/tcp/db-host/5432' \
  && echo "Port OPEN" || echo "Port CLOSED"

# nc (netcat) — test from EC2 / local shell
nc -zv rds-endpoint.amazonaws.com 5432    # PostgreSQL / Aurora PG
nc -zv rds-endpoint.amazonaws.com 3306    # MySQL / Aurora MySQL
nc -zv redshift-endpoint.amazonaws.com 5439  # Redshift

ECR — push image to AWS

# Login to ECR (do this first — token expires in 12h)
aws ecr get-login-password --region us-east-1 | docker login \
  --username AWS --password-stdin \
  <account_id>.dkr.ecr.us-east-1.amazonaws.com

# Tag and push
docker tag my-image:latest <account_id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest
docker push <account_id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest

# Pull from ECR (e.g. on ECS/EC2 worker)
docker pull <account_id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest

Useful aliases — add to .bashrc / .zshrc

alias dps="docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"
alias dlogs="docker logs -f --tail 50"
alias dclean="docker system prune -f"
alias dstop="docker compose down --remove-orphans"

🪣AWS S3 — Operations AWS▼

# Identity check first (always verify before touching prod)
aws sts get-caller-identity --profile production

# List and navigate
aws s3 ls s3://bucket-name/path/
aws s3 ls s3://bucket-name/ --recursive --human-readable --summarize

# Copy and sync
aws s3 cp local_file.csv s3://bucket-name/prefix/
aws s3 cp s3://bucket-name/file.csv ./local_file.csv

# ALWAYS dry-run sync before executing in production
aws s3 sync ./data/ s3://bucket-name/data/ --dryrun
aws s3 sync ./data/ s3://bucket-name/data/             # remove --dryrun when sure

# S3 to S3 copy (fast, server-side)
aws s3 sync s3://source-bucket/path/ s3://dest-bucket/path/

# Delete (careful!)
aws s3 rm s3://bucket-name/file.csv
aws s3 rm s3://bucket-name/prefix/ --recursive --dryrun   # dryrun first

# Object metadata (size, last modified, content-type)
aws s3api head-object --bucket bucket-name --key path/file.csv

# Check bucket region (important for cross-region latency)
aws s3api get-bucket-location --bucket bucket-name

# Secrets (use instead of hardcoding credentials)
aws secretsmanager get-secret-value --secret-id my/db/creds --query SecretString --output text
aws ssm get-parameter --name /myapp/db_password --with-decryption --query Parameter.Value --output text

S3 Diagnostics — the "Source" check

# Find the LATEST file in a prefix (root cause: "file not found")
aws s3 ls s3://my-bucket/prefix/ --recursive | sort | tail -n 5

# Check file size and metadata (root cause: "empty file / zero bytes")
aws s3api head-object --bucket my-bucket --key path/to/file.csv
# Returns: ContentLength, LastModified, ContentType, ETag

# Quick preview of a remote file WITHOUT downloading (root cause: "schema mismatch")
aws s3 cp s3://my-bucket/file.csv - | head -n 5

# Count objects in a prefix
aws s3 ls s3://my-bucket/prefix/ --recursive | wc -l

# Find files modified in the last 24h
aws s3 ls s3://my-bucket/prefix/ --recursive \
  | awk '{print $1, $2, $4}' \
  | awk -v d="$(date -d '1 day ago' '+%Y-%m-%d')" '$1 >= d'

🗄RDS via AWS CLI AWS▼

# List RDS instances and their status
aws rds describe-db-instances \
  --query 'DBInstances[*].[DBInstanceIdentifier,DBInstanceStatus,DBInstanceClass,Engine]' \
  --output table

# RDS events (restarts, failovers, maintenance)
aws rds describe-events \
  --source-identifier my-rds-instance \
  --source-type db-instance \
  --duration 120   # last 2 hours

# List available log files
aws rds describe-db-log-files --db-instance-identifier my-rds-instance

# Download a log file
aws rds download-db-log-file-portion \
  --db-instance-identifier my-rds-instance \
  --log-file-name error/mysql-error.log \
  --starting-token 0 --output text > mysql-error.log

# Check parameter group (find key settings)
aws rds describe-db-parameters \
  --db-parameter-group-name my-pg \
  --query 'Parameters[?ParameterValue!=`null`].[ParameterName,ParameterValue]' \
  --output table | grep -iE 'slow|connect|timeout|buffer|innodb'

# Trigger a manual snapshot
aws rds create-db-snapshot \
  --db-instance-identifier my-rds-instance \
  --db-snapshot-identifier manual-snap-$(date +%Y%m%d-%H%M)

🔄Glue & Step Functions AWS▼

# List Glue jobs
aws glue list-jobs --output table

# Check last 5 runs of a job
aws glue get-job-runs --job-name my-etl-job \
  --query 'JobRuns[0:5].[JobRunState,StartedOn,CompletedOn,ErrorMessage]' \
  --output table

# Get error message for a failed run
aws glue get-job-run \
  --job-name my-etl-job \
  --run-id jr_abc123 \
  --query 'JobRun.ErrorMessage' --output text

# Start a job manually
aws glue start-job-run --job-name my-etl-job

# Reset bookmark (re-process all data from beginning)
aws glue reset-job-bookmark --job-name my-etl-job

# Get Glue error logs from CloudWatch
aws logs get-log-events \
  --log-group-name /aws-glue/jobs/error \
  --log-stream-name my-etl-job/jr_abc123 \
  --query 'events[*].message' --output text | tail -50

# Step Functions — list and check executions
aws stepfunctions list-state-machines --output table
aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:us-east-1:123456:stateMachine:MyPipeline \
  --status-filter FAILED --max-results 10 --output table

# Get execution history for a failed run
aws stepfunctions get-execution-history \
  --execution-arn arn:aws:states:us-east-1:123456:execution:MyPipeline:run-id \
  --query 'events[?type==`ExecutionFailed`]' --output json

📡CloudWatch — Key RDS Metrics AWS▼

# Helper function for cleaner CW queries
function cw_metric() {
  aws cloudwatch get-metric-statistics \
    --namespace AWS/RDS \
    --metric-name "$1" \
    --dimensions Name=DBInstanceIdentifier,Value=my-rds-instance \
    --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 300 --statistics "$2" \
    --query 'sort_by(Datapoints, &Timestamp)[-1].['$2']' \
    --output text
}

# Run key metrics
echo "CPU%:          $(cw_metric CPUUtilization Average)"
echo "Connections:   $(cw_metric DatabaseConnections Average)"
echo "Free Storage:  $(cw_metric FreeStorageSpace Average) bytes"
echo "Read Latency:  $(cw_metric ReadLatency Average) sec"
echo "Write Latency: $(cw_metric WriteLatency Average) sec"
echo "Swap Used:     $(cw_metric SwapUsage Average) bytes"
echo "Replica Lag:   $(cw_metric AuroraReplicaLag Maximum) ms"

# CloudWatch Logs — Glue / Lambda
aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text | tr '\t' '\n' | grep -i glue

aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000) \
  --query 'events[*].message' --output text

🔑IAM & Auth Debugging AWS▼

# Who am I? (run this before any production operation)
aws sts get-caller-identity
aws configure list

# Check Docker clock drift (causes AWS auth errors — common on long-running containers)
date && docker exec  date

# Simulate permissions (check before hitting access denied)
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789:role/GlueExecutionRole \
  --action-names s3:GetObject s3:PutObject glue:StartJobRun \
  --resource-arns "arn:aws:s3:::my-bucket/*"

# List policies on a role
aws iam list-attached-role-policies --role-name GlueExecutionRole
aws iam list-role-policies --role-name GlueExecutionRole   # inline policies

# VPC / Security group — test connectivity
aws ec2 describe-security-groups \
  --filters "Name=group-name,Values=rds-sg" \
  --query 'SecurityGroups[*].IpPermissions'

📁Linux / EC2 — File & Data Operations Linux▼

# Row count, preview, encoding check
wc -l dataset.csv
head -5 dataset.csv
tail -5 dataset.csv
file dataset.csv                    # detect encoding
head -20 dataset.csv | cat -A       # show hidden chars (^M = Windows line endings)

# Column processing
cut -d',' -f1-5 dataset.csv         # first 5 columns
awk -F',' '{print NF}' data.csv | sort | uniq -c   # column count consistency check
awk -F',' 'NR==1{print; next} $3=="ERROR"' data.csv  # header + filter rows

# Find files
find /data -name "*.parquet" -mtime -1     # parquet files modified in last day
find /data -type f -size +1G               # files larger than 1GB
grep -r "customer_id" /pipeline/sql/       # search inside files recursively

# Checksums (data integrity)
md5sum dataset.csv
sha256sum dataset.csv

📤Transfer & Compression Linux▼

# Reliable large transfers (resumes on failure)
rsync -avP --progress source/ destination/
rsync -avP /data user@ec2-host:/backup/

# Compression
gzip large_file.csv                 # standard, single-threaded
pigz large_file.csv                 # parallel gzip (much faster on multi-core)
gunzip file.csv.gz
tar -czvf archive.tar.gz ./folder/
tar -xzvf archive.tar.gz

# Progress monitoring for large operations
pv large_file.csv | gzip > file.csv.gz
pv large_file.csv | wc -l          # count rows while piping

⏳Long Jobs & Persistent Sessions Linux▼

# tmux — persistent sessions (survives SSH disconnect)
tmux new -s etl-run                 # new named session
tmux attach -t etl-run              # reconnect
tmux ls                             # list sessions
# Inside tmux: Ctrl+B, D = detach (leave running)

# nohup — simple background job
nohup python3 pipeline.py > pipeline.log 2>&1 &
echo $!                             # save the PID

# Monitor a long-running job
watch -n 5 "ps aux | grep pipeline.py"     # check every 5 seconds
watch -n 10 df -h                          # disk space every 10 seconds
tail -f pipeline.log                       # live log output

🔴Redshift — Diagnostics AWS▼

Load errors — debug COPY command failures

Run in Redshift SQL editor

-- COPY errors (essential when COPY command fails silently)
SELECT query, starttime, filename, line_number,
       colname, err_reason
FROM stl_load_errors
ORDER BY starttime DESC
LIMIT 10;

-- Detailed COPY history
SELECT query, trim(filename) AS file, lines_scanned,
       lines_loaded, rows_loaded, is_partial
FROM stl_load_commits
ORDER BY query DESC LIMIT 20;

Hanging queries & locks

-- Currently running queries (inflight)
SELECT pid, user_name, start_time, LEFT(query, 100) AS query
FROM stv_inflight
ORDER BY start_time;

-- Kill a stuck query
CANCEL <pid>;

-- Table locks
SELECT a.txn_owner, a.txn_db, a.table_id, b.relname,
       a.lock_mode, a.granted
FROM svv_transactions a
JOIN pg_class b ON a.table_id = b.oid
WHERE a.granted IS FALSE;

-- Long-running queries (> 5 minutes)
SELECT pid, user_name, duration, LEFT(query, 120) AS query
FROM SVL_QLOG
WHERE duration > 300000000    -- microseconds
ORDER BY duration DESC LIMIT 10;

Port connectivity test

nc -zv your-redshift-cluster.region.redshift.amazonaws.com 5439
# "Connection to ... succeeded" = network OK, check credentials
# "Connection refused / timed out" = Security Group or VPC blocking

🌀Airflow — CLI Diagnostics Orchestration▼

Task and DAG status

# Get state of a specific task run (faster than opening the UI)
airflow tasks state <dag_id> <task_id> <execution_date>
# e.g.: airflow tasks state my_etl_dag extract_data 2024-01-15

# List task instances for a DAG run
airflow tasks list <dag_id> --tree

# Test a task without marking it in the metadata DB (safe for debugging)
airflow tasks test <dag_id> <task_id> <execution_date>

Fix stuck or failed tasks

# Clear failed tasks to re-run (quick fix)
airflow tasks clear <dag_id> \
  --start-date 2024-01-15 \
  --end-date 2024-01-15

# Clear + confirm without prompt (-y)
airflow tasks clear -y <dag_id> -s 2024-01-15 -e 2024-01-15

# Trigger a DAG run manually
airflow dags trigger <dag_id>

# Pause / unpause a DAG
airflow dags pause <dag_id>
airflow dags unpause <dag_id>

Scheduler & metadata DB health

# Check metadata DB connectivity (scheduler lag root cause)
airflow db check

# Show scheduler heartbeat (last seen time — if stale, scheduler is dead)
airflow jobs check --job-type SchedulerJob --limit 1

# Check Airflow version and config
airflow version
airflow config get-value core sql_alchemy_conn   # metadata DB connection

🩺check_env.sh — Quick Environment Health Script Script▼

💡Save this as check_env.sh, run before any deployment or when a job fails immediately. Tells you in 10 seconds if the problem is AWS, network, or Docker.

check_env.sh — copy and customize endpoints

#!/bin/bash
# Quick environment health check — run before deployment or on job failure

echo "━━━ 1. AWS Identity ━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
aws sts get-caller-identity --query 'Arn' --output text

echo ""
echo "━━━ 2. S3 Connectivity ━━━━━━━━━━━━━━━━━━━━━━━━━"
aws s3 ls s3://your-main-bucket --max-items 1 > /dev/null \
  && echo "✓ S3 Access: OK" \
  || echo "✗ S3 Access: FAILED"

echo ""
echo "━━━ 3. Database Reachability ━━━━━━━━━━━━━━━━━━━"
# Replace with your actual endpoints
timeout 3 bash -c 'cat < /dev/tcp/your-rds-endpoint/5432' 2>/dev/null \
  && echo "✓ RDS PostgreSQL: Reachable" \
  || echo "✗ RDS PostgreSQL: Unreachable"

timeout 3 bash -c 'cat < /dev/tcp/your-rds-endpoint/3306' 2>/dev/null \
  && echo "✓ RDS MySQL: Reachable" \
  || echo "✗ RDS MySQL: Unreachable"

timeout 3 bash -c 'cat < /dev/tcp/your-redshift-endpoint/5439' 2>/dev/null \
  && echo "✓ Redshift: Reachable" \
  || echo "✗ Redshift: Unreachable"

echo ""
echo "━━━ 4. Docker Health ━━━━━━━━━━━━━━━━━━━━━━━━━━━"
docker info 2>/dev/null | grep -E "Running|Paused|Stopped" \
  || echo "✗ Docker: not running"

echo ""
echo "━━━ 5. Disk Space ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
df -h | awk 'NR==1 || int($5) > 70 {print}'  # show header + any disk > 70% full

echo ""
echo "━━━ 6. IAM Permission Check ━━━━━━━━━━━━━━━━━━━━"
ROLE_ARN=$(aws sts get-caller-identity --query 'Arn' --output text)
aws iam simulate-principal-policy \
  --policy-source-arn "$ROLE_ARN" \
  --action-names s3:GetObject s3:PutObject \
  --resource-arns "arn:aws:s3:::your-main-bucket/*" \
  --query 'EvaluationResults[*].[EvalActionName,EvalDecision]' \
  --output table 2>/dev/null || echo "(IAM simulation skipped — needs role ARN)"

❄️Snowflake — Query & Cost Diagnostics Snowflake▼

Most expensive failed queries (last 14 days) — cost-weighted by warehouse size

💡This query weights execution time by warehouse credits/hour so a 1-minute query on a 4XL warehouse correctly ranks higher than a 10-minute query on XSmall.

Run in Snowflake worksheet — requires ACCOUNTADMIN or MONITOR privilege

WITH WAREHOUSE_CREDITS_MAP AS (
    SELECT * FROM (VALUES
        ('X-Small',  'STANDARD',           1),
        ('Small',    'STANDARD',           2),
        ('Medium',   'STANDARD',           4),
        ('Large',    'STANDARD',           8),
        ('X-Large',  'STANDARD',          16),
        ('2X-Large', 'STANDARD',          32),
        ('3X-Large', 'STANDARD',          64),
        ('4X-Large', 'STANDARD',         128),
        ('5X-Large', 'STANDARD',         256),
        ('6X-Large', 'STANDARD',         512),
        ('Medium',   'SNOWPARK-OPTIMIZED',  6),
        ('Large',    'SNOWPARK-OPTIMIZED', 12),
        ('X-Large',  'SNOWPARK-OPTIMIZED', 24),
        ('2X-Large', 'SNOWPARK-OPTIMIZED', 48),
        ('3X-Large', 'SNOWPARK-OPTIMIZED', 96),
        ('4X-Large', 'SNOWPARK-OPTIMIZED',192),
        ('5X-Large', 'SNOWPARK-OPTIMIZED',384),
        ('6X-Large', 'SNOWPARK-OPTIMIZED',768)
    ) AS t (WAREHOUSE_SIZE, WAREHOUSE_TYPE, CREDITS_PER_HOUR)
),
JOINED AS (
    SELECT
        qh.QUERY_TEXT,
        qh.QUERY_ID,
        qh.EXECUTION_STATUS,
        qh.WAREHOUSE_SIZE,
        qh.WAREHOUSE_TYPE,
        qh.EXECUTION_TIME,
        qh.EXECUTION_TIME * wcm.CREDITS_PER_HOUR AS EFFECTIVE_EXECUTION_TIME,
        qh.USER_NAME
    FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY qh
    JOIN WAREHOUSE_CREDITS_MAP wcm
      ON wcm.WAREHOUSE_SIZE  = qh.WAREHOUSE_SIZE
     AND wcm.WAREHOUSE_TYPE  = qh.WAREHOUSE_TYPE
    WHERE qh.EXECUTION_STATUS = 'FAIL'
      AND qh.START_TIME BETWEEN DATEADD('day', -14, CURRENT_DATE()) AND CURRENT_DATE()
    ORDER BY EFFECTIVE_EXECUTION_TIME DESC
)
SELECT
    QUERY_TEXT,
    MAX(QUERY_ID)                                          AS LATEST_QUERY_ID,
    ANY_VALUE(WAREHOUSE_SIZE)                              AS WAREHOUSE_SIZE,
    ANY_VALUE(WAREHOUSE_TYPE)                              AS WAREHOUSE_TYPE,
    AVG(EFFECTIVE_EXECUTION_TIME::INT / 1000)::INT         AS AVG_COST_WEIGHTED_SECS,
    ANY_VALUE(USER_NAME)                                   AS USER_NAME
FROM JOINED
GROUP BY QUERY_TEXT
ORDER BY AVG_COST_WEIGHTED_SECS DESC;

Active queries & sessions

-- Currently running queries
SELECT query_id, user_name, warehouse_name,
       execution_status,
       DATEDIFF('second', start_time, CURRENT_TIMESTAMP()) AS running_secs,
       LEFT(query_text, 120) AS query
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_SESSION())
WHERE execution_status = 'RUNNING'
ORDER BY running_secs DESC;

-- Kill a query
SELECT SYSTEM$CANCEL_QUERY('<query_id>');

-- Recent failures with error messages
SELECT query_id, start_time, user_name, warehouse_name,
       error_code, error_message,
       LEFT(query_text, 120) AS query
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE execution_status = 'FAIL'
  AND start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
ORDER BY start_time DESC
LIMIT 20;

Warehouse & credit usage

-- Credits used per warehouse last 7 days
SELECT warehouse_name,
       SUM(credits_used)          AS total_credits,
       SUM(credits_used_compute)  AS compute_credits,
       COUNT(*)                   AS sessions
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_time > DATEADD('day', -7, CURRENT_DATE())
GROUP BY warehouse_name
ORDER BY total_credits DESC;

-- Warehouse queuing — are queries waiting?
SELECT warehouse_name, queued_overload_time, queued_provisioning_time,
       overloaded_time, AVG(execution_time) AS avg_exec_ms
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time > DATEADD('hour', -1, CURRENT_TIMESTAMP())
GROUP BY warehouse_name, queued_overload_time,
         queued_provisioning_time, overloaded_time
ORDER BY queued_overload_time DESC NULLS LAST;

Table & storage diagnostics

-- Largest tables in current database
SELECT table_schema, table_name,
       ROUND(bytes / 1024 / 1024 / 1024, 2)          AS size_gb,
       row_count,
       ROUND(bytes / NULLIF(row_count, 0), 0)         AS bytes_per_row
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
ORDER BY bytes DESC NULLIF 20;

-- Query result cache hit rate (high = good, saves credits)
SELECT
    COUNT_IF(result_from_cache) AS cache_hits,
    COUNT(*) AS total_queries,
    ROUND(COUNT_IF(result_from_cache) * 100.0 / COUNT(*), 1) AS cache_hit_pct
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time > DATEADD('day', -1, CURRENT_DATE());

-- Clone and time travel storage cost
SELECT table_name,
       ROUND(active_bytes/1024/1024/1024, 3)      AS active_gb,
       ROUND(time_travel_bytes/1024/1024/1024, 3) AS time_travel_gb,
       ROUND(failsafe_bytes/1024/1024/1024, 3)    AS failsafe_gb
FROM SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS
WHERE active_bytes > 0
ORDER BY (active_bytes + time_travel_bytes + failsafe_bytes) DESC
LIMIT 20;

📦dbt — CLI Diagnostics Snowflake · Redshift · BigQuery▼

Environment & connection check

# Check dbt version and connection to warehouse
dbt --version
dbt debug                             # tests connection, profiles.yml, dependencies

# Debug with specific profile / target
dbt debug --profile my_project --target prod

# List all models (check what's in the project)
dbt ls
dbt ls --select tag:daily             # models with a specific tag
dbt ls --select source:*              # all sources

Running models & tests

# Run all models
dbt run

# Run a single model
dbt run --select orders

# Run a model + all its upstream dependencies
dbt run --select +orders

# Run a model + all downstream dependents
dbt run --select orders+

# Run only failed models from the last run
dbt run --select result:error --state ./target

# Run tests only
dbt test
dbt test --select orders              # test a specific model
dbt test --select source:raw_orders   # test a source

# Run source freshness check
dbt source freshness
dbt source freshness --select source:raw_orders

Diagnosing failures

# See full error output (default truncates)
dbt run --select orders 2>&1 | cat

# Read the run results JSON after a failure
cat target/run_results.json | python3 -c "
import json, sys
r = json.load(sys.stdin)
for n in r['results']:
    if n['status'] not in ('success','pass'):
        print(f\"--- {n['unique_id']} [{n['status']}] ---\")
        print(n.get('message',''))
        print()
"

# Compile a model to see the rendered SQL (without running it)
dbt compile --select orders
cat target/compiled/my_project/models/orders.sql

# Run with --fail-fast (stop on first error instead of continuing)
dbt run --fail-fast

# Verbose logging
dbt run --select orders --log-level debug 2>&1 | grep -iE 'error|warn|fail'

Source freshness & data quality

# Check all sources for freshness (compares MAX(loaded_at) to warn/error thresholds)
dbt source freshness

# Output to file for logging
dbt source freshness -o target/sources.json

# Read freshness results
cat target/sources.json | python3 -c "
import json, sys
r = json.load(sys.stdin)
for s in r['results']:
    if s['status'] != 'pass':
        print(f\"{s['unique_id']}: {s['status']} — {s.get('warn_after') or s.get('error_after')}\")
"

# Re-run only tests that failed last time
dbt test --select result:fail --state ./target

Snowflake / Redshift specific tips

# Check which warehouse dbt is using (Snowflake)
# In profiles.yml — confirm warehouse, role, database, schema
cat ~/.dbt/profiles.yml

# Override warehouse for a heavy model run (Snowflake)
dbt run --select large_model --vars '{"snowflake_warehouse": "TRANSFORM_XL"}'

# Check dbt model run times in Snowflake query history
SELECT
  query_text, start_time,
  ROUND(execution_time / 1000, 1) AS secs,
  warehouse_name, user_name
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_text ILIKE '%/* {"app": "dbt"%'
  AND start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
ORDER BY execution_time DESC
LIMIT 20;

# Redshift — find dbt queries by comment tag
SELECT pid, starttime,
  ROUND(datediff('second', starttime, endtime), 1) AS secs,
  LEFT(querytxt, 120) AS query
FROM STL_QUERY
WHERE querytxt LIKE '%dbt%'
  AND starttime > CURRENT_DATE - 1
ORDER BY starttime DESC LIMIT 20;

📈Load Average — uptime OS▼

output example 14:32 up 42 days, load average: 3.24, 7.81, 9.02 ↑ 1min ↑ 5min ↑ 15min

Compare each number to your CPU count. On a 4-core machine: load 4.0 = 100% busy. Above = queue building. Get CPU count: lscpu | grep "^CPU(s)"

✓ Normal

Load < CPU count

System has headroom. No action needed.

⚠ Elevated

1–2× CPU count

Find driver process. May be transient.

✗ Critical

> 2× CPU count

System saturated. Queries timing out. Act now.

What to do next

High 1-min, low 15-min → spike from a single job. Run ps -eo pid,%cpu,cmd --sort=-%cpu | head -5

All 3 elevated → sustained pressure. Check if it's CPU or I/O: high load + low CPU% in top = disk bottleneck. Run iostat -x 1 3

🧠free -h — Memory OS▼

output example total used free shared buff/cache available Mem: 15Gi 13Gi 312Mi 1.2Gi 1.8Gi 1.5Gi Swap: 4Gi 2.1Gi 1.9Gi ← ANY swap on DB host = problem

🚨Read the available column, not free. Linux uses RAM for disk cache aggressively — that's normal. Available = free + reclaimable cache. Any swap usage on a database host is a performance emergency.

What to do next

Swap > 0: find the hog → ps aux --sort=-%mem | head -10

MySQL: innodb_buffer_pool_size should be ~70% of RAM max. PostgreSQL: watch work_mem × max_connections — it multiplies.

OOM killer fired? → dmesg | grep -i "killed process"

🔁vmstat — Virtual Memory Stats OS▼

vmstat 1 5 — columns that matterprocs ----memory--- --swap-- ---io---- --cpu--- r b swpd free si so bi bo us sy id wa 2 4 2048 312 142 198 8420 12300 34 18 14 34

Key Columns

r = runnable (waiting for CPU slot)
b = blocked on I/O
si = KB/s swapped IN from disk
so = KB/s swapped OUT to disk
wa = CPU% waiting for I/O
id = CPU% idle

Red Flags

si/so > 0 = swapping — immediate problem
wa > 20 = disk is the bottleneck
b > 2 = multiple processes stuck on I/O
r > CPU count = CPU contention
id = 0 = system fully saturated

💽iostat -x — Disk I/O Extended OS▼

iostat -x 1 3Device r/s w/s rMB/s wMB/s await r_await w_await %util nvme0n1 124 890 1.8 28.4 48.2 3.1 52.4 97.3

✓ Healthy

%util < 60%, await < 10ms

I/O not a bottleneck.

⚠ Busy

%util 60–80%, await 10–20ms

Monitor. DB queries may slow.

✗ Saturated

%util > 80%, await > 20ms

Every query is waiting. Find the writer.

Diagnosis

High w_await, low r_await → write-heavy. Check ETL jobs, MySQL binary log, PostgreSQL WAL writes.

Find the offending process: iotop -o or pidstat -d 1 5

On RDS: correlate with CloudWatch ReadLatency, WriteLatency, DiskQueueDepth.

🔌ss -ant — Connection States OS▼

ss -ant | awk '{print $1}' | sort | uniq -c 1 LISTEN 847 ESTAB 312 TIME-WAIT 89 CLOSE-WAIT ← growing = connection leak in app 45 SYN-RECV ← high = server overloaded or SYN flood

ESTAB — established

Active connections. Normal. High count = pool not limiting or pool leak.

Compare to max_connections. If near limit, add PgBouncer/ProxySQL.

TIME-WAIT

Recently closed. Linux holds 60s for safety. Thousands = normal.

Only a problem if exhausting ephemeral port range (65k limit).

CLOSE-WAIT — danger

Remote closed but local app didn't. Growing count = connection leak bug.

Restart the leaking process. Fix the application code (missing close()).

SYN-RECV

Handshake started but not completed. Low = normal. High = overloaded server.

>100 sustained = investigate. Check if server can accept connections fast enough.

🔬ps aux — STAT Column Codes OS▼

Running / runnable — on CPU or ready

Sleeping — normal idle state

Uninterruptible I/O wait — cannot kill

Zombie — dead, parent not cleaned

Stopped by signal or debugger

🚨D-state is the most dangerous. Process is stuck waiting for disk/network I/O and cannot be killed with kill -9. Many D-state processes = disk hung or NFS issue. Only fixing the I/O or rebooting resolves it. Find them: ps -eo pid,stat,wchan:25,cmd | grep "^[0-9]* D"

🐬MySQL PROCESSLIST — State Column MySQL▼

Waiting for lock

Transaction blocked waiting for a row/table lock held by another connection.

Find the blocker via innodb_lock_waits. Kill it if stuck too long.

Copying to tmp table

Query creating temp table — usually a heavy GROUP BY or DISTINCT without proper index.

Add index or rewrite query. Check tmp_table_size setting.

Sending data

Reading data from table and sending to client. Often appears slow — usually the actual query work.

EXPLAIN the query. Look for full table scans (type=ALL).

Sorting result

Filesort — sorting without an index. Expensive on large datasets.

Add index matching ORDER BY clause columns.

Opening tables

Many threads opening tables — table cache too small or table DDL happening.

Increase table_open_cache. Check for ongoing ALTER TABLE operations.

Sleep (very long)

Idle connection holding resources. With wait_timeout expired = connection not properly closed.

Reduce wait_timeout. Ensure app uses connection pooling with proper cleanup.

🐘pg_stat_activity — State & Wait Events PostgreSQL▼

state column valuesstate meaning active query is currently running idle connection open, nothing running (normal in pool) idle in transaction opened transaction but paused — holds locks! idle in transaction (aborted) failed tx not rolled back — blocking others

⚠️idle in transaction is the PostgreSQL silent killer. It holds locks without doing work. Long-running idle in transaction sessions will block VACUUM, DDL changes, and other writes. Kill them:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND query_start < now() - INTERVAL '5 minutes';

wait_event = Lock

Waiting for a lock held by another session.

Use pg_blocking_pids() to find who holds it.

wait_event = ClientRead

Waiting for client to send next query. Connection sitting idle.

Normal for idle pooled connections. Set idle timeout in PgBouncer.

wait_event = IO / DataFileRead

Reading data from disk — cache miss. Shared buffers too small.

Increase shared_buffers. Check cache hit rate in pg_statio_user_tables.

wait_event = WALWrite

Writing to WAL log. High count = write-heavy workload hitting disk.

Tune checkpoint_completion_target=0.9, increase wal_buffers.

📋MySQL Error Log — Key Patterns MySQL▼

Too many connections

Hit max_connections. New connections refused.

Kill idle connections. Add ProxySQL/connection pooler.

Deadlock found when trying to get lock

InnoDB killed one transaction to break the cycle.

Run SHOW ENGINE INNODB STATUS\G and check LATEST DEADLOCK section.

InnoDB: out of memory

Buffer pool or other allocation failed. OOM condition.

Reduce innodb_buffer_pool_size. Check total RAM vs all MySQL settings.

Aborted connection … (got error reading communication packets)

Client disconnected without closing properly. Network drop or app crash.

Usually noise. High frequency = network instability or app bug.

Disk full … waiting for someone to free space

All write operations hanging. MySQL paused.

Free space immediately. Purge binary logs: PURGE BINARY LOGS BEFORE NOW() - INTERVAL 3 DAY;

Table is marked as crashed

MyISAM corruption from unclean shutdown.

REPAIR TABLE tablename; or mysqlcheck --repair db table

📋PostgreSQL Log — Key Patterns PostgreSQL▼

FATAL: remaining connection slots are reserved

Hit max_connections. PG reserves 3 slots for superuser so regular users hit this first.

Add PgBouncer in transaction mode. Increase max_connections + tune shared_buffers.

ERROR: deadlock detected

PG detected a lock cycle and killed one transaction. Log includes full dependency trace.

Read DETAIL in the log — it shows exact tables and PIDs. Fix lock ordering in app.

LOG: checkpoint request taking too long

WAL checkpoints overloaded. Too much dirty data being flushed to disk.

Set checkpoint_completion_target=0.9. Increase max_wal_size. Check disk iostat.

ERROR: could not open file "base/…"

Data file missing or corrupt. Unclean shutdown or disk failure.

Check disk health. Run pg_dump on surviving tables. May need PITR restore.

autovacuum: VACUUM table taking too long

Table bloat accumulating. Dead rows not cleaned fast enough.

Run VACUUM ANALYZE tablename manually. Check n_dead_tup in pg_stat_user_tables.

HINT: No space left on device

Disk full — PG can't write WAL or data files. All transactions fail.

Free space. Check pg_wal/ directory. Review wal_keep_size setting.

⚡Spark / Glue — Error Patterns ETL▼

java.lang.OutOfMemoryError: Java heap space

Executor ran out of JVM heap. Dataset too large for current config.

Increase --executor-memory. Repartition data to reduce partition size.

Lost executor … heartbeat timeout

Worker node died or became unresponsive. Stage will retry.

Check worker node: disk full or OOM. Check spark.executor.heartbeatInterval.

FetchFailed … SparkException: stage failed

Shuffle read failed — executor that wrote shuffle data died before read.

Increase spark.shuffle.maxRetriesOnError. Check shuffle disk health.

Task not serializable

Non-serializable object used inside a Spark closure (lambda/map).

Broadcast the object or make it serializable. Never pass DB connections to executors.

Container killed by YARN for exceeding memory limits

Executor total memory (heap + overhead) exceeded YARN limit.

Increase spark.yarn.executor.memoryOverhead or reduce executor-memory.

Job aborted: FileNotFoundException on s3://

Input file disappeared mid-job. Path changed or deleted.

Verify path: aws s3 ls s3://bucket/path/ Check if another process deleted input.

🗺D-W-F-E-V-S — The Memory Device Framework▼

Data or System?

Split the problem. Pick one. Focus there.

What Changed?

80% rule: recent change in last 24-48h

First Error

Top of logs = root cause. Bottom = cascades.

Evidence

Don't assume. Always verify with proof.

Volume

Compare to baseline. 10x spike = volume issue.

Stupid Questions

Check basics. Catches 40% of all issues.

📌The 90-5-5 Rule: 90% of issues solved by Steps 1-6 in ~5 minutes. 5% need specialized checks (~3 min). 5% require deep dive — escalate.

⏱5-Minute Time Breakdown Framework▼

Step 1

10s

0:10 total

Data or System?

Step 2

60s

1:10 total

What Changed?

Step 3

90s

2:40 total

First Error

Step 4

60s

3:40 total

Verify Evidence

Step 5

60s

4:40 total

Volume Compare

Step 6

60s

5:40 total

Stupid Questions

1️⃣Step 1 — Data or System? (10 seconds) Cuts 80% of noise▼

Ask the question. Pick one. Do not investigate both simultaneously — that's how you lose 30 minutes. Split the problem space in half immediately.

📁 DATA Problems

Files missing or empty
Schema changed
Bad / unparseable records
Volume spike or drop
Upstream source changed

⚙️ SYSTEM Problems

Service down
IAM / credentials expired
Network / VPC blocked
CPU / memory exhausted
Disk full

2️⃣Step 2 — What Changed? (60 seconds) The 80% rule▼

📌80% of production issues are caused by a change in the last 24-48 hours. Code deploy, schema migration, config update, AWS change, upstream data format shift.

# Code changes in last 2 days
git log --since="2 days ago" --oneline

# Volume trend — did row counts change?
SELECT date, COUNT(*) AS row_count
FROM your_table
WHERE date >= CURRENT_DATE - 7
GROUP BY date
ORDER BY date;

# File count per day in S3 (spot the drop)
for i in {0..7}; do
  d=$(date -d "$i days ago" +%Y-%m-%d)
  echo "$d: $(aws s3 ls s3://bucket/$d/ | wc -l) files"
done

# Schema changes (PostgreSQL)
SELECT schemaname, tablename, attname, atttypid::regtype
FROM pg_attribute JOIN pg_class ON attrelid = pg_class.oid
JOIN pg_namespace ON relnamespace = pg_namespace.oid
WHERE attnum > 0 AND NOT attisdropped
ORDER BY attrelid, attnum;

3️⃣Step 3 — First Error in Logs (90 seconds) Root cause is at the top▼

⚠️Read top to bottom, not bottom up. The FIRST error in time is the root cause. Everything after it is a cascading failure triggered by that first event. Junior engineers waste 20 minutes debugging the wrong error.

# 1. Application logs → your code errors first
grep -i "error" logfile.log | head -10

# 2. Glue / system logs → orchestration layer
aws logs filter-log-events \
  --log-group-name /aws/glue/jobs/output \
  --filter-pattern "ERROR" \
  --max-items 10 \
  --query 'events[*].message' --output text

# 3. Infrastructure logs → CloudWatch, RDS
aws logs filter-log-events \
  --log-group-name /aws/rds/instance/my-db/error \
  --filter-pattern "ERROR" --max-items 10 \
  --query 'events[*].message' --output text

# 4. Airflow task logs
airflow tasks log <dag_id> <task_id> <execution_date>

4️⃣Step 4 — Verify, Don't Assume (60 seconds) Evidence over intuition▼

🚨"It should work" ≠ "It works." Every assumption you make without verifying adds 10 minutes to diagnosis time. Run the command. Check the actual value.

# Credentials — are they actually valid?
aws sts get-caller-identity

# Does the file/object actually exist?
aws s3 ls s3://bucket/path/file.parquet
aws s3api head-object --bucket my-bucket --key path/file.parquet

# Is the DB actually accepting connections?
nc -zv rds-endpoint.amazonaws.com 5432
telnet rds-endpoint.amazonaws.com 3306

# Is the data actually there?
SELECT COUNT(*) FROM table WHERE date = CURRENT_DATE;

# Compare dev vs prod config
diff <(aws ssm get-parameter --name /dev/db_config --output text) \
     <(aws ssm get-parameter --name /prod/db_config --output text)

# What environment is this actually running in?
echo $ENV && echo $AWS_DEFAULT_REGION && aws configure list

5️⃣Step 5 — Compare Data Volume (60 seconds) Many perf issues = volume issues▼

💡Red flags: 10x spike, sudden drop to 0. Many "performance issues" are actually volume issues in disguise — the query plan was fine for 1M rows but breaks at 50M.

# Row count trend (SQL — works on MySQL, PostgreSQL, Snowflake)
SELECT
  DATE(created_at) AS day,
  COUNT(*)         AS rows
FROM your_table
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY day;

# File count per day in S3
for i in {0..7}; do
  d=$(date -d "$i days ago" +%Y-%m-%d)
  cnt=$(aws s3 ls s3://your-bucket/prefix/$d/ 2>/dev/null | wc -l)
  echo "$d: $cnt files"
done

# Avg file size check (detect 0-byte or undersized files)
aws s3 ls s3://bucket/prefix/ --recursive \
  | awk '{sum+=$3; count++} END {print "Files:", count, "Avg size:", int(sum/count), "bytes"}'

# Partition distribution (Spark — detect skew)
df.groupBy("partition_col").count().orderBy("count", ascending=False).show(20)

6️⃣Step 6 — Stupid Questions Checklist (60 seconds) Catches 40% of all issues▼

⚠️Senior engineers run this checklist too. There is no shame in checking the obvious. The shame is in spending 3 hours before checking it.

# ✓ Credentials valid?
aws sts get-caller-identity

# ✓ Data exists?
aws s3 ls s3://bucket/path/
SELECT COUNT(*) FROM table WHERE date = CURRENT_DATE;

# ✓ Service actually running?
systemctl status airflow-scheduler
systemctl status mysql
docker ps | grep <service>

# ✓ Right environment? (not running prod code against dev DB)
echo $ENV
echo $AWS_DEFAULT_REGION
aws configure list

# ✓ Network open?
nc -zv hostname 5432              # PostgreSQL
nc -zv hostname 3306              # MySQL
nc -zv hostname 5439              # Redshift
curl -sv https://your-endpoint/health 2>&1 | grep -E 'Connected|Failed'

# ✓ Permissions OK?
aws iam simulate-principal-policy \
  --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
  --action-names s3:GetObject --resource-arns "arn:aws:s3:::my-bucket/*"

🔍10 Critical Log Patterns See pattern → know fix instantly▼

"Field 'x' not found"

Schema drift — upstream changed a column name or removed a field.

Align schema. Add schema validation at ingestion. Check source DDL changes.

"Cannot parse" / "ParseException"

Bad record — malformed JSON, wrong delimiter, encoding issue.

Add input validation. Preview file: head -5 or aws s3 cp s3://... - | head -5

"NullPointerException"

Unexpected null value in a field assumed to be non-null.

Add null checks. Use COALESCE / fillna. Trace to source column.

"OutOfMemoryError: Java heap space"

Data skew or shuffle too large for executor memory.

Repartition. Increase executor-memory. Salt skewed keys.

"Connection timed out"

VPC / firewall / security group blocking the port.

Check network: nc -zv host port. Verify security group inbound rules.

"401 Unauthorized" / "403 Forbidden"

Credentials expired or IAM policy missing the required action.

Rotate secrets. Run IAM simulate-principal-policy to find the missing permission.

"429 Too Many Requests"

Rate limit hit on an API or AWS service.

Add exponential backoff. Batch requests. Check service quotas in AWS console.

"0 records processed"

Upstream source is empty — file missing, table empty, wrong partition filter.

Check source: aws s3 ls or SELECT COUNT(*). Verify date partition logic.

"Corrupted parquet" / "Could not read footer"

Mixed schema in a Parquet partition, or file written incompletely.

Regenerate the file. Check schema evolution. Use schema_merge=True in Spark.

"Task skipped" / "upstream_failed"

Wrong DAG trigger rule or upstream dependency failed silently.

Check trigger_rule on task. Inspect upstream task logs. Fix dependency chain.

📊5 Failure Categories Every failure maps to one of these▼

~50%

DATA

Files, schema, records, volume

⚡ Fast to check

~20%

INFRA

IAM, network, storage, resources

🔧 Medium effort

~15%

CODE

Joins, columns, parsing, nulls

🔍 Quick to find

~10%

ORCHESTRATION

Triggers, upstream, timing

🔧 Medium effort

~5%

UPSTREAM

API, format, rate limits

⚡ Fast to check

💡Start with DATA (50% probability, fast to check). Then INFRA (20%, medium effort). This order gives you the best time-to-diagnosis ratio on any incident.

💻Command Cheat Sheet by Step Quick Reference▼

Step 2 — What changed?

git log --since="2 days ago" --oneline
SELECT date, COUNT(*) FROM table WHERE date >= CURRENT_DATE - 7 GROUP BY date ORDER BY date;
aws s3 ls s3://bucket/prefix/ --recursive | wc -l

Step 3 — First error

grep -i "error" logfile.log | head -10
aws logs filter-log-events --log-group-name /aws/glue/jobs/output \
  --filter-pattern "ERROR" --max-items 10 --query 'events[*].message' --output text

Step 4 — Verify

aws sts get-caller-identity
aws s3 ls s3://bucket/path/file.parquet
telnet hostname 5432

Step 5 — Volume compare

for i in {0..7}; do d=$(date -d "$i days ago" +%Y-%m-%d); echo "$d: $(aws s3 ls s3://bucket/$d/ | wc -l) files"; done

Step 6 — Stupid questions

aws sts get-caller-identity
SELECT COUNT(*) FROM table WHERE date = CURRENT_DATE;
systemctl status service-name
nc -zv hostname 5432

⌕

No results — try a different keyword

⚡ Quick RCA

🗄 DB Diagnostics

☁️ ETL & AWS

🧠 Output Guide

🐛 Debug Protocol