⚡ Quick RCA
Run in this order when something is on fire. Don't skip steps.
6-Step RCA Workflow
Memorize this▼
# STEP 1 — Is the service even running?
pgrep -fl mysql || pgrep -fl postgres || pgrep -fl mongod
# STEP 2 — System health (30 seconds)
uptime && free -h && df -h
# STEP 3 — What do the logs say? (run for your DB)
tail -50 /var/log/mysql/error.log | grep -iE 'error|warn'
tail -50 /var/log/postgresql/postgresql-*.log | grep -iE 'error|fatal'
journalctl -u mysql --since "10 minutes ago" | grep -iE 'error|fail'
# STEP 4 — Resource bottleneck?
top -b -n1 | head -25 # CPU + memory snapshot
iostat -x 1 3 # Disk I/O (await > 20ms = bottleneck)
vmstat 1 3 # si/so > 0 = swapping = bad
# STEP 5 — Connection / lock problem?
# MySQL:
mysql -e "SHOW FULL PROCESSLIST;" | grep -v Sleep
# PostgreSQL:
psql -c "SELECT pid,state,wait_event,query_start,LEFT(query,80) FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;"
# STEP 6 — Timeline: when did it start?
grep -iE 'error|fatal' /var/log/mysql/error.log | awk '{print $1" "$2}' | uniq -c | tail -20
Health Snapshot — One-Liners
Run first▼
Quick health dump to file
{
echo "=== $(date) ===" && uptime
echo "--- MEMORY ---" && free -h
echo "--- DISK ---" && df -h
echo "--- TOP 5 CPU ---" && ps aux --sort=-%cpu | head -6
echo "--- TOP 5 MEM ---" && ps aux --sort=-%mem | head -6
echo "--- CONNECTIONS ---"
ss -ant | awk '{print $1}' | sort | uniq -c
} | tee /tmp/health_$(date +%H%M%S).txt
CPU & Load
Performance▼
uptime # load avg — compare to CPU count lscpu | grep "^CPU(s)" # how many CPUs you have top -b -n1 | head -20 # snapshot (non-interactive) mpstat -P ALL 1 3 # per-core breakdown (needs sysstat) ps -eo pid,%cpu,etime,cmd --sort=-%cpu | head -10 # top CPU processes
Memory & Swap
Performance▼
⚠️Any swap usage on a database host is a performance cliff. Even 1MB of swap activity means you have a problem.
free -h # look at 'available', not 'free' vmstat 1 5 # si/so columns — swap in/out swapon --show # is swap even enabled? cat /proc/meminfo | grep -E 'MemTotal|MemAvailable|SwapTotal|SwapFree' ps aux --sort=-%mem | head -10 # top memory consumers dmesg | grep -i "killed process" # OOM killer fired?
Disk Space & I/O
Performance▼
⚠️Always run
df -i too — full inodes will refuse new files even with gigabytes free.df -h # disk space (human-readable)
df -i # inode usage — the silent killer
iostat -x 1 5 # %util > 80% = saturated, await > 20ms = slow
# Find what's eating disk
du -sh /var/lib/mysql/* 2>/dev/null | sort -hr | head -10
du -sh /var/lib/postgresql/*/base/* 2>/dev/null | sort -hr | head -10
find /var/log -type f -size +200M -exec ls -lh {} \;
# Ghost space: deleted files still held open
lsof | grep deleted | awk '{sum+=$7} END {print sum/1024/1024 " MB held by deleted files"}'
# Which process is doing I/O right now?
iotop -o # only processes with active I/O
pidstat -d 1 3 # per-process I/O (fallback if no iotop)
Connections
Network▼
# Count connections by state
ss -ant | awk '{print $1}' | sort | uniq -c
# Connections to specific DB port
ss -ant state established | grep :3306 | wc -l # MySQL
ss -ant state established | grep :5432 | wc -l # PostgreSQL
# Who is connected (what processes/IPs)
ss -antp | grep :3306
lsof -i :3306
lsof -i :5432
# Connection count by remote IP (find the flood source)
ss -ant | awk '{print $5}' | grep -v '\*' | cut -d: -f1 | sort | uniq -c | sort -nr | head -10
# System file descriptor limits (connections use fds)
ulimit -n # current shell limit
cat /proc/sys/fs/file-max # system-wide max
cat /proc/sys/fs/file-nr # open / unused / max
Process Investigation & Kill
Emergency▼
🚨D-state processes cannot be killed with kill -9. They are waiting for I/O. Only fixing the underlying I/O (or rebooting) resolves them.
# Find stuck processes (D = blocked on I/O, Z = zombie) ps -eo pid,stat,wchan:25,cmd | grep -E '^[0-9]+ [DZ]' # Full process investigation ps aux | grep -E 'mysql|postgres|spark|glue|python|airflow' | grep -v grep ps -eo pid,ppid,stat,etime,%cpu,%mem,cmd | grep -E 'spark|glue' # Long-running processes (etimes = elapsed seconds) ps -eo pid,etimes,cmd --sort=-etimes | head -15 # Kill gracefully first, force only if needed kill -15 PID # SIGTERM (graceful) kill -9 PID # SIGKILL (force — last resort) # Service management systemctl restart mysql systemctl restart postgresql journalctl -u mysql -n 100 --no-pager
Log Investigation — Essential Patterns
Logs▼
# Live error monitoring
tail -f /var/log/mysql/error.log | grep --line-buffered -i error
tail -f app.log | grep --line-buffered -iE 'error|exception|failed'
# Error frequency over time (spot the spike)
grep -i error /var/log/mysql/error.log | awk '{print $1" "$2}' | cut -d: -f1 | uniq -c
# Most common errors (top 10)
grep -i error /var/log/pipeline.log | sort | uniq -c | sort -nr | head -10
# Error with context (5 lines before and after)
grep -i "connection refused" /var/log/pipeline.log -A 5 -B 5
# Kernel / OOM events
dmesg -T | grep -iE 'oom|killed|error|hung' | tail -30
journalctl -k --since "1 hour ago" | grep -i error
🗄 DB Diagnostics
In-database queries — the layer most cheatsheets skip. Run these inside the DB, not the OS.
MySQL / Aurora — Active Sessions
MySQL▼
Who is running what right now
MySQL — run inside DB
-- Full process list (excludes sleeping connections) SHOW FULL PROCESSLIST; -- Better version with duration and host SELECT id, user, host, db, command, time AS secs, state, LEFT(info, 120) AS query FROM information_schema.processlist WHERE command != 'Sleep' ORDER BY time DESC; -- Just the long-running ones (> 30 seconds) SELECT id, user, db, time, state, LEFT(info,100) AS query FROM information_schema.processlist WHERE time > 30 AND command != 'Sleep' ORDER BY time DESC;
Kill a query or connection
-- Kill just the query (connection stays open)
KILL QUERY 12345;
-- Kill the entire connection
KILL 12345;
-- Kill ALL long-running queries at once (> 60s) — USE WITH CARE
SELECT CONCAT('KILL QUERY ', id, ';')
FROM information_schema.processlist
WHERE time > 60 AND command != 'Sleep' AND user != 'root';
-- Copy output and run manually
Connection counts & limits
-- Current vs max connections SHOW STATUS LIKE 'Threads_connected'; SHOW VARIABLES LIKE 'max_connections'; -- Connection history (peak ever reached) SHOW STATUS LIKE 'Max_used_connections'; -- Connections by user (find who's using the most) SELECT user, COUNT(*) AS connections FROM information_schema.processlist GROUP BY user ORDER BY connections DESC;
MySQL / Aurora — Locks & Transactions
MySQL▼
Active transactions & locks
-- Active InnoDB transactions SELECT trx_id, trx_state, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS secs_running, trx_mysql_thread_id AS connection_id, trx_query FROM information_schema.innodb_trx ORDER BY trx_started; -- Who is blocking who (MySQL 8+ / Aurora 2+) SELECT r.trx_id AS waiting_trx, r.trx_mysql_thread_id AS waiting_conn, LEFT(r.trx_query, 80) AS waiting_query, b.trx_id AS blocking_trx, b.trx_mysql_thread_id AS blocking_conn, LEFT(b.trx_query, 80) AS blocking_query FROM information_schema.innodb_lock_waits w JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
InnoDB status (deadlock + lock analysis)
-- Full InnoDB engine status (look for LATEST DEADLOCK section) SHOW ENGINE INNODB STATUS\G -- Lock wait timeout setting SHOW VARIABLES LIKE 'innodb_lock_wait_timeout'; -- Deadlock count since restart SHOW STATUS LIKE 'Innodb_row_lock_waits'; SHOW STATUS LIKE 'Innodb_deadlocks';
MySQL / Aurora — Performance
MySQL▼
Slow query log (most valuable tool — enable in RDS parameter group)
-- Check current slow query settings
SHOW VARIABLES LIKE 'slow_query_log%';
SHOW VARIABLES LIKE 'long_query_time';
-- Enable on-the-fly (EC2/self-managed only — use param group for RDS)
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; -- log queries > 1 second
SET GLOBAL log_queries_not_using_indexes = 'ON';
-- Top slow queries from performance_schema (available on RDS)
SELECT
LEFT(digest_text, 100) AS query_pattern,
count_star AS exec_count,
ROUND(avg_timer_wait/1000000000000, 3) AS avg_sec,
ROUND(max_timer_wait/1000000000000, 3) AS max_sec,
ROUND(sum_rows_examined / count_star) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE schema_name NOT IN ('performance_schema','information_schema','mysql')
ORDER BY avg_timer_wait DESC
LIMIT 15;
EXPLAIN — query analysis
-- Basic explain EXPLAIN SELECT * FROM orders WHERE customer_id = 42; -- Extended (MySQL 8+) — shows optimizer decisions EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42; -- JSON format (most detailed) EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 42\G -- What to look for: -- type = 'ALL' → full table scan (BAD — needs index) -- type = 'index' → full index scan (better but still slow on large tables) -- type = 'ref'/'eq_ref' → good, using index properly -- rows > 100000 → scanning too many rows -- Extra 'Using filesort' → sorting without index (expensive)
Buffer pool & key metrics
-- Buffer pool hit rate (should be > 99%)
SELECT
ROUND(
(1 - innodb_buffer_pool_reads / innodb_buffer_pool_read_requests) * 100, 2
) AS buffer_hit_rate_pct
FROM (
SELECT
(SELECT variable_value FROM performance_schema.global_status WHERE variable_name='Innodb_buffer_pool_reads') + 0 AS innodb_buffer_pool_reads,
(SELECT variable_value FROM performance_schema.global_status WHERE variable_name='Innodb_buffer_pool_read_requests') + 0 AS innodb_buffer_pool_read_requests
) t;
-- Current buffer pool size vs configured
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
SHOW STATUS LIKE 'Innodb_buffer_pool_bytes_data';
MySQL / Aurora — Schema & Tables
MySQL▼
-- Top 20 largest tables in a database
SELECT
table_name,
ROUND(data_length / 1024 / 1024, 1) AS data_mb,
ROUND(index_length / 1024 / 1024, 1) AS index_mb,
ROUND((data_length + index_length) / 1024 / 1024, 1) AS total_mb,
table_rows AS est_rows
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY (data_length + index_length) DESC
LIMIT 20;
-- Tables with no primary key (data quality / replication risk)
SELECT table_schema, table_name
FROM information_schema.tables t
WHERE table_type = 'BASE TABLE'
AND NOT EXISTS (
SELECT 1 FROM information_schema.table_constraints tc
WHERE tc.table_schema = t.table_schema
AND tc.table_name = t.table_name
AND tc.constraint_type = 'PRIMARY KEY'
)
ORDER BY table_schema, table_name;
-- Unused indexes (rows = 0 in statistics since last restart)
SELECT
object_schema, object_name AS table_name, index_name
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE index_name IS NOT NULL
AND count_star = 0
AND object_schema NOT IN ('mysql','performance_schema','information_schema')
ORDER BY object_schema, object_name;
RDS MySQL / Aurora — Specific Checks
RDS▼
# Download slow query log from RDS aws rds describe-db-log-files \ --db-instance-identifier my-rds-instance \ --filename-contains slow aws rds download-db-log-file-portion \ --db-instance-identifier my-rds-instance \ --log-file-name slowquery/mysql-slowquery.log \ --starting-token 0 --output text # Check RDS events (restarts, failovers, parameter changes) aws rds describe-events \ --source-identifier my-rds-instance \ --source-type db-instance \ --duration 60 # last 60 minutes # Aurora replica lag (in seconds) aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name AuroraReplicaLag \ --dimensions Name=DBClusterIdentifier,Value=my-aurora-cluster \ --start-time $(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 60 --statistics Maximum # Check parameter group settings aws rds describe-db-parameters \ --db-parameter-group-name my-param-group \ --query 'Parameters[?ParameterValue!=`null`].[ParameterName,ParameterValue]' \ --output table | grep -iE 'slow|connect|timeout|innodb_buffer'
PostgreSQL / Aurora PG — Active Sessions
PostgreSQL▼
Who is running what
PostgreSQL — run inside DB as superuser
-- All active queries with duration SELECT pid, usename AS user, datname AS db, application_name AS app, client_addr, state, wait_event_type, wait_event, EXTRACT(EPOCH FROM (now() - query_start))::int AS secs, LEFT(query, 100) AS query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start NULLS LAST; -- Long-running queries only (> 30 seconds) SELECT pid, usename, state, wait_event, EXTRACT(EPOCH FROM (now() - query_start))::int AS secs, LEFT(query, 120) AS query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - INTERVAL '30 seconds' ORDER BY secs DESC;
Cancel or terminate
-- Cancel just the query (connection stays — like KILL QUERY in MySQL) SELECT pg_cancel_backend(12345); -- Terminate the entire connection (like KILL in MySQL) SELECT pg_terminate_backend(12345); -- Terminate ALL long-running queries > 5 minutes — USE WITH CARE SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - INTERVAL '5 minutes' AND usename != 'rdsadmin'; -- don't kill internal RDS user
Connection counts
-- Current vs max connections
SELECT current_setting('max_connections') AS max,
COUNT(*) AS current
FROM pg_stat_activity;
-- By user and state
SELECT usename, state, COUNT(*)
FROM pg_stat_activity
GROUP BY usename, state
ORDER BY count DESC;
-- % of max_connections used
SELECT
MAX(conn) AS current_connections,
MAX(setting)::int AS max_connections,
ROUND(MAX(conn) * 100.0 / MAX(setting)::int, 1) AS pct_used
FROM
(SELECT COUNT(*) AS conn FROM pg_stat_activity) c,
pg_settings WHERE name = 'max_connections';
PostgreSQL — Locks & Blocking
PostgreSQL▼
-- Who is blocking who (the most useful lock query) SELECT blocked.pid AS blocked_pid, blocked.usename AS blocked_user, LEFT(blocked.query, 80) AS blocked_query, blocking.pid AS blocking_pid, blocking.usename AS blocking_user, LEFT(blocking.query, 80) AS blocking_query, EXTRACT(EPOCH FROM (now() - blocked.query_start))::int AS blocked_for_secs FROM pg_stat_activity AS blocked JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)) WHERE blocked.cardinality(pg_blocking_pids(blocked.pid)) > 0; -- All current locks SELECT l.pid, l.locktype, l.relation::regclass AS table_name, l.mode, l.granted, a.usename, LEFT(a.query, 80) AS query FROM pg_locks l JOIN pg_stat_activity a ON a.pid = l.pid WHERE l.relation IS NOT NULL ORDER BY l.granted, l.pid; -- Lock waits summary SELECT wait_event_type, wait_event, COUNT(*) AS count FROM pg_stat_activity WHERE wait_event IS NOT NULL GROUP BY wait_event_type, wait_event ORDER BY count DESC;
PostgreSQL — Performance & Query Analysis
PostgreSQL▼
Top slow queries (pg_stat_statements — must be enabled)
-- Check if pg_stat_statements is enabled SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements'; -- Top 15 slowest query patterns SELECT ROUND(mean_exec_time::numeric, 2) AS avg_ms, ROUND(max_exec_time::numeric, 2) AS max_ms, calls, ROUND(total_exec_time::numeric / 1000, 1) AS total_sec, rows, LEFT(query, 120) AS query FROM pg_stat_statements WHERE query NOT LIKE '%pg_%' ORDER BY mean_exec_time DESC LIMIT 15; -- Most called queries (high frequency targets) SELECT calls, ROUND(mean_exec_time::numeric,2) AS avg_ms, LEFT(query,100) AS query FROM pg_stat_statements ORDER BY calls DESC LIMIT 10; -- Reset stats (after fixing a query) SELECT pg_stat_statements_reset();
EXPLAIN ANALYZE — query plan
-- Full analysis (BUFFERS shows cache hit/miss — critical for tuning) EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2024-01-01'; -- What to look for: -- Seq Scan on large table → missing index -- Rows= vs actual rows → bad statistics (run ANALYZE) -- Buffers: shared hit=X read=Y → Y > 0 means going to disk -- Nested Loop with large outer → consider Hash Join -- Sort → possible index optimization
Cache hit rate
-- Overall cache hit ratio (should be > 99%)
SELECT
ROUND(
blks_hit * 100.0 / NULLIF(blks_hit + blks_read, 0), 2
) AS cache_hit_pct
FROM pg_stat_database
WHERE datname = current_database();
-- Per-table cache ratio
SELECT relname,
ROUND(heap_blks_hit * 100.0 / NULLIF(heap_blks_hit + heap_blks_read, 0), 1) AS hit_pct,
heap_blks_hit, heap_blks_read
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC LIMIT 15;
PostgreSQL — Table Bloat & Vacuum
PostgreSQL▼
⚠️Bloat is PostgreSQL's silent performance killer. Dead rows accumulate from updates/deletes and must be cleaned by VACUUM. Monitor this daily on write-heavy tables.
-- Tables with most dead rows (bloat candidates) SELECT schemaname, relname AS table_name, n_live_tup AS live_rows, n_dead_tup AS dead_rows, ROUND(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables WHERE n_dead_tup > 1000 ORDER BY n_dead_tup DESC LIMIT 20; -- When was the last vacuum/analyze run? SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze FROM pg_stat_user_tables ORDER BY last_autovacuum NULLS FIRST LIMIT 20; -- Manually run vacuum on a bloated table VACUUM ANALYZE orders; VACUUM VERBOSE orders; -- shows what it's doing -- Full vacuum (rewrites table, reclaims space — blocks reads/writes!) -- Only use during maintenance window VACUUM FULL orders;
PostgreSQL — Schema, Tables & Indexes
PostgreSQL▼
-- Largest tables SELECT schemaname, relname AS table_name, pg_size_pretty(pg_total_relation_size(quote_ident(relname))) AS total_size, pg_size_pretty(pg_relation_size(quote_ident(relname))) AS table_size, pg_size_pretty(pg_total_relation_size(quote_ident(relname)) - pg_relation_size(quote_ident(relname))) AS index_size FROM pg_stat_user_tables ORDER BY pg_total_relation_size(quote_ident(relname)) DESC LIMIT 20; -- Unused indexes (never scanned since last stats reset) SELECT schemaname, tablename, indexname, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS times_used FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) WHERE idx_scan = 0 AND NOT indisprimary AND NOT indisunique ORDER BY pg_relation_size(indexrelid) DESC; -- Missing indexes (sequential scans on large tables) SELECT schemaname, relname AS table_name, seq_scan, idx_scan, ROUND(seq_scan * 100.0 / NULLIF(seq_scan + idx_scan, 0), 1) AS seq_scan_pct FROM pg_stat_user_tables WHERE seq_scan > 1000 ORDER BY seq_scan DESC LIMIT 15;
Backup & Restore
RDS · Aurora · EC2▼
RDS / Aurora — snapshots via AWS CLI (managed, preferred)
# Create manual snapshot before risky operations aws rds create-db-snapshot \ --db-instance-identifier my-rds \ --db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M) # List available snapshots aws rds describe-db-snapshots \ --db-instance-identifier my-rds \ --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime,Status,AllocatedStorage]' \ --output table # Restore snapshot to a NEW instance (never restore over existing prod) aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier my-rds-restored \ --db-snapshot-identifier pre-migration-20240115-1430 \ --db-instance-class db.r6g.large \ --no-publicly-accessible # Check restore progress aws rds describe-db-instances \ --db-instance-identifier my-rds-restored \ --query 'DBInstances[0].[DBInstanceStatus,PercentProgress]' \ --output text # Aurora cluster snapshot aws rds create-db-cluster-snapshot \ --db-cluster-identifier my-aurora-cluster \ --db-cluster-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M) # Restore Aurora snapshot to new cluster aws rds restore-db-cluster-from-snapshot \ --db-cluster-identifier my-aurora-restored \ --snapshot-identifier pre-migration-20240115-1430 \ --engine aurora-mysql
PostgreSQL — pg_dump (EC2-hosted or export from RDS)
⚠️For RDS, use snapshots first. pg_dump is for single-table exports, cross-account migrations, and dev/test refreshes — not full production backup replacement.
# Dump single database (compressed, parallel) pg_dump -h rds-endpoint.amazonaws.com -U masteruser \ -Fc -j 4 -d mydb -f mydb_$(date +%Y%m%d).dump # Dump single table only pg_dump -h rds-endpoint.amazonaws.com -U masteruser \ -Fc -t public.orders -d mydb -f orders_backup.dump # Restore to target DB (--no-owner prevents ownership errors) pg_restore -h target-rds.amazonaws.com -U masteruser \ -d mydb --no-owner --no-acl -j 4 -Fc mydb_20240115.dump # Restore single table into existing DB pg_restore -h target-rds.amazonaws.com -U masteruser \ -d mydb --no-owner -t orders -Fc orders_backup.dump # Upload dump to S3 immediately after aws s3 cp mydb_$(date +%Y%m%d).dump s3://my-backups-bucket/postgres/ # Quick schema-only export (structure without data) pg_dump -h rds-endpoint.amazonaws.com -U masteruser \ --schema-only -d mydb -f schema_$(date +%Y%m%d).sql
MySQL / Aurora MySQL — mysqldump
# Full database dump mysqldump -h rds-endpoint.amazonaws.com -u admin -p \ --single-transaction --routines --triggers \ mydb > mydb_$(date +%Y%m%d).sql # Single table mysqldump -h rds-endpoint.amazonaws.com -u admin -p \ --single-transaction mydb orders > orders_backup.sql # Compressed dump piped directly to S3 mysqldump -h rds-endpoint.amazonaws.com -u admin -p \ --single-transaction mydb \ | gzip | aws s3 cp - s3://my-backups/mysql/mydb_$(date +%Y%m%d).sql.gz # Restore mysql -h target-rds.amazonaws.com -u admin -p mydb < mydb_20240115.sql # Restore from S3 directly aws s3 cp s3://my-backups/mysql/mydb_20240115.sql.gz - \ | gunzip | mysql -h target-rds.amazonaws.com -u admin -p mydb
Point-in-Time Recovery (PITR) — RDS / Aurora
# Restore to specific point in time (e.g. just before a bad migration)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier my-rds \
--target-db-instance-identifier my-rds-pitr \
--restore-time 2024-01-15T14:30:00Z \
--db-instance-class db.r6g.large \
--no-publicly-accessible
# Check what the earliest restore point is
aws rds describe-db-instances \
--db-instance-identifier my-rds \
--query 'DBInstances[0].{Earliest:EarliestRestorableTime,Latest:LatestRestorableTime}' \
--output table
Index Maintenance
MySQL · PostgreSQL · Redshift · Snowflake▼
PostgreSQL / Aurora PG — non-blocking index operations
🚨
CREATE INDEX without CONCURRENTLY locks the table. On a production table with millions of rows, this means downtime. Always use CONCURRENTLY on live tables.-- Create index WITHOUT table lock (takes longer but safe in prod)
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON public.orders (customer_id);
-- Create partial index (smaller, faster for filtered queries)
CREATE INDEX CONCURRENTLY idx_orders_pending
ON public.orders (created_at)
WHERE status = 'pending';
-- Drop index without lock
DROP INDEX CONCURRENTLY idx_orders_old;
-- Rebuild bloated index without lock
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;
-- Rebuild all indexes on a table (PostgreSQL 12+)
REINDEX TABLE CONCURRENTLY public.orders;
-- Check index bloat (indexes > 30% bloat are candidates for rebuild)
SELECT
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used,
ROUND(pg_relation_size(indexrelid) * 100.0 /
NULLIF(pg_total_relation_size(indrelid), 0), 1) AS pct_of_table
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE pg_relation_size(indexrelid) > 10 * 1024 * 1024 -- > 10MB
ORDER BY pg_relation_size(indexrelid) DESC;
MySQL / Aurora MySQL
-- Add index online (ALGORITHM=INPLACE avoids full table copy for most cases) ALTER TABLE orders ADD INDEX idx_orders_customer_id (customer_id), ALGORITHM=INPLACE, LOCK=NONE; -- Drop unused index online ALTER TABLE orders DROP INDEX idx_orders_old, ALGORITHM=INPLACE, LOCK=NONE; -- Optimize table (reclaims space + rebuilds indexes — avoid on large tables in prod) -- For InnoDB this is a full table rebuild. Schedule during maintenance window. OPTIMIZE TABLE orders; -- Check index fragmentation SELECT table_name, index_name, stat_value AS pages, ROUND(stat_value * 16 / 1024, 1) AS size_mb FROM mysql.innodb_index_stats WHERE stat_name = 'size' AND database_name = DATABASE() ORDER BY stat_value DESC LIMIT 20; -- Show all indexes on a table SHOW INDEX FROM orders; -- Force index usage for testing (remove in prod) SELECT * FROM orders FORCE INDEX (idx_orders_customer_id) WHERE customer_id = 42;
Redshift — sort keys & vacuum
💡Redshift uses sort keys instead of B-tree indexes. VACUUM reclaims deleted space and re-sorts blocks. Run after large DELETE/UPDATE operations.
-- Check tables needing vacuum (unsorted rows & deleted rows)
SELECT "table", unsorted, stats_off, size AS size_mb,
tbl_rows, estimated_visible_rows,
vacuum_sort_benefit
FROM SVV_TABLE_INFO
WHERE unsorted > 5 OR stats_off > 20
ORDER BY vacuum_sort_benefit DESC
LIMIT 20;
-- Vacuum specific table (reclaim space + resort)
VACUUM orders;
-- Vacuum + update statistics
VACUUM orders TO 75 PERCENT;
ANALYZE orders;
-- Full vacuum all tables (run during off-hours)
VACUUM FULL;
-- Check distkey & sortkey configuration
SELECT tablename, "column", sortkey, distkey, encoding
FROM pg_table_def
WHERE schemaname = 'public'
AND tablename = 'orders'
ORDER BY sortkey NULLS LAST;
Snowflake — clustering & search optimization
-- Check clustering depth (DML_ratio > 0.5 or clustering_errors > 0 = re-cluster needed)
SELECT SYSTEM$CLUSTERING_INFORMATION('orders', '(order_date)');
-- Manually trigger re-clustering (background, non-blocking)
ALTER TABLE orders CLUSTER BY (order_date, customer_id);
-- Enable search optimization (for equality/range lookups on large tables)
ALTER TABLE orders ADD SEARCH OPTIMIZATION;
-- Check search optimization status & cost
SELECT * FROM INFORMATION_SCHEMA.SEARCH_OPTIMIZATION_HISTORY
WHERE table_name = 'ORDERS'
ORDER BY start_time DESC LIMIT 10;
-- Show current clustering keys
SHOW TABLES LIKE 'orders';
Data Validation SQL
Run after every pipeline▼
💡These are the 6 checks you run after any pipeline to confirm it produced correct results — not just that it finished without error. Works on MySQL, PostgreSQL, Redshift, and Snowflake unless noted.
1. Row count — source vs target
-- Compare source vs target counts for today's load SELECT 'source' AS layer, COUNT(*) AS rows FROM source_db.orders WHERE DATE(created_at) = CURRENT_DATE UNION ALL SELECT 'target' AS layer, COUNT(*) AS rows FROM target_db.orders WHERE DATE(created_at) = CURRENT_DATE; -- Row count trend (detect sudden drops to 0 or 10x spikes) SELECT DATE(created_at) AS day, COUNT(*) AS rows, COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY DATE(created_at)) AS day_over_day_delta FROM orders WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY DATE(created_at) ORDER BY day;
2. Null rate per column
-- Null rate for key columns (adjust column list to your table) SELECT COUNT(*) AS total_rows, ROUND(COUNT(*) FILTER (WHERE customer_id IS NULL) * 100.0 / COUNT(*), 2) AS customer_id_null_pct, ROUND(COUNT(*) FILTER (WHERE order_total IS NULL) * 100.0 / COUNT(*), 2) AS order_total_null_pct, ROUND(COUNT(*) FILTER (WHERE created_at IS NULL) * 100.0 / COUNT(*), 2) AS created_at_null_pct, ROUND(COUNT(*) FILTER (WHERE status IS NULL) * 100.0 / COUNT(*), 2) AS status_null_pct FROM orders WHERE DATE(created_at) = CURRENT_DATE; -- MySQL / Redshift / Snowflake version (no FILTER support): SELECT COUNT(*) AS total, ROUND(SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS customer_id_null_pct, ROUND(SUM(CASE WHEN order_total IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS order_total_null_pct FROM orders WHERE DATE(created_at) = CURRENT_DATE;
3. Duplicate key detection
-- Find duplicate primary keys (should return 0 rows if pipeline is correct) SELECT order_id, COUNT(*) AS occurrences FROM orders WHERE DATE(created_at) = CURRENT_DATE GROUP BY order_id HAVING COUNT(*) > 1 ORDER BY occurrences DESC LIMIT 20; -- Count of duplicates only (quick check) SELECT COUNT(*) AS duplicate_keys FROM ( SELECT order_id FROM orders WHERE DATE(created_at) = CURRENT_DATE GROUP BY order_id HAVING COUNT(*) > 1 ) dupes;
4. Data freshness — last record loaded
-- When was the last record actually loaded? (staleness check)
SELECT
MAX(created_at) AS latest_record,
NOW() - MAX(created_at) AS lag,
CASE
WHEN NOW() - MAX(created_at) > INTERVAL '2 hours' THEN '⚠ STALE'
ELSE '✓ FRESH'
END AS status
FROM orders;
-- Snowflake version
SELECT
MAX(created_at) AS latest_record,
DATEDIFF('minute', MAX(created_at), NOW()) AS lag_minutes,
CASE WHEN DATEDIFF('minute', MAX(created_at), NOW()) > 120
THEN 'STALE' ELSE 'FRESH' END AS status
FROM orders;
5. Referential integrity check
-- Orders with no matching customer (orphaned foreign keys) SELECT COUNT(*) AS orphaned_orders FROM orders o LEFT JOIN customers c ON o.customer_id = c.id WHERE c.id IS NULL; -- Show sample orphans SELECT o.order_id, o.customer_id, o.created_at FROM orders o LEFT JOIN customers c ON o.customer_id = c.id WHERE c.id IS NULL LIMIT 10;
6. Value distribution sanity check
-- Distribution of status values (unexpected values = upstream change) SELECT status, COUNT(*) AS count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct FROM orders WHERE DATE(created_at) = CURRENT_DATE GROUP BY status ORDER BY count DESC; -- Min / max / avg sanity check (detect negative prices, future dates, etc.) SELECT MIN(order_total) AS min_total, MAX(order_total) AS max_total, ROUND(AVG(order_total), 2) AS avg_total, MIN(created_at) AS earliest, MAX(created_at) AS latest FROM orders WHERE DATE(created_at) = CURRENT_DATE;
PgBouncer & RDS Proxy — Connection Pooler Status
PostgreSQL▼
💡When you hit max_connections on RDS PostgreSQL, the pooler itself is often the bottleneck — not the database. Check the pooler first before scaling the RDS instance.
PgBouncer — connect to admin console
# Connect to PgBouncer admin console (port 6432 by default) psql -h localhost -p 6432 -U pgbouncer pgbouncer # Or from inside Docker docker exec -it pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer
PgBouncer — key status queries
Run inside PgBouncer admin console
-- Pool status: cl_active = clients using connection, sv_idle = free server connections SHOW POOLS; -- Key columns: -- cl_active = clients currently executing a query -- cl_waiting = clients WAITING for a connection (THIS IS BAD — means pool is full) -- sv_idle = server connections idle (available) -- sv_active = server connections in use -- maxwait = seconds oldest waiting client has been waiting (> 5s = problem) -- All active client connections SHOW CLIENTS; -- Server connections to the real DB SHOW SERVERS; -- Throughput and latency stats SHOW STATS; -- avg_query_time in microseconds — divide by 1000 for ms -- Current configuration SHOW CONFIG; -- Reload config without restart (after editing pgbouncer.ini) RELOAD; -- Pause a database (graceful — finish current queries first) PAUSE mydb; -- Resume RESUME mydb;
PgBouncer — diagnose pool exhaustion
-- cl_waiting > 0 means clients are queuing for a DB connection
-- This query shows it clearly
SHOW POOLS;
-- Common causes and fixes:
-- 1. pool_size too small → increase pool_size in pgbouncer.ini
-- 2. Long-running transactions holding connections →
-- check pg_stat_activity for 'idle in transaction'
-- 3. Wrong pool mode → use 'transaction' mode for most apps, not 'session'
SHOW CONFIG; -- check pool_mode
-- Current max connections consumed at the DB level
-- (run in PostgreSQL, not PgBouncer)
SELECT COUNT(*) AS pooler_connections,
current_setting('max_connections') AS max_connections
FROM pg_stat_activity
WHERE application_name LIKE '%pgbouncer%'
OR application_name LIKE '%bouncer%';
RDS Proxy — status via AWS CLI
# List all RDS proxies aws rds describe-db-proxies \ --query 'DBProxies[*].[DBProxyName,Status,Endpoint,EngineFamily]' \ --output table # Check proxy target health (is it actually connected to the DB?) aws rds describe-db-proxy-targets \ --db-proxy-name my-rds-proxy \ --query 'Targets[*].[Endpoint,Port,TargetHealth.State,TargetHealth.Reason]' \ --output table # TargetHealth.State should be AVAILABLE # UNAVAILABLE with reason = shows what's wrong # Connection pool metrics via CloudWatch aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name ClientConnections \ --dimensions Name=ProxyName,Value=my-rds-proxy \ --start-time $(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 60 --statistics Average --output table # Other key RDS Proxy metrics to watch: # DatabaseConnectionsCurrentlyBorrowed → connections in use # DatabaseConnectionsBorrowLatency → how long to get a connection (ms) # QueryDatabaseResponseLatency → DB response time through proxy # ClientConnections → total clients connected to proxy
☁️ ETL & AWS
Docker, S3, Glue, CloudWatch — the daily AWS data engineering toolkit.
Docker — Daily Commands
Docker▼
# Start / stop stack docker compose up -d # start all services in background docker compose up --build -d # rebuild images before starting docker compose down # stop and remove containers + networks docker compose restart# restart one service only # What's running? docker ps # running containers docker ps -a # all containers including stopped docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" # Logs — most used commands daily docker logs -f # follow logs (Ctrl+C to stop) docker logs --tail 200 # last 200 lines only docker logs --since 30m # last 30 minutes of logs docker logs 2>&1 | grep -i error # grep for errors only # Resource usage docker stats # live CPU/RAM per container docker stats --no-stream # snapshot (non-interactive)
Docker — Debug & Inspect
Docker▼
# Get a shell inside a container docker exec -itbash # bash shell docker exec -it sh # sh if bash not available # Check env vars inside container (crucial for debugging config) docker exec -it env | sort # Connect to Postgres inside container docker exec -it psql -U postgres -d mydb # Test DB port from inside another container docker exec -it bash -c ':/app/config.yaml # host → container docker cp :/app/output.csv ./output.csv # container → host # Inspect container details (networking, mounts, env) docker inspect docker inspect | grep -E 'IPAddress|Mounts|Env' -A 5 # Check container IP docker inspect --format '{{.NetworkSettings.IPAddress}}' # Follow multiple containers at once (docker-compose) docker compose logs -f --tail 50
Docker — Cleanup
Docker▼
⚠️
docker volume prune deletes database data. Always check which volumes exist before pruning.docker system prune # remove stopped containers, dangling images, unused networks docker image prune -a # remove ALL unused images (frees most space) docker container prune -f # remove all stopped containers docker volume ls # LIST volumes before pruning docker volume prune # remove unused volumes (DANGEROUS if DB volume) docker system df # show disk usage by Docker components docker compose down --remove-orphans # removes leftover containers hogging ports
Docker — Emergency & Performance
Docker▼
When your laptop fans are screaming
# Live resource monitor — shows which container eats RAM (Spark/JVM culprit) docker stats docker stats --no-stream # snapshot, non-interactive # See actual Linux processes inside the container docker top <name> docker top <name> aux # full ps output # Freeze container CPU without killing it (temporary relief) docker pause <name> docker unpause <name> # Wait for container to finish and capture exit code (CI/CD scripts) docker wait <name> echo "Exit code: $?"
Network testing — the "Swiss Army Knife"
💡Run this from inside your ETL container. "Connected" = network OK, your Python code is the problem. "Hangs" = Docker network or Security Group blocking you.
# Test TCP connectivity from inside a container (verbose output) docker exec -it <etl_container> curl -v telnet://db-hostname:5432 docker exec -it <etl_container> curl -v telnet://rds-endpoint.amazonaws.com:3306 # Alternative: /dev/tcp method (no curl needed) docker exec -it <name> timeout 3 bash -c '</dev/tcp/db-host/5432' \ && echo "Port OPEN" || echo "Port CLOSED" # nc (netcat) — test from EC2 / local shell nc -zv rds-endpoint.amazonaws.com 5432 # PostgreSQL / Aurora PG nc -zv rds-endpoint.amazonaws.com 3306 # MySQL / Aurora MySQL nc -zv redshift-endpoint.amazonaws.com 5439 # Redshift
ECR — push image to AWS
# Login to ECR (do this first — token expires in 12h) aws ecr get-login-password --region us-east-1 | docker login \ --username AWS --password-stdin \ <account_id>.dkr.ecr.us-east-1.amazonaws.com # Tag and push docker tag my-image:latest <account_id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest docker push <account_id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest # Pull from ECR (e.g. on ECS/EC2 worker) docker pull <account_id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest
Useful aliases — add to .bashrc / .zshrc
alias dps="docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"
alias dlogs="docker logs -f --tail 50"
alias dclean="docker system prune -f"
alias dstop="docker compose down --remove-orphans"
AWS S3 — Operations
AWS▼
# Identity check first (always verify before touching prod) aws sts get-caller-identity --profile production # List and navigate aws s3 ls s3://bucket-name/path/ aws s3 ls s3://bucket-name/ --recursive --human-readable --summarize # Copy and sync aws s3 cp local_file.csv s3://bucket-name/prefix/ aws s3 cp s3://bucket-name/file.csv ./local_file.csv # ALWAYS dry-run sync before executing in production aws s3 sync ./data/ s3://bucket-name/data/ --dryrun aws s3 sync ./data/ s3://bucket-name/data/ # remove --dryrun when sure # S3 to S3 copy (fast, server-side) aws s3 sync s3://source-bucket/path/ s3://dest-bucket/path/ # Delete (careful!) aws s3 rm s3://bucket-name/file.csv aws s3 rm s3://bucket-name/prefix/ --recursive --dryrun # dryrun first # Object metadata (size, last modified, content-type) aws s3api head-object --bucket bucket-name --key path/file.csv # Check bucket region (important for cross-region latency) aws s3api get-bucket-location --bucket bucket-name # Secrets (use instead of hardcoding credentials) aws secretsmanager get-secret-value --secret-id my/db/creds --query SecretString --output text aws ssm get-parameter --name /myapp/db_password --with-decryption --query Parameter.Value --output text
S3 Diagnostics — the "Source" check
# Find the LATEST file in a prefix (root cause: "file not found")
aws s3 ls s3://my-bucket/prefix/ --recursive | sort | tail -n 5
# Check file size and metadata (root cause: "empty file / zero bytes")
aws s3api head-object --bucket my-bucket --key path/to/file.csv
# Returns: ContentLength, LastModified, ContentType, ETag
# Quick preview of a remote file WITHOUT downloading (root cause: "schema mismatch")
aws s3 cp s3://my-bucket/file.csv - | head -n 5
# Count objects in a prefix
aws s3 ls s3://my-bucket/prefix/ --recursive | wc -l
# Find files modified in the last 24h
aws s3 ls s3://my-bucket/prefix/ --recursive \
| awk '{print $1, $2, $4}' \
| awk -v d="$(date -d '1 day ago' '+%Y-%m-%d')" '$1 >= d'
RDS via AWS CLI
AWS▼
# List RDS instances and their status aws rds describe-db-instances \ --query 'DBInstances[*].[DBInstanceIdentifier,DBInstanceStatus,DBInstanceClass,Engine]' \ --output table # RDS events (restarts, failovers, maintenance) aws rds describe-events \ --source-identifier my-rds-instance \ --source-type db-instance \ --duration 120 # last 2 hours # List available log files aws rds describe-db-log-files --db-instance-identifier my-rds-instance # Download a log file aws rds download-db-log-file-portion \ --db-instance-identifier my-rds-instance \ --log-file-name error/mysql-error.log \ --starting-token 0 --output text > mysql-error.log # Check parameter group (find key settings) aws rds describe-db-parameters \ --db-parameter-group-name my-pg \ --query 'Parameters[?ParameterValue!=`null`].[ParameterName,ParameterValue]' \ --output table | grep -iE 'slow|connect|timeout|buffer|innodb' # Trigger a manual snapshot aws rds create-db-snapshot \ --db-instance-identifier my-rds-instance \ --db-snapshot-identifier manual-snap-$(date +%Y%m%d-%H%M)
Glue & Step Functions
AWS▼
# List Glue jobs aws glue list-jobs --output table # Check last 5 runs of a job aws glue get-job-runs --job-name my-etl-job \ --query 'JobRuns[0:5].[JobRunState,StartedOn,CompletedOn,ErrorMessage]' \ --output table # Get error message for a failed run aws glue get-job-run \ --job-name my-etl-job \ --run-id jr_abc123 \ --query 'JobRun.ErrorMessage' --output text # Start a job manually aws glue start-job-run --job-name my-etl-job # Reset bookmark (re-process all data from beginning) aws glue reset-job-bookmark --job-name my-etl-job # Get Glue error logs from CloudWatch aws logs get-log-events \ --log-group-name /aws-glue/jobs/error \ --log-stream-name my-etl-job/jr_abc123 \ --query 'events[*].message' --output text | tail -50 # Step Functions — list and check executions aws stepfunctions list-state-machines --output table aws stepfunctions list-executions \ --state-machine-arn arn:aws:states:us-east-1:123456:stateMachine:MyPipeline \ --status-filter FAILED --max-results 10 --output table # Get execution history for a failed run aws stepfunctions get-execution-history \ --execution-arn arn:aws:states:us-east-1:123456:execution:MyPipeline:run-id \ --query 'events[?type==`ExecutionFailed`]' --output json
CloudWatch — Key RDS Metrics
AWS▼
# Helper function for cleaner CW queries
function cw_metric() {
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name "$1" \
--dimensions Name=DBInstanceIdentifier,Value=my-rds-instance \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 --statistics "$2" \
--query 'sort_by(Datapoints, &Timestamp)[-1].['$2']' \
--output text
}
# Run key metrics
echo "CPU%: $(cw_metric CPUUtilization Average)"
echo "Connections: $(cw_metric DatabaseConnections Average)"
echo "Free Storage: $(cw_metric FreeStorageSpace Average) bytes"
echo "Read Latency: $(cw_metric ReadLatency Average) sec"
echo "Write Latency: $(cw_metric WriteLatency Average) sec"
echo "Swap Used: $(cw_metric SwapUsage Average) bytes"
echo "Replica Lag: $(cw_metric AuroraReplicaLag Maximum) ms"
# CloudWatch Logs — Glue / Lambda
aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text | tr '\t' '\n' | grep -i glue
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000) \
--query 'events[*].message' --output text
IAM & Auth Debugging
AWS▼
# Who am I? (run this before any production operation) aws sts get-caller-identity aws configure list # Check Docker clock drift (causes AWS auth errors — common on long-running containers) date && docker execdate # Simulate permissions (check before hitting access denied) aws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::123456789:role/GlueExecutionRole \ --action-names s3:GetObject s3:PutObject glue:StartJobRun \ --resource-arns "arn:aws:s3:::my-bucket/*" # List policies on a role aws iam list-attached-role-policies --role-name GlueExecutionRole aws iam list-role-policies --role-name GlueExecutionRole # inline policies # VPC / Security group — test connectivity aws ec2 describe-security-groups \ --filters "Name=group-name,Values=rds-sg" \ --query 'SecurityGroups[*].IpPermissions'
Linux / EC2 — File & Data Operations
Linux▼
# Row count, preview, encoding check
wc -l dataset.csv
head -5 dataset.csv
tail -5 dataset.csv
file dataset.csv # detect encoding
head -20 dataset.csv | cat -A # show hidden chars (^M = Windows line endings)
# Column processing
cut -d',' -f1-5 dataset.csv # first 5 columns
awk -F',' '{print NF}' data.csv | sort | uniq -c # column count consistency check
awk -F',' 'NR==1{print; next} $3=="ERROR"' data.csv # header + filter rows
# Find files
find /data -name "*.parquet" -mtime -1 # parquet files modified in last day
find /data -type f -size +1G # files larger than 1GB
grep -r "customer_id" /pipeline/sql/ # search inside files recursively
# Checksums (data integrity)
md5sum dataset.csv
sha256sum dataset.csv
Transfer & Compression
Linux▼
# Reliable large transfers (resumes on failure) rsync -avP --progress source/ destination/ rsync -avP /data user@ec2-host:/backup/ # Compression gzip large_file.csv # standard, single-threaded pigz large_file.csv # parallel gzip (much faster on multi-core) gunzip file.csv.gz tar -czvf archive.tar.gz ./folder/ tar -xzvf archive.tar.gz # Progress monitoring for large operations pv large_file.csv | gzip > file.csv.gz pv large_file.csv | wc -l # count rows while piping
Long Jobs & Persistent Sessions
Linux▼
# tmux — persistent sessions (survives SSH disconnect) tmux new -s etl-run # new named session tmux attach -t etl-run # reconnect tmux ls # list sessions # Inside tmux: Ctrl+B, D = detach (leave running) # nohup — simple background job nohup python3 pipeline.py > pipeline.log 2>&1 & echo $! # save the PID # Monitor a long-running job watch -n 5 "ps aux | grep pipeline.py" # check every 5 seconds watch -n 10 df -h # disk space every 10 seconds tail -f pipeline.log # live log output
Redshift — Diagnostics
AWS▼
Load errors — debug COPY command failures
Run in Redshift SQL editor
-- COPY errors (essential when COPY command fails silently)
SELECT query, starttime, filename, line_number,
colname, err_reason
FROM stl_load_errors
ORDER BY starttime DESC
LIMIT 10;
-- Detailed COPY history
SELECT query, trim(filename) AS file, lines_scanned,
lines_loaded, rows_loaded, is_partial
FROM stl_load_commits
ORDER BY query DESC LIMIT 20;
Hanging queries & locks
-- Currently running queries (inflight)
SELECT pid, user_name, start_time, LEFT(query, 100) AS query
FROM stv_inflight
ORDER BY start_time;
-- Kill a stuck query
CANCEL <pid>;
-- Table locks
SELECT a.txn_owner, a.txn_db, a.table_id, b.relname,
a.lock_mode, a.granted
FROM svv_transactions a
JOIN pg_class b ON a.table_id = b.oid
WHERE a.granted IS FALSE;
-- Long-running queries (> 5 minutes)
SELECT pid, user_name, duration, LEFT(query, 120) AS query
FROM SVL_QLOG
WHERE duration > 300000000 -- microseconds
ORDER BY duration DESC LIMIT 10;
Port connectivity test
nc -zv your-redshift-cluster.region.redshift.amazonaws.com 5439 # "Connection to ... succeeded" = network OK, check credentials # "Connection refused / timed out" = Security Group or VPC blocking
Airflow — CLI Diagnostics
Orchestration▼
Task and DAG status
# Get state of a specific task run (faster than opening the UI) airflow tasks state <dag_id> <task_id> <execution_date> # e.g.: airflow tasks state my_etl_dag extract_data 2024-01-15 # List task instances for a DAG run airflow tasks list <dag_id> --tree # Test a task without marking it in the metadata DB (safe for debugging) airflow tasks test <dag_id> <task_id> <execution_date>
Fix stuck or failed tasks
# Clear failed tasks to re-run (quick fix) airflow tasks clear <dag_id> \ --start-date 2024-01-15 \ --end-date 2024-01-15 # Clear + confirm without prompt (-y) airflow tasks clear -y <dag_id> -s 2024-01-15 -e 2024-01-15 # Trigger a DAG run manually airflow dags trigger <dag_id> # Pause / unpause a DAG airflow dags pause <dag_id> airflow dags unpause <dag_id>
Scheduler & metadata DB health
# Check metadata DB connectivity (scheduler lag root cause) airflow db check # Show scheduler heartbeat (last seen time — if stale, scheduler is dead) airflow jobs check --job-type SchedulerJob --limit 1 # Check Airflow version and config airflow version airflow config get-value core sql_alchemy_conn # metadata DB connection
check_env.sh — Quick Environment Health Script
Script▼
💡Save this as
check_env.sh, run before any deployment or when a job fails immediately. Tells you in 10 seconds if the problem is AWS, network, or Docker.check_env.sh — copy and customize endpoints
#!/bin/bash
# Quick environment health check — run before deployment or on job failure
echo "━━━ 1. AWS Identity ━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
aws sts get-caller-identity --query 'Arn' --output text
echo ""
echo "━━━ 2. S3 Connectivity ━━━━━━━━━━━━━━━━━━━━━━━━━"
aws s3 ls s3://your-main-bucket --max-items 1 > /dev/null \
&& echo "✓ S3 Access: OK" \
|| echo "✗ S3 Access: FAILED"
echo ""
echo "━━━ 3. Database Reachability ━━━━━━━━━━━━━━━━━━━"
# Replace with your actual endpoints
timeout 3 bash -c 'cat < /dev/tcp/your-rds-endpoint/5432' 2>/dev/null \
&& echo "✓ RDS PostgreSQL: Reachable" \
|| echo "✗ RDS PostgreSQL: Unreachable"
timeout 3 bash -c 'cat < /dev/tcp/your-rds-endpoint/3306' 2>/dev/null \
&& echo "✓ RDS MySQL: Reachable" \
|| echo "✗ RDS MySQL: Unreachable"
timeout 3 bash -c 'cat < /dev/tcp/your-redshift-endpoint/5439' 2>/dev/null \
&& echo "✓ Redshift: Reachable" \
|| echo "✗ Redshift: Unreachable"
echo ""
echo "━━━ 4. Docker Health ━━━━━━━━━━━━━━━━━━━━━━━━━━━"
docker info 2>/dev/null | grep -E "Running|Paused|Stopped" \
|| echo "✗ Docker: not running"
echo ""
echo "━━━ 5. Disk Space ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
df -h | awk 'NR==1 || int($5) > 70 {print}' # show header + any disk > 70% full
echo ""
echo "━━━ 6. IAM Permission Check ━━━━━━━━━━━━━━━━━━━━"
ROLE_ARN=$(aws sts get-caller-identity --query 'Arn' --output text)
aws iam simulate-principal-policy \
--policy-source-arn "$ROLE_ARN" \
--action-names s3:GetObject s3:PutObject \
--resource-arns "arn:aws:s3:::your-main-bucket/*" \
--query 'EvaluationResults[*].[EvalActionName,EvalDecision]' \
--output table 2>/dev/null || echo "(IAM simulation skipped — needs role ARN)"
Snowflake — Query & Cost Diagnostics
Snowflake▼
Most expensive failed queries (last 14 days) — cost-weighted by warehouse size
💡This query weights execution time by warehouse credits/hour so a 1-minute query on a 4XL warehouse correctly ranks higher than a 10-minute query on XSmall.
Run in Snowflake worksheet — requires ACCOUNTADMIN or MONITOR privilege
WITH WAREHOUSE_CREDITS_MAP AS (
SELECT * FROM (VALUES
('X-Small', 'STANDARD', 1),
('Small', 'STANDARD', 2),
('Medium', 'STANDARD', 4),
('Large', 'STANDARD', 8),
('X-Large', 'STANDARD', 16),
('2X-Large', 'STANDARD', 32),
('3X-Large', 'STANDARD', 64),
('4X-Large', 'STANDARD', 128),
('5X-Large', 'STANDARD', 256),
('6X-Large', 'STANDARD', 512),
('Medium', 'SNOWPARK-OPTIMIZED', 6),
('Large', 'SNOWPARK-OPTIMIZED', 12),
('X-Large', 'SNOWPARK-OPTIMIZED', 24),
('2X-Large', 'SNOWPARK-OPTIMIZED', 48),
('3X-Large', 'SNOWPARK-OPTIMIZED', 96),
('4X-Large', 'SNOWPARK-OPTIMIZED',192),
('5X-Large', 'SNOWPARK-OPTIMIZED',384),
('6X-Large', 'SNOWPARK-OPTIMIZED',768)
) AS t (WAREHOUSE_SIZE, WAREHOUSE_TYPE, CREDITS_PER_HOUR)
),
JOINED AS (
SELECT
qh.QUERY_TEXT,
qh.QUERY_ID,
qh.EXECUTION_STATUS,
qh.WAREHOUSE_SIZE,
qh.WAREHOUSE_TYPE,
qh.EXECUTION_TIME,
qh.EXECUTION_TIME * wcm.CREDITS_PER_HOUR AS EFFECTIVE_EXECUTION_TIME,
qh.USER_NAME
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY qh
JOIN WAREHOUSE_CREDITS_MAP wcm
ON wcm.WAREHOUSE_SIZE = qh.WAREHOUSE_SIZE
AND wcm.WAREHOUSE_TYPE = qh.WAREHOUSE_TYPE
WHERE qh.EXECUTION_STATUS = 'FAIL'
AND qh.START_TIME BETWEEN DATEADD('day', -14, CURRENT_DATE()) AND CURRENT_DATE()
ORDER BY EFFECTIVE_EXECUTION_TIME DESC
)
SELECT
QUERY_TEXT,
MAX(QUERY_ID) AS LATEST_QUERY_ID,
ANY_VALUE(WAREHOUSE_SIZE) AS WAREHOUSE_SIZE,
ANY_VALUE(WAREHOUSE_TYPE) AS WAREHOUSE_TYPE,
AVG(EFFECTIVE_EXECUTION_TIME::INT / 1000)::INT AS AVG_COST_WEIGHTED_SECS,
ANY_VALUE(USER_NAME) AS USER_NAME
FROM JOINED
GROUP BY QUERY_TEXT
ORDER BY AVG_COST_WEIGHTED_SECS DESC;
Active queries & sessions
-- Currently running queries
SELECT query_id, user_name, warehouse_name,
execution_status,
DATEDIFF('second', start_time, CURRENT_TIMESTAMP()) AS running_secs,
LEFT(query_text, 120) AS query
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_SESSION())
WHERE execution_status = 'RUNNING'
ORDER BY running_secs DESC;
-- Kill a query
SELECT SYSTEM$CANCEL_QUERY('<query_id>');
-- Recent failures with error messages
SELECT query_id, start_time, user_name, warehouse_name,
error_code, error_message,
LEFT(query_text, 120) AS query
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE execution_status = 'FAIL'
AND start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
ORDER BY start_time DESC
LIMIT 20;
Warehouse & credit usage
-- Credits used per warehouse last 7 days
SELECT warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used_compute) AS compute_credits,
COUNT(*) AS sessions
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_time > DATEADD('day', -7, CURRENT_DATE())
GROUP BY warehouse_name
ORDER BY total_credits DESC;
-- Warehouse queuing — are queries waiting?
SELECT warehouse_name, queued_overload_time, queued_provisioning_time,
overloaded_time, AVG(execution_time) AS avg_exec_ms
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time > DATEADD('hour', -1, CURRENT_TIMESTAMP())
GROUP BY warehouse_name, queued_overload_time,
queued_provisioning_time, overloaded_time
ORDER BY queued_overload_time DESC NULLS LAST;
Table & storage diagnostics
-- Largest tables in current database
SELECT table_schema, table_name,
ROUND(bytes / 1024 / 1024 / 1024, 2) AS size_gb,
row_count,
ROUND(bytes / NULLIF(row_count, 0), 0) AS bytes_per_row
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
ORDER BY bytes DESC NULLIF 20;
-- Query result cache hit rate (high = good, saves credits)
SELECT
COUNT_IF(result_from_cache) AS cache_hits,
COUNT(*) AS total_queries,
ROUND(COUNT_IF(result_from_cache) * 100.0 / COUNT(*), 1) AS cache_hit_pct
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time > DATEADD('day', -1, CURRENT_DATE());
-- Clone and time travel storage cost
SELECT table_name,
ROUND(active_bytes/1024/1024/1024, 3) AS active_gb,
ROUND(time_travel_bytes/1024/1024/1024, 3) AS time_travel_gb,
ROUND(failsafe_bytes/1024/1024/1024, 3) AS failsafe_gb
FROM SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS
WHERE active_bytes > 0
ORDER BY (active_bytes + time_travel_bytes + failsafe_bytes) DESC
LIMIT 20;
dbt — CLI Diagnostics
Snowflake · Redshift · BigQuery▼
Environment & connection check
# Check dbt version and connection to warehouse dbt --version dbt debug # tests connection, profiles.yml, dependencies # Debug with specific profile / target dbt debug --profile my_project --target prod # List all models (check what's in the project) dbt ls dbt ls --select tag:daily # models with a specific tag dbt ls --select source:* # all sources
Running models & tests
# Run all models dbt run # Run a single model dbt run --select orders # Run a model + all its upstream dependencies dbt run --select +orders # Run a model + all downstream dependents dbt run --select orders+ # Run only failed models from the last run dbt run --select result:error --state ./target # Run tests only dbt test dbt test --select orders # test a specific model dbt test --select source:raw_orders # test a source # Run source freshness check dbt source freshness dbt source freshness --select source:raw_orders
Diagnosing failures
# See full error output (default truncates)
dbt run --select orders 2>&1 | cat
# Read the run results JSON after a failure
cat target/run_results.json | python3 -c "
import json, sys
r = json.load(sys.stdin)
for n in r['results']:
if n['status'] not in ('success','pass'):
print(f\"--- {n['unique_id']} [{n['status']}] ---\")
print(n.get('message',''))
print()
"
# Compile a model to see the rendered SQL (without running it)
dbt compile --select orders
cat target/compiled/my_project/models/orders.sql
# Run with --fail-fast (stop on first error instead of continuing)
dbt run --fail-fast
# Verbose logging
dbt run --select orders --log-level debug 2>&1 | grep -iE 'error|warn|fail'
Source freshness & data quality
# Check all sources for freshness (compares MAX(loaded_at) to warn/error thresholds)
dbt source freshness
# Output to file for logging
dbt source freshness -o target/sources.json
# Read freshness results
cat target/sources.json | python3 -c "
import json, sys
r = json.load(sys.stdin)
for s in r['results']:
if s['status'] != 'pass':
print(f\"{s['unique_id']}: {s['status']} — {s.get('warn_after') or s.get('error_after')}\")
"
# Re-run only tests that failed last time
dbt test --select result:fail --state ./target
Snowflake / Redshift specific tips
# Check which warehouse dbt is using (Snowflake)
# In profiles.yml — confirm warehouse, role, database, schema
cat ~/.dbt/profiles.yml
# Override warehouse for a heavy model run (Snowflake)
dbt run --select large_model --vars '{"snowflake_warehouse": "TRANSFORM_XL"}'
# Check dbt model run times in Snowflake query history
SELECT
query_text, start_time,
ROUND(execution_time / 1000, 1) AS secs,
warehouse_name, user_name
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_text ILIKE '%/* {"app": "dbt"%'
AND start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
ORDER BY execution_time DESC
LIMIT 20;
# Redshift — find dbt queries by comment tag
SELECT pid, starttime,
ROUND(datediff('second', starttime, endtime), 1) AS secs,
LEFT(querytxt, 120) AS query
FROM STL_QUERY
WHERE querytxt LIKE '%dbt%'
AND starttime > CURRENT_DATE - 1
ORDER BY starttime DESC LIMIT 20;
🧠 Output Guide
What the numbers actually mean — and what to do next.
Load Average — uptime
OS▼
output example 14:32 up 42 days, load average: 3.24, 7.81, 9.02
↑ 1min ↑ 5min ↑ 15min
Compare each number to your CPU count. On a 4-core machine: load 4.0 = 100% busy. Above = queue building. Get CPU count: lscpu | grep "^CPU(s)"
✓ Normal
Load < CPU count
System has headroom. No action needed.
⚠ Elevated
1–2× CPU count
Find driver process. May be transient.
✗ Critical
> 2× CPU count
System saturated. Queries timing out. Act now.
What to do next
1
High 1-min, low 15-min → spike from a single job. Run
ps -eo pid,%cpu,cmd --sort=-%cpu | head -52
All 3 elevated → sustained pressure. Check if it's CPU or I/O: high load + low CPU% in top = disk bottleneck. Run
iostat -x 1 3
free -h — Memory
OS▼
output example total used free shared buff/cache available
Mem: 15Gi 13Gi 312Mi 1.2Gi 1.8Gi 1.5Gi
Swap: 4Gi 2.1Gi 1.9Gi ← ANY swap on DB host = problem
🚨Read the available column, not free. Linux uses RAM for disk cache aggressively — that's normal. Available = free + reclaimable cache. Any swap usage on a database host is a performance emergency.
What to do next
1
Swap > 0: find the hog →
ps aux --sort=-%mem | head -102
MySQL:
innodb_buffer_pool_size should be ~70% of RAM max. PostgreSQL: watch work_mem × max_connections — it multiplies.3
OOM killer fired? →
dmesg | grep -i "killed process"
vmstat — Virtual Memory Stats
OS▼
vmstat 1 5 — columns that matterprocs ----memory--- --swap-- ---io---- --cpu---
r b swpd free si so bi bo us sy id wa
2 4 2048 312 142 198 8420 12300 34 18 14 34
Key Columns
r = runnable (waiting for CPU slot)
b = blocked on I/O
si = KB/s swapped IN from disk
so = KB/s swapped OUT to disk
wa = CPU% waiting for I/O
id = CPU% idle
b = blocked on I/O
si = KB/s swapped IN from disk
so = KB/s swapped OUT to disk
wa = CPU% waiting for I/O
id = CPU% idle
Red Flags
si/so > 0 = swapping — immediate problem
wa > 20 = disk is the bottleneck
b > 2 = multiple processes stuck on I/O
r > CPU count = CPU contention
id = 0 = system fully saturated
wa > 20 = disk is the bottleneck
b > 2 = multiple processes stuck on I/O
r > CPU count = CPU contention
id = 0 = system fully saturated
iostat -x — Disk I/O Extended
OS▼
iostat -x 1 3Device r/s w/s rMB/s wMB/s await r_await w_await %util
nvme0n1 124 890 1.8 28.4 48.2 3.1 52.4 97.3
✓ Healthy
%util < 60%, await < 10ms
I/O not a bottleneck.
⚠ Busy
%util 60–80%, await 10–20ms
Monitor. DB queries may slow.
✗ Saturated
%util > 80%, await > 20ms
Every query is waiting. Find the writer.
Diagnosis
1
High w_await, low r_await → write-heavy. Check ETL jobs, MySQL binary log, PostgreSQL WAL writes.
2
Find the offending process:
iotop -o or pidstat -d 1 53
On RDS: correlate with CloudWatch
ReadLatency, WriteLatency, DiskQueueDepth.
ss -ant — Connection States
OS▼
ss -ant | awk '{print $1}' | sort | uniq -c 1 LISTEN
847 ESTAB
312 TIME-WAIT
89 CLOSE-WAIT ← growing = connection leak in app
45 SYN-RECV ← high = server overloaded or SYN flood
ESTAB — established
Active connections. Normal. High count = pool not limiting or pool leak.
Compare to max_connections. If near limit, add PgBouncer/ProxySQL.
TIME-WAIT
Recently closed. Linux holds 60s for safety. Thousands = normal.
Only a problem if exhausting ephemeral port range (65k limit).
CLOSE-WAIT — danger
Remote closed but local app didn't. Growing count = connection leak bug.
Restart the leaking process. Fix the application code (missing close()).
SYN-RECV
Handshake started but not completed. Low = normal. High = overloaded server.
>100 sustained = investigate. Check if server can accept connections fast enough.
ps aux — STAT Column Codes
OS▼
R
Running / runnable — on CPU or ready
S
Sleeping — normal idle state
D
Uninterruptible I/O wait — cannot kill
Z
Zombie — dead, parent not cleaned
T
Stopped by signal or debugger
🚨D-state is the most dangerous. Process is stuck waiting for disk/network I/O and cannot be killed with kill -9. Many D-state processes = disk hung or NFS issue. Only fixing the I/O or rebooting resolves it. Find them:
ps -eo pid,stat,wchan:25,cmd | grep "^[0-9]* D"
MySQL PROCESSLIST — State Column
MySQL▼
Waiting for lock
Transaction blocked waiting for a row/table lock held by another connection.
Find the blocker via innodb_lock_waits. Kill it if stuck too long.
Copying to tmp table
Query creating temp table — usually a heavy GROUP BY or DISTINCT without proper index.
Add index or rewrite query. Check tmp_table_size setting.
Sending data
Reading data from table and sending to client. Often appears slow — usually the actual query work.
EXPLAIN the query. Look for full table scans (type=ALL).
Sorting result
Filesort — sorting without an index. Expensive on large datasets.
Add index matching ORDER BY clause columns.
Opening tables
Many threads opening tables — table cache too small or table DDL happening.
Increase table_open_cache. Check for ongoing ALTER TABLE operations.
Sleep (very long)
Idle connection holding resources. With wait_timeout expired = connection not properly closed.
Reduce wait_timeout. Ensure app uses connection pooling with proper cleanup.
pg_stat_activity — State & Wait Events
PostgreSQL▼
state column valuesstate meaning
active query is currently running
idle connection open, nothing running (normal in pool)
idle in transaction opened transaction but paused — holds locks!
idle in transaction (aborted) failed tx not rolled back — blocking others
⚠️idle in transaction is the PostgreSQL silent killer. It holds locks without doing work. Long-running idle in transaction sessions will block VACUUM, DDL changes, and other writes. Kill them:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND query_start < now() - INTERVAL '5 minutes';wait_event = Lock
Waiting for a lock held by another session.
Use pg_blocking_pids() to find who holds it.
wait_event = ClientRead
Waiting for client to send next query. Connection sitting idle.
Normal for idle pooled connections. Set idle timeout in PgBouncer.
wait_event = IO / DataFileRead
Reading data from disk — cache miss. Shared buffers too small.
Increase shared_buffers. Check cache hit rate in pg_statio_user_tables.
wait_event = WALWrite
Writing to WAL log. High count = write-heavy workload hitting disk.
Tune checkpoint_completion_target=0.9, increase wal_buffers.
MySQL Error Log — Key Patterns
MySQL▼
Too many connections
Hit max_connections. New connections refused.
Kill idle connections. Add ProxySQL/connection pooler.
Deadlock found when trying to get lock
InnoDB killed one transaction to break the cycle.
Run SHOW ENGINE INNODB STATUS\G and check LATEST DEADLOCK section.
InnoDB: out of memory
Buffer pool or other allocation failed. OOM condition.
Reduce innodb_buffer_pool_size. Check total RAM vs all MySQL settings.
Aborted connection … (got error reading communication packets)
Client disconnected without closing properly. Network drop or app crash.
Usually noise. High frequency = network instability or app bug.
Disk full … waiting for someone to free space
All write operations hanging. MySQL paused.
Free space immediately. Purge binary logs: PURGE BINARY LOGS BEFORE NOW() - INTERVAL 3 DAY;
Table is marked as crashed
MyISAM corruption from unclean shutdown.
REPAIR TABLE tablename; or mysqlcheck --repair db table
PostgreSQL Log — Key Patterns
PostgreSQL▼
FATAL: remaining connection slots are reserved
Hit max_connections. PG reserves 3 slots for superuser so regular users hit this first.
Add PgBouncer in transaction mode. Increase max_connections + tune shared_buffers.
ERROR: deadlock detected
PG detected a lock cycle and killed one transaction. Log includes full dependency trace.
Read DETAIL in the log — it shows exact tables and PIDs. Fix lock ordering in app.
LOG: checkpoint request taking too long
WAL checkpoints overloaded. Too much dirty data being flushed to disk.
Set checkpoint_completion_target=0.9. Increase max_wal_size. Check disk iostat.
ERROR: could not open file "base/…"
Data file missing or corrupt. Unclean shutdown or disk failure.
Check disk health. Run pg_dump on surviving tables. May need PITR restore.
autovacuum: VACUUM table taking too long
Table bloat accumulating. Dead rows not cleaned fast enough.
Run VACUUM ANALYZE tablename manually. Check n_dead_tup in pg_stat_user_tables.
HINT: No space left on device
Disk full — PG can't write WAL or data files. All transactions fail.
Free space. Check pg_wal/ directory. Review wal_keep_size setting.
Spark / Glue — Error Patterns
ETL▼
java.lang.OutOfMemoryError: Java heap space
Executor ran out of JVM heap. Dataset too large for current config.
Increase --executor-memory. Repartition data to reduce partition size.
Lost executor … heartbeat timeout
Worker node died or became unresponsive. Stage will retry.
Check worker node: disk full or OOM. Check spark.executor.heartbeatInterval.
FetchFailed … SparkException: stage failed
Shuffle read failed — executor that wrote shuffle data died before read.
Increase spark.shuffle.maxRetriesOnError. Check shuffle disk health.
Task not serializable
Non-serializable object used inside a Spark closure (lambda/map).
Broadcast the object or make it serializable. Never pass DB connections to executors.
Container killed by YARN for exceeding memory limits
Executor total memory (heap + overhead) exceeded YARN limit.
Increase spark.yarn.executor.memoryOverhead or reduce executor-memory.
Job aborted: FileNotFoundException on s3://
Input file disappeared mid-job. Path changed or deleted.
Verify path: aws s3 ls s3://bucket/path/ Check if another process deleted input.
🐛 Debug Protocol
D-W-F-E-V-S — 6 steps, 5 minutes, 90% of issues solved. Memorize this sequence.
D-W-F-E-V-S — The Memory Device
Framework▼
D
Data or System?
Split the problem. Pick one. Focus there.
W
What Changed?
80% rule: recent change in last 24-48h
F
First Error
Top of logs = root cause. Bottom = cascades.
E
Evidence
Don't assume. Always verify with proof.
V
Volume
Compare to baseline. 10x spike = volume issue.
S
Stupid Questions
Check basics. Catches 40% of all issues.
📌The 90-5-5 Rule: 90% of issues solved by Steps 1-6 in ~5 minutes. 5% need specialized checks (~3 min). 5% require deep dive — escalate.
5-Minute Time Breakdown
Framework▼
Step 1
10s
0:10 total
Data or System?
Step 2
60s
1:10 total
What Changed?
Step 3
90s
2:40 total
First Error
Step 4
60s
3:40 total
Verify Evidence
Step 5
60s
4:40 total
Volume Compare
Step 6
60s
5:40 total
Stupid Questions
Step 1 — Data or System? (10 seconds)
Cuts 80% of noise▼
Ask the question. Pick one. Do not investigate both simultaneously — that's how you lose 30 minutes. Split the problem space in half immediately.
📁 DATA Problems
Files missing or empty
Schema changed
Bad / unparseable records
Volume spike or drop
Upstream source changed
Schema changed
Bad / unparseable records
Volume spike or drop
Upstream source changed
⚙️ SYSTEM Problems
Service down
IAM / credentials expired
Network / VPC blocked
CPU / memory exhausted
Disk full
IAM / credentials expired
Network / VPC blocked
CPU / memory exhausted
Disk full
Step 2 — What Changed? (60 seconds)
The 80% rule▼
📌80% of production issues are caused by a change in the last 24-48 hours. Code deploy, schema migration, config update, AWS change, upstream data format shift.
# Code changes in last 2 days
git log --since="2 days ago" --oneline
# Volume trend — did row counts change?
SELECT date, COUNT(*) AS row_count
FROM your_table
WHERE date >= CURRENT_DATE - 7
GROUP BY date
ORDER BY date;
# File count per day in S3 (spot the drop)
for i in {0..7}; do
d=$(date -d "$i days ago" +%Y-%m-%d)
echo "$d: $(aws s3 ls s3://bucket/$d/ | wc -l) files"
done
# Schema changes (PostgreSQL)
SELECT schemaname, tablename, attname, atttypid::regtype
FROM pg_attribute JOIN pg_class ON attrelid = pg_class.oid
JOIN pg_namespace ON relnamespace = pg_namespace.oid
WHERE attnum > 0 AND NOT attisdropped
ORDER BY attrelid, attnum;
Step 3 — First Error in Logs (90 seconds)
Root cause is at the top▼
⚠️Read top to bottom, not bottom up. The FIRST error in time is the root cause. Everything after it is a cascading failure triggered by that first event. Junior engineers waste 20 minutes debugging the wrong error.
# 1. Application logs → your code errors first grep -i "error" logfile.log | head -10 # 2. Glue / system logs → orchestration layer aws logs filter-log-events \ --log-group-name /aws/glue/jobs/output \ --filter-pattern "ERROR" \ --max-items 10 \ --query 'events[*].message' --output text # 3. Infrastructure logs → CloudWatch, RDS aws logs filter-log-events \ --log-group-name /aws/rds/instance/my-db/error \ --filter-pattern "ERROR" --max-items 10 \ --query 'events[*].message' --output text # 4. Airflow task logs airflow tasks log <dag_id> <task_id> <execution_date>
Step 4 — Verify, Don't Assume (60 seconds)
Evidence over intuition▼
🚨"It should work" ≠ "It works." Every assumption you make without verifying adds 10 minutes to diagnosis time. Run the command. Check the actual value.
# Credentials — are they actually valid?
aws sts get-caller-identity
# Does the file/object actually exist?
aws s3 ls s3://bucket/path/file.parquet
aws s3api head-object --bucket my-bucket --key path/file.parquet
# Is the DB actually accepting connections?
nc -zv rds-endpoint.amazonaws.com 5432
telnet rds-endpoint.amazonaws.com 3306
# Is the data actually there?
SELECT COUNT(*) FROM table WHERE date = CURRENT_DATE;
# Compare dev vs prod config
diff <(aws ssm get-parameter --name /dev/db_config --output text) \
<(aws ssm get-parameter --name /prod/db_config --output text)
# What environment is this actually running in?
echo $ENV && echo $AWS_DEFAULT_REGION && aws configure list
Step 5 — Compare Data Volume (60 seconds)
Many perf issues = volume issues▼
💡Red flags: 10x spike, sudden drop to 0. Many "performance issues" are actually volume issues in disguise — the query plan was fine for 1M rows but breaks at 50M.
# Row count trend (SQL — works on MySQL, PostgreSQL, Snowflake)
SELECT
DATE(created_at) AS day,
COUNT(*) AS rows
FROM your_table
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY day;
# File count per day in S3
for i in {0..7}; do
d=$(date -d "$i days ago" +%Y-%m-%d)
cnt=$(aws s3 ls s3://your-bucket/prefix/$d/ 2>/dev/null | wc -l)
echo "$d: $cnt files"
done
# Avg file size check (detect 0-byte or undersized files)
aws s3 ls s3://bucket/prefix/ --recursive \
| awk '{sum+=$3; count++} END {print "Files:", count, "Avg size:", int(sum/count), "bytes"}'
# Partition distribution (Spark — detect skew)
df.groupBy("partition_col").count().orderBy("count", ascending=False).show(20)
Step 6 — Stupid Questions Checklist (60 seconds)
Catches 40% of all issues▼
⚠️Senior engineers run this checklist too. There is no shame in checking the obvious. The shame is in spending 3 hours before checking it.
# ✓ Credentials valid? aws sts get-caller-identity # ✓ Data exists? aws s3 ls s3://bucket/path/ SELECT COUNT(*) FROM table WHERE date = CURRENT_DATE; # ✓ Service actually running? systemctl status airflow-scheduler systemctl status mysql docker ps | grep <service> # ✓ Right environment? (not running prod code against dev DB) echo $ENV echo $AWS_DEFAULT_REGION aws configure list # ✓ Network open? nc -zv hostname 5432 # PostgreSQL nc -zv hostname 3306 # MySQL nc -zv hostname 5439 # Redshift curl -sv https://your-endpoint/health 2>&1 | grep -E 'Connected|Failed' # ✓ Permissions OK? aws iam simulate-principal-policy \ --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \ --action-names s3:GetObject --resource-arns "arn:aws:s3:::my-bucket/*"
10 Critical Log Patterns
See pattern → know fix instantly▼
"Field 'x' not found"
Schema drift — upstream changed a column name or removed a field.
Align schema. Add schema validation at ingestion. Check source DDL changes.
"Cannot parse" / "ParseException"
Bad record — malformed JSON, wrong delimiter, encoding issue.
Add input validation. Preview file: head -5 or aws s3 cp s3://... - | head -5
"NullPointerException"
Unexpected null value in a field assumed to be non-null.
Add null checks. Use COALESCE / fillna. Trace to source column.
"OutOfMemoryError: Java heap space"
Data skew or shuffle too large for executor memory.
Repartition. Increase executor-memory. Salt skewed keys.
"Connection timed out"
VPC / firewall / security group blocking the port.
Check network: nc -zv host port. Verify security group inbound rules.
"401 Unauthorized" / "403 Forbidden"
Credentials expired or IAM policy missing the required action.
Rotate secrets. Run IAM simulate-principal-policy to find the missing permission.
"429 Too Many Requests"
Rate limit hit on an API or AWS service.
Add exponential backoff. Batch requests. Check service quotas in AWS console.
"0 records processed"
Upstream source is empty — file missing, table empty, wrong partition filter.
Check source: aws s3 ls or SELECT COUNT(*). Verify date partition logic.
"Corrupted parquet" / "Could not read footer"
Mixed schema in a Parquet partition, or file written incompletely.
Regenerate the file. Check schema evolution. Use schema_merge=True in Spark.
"Task skipped" / "upstream_failed"
Wrong DAG trigger rule or upstream dependency failed silently.
Check trigger_rule on task. Inspect upstream task logs. Fix dependency chain.
5 Failure Categories
Every failure maps to one of these▼
~50%
DATA
Files, schema, records, volume
⚡ Fast to check
~20%
INFRA
IAM, network, storage, resources
🔧 Medium effort
~15%
CODE
Joins, columns, parsing, nulls
🔍 Quick to find
~10%
ORCHESTRATION
Triggers, upstream, timing
🔧 Medium effort
~5%
UPSTREAM
API, format, rate limits
⚡ Fast to check
💡Start with DATA (50% probability, fast to check). Then INFRA (20%, medium effort). This order gives you the best time-to-diagnosis ratio on any incident.
Command Cheat Sheet by Step
Quick Reference▼
Step 2 — What changed?
git log --since="2 days ago" --oneline SELECT date, COUNT(*) FROM table WHERE date >= CURRENT_DATE - 7 GROUP BY date ORDER BY date; aws s3 ls s3://bucket/prefix/ --recursive | wc -l
Step 3 — First error
grep -i "error" logfile.log | head -10 aws logs filter-log-events --log-group-name /aws/glue/jobs/output \ --filter-pattern "ERROR" --max-items 10 --query 'events[*].message' --output text
Step 4 — Verify
aws sts get-caller-identity aws s3 ls s3://bucket/path/file.parquet telnet hostname 5432
Step 5 — Volume compare
for i in {0..7}; do d=$(date -d "$i days ago" +%Y-%m-%d); echo "$d: $(aws s3 ls s3://bucket/$d/ | wc -l) files"; done
Step 6 — Stupid questions
aws sts get-caller-identity SELECT COUNT(*) FROM table WHERE date = CURRENT_DATE; systemctl status service-name nc -zv hostname 5432
⌕
No results — try a different keyword