Plan an investigation into a metric anomaly
intermediateClaude SonnetData & AnalyticsData Analystanomaly-detectionincidentinvestigationtriagedata-analyst
Use case
Use this prompt when a metric drops or spikes unexpectedly and the team is asking 'is this real?' This produces the triage plan you bring to the first meeting — not a writeup of what happened.
The prompt
You are a senior data analyst building an anomaly triage plan. The metric just moved, the team is alarmed, and you have an hour to stand up an investigation. Inputs: - Metric and movement:{{metric_movement}}(e.g., "DAU dropped 18% on Mar 12, recovered partial 50% by Mar 13") - When it started:{{onset}}- How it was detected:{{detection}}(alert, dashboard glance, exec ping, customer report) - Recent changes in scope:{{recent_changes}}(deploys, marketing, instrumentation, infra, vendor) - Stakeholders and their question:{{stakeholders}}- Available systems and data:{{systems}}- Severity assessment:{{severity}}(revenue impact, customer-facing, exec-visible) Produce a triage plan with these sections: ## Is It Real? (the first 30 minutes) Three to five quick checks that determine whether this is a data problem or a real-world problem. Order by speed. Examples: tracking pipeline health, comparing two independent data sources for the same metric, checking whether the move is on an aggregate or only specific dimensions. For each: the check, the system or query, and the expected time. ## If Real: Candidate Causes (ranked) For each candidate cause: - One-sentence statement - Why it fits the timing and shape of the anomaly - The specific test to confirm or rule out - Effort to test Rank by how well they fit the anomaly's shape (sudden vs. gradual, recovered vs. persistent, segment-wide vs. concentrated). ## Ownership and Sequencing - Who runs which check (analyst, data engineer, product engineer, ops, vendor) - The order: what runs first, what depends on what - The next checkpoint and what we need to know by then ## Communications Plan - Who needs an update and when (especially if{{severity}}is high) - The interim message to send before the investigation concludes (1–2 sentences) - The escalation trigger: if X by Y time, raise to Z ## What Not to Do Two to three things that are tempting but wrong: shipping a fix before root cause, communicating uncertainty as certainty, or rolling back a deploy without isolating it as the cause. Tone: calm, structured, operational. The reader is in incident mode and needs a plan, not a recap.
Variables
{{{{metric_movement}}}}Replace with your {{metric movement}}{{{{onset}}}}Replace with your {{onset}}{{{{detection}}}}Replace with your {{detection}}{{{{recent_changes}}}}Replace with your {{recent changes}}{{{{stakeholders}}}}Replace with your {{stakeholders}}{{{{systems}}}}Replace with your {{systems}}{{{{severity}}}}Replace with your {{severity}}
Example
Input
metric_movement: Daily revenue (gross transactions) dropped 31% on Mar 18 vs. 28-day average. Mar 19 morning data shows partial recovery to -12%. onset: Drop began at approximately 09:00 UTC on Mar 18; persisted through end of day. detection: Anomaly alert in Looker triggered at 14:00 UTC; CFO emailed at 17:00 UTC asking what happened. recent_changes: | - Checkout v3 deployed 22 hours before drop (rolled to 100%) - Stripe SDK upgraded same day in same deploy - Marketing turned down a paid search campaign for budget reasons Mar 17 evening - Monthly Stripe report run and reconciliation pipeline ran at 08:50 UTC stakeholders: CFO (asking), VP Eng (deploy owner), Head of Growth (marketing change) systems: Stripe dashboard, internal billing DB (Postgres), Looker, Snowplow, Sentry, deploy log severity: High — revenue-visible, exec-visible, customer-impact unknown
Output
## Is It Real? (first 30 minutes) 1. **Reconcile internal billing DB to Stripe dashboard for Mar 18.** Both systems should agree on gross daily revenue. If Stripe shows a smaller drop (or no drop), the anomaly is in our pipeline, not in real transactions. (~10 min — VP Eng or analyst, query Postgres + open Stripe dashboard side by side) 2. **Check Snowplow event volume for `purchase_completed` on Mar 18 vs. 28-day baseline.** A third independent source. If Snowplow agrees with the drop, it's likely real. If Snowplow shows normal volume, the issue is downstream of purchase. (~10 min — analyst) 3. **Look at the reconciliation pipeline log.** It ran at 08:50 UTC, 10 minutes before drop began. Did it complete normally? Did it write to the same tables Looker reads from? (~10 min — data engineer) 4. **Sentry for checkout/payment errors Mar 18.** Spike in checkout errors after the v3 deploy is the simplest "real" cause. (~5 min — VP Eng) Decision point at the end of these checks: real, data, or partial. ## If Real: Candidate Causes (ranked) 1. **Checkout v3 introduced a regression on a payment path.** - Fits: deploy was 22 hours before onset; partial recovery suggests a hotfix or partial rollback may already be active. Sudden onset matches a deploy. - Test: Sentry error rate on checkout endpoints; conversion rate on the payment step pre- vs. post-deploy; segment by browser/country/payment method. Cross-check with deploy timeline for any rollback or feature flag change between Mar 18 and Mar 19. - Effort: small. 2. **Stripe SDK upgrade is silently failing for a subset of cards or 3DS flows.** - Fits: same deploy, but SDK issues often hit specific BIN ranges or 3DS regions, which would explain partial (not total) drop and partial recovery if traffic mix changed. - Test: Stripe error codes by hour on Mar 18; cut by card brand, country, 3DS yes/no. Compare to Mar 17. - Effort: small (Stripe dashboard is fastest). 3. **Paid search pause cut top-of-funnel transaction-likely traffic.** - Fits: pause happened evening of Mar 17; lower volume from a high-converting channel could explain part of the drop. - Test: pull transaction volume by traffic source for Mar 18 vs. 28-day baseline. If the drop is concentrated in paid search and broader sources are flat, this is the cause (or a contributor). - Effort: small. - Note: a 31% revenue drop is unlikely to be fully explained by a single paid search pause unless that channel is unusually large. Likely contributor, not full cause. 4. **Reconciliation pipeline corrupted Mar 18 figures (data, not real).** - Fits: pipeline ran 10 minutes before the apparent drop. If checks 1 and 2 disagree, this is the answer. - Test: covered in "Is It Real?" check 1. If real, drop this candidate. - Effort: small. 5. **External: Stripe outage or card network issue.** - Fits: real but exogenous. Less likely without other companies reporting it. Status page is one click. - Test: Stripe status page, downdetector, ops Slack channels. - Effort: trivial. ## Ownership and Sequencing - **VP Eng:** Sentry check, deploy timeline reconciliation (in parallel with analyst's reconciliation). - **Data engineer:** reconciliation pipeline log; confirm Looker source-of-truth tables. - **Analyst:** Stripe vs. internal DB reconciliation, Snowplow check, traffic-source cut. - **Order:** the four "Is It Real?" checks run in parallel within the first 30 minutes. By T+30, we know whether to investigate as data or as product. Whichever path it is, ranked candidates above sequence accordingly. - **Next checkpoint:** stand-up at T+30, then again at T+90. ## Communications Plan - **Now (interim message to CFO):** "We're seeing the same drop and have an investigation running across pipeline, deploy, and traffic sources. Will have a directional answer within 60 minutes; root cause within 3 hours." - **At T+90:** update with confirmed cause direction (real vs. data, candidate cause if real). - **Escalation trigger:** if no clear direction by T+90 and Sentry shows ongoing checkout errors, page on-call and consider a rollback of the checkout v3 deploy. ## What Not to Do - **Do not roll back the checkout deploy yet.** Without isolating it as the cause, a rollback both costs work and destroys evidence. Hold rollback as the action you take after the deploy is confirmed as the cause, unless customer-facing checkout is breaking right now. - **Do not communicate "checkout is broken" to anyone outside the investigation team yet.** Premature attribution to a deploy creates noise and damages eng trust. - **Do not assume the recovery on Mar 19 means the issue is over.** A partial recovery with no root cause identified means it can come back.
Tips for best results
- 1Always start with 'is it real?' Pipeline issues masquerading as business issues are common, and you cannot un-pull-the-fire-alarm with the CFO.
- 2Two independent data sources beat one trusted source. If Stripe and your internal DB agree, it's real; if they disagree, you have a different (smaller) problem.
- 3Map the anomaly's shape — sudden vs. gradual, persistent vs. partial recovery, segment-concentrated vs. wide — before ranking causes. The shape rules out half the candidates immediately.
- 4Send a calm interim message to execs early. 'We're investigating, will have an update in 60 minutes' beats silence and beats speculation.
- 5Resist the urge to roll back a deploy before isolating it. Rollback is an action of last resort during triage, not first.
Related prompts
Generate hypotheses for a funnel dropoff
intermediateTake a conversion funnel with a notable dropoff and produce a ranked set of hypotheses with a concrete investigation plan for each.
Data & Analyticsfunnel-analysisinvestigationhypotheses
Generate weekly metrics commentary from a CSV
beginnerTurn a weekly metrics CSV into a tight written commentary that explains what moved, why, and what to do about it.
Data & Analyticsweekly-metricsbusiness-reviewcommentary
Plan data quality tests for a new pipeline
advancedProduce a tiered data quality test plan for a new ingest or pipeline — covering source freshness, schema, volume, and business-logic checks without over-testing.
Data & Analyticsdata-qualitytestingdbt
Need help implementing this prompt in your workflow?
Book a call