Plan data quality tests for a new pipeline

advancedClaude SonnetData & AnalyticsAnalytics Engineeringdata-qualitytestingdbtpipelineanalytics-engineering

Use case

Use this prompt when standing up a new pipeline or dbt project area and needing the test plan before you ship to production. Forces a tiered approach so you don't end up with 400 tests, half of them flapping.

The prompt

You are a senior analytics engineer designing the data quality test plan for a new pipeline. The goal is to catch real problems early, fail loudly when they matter, and avoid the test-fatigue trap where engineers ignore alerts because half are false positives.

Inputs:
- Pipeline overview: {{pipeline_overview}} (source, transformations, destination, downstream consumers)
- Source(s) and their reliability: {{source_reliability}} (vendor SLA, known issues, schema stability)
- Critical downstream uses: {{downstream_uses}} (e.g., "drives finance MRR report — must reconcile to Stripe", "powers customer-facing usage dashboard")
- Existing tests in adjacent areas: {{existing_tests}}
- Tooling available: {{tooling}} (dbt tests, dbt-utils, dbt-expectations, Great Expectations, Monte Carlo, Elementary, custom)
- Alert routing: {{alert_routing}} (where failures go, who's on-call)

Produce a plan with these tiers:

## Tier 1: Source / Freshness Tests
Tests that verify data is arriving and is structurally what we expect. Run these as the first gate.
- Source freshness (max age since last load) — set thresholds based on source reliability
- Source row volume bounds (min/max expected daily volume; absolute floor + relative drop %)
- Schema drift detection (new columns, removed columns, type changes)
- Specify alert severity: page or ticket?

## Tier 2: Structural / Schema Tests
Tests on the materialized models that verify dbt-level invariants.
- Primary key uniqueness and not-nullness on every model's grain column
- Referential integrity for foreign keys to dimension tables
- Accepted values for status / enum columns
- Not-null on columns the SQL guarantees won't be null
- Be conservative — only test what the model actually guarantees

## Tier 3: Business Logic Tests
Tests that verify the pipeline produces correct numbers. These are the highest-value tests and the ones most worth the engineering time.
- Reconciliation against an external source of truth (e.g., daily revenue total in pipeline = daily revenue total in vendor dashboard within X%)
- Ratio bounds (e.g., conversion rate should be between 1% and 30%; negative outside means a bug)
- Cross-table consistency (e.g., sum of order line items = order total)
- Derived metric stability (large WoW deltas on metrics that should be stable)

## Tier 4: Anomaly / Trend Tests
Optional. Run sparingly. These are softer signals.
- Volume anomaly detection (today's volume more than 3σ from rolling 30-day mean)
- Distribution drift on key categorical columns
- Late-arriving data warnings

## Routing and Severity
For each tier, define:
- Alert severity (page on-call, ticket, dashboard-only)
- Owner (analytics engineering, data engineering, business owner)
- Expected response SLA

## What NOT to Test
Call out 3–5 things that are tempting to test but produce more noise than signal. Examples: testing every nullable column for non-null; testing aggregate metrics that legitimately move day-to-day; testing exact equality on derived columns subject to ordering.

Tone: pragmatic. The plan should produce a finite, defensible list of tests, not an exhaustive coverage matrix.

Variables

{{{{pipeline_overview}}}}Replace with your {{pipeline overview}}
{{{{source_reliability}}}}Replace with your {{source reliability}}
{{{{downstream_uses}}}}Replace with your {{downstream uses}}
{{{{existing_tests}}}}Replace with your {{existing tests}}
{{{{tooling}}}}Replace with your {{tooling}}
{{{{alert_routing}}}}Replace with your {{alert routing}}

Example

Input

pipeline_overview: |
  Source: Stripe (Fivetran connector, hourly sync)
  Staging: stg_stripe__charges, stg_stripe__customers, stg_stripe__subscriptions
  Marts: fct_charges (transaction-level), fct_subscription_events (subscription state changes), dim_subscriptions
  Consumers: finance MRR/ARR reporting, GTM dashboards, customer success at-risk model
source_reliability: Stripe is highly reliable; Fivetran connector has had 2 schema-drift incidents in the last 12 months (both column additions, not breaking)
downstream_uses: |
  1. Finance MRR report — must reconcile to Stripe dashboard within 0.5%
  2. Customer-facing usage / billing dashboard — wrong numbers are visible to customers
  3. Churn prediction model — feature freshness matters
existing_tests: dbt project has uniqueness/not-null tests on existing dimension models; no business-logic reconciliation tests yet
tooling: dbt + dbt-utils + dbt-expectations; Elementary for anomaly detection; Slack for alerts; PagerDuty integration available
alert_routing: Slack #data-alerts for warnings; PagerDuty for criticals; analytics-eng on-call rotation

Output

## Tier 1: Source / Freshness Tests
- **stg_stripe__charges freshness:** warn at 4h, error at 8h. Stripe is hourly; an 8h gap means the connector is broken. → PagerDuty.
- **stg_stripe__customers / __subscriptions freshness:** warn at 8h, error at 24h. Less time-critical. → Slack.
- **Daily charge volume bounds:** absolute floor of 50 charges/day (sanity), and relative bound: today's volume not less than 50% of trailing 7-day median or more than 200%. → Slack first hit, PagerDuty if persists 2 days.
- **Schema drift on stg_stripe__* tables:** monitor for added/removed columns or type changes via Elementary. Added columns → ticket (non-blocking, common). Removed columns or type changes → PagerDuty (prior incidents had column additions, but type changes would break downstream).
- **Owner:** analytics engineering on-call.

## Tier 2: Structural / Schema Tests
Apply to all marts:
- `unique` and `not_null` on grain columns: `charge_id` (fct_charges), `subscription_event_id` (fct_subscription_events), `subscription_id` (dim_subscriptions).
- `relationships` test from `fct_charges.customer_id` to `dim_customers.customer_id`. Use a where-clause to permit guest charges if any.
- `accepted_values` on `status`, `currency`, `subscription_state` columns.
- `not_null` on `created_at_utc`, `amount_usd_net_discount`, `customer_id` (where the SQL guarantees it).
- **Owner:** model author at PR time; CI blocks on failure.

## Tier 3: Business Logic Tests (the highest-value tier)
- **Daily reconciliation: sum(fct_charges.amount_usd) by date vs. Stripe API daily summary, last 7 days.** Threshold: 0.5% absolute difference. This is the test that earns the rest of the plan its keep — it catches every silent breakage that schema tests miss. → PagerDuty if outside threshold.
- **MRR computed from dim_subscriptions vs. finance MRR report:** monthly. Threshold: 0.5%. → ticket to finance + analytics-eng.
- **Subscription state machine integrity:** for each subscription_id, the sequence of `subscription_event_type` should be a valid path (e.g., 'created' → 'active' → 'cancelled', not 'cancelled' → 'active'). Custom dbt test. → Slack.
- **Charge amount bounds:** `amount_usd >= 0`, `currency in ('USD','EUR','GBP', ...)`. → CI block.
- **Cross-table: sum of `fct_charges` for a subscription should be ≤ contract value × 1.05 (5% buffer for usage overages).** Catches double-charging or stale dim joins. → ticket.
- **Owner:** analytics engineering writes; business owner (finance) approves the threshold.

## Tier 4: Anomaly / Trend Tests
- **Daily charge count anomaly (Elementary):** alert when today's count is >3σ from rolling 30-day mean. → Slack only, suppressed on weekends and known marketing-event days.
- **Daily new-subscription count anomaly:** same setup. → Slack only.
- **Late-arriving subscription events:** flag any event whose `event_at_utc` is more than 48h before the loader timestamp. Stripe occasionally backfills, but a sudden surge is a quality signal. → ticket.
- **Owner:** monitored by analytics-eng on-call; not actioned outside business hours unless paired with a Tier 1–3 alert.

## Routing and Severity
| Tier | Severity | Channel | Owner | SLA |
| --- | --- | --- | --- | --- |
| Tier 1 critical (freshness, schema-break) | Page | PagerDuty | AE on-call | Acknowledge 30 min |
| Tier 1 warning (volume bound) | Ticket | Slack #data-alerts | AE on-call | Same business day |
| Tier 2 | Block CI | n/a | Model author | At PR review |
| Tier 3 reconciliation | Page | PagerDuty | AE on-call + finance | Acknowledge 1h |
| Tier 3 logic | Ticket | Slack | AE on-call | Same business day |
| Tier 4 | Info | Slack only | AE on-call (review weekly) | Weekly review |

## What NOT to Test
- **Don't test every nullable column for not_null.** Many columns are legitimately null (e.g., `cancelled_at_utc` on active subscriptions). Test what the SQL guarantees, not what would feel safer.
- **Don't test exact equality on derived ratios.** Floating-point drift and ordering produce false positives. Use bounds.
- **Don't anomaly-test metrics that should move day-to-day** (e.g., daily new charges during a marketing campaign window). Use suppression windows or skip.
- **Don't reconcile against the dbt project's own MRR for "validation."** Reconcile against an external source (Stripe API or finance's hand-calculated total). A test that compares the model to itself catches nothing.
- **Don't add a test for every business stakeholder request.** Stakeholder concerns are real but should be addressed in CI / dashboards / docs first; tests are a last resort because they have an ongoing maintenance cost.

Tips for best results

1Tier the tests. Treating every test as equally important guarantees alert fatigue and ignored failures.
2The single highest-leverage test is reconciliation to an external source of truth. If you only have time for one Tier 3 test, that's the one.
3Set thresholds based on the cost of a false positive. A flapping test that wakes someone up at 2am will be silenced and stop catching real problems.
4Don't test the model against itself. Self-consistency tests catch nothing — reconciliation has to be against a different system.
5Schedule a quarterly test review. Tests accumulate; the marginal value of each one decays. Prune tests that haven't fired meaningfully in 6 months.

Related prompts

Generate dbt model documentation from SQL

intermediate

Take a dbt model's SQL and generate the schema.yml description, column docs, and recommended tests an analytics engineer would actually approve.

Data & Analyticsdbtdocumentationschema-yml

Review and propose refactors for a SQL query

intermediate

Produce a structured review of a SQL query — correctness, readability, performance, and dbt-fitness — with concrete refactor suggestions.

Data & Analyticssqlcode-reviewrefactor

Plan an investigation into a metric anomaly

intermediate

Produce a structured investigation plan for a sudden metric anomaly: triage steps, candidate causes, ownership, and the order of cuts to run.

Data & Analyticsanomaly-detectionincidentinvestigation

Need help implementing this prompt in your workflow?

Book a call

You are a senior analytics engineer designing the data quality test plan for a new pipeline. The goal is to catch real problems early, fail loudly when they matter, and avoid the test-fatigue trap where engineers ignore alerts because half are false positives. Inputs: - Pipeline overview: {{pipeline_overview}} (source, transformations, destination, downstream consumers) - Source(s) and their reliability: {{source_reliability}} (vendor SLA, known issues, schema stability) - Critical downstream uses: {{downstream_uses}} (e.g., "drives finance MRR report — must reconcile to Stripe", "powers customer-facing usage dashboard") - Existing tests in adjacent areas: {{existing_tests}} - Tooling available: {{tooling}} (dbt tests, dbt-utils, dbt-expectations, Great Expectations, Monte Carlo, Elementary, custom) - Alert routing: {{alert_routing}} (where failures go, who's on-call) Produce a plan with these tiers: ## Tier 1: Source / Freshness Tests Tests that verify data is arriving and is structurally what we expect. Run these as the first gate. - Source freshness (max age since last load) — set thresholds based on source reliability - Source row volume bounds (min/max expected daily volume; absolute floor + relative drop %) - Schema drift detection (new columns, removed columns, type changes) - Specify alert severity: page or ticket? ## Tier 2: Structural / Schema Tests Tests on the materialized models that verify dbt-level invariants. - Primary key uniqueness and not-nullness on every model's grain column - Referential integrity for foreign keys to dimension tables - Accepted values for status / enum columns - Not-null on columns the SQL guarantees won't be null - Be conservative — only test what the model actually guarantees ## Tier 3: Business Logic Tests Tests that verify the pipeline produces correct numbers. These are the highest-value tests and the ones most worth the engineering time. - Reconciliation against an external source of truth (e.g., daily revenue total in pipeline = daily revenue total in vendor dashboard within X%) - Ratio bounds (e.g., conversion rate should be between 1% and 30%; negative outside means a bug) - Cross-table consistency (e.g., sum of order line items = order total) - Derived metric stability (large WoW deltas on metrics that should be stable) ## Tier 4: Anomaly / Trend Tests Optional. Run sparingly. These are softer signals. - Volume anomaly detection (today's volume more than 3σ from rolling 30-day mean) - Distribution drift on key categorical columns - Late-arriving data warnings ## Routing and Severity For each tier, define: - Alert severity (page on-call, ticket, dashboard-only) - Owner (analytics engineering, data engineering, business owner) - Expected response SLA ## What NOT to Test Call out 3–5 things that are tempting to test but produce more noise than signal. Examples: testing every nullable column for non-null; testing aggregate metrics that legitimately move day-to-day; testing exact equality on derived columns subject to ordering. Tone: pragmatic. The plan should produce a finite, defensible list of tests, not an exhaustive coverage matrix.

Variables

{{{{pipeline_overview}}}}Replace with your {{pipeline overview}}

{{{{source_reliability}}}}Replace with your {{source reliability}}

{{{{downstream_uses}}}}Replace with your {{downstream uses}}

{{{{existing_tests}}}}Replace with your {{existing tests}}

{{{{tooling}}}}Replace with your {{tooling}}

{{{{alert_routing}}}}Replace with your {{alert routing}}

Example

Input

pipeline_overview: |
  Source: Stripe (Fivetran connector, hourly sync)
  Staging: stg_stripe__charges, stg_stripe__customers, stg_stripe__subscriptions
  Marts: fct_charges (transaction-level), fct_subscription_events (subscription state changes), dim_subscriptions
  Consumers: finance MRR/ARR reporting, GTM dashboards, customer success at-risk model
source_reliability: Stripe is highly reliable; Fivetran connector has had 2 schema-drift incidents in the last 12 months (both column additions, not breaking)
downstream_uses: |
  1. Finance MRR report — must reconcile to Stripe dashboard within 0.5%
  2. Customer-facing usage / billing dashboard — wrong numbers are visible to customers
  3. Churn prediction model — feature freshness matters
existing_tests: dbt project has uniqueness/not-null tests on existing dimension models; no business-logic reconciliation tests yet
tooling: dbt + dbt-utils + dbt-expectations; Elementary for anomaly detection; Slack for alerts; PagerDuty integration available
alert_routing: Slack #data-alerts for warnings; PagerDuty for criticals; analytics-eng on-call rotation

Output

## Tier 1: Source / Freshness Tests
- **stg_stripe__charges freshness:** warn at 4h, error at 8h. Stripe is hourly; an 8h gap means the connector is broken. → PagerDuty.
- **stg_stripe__customers / __subscriptions freshness:** warn at 8h, error at 24h. Less time-critical. → Slack.
- **Daily charge volume bounds:** absolute floor of 50 charges/day (sanity), and relative bound: today's volume not less than 50% of trailing 7-day median or more than 200%. → Slack first hit, PagerDuty if persists 2 days.
- **Schema drift on stg_stripe__* tables:** monitor for added/removed columns or type changes via Elementary. Added columns → ticket (non-blocking, common). Removed columns or type changes → PagerDuty (prior incidents had column additions, but type changes would break downstream).
- **Owner:** analytics engineering on-call.

## Tier 2: Structural / Schema Tests
Apply to all marts:
- `unique` and `not_null` on grain columns: `charge_id` (fct_charges), `subscription_event_id` (fct_subscription_events), `subscription_id` (dim_subscriptions).
- `relationships` test from `fct_charges.customer_id` to `dim_customers.customer_id`. Use a where-clause to permit guest charges if any.
- `accepted_values` on `status`, `currency`, `subscription_state` columns.
- `not_null` on `created_at_utc`, `amount_usd_net_discount`, `customer_id` (where the SQL guarantees it).
- **Owner:** model author at PR time; CI blocks on failure.

## Tier 3: Business Logic Tests (the highest-value tier)
- **Daily reconciliation: sum(fct_charges.amount_usd) by date vs. Stripe API daily summary, last 7 days.** Threshold: 0.5% absolute difference. This is the test that earns the rest of the plan its keep — it catches every silent breakage that schema tests miss. → PagerDuty if outside threshold.
- **MRR computed from dim_subscriptions vs. finance MRR report:** monthly. Threshold: 0.5%. → ticket to finance + analytics-eng.
- **Subscription state machine integrity:** for each subscription_id, the sequence of `subscription_event_type` should be a valid path (e.g., 'created' → 'active' → 'cancelled', not 'cancelled' → 'active'). Custom dbt test. → Slack.
- **Charge amount bounds:** `amount_usd >= 0`, `currency in ('USD','EUR','GBP', ...)`. → CI block.
- **Cross-table: sum of `fct_charges` for a subscription should be ≤ contract value × 1.05 (5% buffer for usage overages).** Catches double-charging or stale dim joins. → ticket.
- **Owner:** analytics engineering writes; business owner (finance) approves the threshold.

## Tier 4: Anomaly / Trend Tests
- **Daily charge count anomaly (Elementary):** alert when today's count is >3σ from rolling 30-day mean. → Slack only, suppressed on weekends and known marketing-event days.
- **Daily new-subscription count anomaly:** same setup. → Slack only.
- **Late-arriving subscription events:** flag any event whose `event_at_utc` is more than 48h before the loader timestamp. Stripe occasionally backfills, but a sudden surge is a quality signal. → ticket.
- **Owner:** monitored by analytics-eng on-call; not actioned outside business hours unless paired with a Tier 1–3 alert.

## Routing and Severity
| Tier | Severity | Channel | Owner | SLA |
| --- | --- | --- | --- | --- |
| Tier 1 critical (freshness, schema-break) | Page | PagerDuty | AE on-call | Acknowledge 30 min |
| Tier 1 warning (volume bound) | Ticket | Slack #data-alerts | AE on-call | Same business day |
| Tier 2 | Block CI | n/a | Model author | At PR review |
| Tier 3 reconciliation | Page | PagerDuty | AE on-call + finance | Acknowledge 1h |
| Tier 3 logic | Ticket | Slack | AE on-call | Same business day |
| Tier 4 | Info | Slack only | AE on-call (review weekly) | Weekly review |

## What NOT to Test
- **Don't test every nullable column for not_null.** Many columns are legitimately null (e.g., `cancelled_at_utc` on active subscriptions). Test what the SQL guarantees, not what would feel safer.
- **Don't test exact equality on derived ratios.** Floating-point drift and ordering produce false positives. Use bounds.
- **Don't anomaly-test metrics that should move day-to-day** (e.g., daily new charges during a marketing campaign window). Use suppression windows or skip.
- **Don't reconcile against the dbt project's own MRR for "validation."** Reconcile against an external source (Stripe API or finance's hand-calculated total). A test that compares the model to itself catches nothing.
- **Don't add a test for every business stakeholder request.** Stakeholder concerns are real but should be addressed in CI / dashboards / docs first; tests are a last resort because they have an ongoing maintenance cost.

Tips for best results

1Tier the tests. Treating every test as equally important guarantees alert fatigue and ignored failures.

2The single highest-leverage test is reconciliation to an external source of truth. If you only have time for one Tier 3 test, that's the one.

3Set thresholds based on the cost of a false positive. A flapping test that wakes someone up at 2am will be silenced and stop catching real problems.

4Don't test the model against itself. Self-consistency tests catch nothing — reconciliation has to be against a different system.

5Schedule a quarterly test review. Tests accumulate; the marginal value of each one decays. Prune tests that haven't fired meaningfully in 6 months.