Review and improve a database data model

advancedclaude-opus-4-5EngineeringArchitecturedatabaseschemaarchitectureengineeringpostgresql

Use case

Use this prompt when designing a schema for a new feature, before a migration to fix an existing schema, or when a data model is causing recurring bugs or performance problems. Data model decisions are expensive to reverse — a review before building saves months of migration work later.

The prompt

You are a database architect reviewing a data model. Analyze the schema below and provide a comprehensive review with prioritized improvements.

**Database system:** {{database}}
**Application context:** {{app_context}} (what the application does, domain description)
**Schema to review:**
```sql
{{schema}}
```
**Key queries this schema must support efficiently:** {{key_queries}}
**Scale requirements:** {{scale}} (row counts, write rate, read rate)
**Known issues:** {{known_issues}} (bugs or performance problems you've already noticed)

Provide a complete data model review:

## 1. Schema Overview
Summarize what the schema models, note the relationships between entities, and identify any domain concepts that are missing from the schema.

## 2. Normalization Analysis
- Is the schema over-normalized (causing excessive joins for common queries)?
- Is it under-normalized (storing the same data in multiple places)?
- Identify any functional dependencies that violate 3NF (data that depends on non-key columns)
- Identify any repeating groups or multi-valued attributes

## 3. Constraints and Data Integrity
For each constraint issue found:
- **Issue:** What invariant is not enforced
- **Risk:** What data corruption or bugs this enables
- **Fix:** The specific constraint or change required

Missing constraint types to check:
- NOT NULL on required fields
- UNIQUE constraints on logically unique fields
- CHECK constraints on bounded values
- Foreign key constraints on relationships
- Consistent use of ON DELETE behavior

## 4. Indexing Analysis
- What indexes are needed for the stated key queries?
- Are there any existing indexes that appear redundant or counterproductive?
- Composite index recommendations with column ordering rationale
- Partial index opportunities

## 5. Naming and Conventions
- Inconsistent naming conventions that will cause confusion
- Ambiguous column names
- Any naming changes that would make the schema clearer

## 6. Design Patterns and Anti-patterns
Call out any recognized anti-patterns:
- EAV (Entity-Attribute-Value) anti-pattern
- Polymorphic associations
- Storing serialized data in text columns (when structured columns are better)
- Using strings for IDs or enums that should be typed differently
- Magic numbers or status codes without a reference table

## 7. Scalability Concerns
At the stated scale:
- What tables will grow fastest and cause problems?
- Are there any hotspot risks (e.g., all writes going to the same table partition)?
- Archiving or partitioning recommendations

## 8. Recommended Schema (Improved Version)
```sql
-- Improved schema with all changes applied
-- Inline comments explaining non-obvious choices
```

## 9. Migration Path
If this is an existing schema: how to migrate to the improved version safely, in what order, with what risks.

Variables

{{{{database}}}}Replace with your {{database}}
{{{{app_context}}}}Replace with your {{app context}}
{{{{schema}}}}Replace with your {{schema}}
{{{{key_queries}}}}Replace with your {{key queries}}
{{{{scale}}}}Replace with your {{scale}}
{{{{known_issues}}}}Replace with your {{known issues}}

Example

Input

database: PostgreSQL 15
app_context: A B2B HR survey platform. Companies run engagement surveys for their employees. Survey results are analyzed by HR managers to understand team sentiment.
schema: |
  CREATE TABLE companies (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    created_at TIMESTAMP
  );

  CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    company_id INT REFERENCES companies(id),
    email VARCHAR(255),
    full_name VARCHAR(255),
    role VARCHAR(50),
    is_active BOOLEAN,
    manager_id INT,
    department VARCHAR(255),
    created_at TIMESTAMP
  );

  CREATE TABLE surveys (
    id SERIAL PRIMARY KEY,
    company_id INT REFERENCES companies(id),
    title VARCHAR(500),
    description TEXT,
    questions TEXT,  -- JSON stored as text
    status VARCHAR(50),
    created_by INT,
    start_date DATE,
    end_date DATE,
    created_at TIMESTAMP
  );

  CREATE TABLE responses (
    id SERIAL PRIMARY KEY,
    survey_id INT REFERENCES surveys(id),
    user_id INT REFERENCES users(id),
    answers TEXT,  -- JSON stored as text
    submitted_at TIMESTAMP,
    created_at TIMESTAMP
  );
key_queries: |
  1. Get all responses for a survey with user details (for HR manager dashboard)
  2. Check if a specific user has completed a specific survey
  3. Get response rate for a survey, broken down by department
  4. Get all active surveys for a company
  5. Check if a survey's end_date has passed (to auto-close)
scale: |
  - companies: ~1K rows, slow growth
  - users: ~500K rows, moderate growth
  - surveys: ~10K rows, moderate growth  
  - responses: ~50M rows, high growth (main volume table)
known_issues: |
  - Querying "has user X completed survey Y" is slow (full scan of responses)
  - JSON parsing of questions/answers in Python is causing bugs when fields are missing
  - Duplicate responses appear occasionally (same user, same survey)

Output

## 1. Schema Overview

The schema models: Companies → Users (employees), Companies → Surveys, Surveys → Responses (user answers). The relationships are correct and the entity model covers the core domain.

**Missing domain concepts:**
- Survey questions are stored as a JSON text blob in `surveys.questions` — questions are not first-class entities. This makes it impossible to query or aggregate by individual question.
- Survey responses store answers as JSON text — same problem. Individual answer values can't be queried without application-layer parsing.
- There's no `teams` or `departments` table — department is a free-text field on `users`, making consistent grouping unreliable.
- No audit trail for survey status changes.

## 2. Normalization Analysis

**Under-normalized:**
- `users.department` is a VARCHAR — if department names change or contain typos, existing user records don't update. This causes the "by-department" breakdown query to produce fragmented results (e.g., "Engineering", "engineering", "Eng" as three separate groups).
- Recommendation: extract `departments` to a reference table with `department_id` foreign key on `users`.

**Appropriately normalized:** The company → user → survey → response hierarchy is correctly normalized.

**JSON blobs:** `surveys.questions` and `responses.answers` stored as `TEXT` are a normalization issue — structured data is hidden from the database. At minimum, use JSONB; ideally, extract to first-class tables.

## 3. Constraints and Data Integrity

**Critical — Missing UNIQUE constraint on responses:**
- **Issue:** No unique constraint prevents the same user from submitting multiple responses to the same survey
- **Risk:** Duplicate responses — this is your known bug
- **Fix:**
  ```sql
  ALTER TABLE responses ADD CONSTRAINT uq_responses_survey_user
    UNIQUE (survey_id, user_id);
  ```

**Missing NOT NULL on required fields:**
- `users.email` — NULL email causes issues with auth and notification lookup
- `users.company_id` — a user without a company is invalid
- `users.is_active` — NULL is not the same as FALSE; should have a default
- `surveys.status` — NULL status makes state machine logic ambiguous
- `surveys.company_id` — a survey without a company is invalid
- Fix: `ALTER TABLE users ALTER COLUMN email SET NOT NULL;` (and similar for each)

**Missing UNIQUE on users.email:**
- **Issue:** Two users can have the same email, causing auth confusion
- **Fix:** `ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE (email);`

**Missing CHECK constraint on surveys.status:**
- **Issue:** Any string can be written to `status` — "publihed", "ACTIVE", etc.
- **Fix:**
  ```sql
  ALTER TABLE surveys ADD CONSTRAINT chk_surveys_status
    CHECK (status IN ('draft', 'active', 'closed', 'archived'));
  ```

**Missing ON DELETE behavior:**
- `users.company_id` has no `ON DELETE` — what happens if a company is deleted? Define explicitly (likely `RESTRICT` or `CASCADE`).

**Missing FK on `users.manager_id`:**
- `manager_id` references another user but has no FK constraint. Self-referential FK:
  ```sql
  ALTER TABLE users ADD CONSTRAINT fk_users_manager
    FOREIGN KEY (manager_id) REFERENCES users(id) ON DELETE SET NULL;
  ```

## 4. Indexing Analysis

**Indexes needed for key queries:**
```sql
-- Query 1: Responses for a survey (most frequent dashboard query)
CREATE INDEX idx_responses_survey_id ON responses (survey_id);

-- Query 2: Has user completed survey (currently causing full scan — this is your known slow query)
-- The UNIQUE constraint above creates this index automatically:
-- idx_responses_survey_user already covers (survey_id, user_id)

-- Query 3: Response rate by department (needs join through users)
CREATE INDEX idx_users_company_dept ON users (company_id, department);

-- Query 4: Active surveys for a company
CREATE INDEX idx_surveys_company_status ON surveys (company_id, status)
  WHERE status = 'active';  -- partial index: only active surveys

-- Query 5: Surveys past end_date (for auto-close job)
CREATE INDEX idx_surveys_end_date ON surveys (end_date)
  WHERE status = 'active';  -- partial: only active surveys need this check
```

**responses table at 50M rows:** The primary key index is critical. The `(survey_id, user_id)` unique index covers both the duplicate-prevention constraint and the "has user completed" query — one index serves two purposes.

## 5. Naming and Conventions

- `submitted_at` vs `created_at` on `responses` — these appear redundant. If they're meant to track different events (draft saved vs. submitted), document this. If identical, remove one.
- `full_name` on users — consider splitting to `first_name` / `last_name` for personalization use cases
- `questions` and `answers` column names on surveys/responses are ambiguous — consider `question_definitions` and `answer_payload`

## 6. Design Patterns and Anti-patterns

**JSON stored as TEXT (critical):** Both `surveys.questions` and `responses.answers` use `TEXT` for JSON. This:
- Prevents database-level querying of individual questions/answers
- Cannot be indexed
- Means bugs from missing fields are caught in Python, not at write time

**Immediate fix:** Change to `JSONB`. This enables indexing and operators like `->` without a schema change.

**Better fix (if query patterns require it):** Extract to a `questions` table and an `answers` table. This enables "average score for question 3 across all responses" queries in SQL.

**Status as VARCHAR:** The `surveys.status` field should use a PostgreSQL `ENUM` type or a CHECK constraint (above). VARCHAR allows invalid states.

## 7. Scalability Concerns

- **responses table (50M rows, high growth):** At 500M rows, queries on `survey_id` will still be fast with the index. Consider range partitioning by `submitted_at` year if you need to archive old responses.
- **JSONB on responses.answers:** At 50M rows with large answer payloads, this column could represent significant storage. Profile average payload size; if >2KB average, consider response compression.
- **Department fragmentation:** The free-text department field will cause increasing fragmentation in analytics. This is the most urgent non-constraint issue.

## 8. Recommended Schema

```sql
-- Reference tables
CREATE TYPE survey_status AS ENUM ('draft', 'active', 'closed', 'archived');
CREATE TYPE user_role AS ENUM ('employee', 'manager', 'admin');

CREATE TABLE companies (
  id          BIGSERIAL PRIMARY KEY,
  name        VARCHAR(255) NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE departments (
  id          BIGSERIAL PRIMARY KEY,
  company_id  BIGINT NOT NULL REFERENCES companies(id) ON DELETE CASCADE,
  name        VARCHAR(255) NOT NULL,
  CONSTRAINT uq_departments_company_name UNIQUE (company_id, name)
);

CREATE TABLE users (
  id            BIGSERIAL PRIMARY KEY,
  company_id    BIGINT NOT NULL REFERENCES companies(id) ON DELETE RESTRICT,
  email         VARCHAR(255) NOT NULL,
  first_name    VARCHAR(255),
  last_name     VARCHAR(255),
  role          user_role NOT NULL DEFAULT 'employee',
  is_active     BOOLEAN NOT NULL DEFAULT TRUE,
  manager_id    BIGINT REFERENCES users(id) ON DELETE SET NULL,
  department_id BIGINT REFERENCES departments(id) ON DELETE SET NULL,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  CONSTRAINT uq_users_email UNIQUE (email)
);

CREATE TABLE surveys (
  id           BIGSERIAL PRIMARY KEY,
  company_id   BIGINT NOT NULL REFERENCES companies(id) ON DELETE CASCADE,
  title        VARCHAR(500) NOT NULL,
  description  TEXT,
  questions    JSONB NOT NULL DEFAULT '[]',  -- JSONB, not TEXT
  status       survey_status NOT NULL DEFAULT 'draft',
  created_by   BIGINT NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
  start_date   DATE,
  end_date     DATE,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  CONSTRAINT chk_surveys_dates CHECK (end_date IS NULL OR end_date >= start_date)
);

CREATE TABLE responses (
  id           BIGSERIAL PRIMARY KEY,
  survey_id    BIGINT NOT NULL REFERENCES surveys(id) ON DELETE CASCADE,
  user_id      BIGINT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  answers      JSONB NOT NULL DEFAULT '{}',  -- JSONB, not TEXT
  submitted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  CONSTRAINT uq_responses_survey_user UNIQUE (survey_id, user_id)
);

-- Indexes
CREATE INDEX idx_surveys_company_status ON surveys (company_id, status) WHERE status = 'active';
CREATE INDEX idx_surveys_end_date ON surveys (end_date) WHERE status = 'active';
CREATE INDEX idx_users_company_dept ON users (company_id, department_id);
CREATE INDEX idx_responses_survey_id ON responses (survey_id);
```

## 9. Migration Path

1. **Add UNIQUE on responses** (blocking bug fix — do first, run in off-hours): Confirm no existing duplicates (`SELECT survey_id, user_id, COUNT(*) FROM responses GROUP BY 1,2 HAVING COUNT(*) > 1`). Deduplicate if found, then add constraint.
2. **Change TEXT to JSONB** for questions and answers: Non-breaking if data is valid JSON; validates on migration. Run with `ALTER TABLE surveys ALTER COLUMN questions TYPE JSONB USING questions::JSONB;`
3. **Add NOT NULL constraints**: Add defaults first, then constraints. Can be done per-column during low-traffic window.
4. **Add departments table + migrate department text**: New table, backfill from distinct values in `users.department`, update `users` with `department_id` FK, then optionally drop the old `department` text column.

Tips for best results

1The data model is the hardest thing to change in a live system. Every hour spent reviewing a schema before it goes to production saves days of migration work later.
2Missing UNIQUE constraints are the #1 source of data integrity bugs — they don't cause loud errors, they cause subtle data quality problems that are hard to diagnose months later.
3Include the key queries the schema must support. A schema that looks clean in isolation may be wrong for its actual query patterns.
4Use TIMESTAMPTZ (timezone-aware) instead of TIMESTAMP for all timestamps unless you have a specific reason not to. Timezone bugs are painful to debug in production data.

Related prompts

Optimize a slow SQL query

intermediate

Analyze a slow SQL query, identify the root cause of performance issues, and produce an optimized version with indexing recommendations and a performance comparison.

Engineeringsqldatabaseperformance

Write a safe database migration script

advanced

Generate a database migration script with rollback, safety checks, batching for large tables, and a step-by-step execution plan that minimizes downtime and risk.

Engineeringdatabasemigrationspostgresql

Review a system design proposal

advanced

Critically evaluate a system design proposal — assessing scalability, reliability, data consistency, security, operational complexity, and cost — with prioritized feedback and alternative approaches.

Engineeringarchitecturesystem-designscalability

Need help implementing this prompt in your workflow?

Book a call

You are a database architect reviewing a data model. Analyze the schema below and provide a comprehensive review with prioritized improvements. **Database system:** {{database}} **Application context:** {{app_context}} (what the application does, domain description) **Schema to review:** ```sql {{schema}} ``` **Key queries this schema must support efficiently:** {{key_queries}} **Scale requirements:** {{scale}} (row counts, write rate, read rate) **Known issues:** {{known_issues}} (bugs or performance problems you've already noticed) Provide a complete data model review: ## 1. Schema Overview Summarize what the schema models, note the relationships between entities, and identify any domain concepts that are missing from the schema. ## 2. Normalization Analysis - Is the schema over-normalized (causing excessive joins for common queries)? - Is it under-normalized (storing the same data in multiple places)? - Identify any functional dependencies that violate 3NF (data that depends on non-key columns) - Identify any repeating groups or multi-valued attributes ## 3. Constraints and Data Integrity For each constraint issue found: - **Issue:** What invariant is not enforced - **Risk:** What data corruption or bugs this enables - **Fix:** The specific constraint or change required Missing constraint types to check: - NOT NULL on required fields - UNIQUE constraints on logically unique fields - CHECK constraints on bounded values - Foreign key constraints on relationships - Consistent use of ON DELETE behavior ## 4. Indexing Analysis - What indexes are needed for the stated key queries? - Are there any existing indexes that appear redundant or counterproductive? - Composite index recommendations with column ordering rationale - Partial index opportunities ## 5. Naming and Conventions - Inconsistent naming conventions that will cause confusion - Ambiguous column names - Any naming changes that would make the schema clearer ## 6. Design Patterns and Anti-patterns Call out any recognized anti-patterns: - EAV (Entity-Attribute-Value) anti-pattern - Polymorphic associations - Storing serialized data in text columns (when structured columns are better) - Using strings for IDs or enums that should be typed differently - Magic numbers or status codes without a reference table ## 7. Scalability Concerns At the stated scale: - What tables will grow fastest and cause problems? - Are there any hotspot risks (e.g., all writes going to the same table partition)? - Archiving or partitioning recommendations ## 8. Recommended Schema (Improved Version) ```sql -- Improved schema with all changes applied -- Inline comments explaining non-obvious choices ``` ## 9. Migration Path If this is an existing schema: how to migrate to the improved version safely, in what order, with what risks.

Example

Input

database: PostgreSQL 15
app_context: A B2B HR survey platform. Companies run engagement surveys for their employees. Survey results are analyzed by HR managers to understand team sentiment.
schema: |
  CREATE TABLE companies (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    created_at TIMESTAMP
  );

  CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    company_id INT REFERENCES companies(id),
    email VARCHAR(255),
    full_name VARCHAR(255),
    role VARCHAR(50),
    is_active BOOLEAN,
    manager_id INT,
    department VARCHAR(255),
    created_at TIMESTAMP
  );

  CREATE TABLE surveys (
    id SERIAL PRIMARY KEY,
    company_id INT REFERENCES companies(id),
    title VARCHAR(500),
    description TEXT,
    questions TEXT,  -- JSON stored as text
    status VARCHAR(50),
    created_by INT,
    start_date DATE,
    end_date DATE,
    created_at TIMESTAMP
  );

  CREATE TABLE responses (
    id SERIAL PRIMARY KEY,
    survey_id INT REFERENCES surveys(id),
    user_id INT REFERENCES users(id),
    answers TEXT,  -- JSON stored as text
    submitted_at TIMESTAMP,
    created_at TIMESTAMP
  );
key_queries: |
  1. Get all responses for a survey with user details (for HR manager dashboard)
  2. Check if a specific user has completed a specific survey
  3. Get response rate for a survey, broken down by department
  4. Get all active surveys for a company
  5. Check if a survey's end_date has passed (to auto-close)
scale: |
  - companies: ~1K rows, slow growth
  - users: ~500K rows, moderate growth
  - surveys: ~10K rows, moderate growth  
  - responses: ~50M rows, high growth (main volume table)
known_issues: |
  - Querying "has user X completed survey Y" is slow (full scan of responses)
  - JSON parsing of questions/answers in Python is causing bugs when fields are missing
  - Duplicate responses appear occasionally (same user, same survey)

Output

## 1. Schema Overview

The schema models: Companies → Users (employees), Companies → Surveys, Surveys → Responses (user answers). The relationships are correct and the entity model covers the core domain.

**Missing domain concepts:**
- Survey questions are stored as a JSON text blob in `surveys.questions` — questions are not first-class entities. This makes it impossible to query or aggregate by individual question.
- Survey responses store answers as JSON text — same problem. Individual answer values can't be queried without application-layer parsing.
- There's no `teams` or `departments` table — department is a free-text field on `users`, making consistent grouping unreliable.
- No audit trail for survey status changes.

## 2. Normalization Analysis

**Under-normalized:**
- `users.department` is a VARCHAR — if department names change or contain typos, existing user records don't update. This causes the "by-department" breakdown query to produce fragmented results (e.g., "Engineering", "engineering", "Eng" as three separate groups).
- Recommendation: extract `departments` to a reference table with `department_id` foreign key on `users`.

**Appropriately normalized:** The company → user → survey → response hierarchy is correctly normalized.

**JSON blobs:** `surveys.questions` and `responses.answers` stored as `TEXT` are a normalization issue — structured data is hidden from the database. At minimum, use JSONB; ideally, extract to first-class tables.

## 3. Constraints and Data Integrity

**Critical — Missing UNIQUE constraint on responses:**
- **Issue:** No unique constraint prevents the same user from submitting multiple responses to the same survey
- **Risk:** Duplicate responses — this is your known bug
- **Fix:**
  ```sql
  ALTER TABLE responses ADD CONSTRAINT uq_responses_survey_user
    UNIQUE (survey_id, user_id);
  ```

**Missing NOT NULL on required fields:**
- `users.email` — NULL email causes issues with auth and notification lookup
- `users.company_id` — a user without a company is invalid
- `users.is_active` — NULL is not the same as FALSE; should have a default
- `surveys.status` — NULL status makes state machine logic ambiguous
- `surveys.company_id` — a survey without a company is invalid
- Fix: `ALTER TABLE users ALTER COLUMN email SET NOT NULL;` (and similar for each)

**Missing UNIQUE on users.email:**
- **Issue:** Two users can have the same email, causing auth confusion
- **Fix:** `ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE (email);`

**Missing CHECK constraint on surveys.status:**
- **Issue:** Any string can be written to `status` — "publihed", "ACTIVE", etc.
- **Fix:**
  ```sql
  ALTER TABLE surveys ADD CONSTRAINT chk_surveys_status
    CHECK (status IN ('draft', 'active', 'closed', 'archived'));
  ```

**Missing ON DELETE behavior:**
- `users.company_id` has no `ON DELETE` — what happens if a company is deleted? Define explicitly (likely `RESTRICT` or `CASCADE`).

**Missing FK on `users.manager_id`:**
- `manager_id` references another user but has no FK constraint. Self-referential FK:
  ```sql
  ALTER TABLE users ADD CONSTRAINT fk_users_manager
    FOREIGN KEY (manager_id) REFERENCES users(id) ON DELETE SET NULL;
  ```

## 4. Indexing Analysis

**Indexes needed for key queries:**
```sql
-- Query 1: Responses for a survey (most frequent dashboard query)
CREATE INDEX idx_responses_survey_id ON responses (survey_id);

-- Query 2: Has user completed survey (currently causing full scan — this is your known slow query)
-- The UNIQUE constraint above creates this index automatically:
-- idx_responses_survey_user already covers (survey_id, user_id)

-- Query 3: Response rate by department (needs join through users)
CREATE INDEX idx_users_company_dept ON users (company_id, department);

-- Query 4: Active surveys for a company
CREATE INDEX idx_surveys_company_status ON surveys (company_id, status)
  WHERE status = 'active';  -- partial index: only active surveys

-- Query 5: Surveys past end_date (for auto-close job)
CREATE INDEX idx_surveys_end_date ON surveys (end_date)
  WHERE status = 'active';  -- partial: only active surveys need this check
```

**responses table at 50M rows:** The primary key index is critical. The `(survey_id, user_id)` unique index covers both the duplicate-prevention constraint and the "has user completed" query — one index serves two purposes.

## 5. Naming and Conventions

- `submitted_at` vs `created_at` on `responses` — these appear redundant. If they're meant to track different events (draft saved vs. submitted), document this. If identical, remove one.
- `full_name` on users — consider splitting to `first_name` / `last_name` for personalization use cases
- `questions` and `answers` column names on surveys/responses are ambiguous — consider `question_definitions` and `answer_payload`

## 6. Design Patterns and Anti-patterns

**JSON stored as TEXT (critical):** Both `surveys.questions` and `responses.answers` use `TEXT` for JSON. This:
- Prevents database-level querying of individual questions/answers
- Cannot be indexed
- Means bugs from missing fields are caught in Python, not at write time

**Immediate fix:** Change to `JSONB`. This enables indexing and operators like `->` without a schema change.

**Better fix (if query patterns require it):** Extract to a `questions` table and an `answers` table. This enables "average score for question 3 across all responses" queries in SQL.

**Status as VARCHAR:** The `surveys.status` field should use a PostgreSQL `ENUM` type or a CHECK constraint (above). VARCHAR allows invalid states.

## 7. Scalability Concerns

- **responses table (50M rows, high growth):** At 500M rows, queries on `survey_id` will still be fast with the index. Consider range partitioning by `submitted_at` year if you need to archive old responses.
- **JSONB on responses.answers:** At 50M rows with large answer payloads, this column could represent significant storage. Profile average payload size; if >2KB average, consider response compression.
- **Department fragmentation:** The free-text department field will cause increasing fragmentation in analytics. This is the most urgent non-constraint issue.

## 8. Recommended Schema

```sql
-- Reference tables
CREATE TYPE survey_status AS ENUM ('draft', 'active', 'closed', 'archived');
CREATE TYPE user_role AS ENUM ('employee', 'manager', 'admin');

CREATE TABLE companies (
  id          BIGSERIAL PRIMARY KEY,
  name        VARCHAR(255) NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE departments (
  id          BIGSERIAL PRIMARY KEY,
  company_id  BIGINT NOT NULL REFERENCES companies(id) ON DELETE CASCADE,
  name        VARCHAR(255) NOT NULL,
  CONSTRAINT uq_departments_company_name UNIQUE (company_id, name)
);

CREATE TABLE users (
  id            BIGSERIAL PRIMARY KEY,
  company_id    BIGINT NOT NULL REFERENCES companies(id) ON DELETE RESTRICT,
  email         VARCHAR(255) NOT NULL,
  first_name    VARCHAR(255),
  last_name     VARCHAR(255),
  role          user_role NOT NULL DEFAULT 'employee',
  is_active     BOOLEAN NOT NULL DEFAULT TRUE,
  manager_id    BIGINT REFERENCES users(id) ON DELETE SET NULL,
  department_id BIGINT REFERENCES departments(id) ON DELETE SET NULL,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  CONSTRAINT uq_users_email UNIQUE (email)
);

CREATE TABLE surveys (
  id           BIGSERIAL PRIMARY KEY,
  company_id   BIGINT NOT NULL REFERENCES companies(id) ON DELETE CASCADE,
  title        VARCHAR(500) NOT NULL,
  description  TEXT,
  questions    JSONB NOT NULL DEFAULT '[]',  -- JSONB, not TEXT
  status       survey_status NOT NULL DEFAULT 'draft',
  created_by   BIGINT NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
  start_date   DATE,
  end_date     DATE,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  CONSTRAINT chk_surveys_dates CHECK (end_date IS NULL OR end_date >= start_date)
);

CREATE TABLE responses (
  id           BIGSERIAL PRIMARY KEY,
  survey_id    BIGINT NOT NULL REFERENCES surveys(id) ON DELETE CASCADE,
  user_id      BIGINT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  answers      JSONB NOT NULL DEFAULT '{}',  -- JSONB, not TEXT
  submitted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  CONSTRAINT uq_responses_survey_user UNIQUE (survey_id, user_id)
);

-- Indexes
CREATE INDEX idx_surveys_company_status ON surveys (company_id, status) WHERE status = 'active';
CREATE INDEX idx_surveys_end_date ON surveys (end_date) WHERE status = 'active';
CREATE INDEX idx_users_company_dept ON users (company_id, department_id);
CREATE INDEX idx_responses_survey_id ON responses (survey_id);
```

## 9. Migration Path

1. **Add UNIQUE on responses** (blocking bug fix — do first, run in off-hours): Confirm no existing duplicates (`SELECT survey_id, user_id, COUNT(*) FROM responses GROUP BY 1,2 HAVING COUNT(*) > 1`). Deduplicate if found, then add constraint.
2. **Change TEXT to JSONB** for questions and answers: Non-breaking if data is valid JSON; validates on migration. Run with `ALTER TABLE surveys ALTER COLUMN questions TYPE JSONB USING questions::JSONB;`
3. **Add NOT NULL constraints**: Add defaults first, then constraints. Can be done per-column during low-traffic window.
4. **Add departments table + migrate department text**: New table, backfill from distinct values in `users.department`, update `users` with `department_id` FK, then optionally drop the old `department` text column.

Tips for best results

1The data model is the hardest thing to change in a live system. Every hour spent reviewing a schema before it goes to production saves days of migration work later.

2Missing UNIQUE constraints are the #1 source of data integrity bugs — they don't cause loud errors, they cause subtle data quality problems that are hard to diagnose months later.

3Include the key queries the schema must support. A schema that looks clean in isolation may be wrong for its actual query patterns.

4Use TIMESTAMPTZ (timezone-aware) instead of TIMESTAMP for all timestamps unless you have a specific reason not to. Timezone bugs are painful to debug in production data.