Dossier · Internal docs
Internal · Engine Story

Dossier Engine

The Insight

Every regulated industry has the same problem: structured forms that need to be filled with data that already exists somewhere. A bankruptcy attorney types the debtor's name into Form 101, then types the same name into Schedule A/B, Schedule D, Schedule E/F, the Statement of Financial Affairs, and every other form in the package. The data exists — it was entered once. The routing is what's missing.

Dossier solves this with three ideas:

  1. A shared vocabulary (Schema) — Every data point in a domain gets a canonical name. debtor1.first_name is the debtor's first name, regardless of which form asks for it, regardless of which source provided it.

  2. Bindings that route data to forms — Each form declares how its PDF fields map to schema keys. Enter debtor1.first_name once, and bindings carry it to every form that needs it — across 70+ forms in a bankruptcy package.

  3. Data sources that are interchangeable — Manual entry, credit report XML, a case management API, or an LLM reading a pay stub all produce the same thing: schema key = value pairs. The binding engine doesn't care where the data came from.

The engine is the schema + binding resolver + expression engine + PDF filler. It knows nothing about law, bankruptcy, insurance, tax, or any specific domain. All domain knowledge is externalized into three JSON artifacts: schemas, forms (with bindings), and data sources.


How It Works

Data Flow

The Schema sits at the center. DataSources write to it (inbound), Bindings read from it (outbound). Every interaction — manual or automated — produces the same artifact: an Entry.

External data (XML, API, CSV, PDF, manual entry)
    ↓
DataSource (parse rules + field mapping)
    ↓
Schema keys (the shared vocabulary)
    ↓
Entry on Case (batch of key=value with source + timestamp)
    ↓
Binding engine (resolves entries → form fields)
    ↓
Filled PDF

Fill Once, Populate Everywhere

The debtor's first name is entered once. The binding engine routes it to every form:

debtor1.first_name
    → Form 101, field "Debtor1.First name"
    → Schedule A/B, field "Debtor 1 First name"
    → Schedule D, field "Debtor 1 Name"
    → Schedule E/F, field "Debtor 1 Name"
    → Statement of Financial Affairs, field "Debtor 1"
    → Declaration, field "Name of Debtor"
    → ...every form in the package that needs it

This works for every data point. Creditor names cascade to every schedule. Addresses appear on every form that asks. Social security numbers are masked where required. One schema key, many targets.

Any Source, Same Result

All data sources produce the same thing: an Entry with schema key = value pairs.

Source What happens Entry
Manual entry Lawyer types 5 fields, clicks save 1 entry, 5 values, source="manual", auto-confirmed
Credit report MISMO XML parsed by DataSource config 1 entry, 80 values, source="credit-report", pending review
Case management API REST sync from Clio/MyCase 1 entry, 12 values, source="case-mgmt", auto-confirmed
Document upload LLM reads a pay stub, maps to schema keys 1 entry, 5 values, source="pay-stub", pending review
Manual correction Lawyer fixes a creditor name from the credit report 1 entry, 1 value, overrides the credit report's value

Current state = merge all confirmed entries by timestamp (latest wins per key). The case activity feed shows every entry — what changed, where it came from, when. Both entries stay in the timeline; the correction doesn't erase the original.

Forms Are Recursive

A form can be a leaf (has a PDF with extractable fields), a composite (groups other forms and routes data between them), or both. Composites contain composites — a Chapter 7 Individual package contains a Petition group, which contains Form 101, Declaration, and other leaf forms.

Chapter 7 Individual (composite)
├── Petition (composite)
│   ├── Form 101 — Voluntary Petition (leaf, 155 fields)
│   ├── Form 101A — Initial Statement (leaf, conditional)
│   └── Declaration (leaf)
├── Schedules (composite)
│   ├── Schedule Summary (leaf)
│   ├── Schedule A/B — Property (leaf)
│   ├── Schedule C — Exemptions (leaf)
│   ├── Schedule D — Secured Creditors (leaf)
│   ├── Schedule E/F — Unsecured Creditors (leaf)
│   ├── Schedule G — Executory Contracts (leaf)
│   ├── Schedule H — Co-debtors (leaf, conditional)
│   ├── Schedule I — Income (leaf)
│   └── Schedule J — Expenses (leaf)
├── SOFA — Statement of Financial Affairs (leaf)
├── Means Test — Form 122A (leaf)
└── ...other forms

Bindings flow downward only. A composite form's binding can reference any descendant. Cross-form bindings (e.g., income from Schedule I flows to the Means Test) live on the common ancestor. Never up, never sideways.

Expression Engine

Bindings can do more than simple routing. The expression engine supports 22 Excel-style functions:

  • String: CONCAT(first, ' ', last), UPPER(name), RIGHT(acct, 4), SUBSTITUTE(phone, "-", "")
  • Math: SUM(line1, line2), ROUND(amount, 2), MAX(income, 0), COUNT(creditors)
  • Logical: IF(joint, debtor2_name, ""), AND(employed, income > 0), IN(type, 'Secured', 'Mortgage')
  • Date: TODAY(), YEAR(filed_date), MONTH(opened)

Conditional bindings apply only when a condition is met:

{
  "source": "debtor2.first_name",
  "targets": ["$b101.Debtor2.First name", "$b106ab.Joint Debtor Name"],
  "condition": "case.is_joint == true"
}

Computed schema entries derive values from other entries:

{
  "key": "income.total",
  "type": "money",
  "label": "Total Monthly Income",
  "expression": "SUM(income.wages, income.business, income.rental)"
}

DataSource Configuration

A DataSource is a reusable recipe that turns external data into schema key values. It combines parse rules (how to read the source format) with field mapping (how to transform into schema keys).

{
  "name": "Credit Report (Bankruptcy)",
  "type": "api",
  "schemaId": "bankruptcy.individual",
  "singleton": {
    "debtor1.full_name": "CONCAT(borrowers[0].firstName, ' ', borrowers[0].lastName)",
    "debtor1.ssn": "borrowers[0].ssn"
  },
  "collections": {
    "creditors": {
      "source": "liabilities",
      "filter": "AND(IN(status, 'Open'), balance > 0)",
      "classify": {
        "secured": { "rule": "IN(loanType, 'Secured', 'Mortgage')", "targetPrefix": "creditor.secured[]" },
        "unsecured": { "rule": "DEFAULT", "targetPrefix": "creditor.unsecured[]" }
      },
      "fields": {
        "name": "creditorName",
        "account_number": "RIGHT(accountNumber, 4)",
        "claim_amount": "balance"
      }
    }
  }
}

DataSource mappings are inbound (external field → schema key). Bindings are outbound (schema key → form field). Both use the same key vocabulary — same expression engine, same [] array syntax.

The schema is the clean boundary. DataSources don't know about PDF fields. Bindings don't know about APIs. Build the key picker once, use it everywhere.

Example Flows

Every path through the system follows the same pattern: something produces entries on a case, bindings route them to PDF fields.

A. Lawyer fills in debtor information and clicks save

Lawyer types 5 fields, saves → Entry (manual, 5 values, confirmed)
    → Bindings route each value to every form field that needs it
    → No DataSource needed. Source = "manual", auto-confirmed.

B. Credit report fills creditor schedules

MISMO XML → DataSource: credit-report (parse + classify + map)
    → Entry (credit-report, 80 values, pending review)
    → Bindings → Schedule D creditor rows, Schedule E/F creditor rows, ...
    → Lawyer reviews and confirms before values flow to PDFs

C. Case management sync fills debtor demographics

Clio REST API → DataSource: case-mgmt (field mapping)
    → Entry (case-mgmt, 12 values, confirmed)
    → Bindings → Form 101, Schedule A/B, Schedule I, ...
    → Same schema keys, same bindings, different source
    → If lawyer already typed the name manually, the API value doesn't override

D. Pay stub upload fills income

Pay stub PDF → DataSource: pay-stub (LLM extraction)
    → Entry (pay-stub, 5 values, pending review)
    → Bindings → Schedule I income fields
    → First time: lawyer confirms the LLM mapping (saved as reusable template)
    → Second time: auto-applied from the saved template

E. Lawyer corrects a value from the credit report

Fixes creditor name → Entry (manual, 1 value, confirmed)
    → Overrides the credit report's value for creditors.secured[0].name
    → Credit report entry still exists in timeline — activity feed shows what changed and why

F. Same data source, different domain (future)

Same MISMO XML → DataSource: credit-report-auto (different schema mapping)
    → Entry (credit-report, N values)
    → Different schema (insurance.auto.claim) → different bindings → insurance forms
    → The DataSource is scoped to a schema, not to forms

The Extraction Pipeline

Building a new domain follows a repeatable pipeline:

  1. Define the schema — Enumerate every data point in the domain. Give each a canonical key, type, label, and group.
  2. Process the forms — Take each government/standard PDF, extract its AcroForm fields (key, type, page, position, rect).
  3. Generate bindings — Map each PDF field to the appropriate schema key. This is where domain knowledge is captured: understanding that "Debtor1.First name" on Form 101 and "Debtor 1 Name" on Schedule D both mean debtor1.first_name.
  4. Compose form packages — Group leaf forms into composites (Petition, Schedules, Chapter 7 Package). Write cross-child bindings that route data between sibling forms.
  5. Configure data sources — Define how external data (credit reports, APIs, documents) maps to schema keys.

The result: populate a few schema keys from any source, and the binding engine cascades the values to every form that needs them.


What's Been Built

Bankruptcy Domain

Schemas:

  • bankruptcy.individual — ~1,100 schema keys covering debtor identity, income, expenses, assets, liabilities, creditors (secured, unsecured, priority), executory contracts, co-debtors, prior filings, and administrative data
  • bankruptcy.nonindividual — ~640 schema keys covering entity information, officers, revenue, assets, liabilities, and corporate-specific data
  • 12 shared administrative keys (case.*, attorney.*)

Forms processed:

  • 69 federal leaf forms — Every fillable PDF in the bankruptcy form set. Fields extracted, bindings generated, schema keys mapped.
  • Local forms from IL and GA — State-specific local bankruptcy forms processed with the same pipeline, extending the federal form set.
  • 19 composite forms — Chapter packages (Ch.7 Individual, Ch.7 Non-Individual, Ch.13 Individual, etc.), Petition group, Schedules group, and other logical groupings.

What the bindings encode:

The bindings are the captured domain knowledge. They encode:

  • debtor1.first_name appears on 40+ forms under different field names
  • Creditor arrays in the schema map to repeating table rows on Schedule D, E/F, and the creditor matrix
  • The means test (Form 122A) uses income values from Schedule I and expense values from Schedule J
  • Summary totals on Schedule A/B Sum aggregate values from individual schedules
  • Conditional forms are included or excluded based on case data (e.g., Schedule H only when case.has_codebtors == true)
  • Cross-form sync ensures the same creditor list stays consistent between Schedules and the Chapter 13 Plan

Engine implementation:

  • PDF field extraction (AcroForm field discovery with position and type metadata)
  • Expression engine: tokenizer → parser → AST → evaluator, 22 functions, AST caching
  • Binding resolver with condition evaluation and cycle detection
  • PDF filler (pdf-lib AcroFields) + multi-form export (merged PDF or ZIP)
  • DataSource framework — credit report parsing (live), CSV import, Clio API mapping config, pay-stub / document upload (extraction path scaffolded, LLM mapping planned)
  • Client portal (packages/portal) — tenant-branded intake app with embeddable widget (bubble, panel, full-page), dashboard for invited clients, and four published intake configs (Atlas, DebtStoppers, Greenfield, individual self-file)
  • Public intake + portal routes (/intake/:slug, /portal/:tenantSlug) — tokenized invites, multipart file uploads, rate-limited, write directly to entries on a case

Why It Generalizes

The engine has no concept of "law" or "bankruptcy." It knows five things:

  1. Schemas — vocabularies of typed data points with dotted key notation
  2. Forms — PDFs with extractable AcroForm fields, composable into packages
  3. Bindings — routes from schema keys to form fields, with expressions and conditions
  4. DataSources — recipes for importing external data into schema keys
  5. Entries — batches of key=value changes with source tracking

Swap the schema, forms, and bindings — you have a different domain. The engine, server, database, and API routes are unchanged.

Domain-Agnostic Infrastructure

The platform's infrastructure is fully abstract:

  • Database tables know nothing about law, insurance, or tax. They store schemas (JSONB entries), forms (JSONB fields/bindings), cases (JSONB references/dates), entries (JSONB values), and filings (JSONB snapshots).
  • ~45 API routes handle CRUD for cases, entries, filings, contacts, tasks, notes, events, billing, activity, attachments, data sources — all tenant-scoped and role-aware, all abstract.
  • Schema UI config shapes the client app per domain: status labels, party roles, reference fields, date fields, document checklists, event types, and the label for "Case" (which becomes "Claim" or "Return" or "Application" in other domains).
  • Case management (tasks, notes, calendar, billing, activity, attachments, contacts) is universal across all domains without modification.

What Changes Per Domain

To target... What you build Code changes
Another law type (immigration, family, PI) New schemas + forms + bindings + UI config None
Tax preparation New schemas + forms. Expression engine handles calculations. None (maybe DataSource for tax tables)
Insurance claims New schemas + forms + 1-2 tables for payouts/settlements Minimal — new routes for claim financials
Real estate closings New schemas + forms + 1-2 tables for escrow management Minimal — new routes for escrow
Healthcare credentialing New schemas + forms + 1 table for credential expiry Minimal — new routes for re-credentialing
Government permits New schemas + forms + 1-2 tables for inspection workflows Minimal — new routes for inspections

For any industry where structured forms need to be filled with data from a shared vocabulary, the engine works as-is. The only question is whether the domain needs concepts beyond the core model (cases, entries, filings, contacts, tasks, billing) — and if so, it's 1-2 new tables, not a rewrite.

Cross-Domain Concept Mapping

The core concepts translate directly across industries:

Dossier Concept Law Insurance Real Estate Tax Healthcare Government
Case Bankruptcy case Claim Transaction Return Provider app Permit app
Schema Data vocabulary (1,100 keys) Claim fields Transaction fields Tax data Provider info Application data
Form Court forms ACORD forms Closing docs IRS forms Credentialing apps Application forms
Filing Court filing Claim submission County recording IRS e-file Board submission Agency submission
Binding Schema → form fields Same Same Same Same Same
Validation Means test, schedule totals Coverage limits Loan-to-value Tax calculations License expiry Zoning compliance
Contact Debtor, Attorney, Trustee Claimant, Adjuster Buyer, Seller, Agent Taxpayer, CPA Provider, Payer Applicant, Inspector
DataSource Credit report, CSV Policy system MLS, title search W-2 import NPDB, license DB GIS, prior permits
Status workflow Intake → Filed → Discharged Reported → Settled → Closed Listed → Closing → Recorded Intake → Filed → Accepted Submitted → Approved → Enrolled Submitted → Approved → Issued

Concepts that DON'T exist in the core model but specific domains would need:

  • Insurance: Claim payout/settlement tracking (reserves, subrogation)
  • Real estate: Escrow management (trust accounting, disbursement)
  • Tax: Tax calculation engine (brackets, phase-outs — beyond simple expressions)
  • Healthcare: License/credential expiry tracking (recurring re-verification cycles)
  • Government: Multi-stage inspection workflows (sequential pass/fail gates)

None of these are needed for law verticals. Switching between bankruptcy, immigration, family law, PI, or estate — zero code changes.

Next Verticals

Ranked by volume and Dossier fit (detailed analysis in next-verticals.md):

  1. Immigration (10/10 fit) — 8-13M USCIS form receipts/year. ~100+ federal fillable PDFs. Same architecture: federal forms, fillable PDFs, schema → bindings → AcroFields. No local forms needed.

  2. Family Law (8/10 fit) — 1.5-2M matters/year. Very form-dense (15-30 forms per contested case). State-by-state build, but schema overlaps with bankruptcy (asset/debt inventories, financial disclosures).

  3. Eviction (7/10 fit) — 3.6M filings/year. Simple forms (3-7 per case) but massive volume. Good for bulk automation.

  4. Probate (7/10 fit) — 2.6M filings/year. Schema overlaps with bankruptcy (asset/debt inventories, creditor lists). State variation is the main cost.

  5. Workers' Compensation (6/10 fit) — 2.5M claims/year. IAIABC standards provide a natural schema. Attorney-side market is open.

For detailed cross-domain analysis including what translates directly, what needs schema config, and what needs new DB/server work, see domain-comparison.md.


Technical Reference

For complete type definitions, expression syntax, scoping rules, database schema, and ID conventions, see Domain Model Reference.

Source: docs/engine.md