Dossier Engine
The Insight
Every regulated industry has the same problem: structured forms that need to be filled with data that already exists somewhere. A bankruptcy attorney types the debtor's name into Form 101, then types the same name into Schedule A/B, Schedule D, Schedule E/F, the Statement of Financial Affairs, and every other form in the package. The data exists — it was entered once. The routing is what's missing.
Dossier solves this with three ideas:
A shared vocabulary (Schema) — Every data point in a domain gets a canonical name.
debtor1.first_nameis the debtor's first name, regardless of which form asks for it, regardless of which source provided it.Bindings that route data to forms — Each form declares how its PDF fields map to schema keys. Enter
debtor1.first_nameonce, and bindings carry it to every form that needs it — across 70+ forms in a bankruptcy package.Data sources that are interchangeable — Manual entry, credit report XML, a case management API, or an LLM reading a pay stub all produce the same thing: schema key = value pairs. The binding engine doesn't care where the data came from.
The engine is the schema + binding resolver + expression engine + PDF filler. It knows nothing about law, bankruptcy, insurance, tax, or any specific domain. All domain knowledge is externalized into three JSON artifacts: schemas, forms (with bindings), and data sources.
How It Works
Data Flow
The Schema sits at the center. DataSources write to it (inbound), Bindings read from it (outbound). Every interaction — manual or automated — produces the same artifact: an Entry.
External data (XML, API, CSV, PDF, manual entry)
↓
DataSource (parse rules + field mapping)
↓
Schema keys (the shared vocabulary)
↓
Entry on Case (batch of key=value with source + timestamp)
↓
Binding engine (resolves entries → form fields)
↓
Filled PDF
Fill Once, Populate Everywhere
The debtor's first name is entered once. The binding engine routes it to every form:
debtor1.first_name
→ Form 101, field "Debtor1.First name"
→ Schedule A/B, field "Debtor 1 First name"
→ Schedule D, field "Debtor 1 Name"
→ Schedule E/F, field "Debtor 1 Name"
→ Statement of Financial Affairs, field "Debtor 1"
→ Declaration, field "Name of Debtor"
→ ...every form in the package that needs it
This works for every data point. Creditor names cascade to every schedule. Addresses appear on every form that asks. Social security numbers are masked where required. One schema key, many targets.
Any Source, Same Result
All data sources produce the same thing: an Entry with schema key = value pairs.
| Source | What happens | Entry |
|---|---|---|
| Manual entry | Lawyer types 5 fields, clicks save | 1 entry, 5 values, source="manual", auto-confirmed |
| Credit report | MISMO XML parsed by DataSource config | 1 entry, 80 values, source="credit-report", pending review |
| Case management API | REST sync from Clio/MyCase | 1 entry, 12 values, source="case-mgmt", auto-confirmed |
| Document upload | LLM reads a pay stub, maps to schema keys | 1 entry, 5 values, source="pay-stub", pending review |
| Manual correction | Lawyer fixes a creditor name from the credit report | 1 entry, 1 value, overrides the credit report's value |
Current state = merge all confirmed entries by timestamp (latest wins per key). The case activity feed shows every entry — what changed, where it came from, when. Both entries stay in the timeline; the correction doesn't erase the original.
Forms Are Recursive
A form can be a leaf (has a PDF with extractable fields), a composite (groups other forms and routes data between them), or both. Composites contain composites — a Chapter 7 Individual package contains a Petition group, which contains Form 101, Declaration, and other leaf forms.
Chapter 7 Individual (composite)
├── Petition (composite)
│ ├── Form 101 — Voluntary Petition (leaf, 155 fields)
│ ├── Form 101A — Initial Statement (leaf, conditional)
│ └── Declaration (leaf)
├── Schedules (composite)
│ ├── Schedule Summary (leaf)
│ ├── Schedule A/B — Property (leaf)
│ ├── Schedule C — Exemptions (leaf)
│ ├── Schedule D — Secured Creditors (leaf)
│ ├── Schedule E/F — Unsecured Creditors (leaf)
│ ├── Schedule G — Executory Contracts (leaf)
│ ├── Schedule H — Co-debtors (leaf, conditional)
│ ├── Schedule I — Income (leaf)
│ └── Schedule J — Expenses (leaf)
├── SOFA — Statement of Financial Affairs (leaf)
├── Means Test — Form 122A (leaf)
└── ...other forms
Bindings flow downward only. A composite form's binding can reference any descendant. Cross-form bindings (e.g., income from Schedule I flows to the Means Test) live on the common ancestor. Never up, never sideways.
Expression Engine
Bindings can do more than simple routing. The expression engine supports 22 Excel-style functions:
- String:
CONCAT(first, ' ', last),UPPER(name),RIGHT(acct, 4),SUBSTITUTE(phone, "-", "") - Math:
SUM(line1, line2),ROUND(amount, 2),MAX(income, 0),COUNT(creditors) - Logical:
IF(joint, debtor2_name, ""),AND(employed, income > 0),IN(type, 'Secured', 'Mortgage') - Date:
TODAY(),YEAR(filed_date),MONTH(opened)
Conditional bindings apply only when a condition is met:
{
"source": "debtor2.first_name",
"targets": ["$b101.Debtor2.First name", "$b106ab.Joint Debtor Name"],
"condition": "case.is_joint == true"
}
Computed schema entries derive values from other entries:
{
"key": "income.total",
"type": "money",
"label": "Total Monthly Income",
"expression": "SUM(income.wages, income.business, income.rental)"
}
DataSource Configuration
A DataSource is a reusable recipe that turns external data into schema key values. It combines parse rules (how to read the source format) with field mapping (how to transform into schema keys).
{
"name": "Credit Report (Bankruptcy)",
"type": "api",
"schemaId": "bankruptcy.individual",
"singleton": {
"debtor1.full_name": "CONCAT(borrowers[0].firstName, ' ', borrowers[0].lastName)",
"debtor1.ssn": "borrowers[0].ssn"
},
"collections": {
"creditors": {
"source": "liabilities",
"filter": "AND(IN(status, 'Open'), balance > 0)",
"classify": {
"secured": { "rule": "IN(loanType, 'Secured', 'Mortgage')", "targetPrefix": "creditor.secured[]" },
"unsecured": { "rule": "DEFAULT", "targetPrefix": "creditor.unsecured[]" }
},
"fields": {
"name": "creditorName",
"account_number": "RIGHT(accountNumber, 4)",
"claim_amount": "balance"
}
}
}
}
DataSource mappings are inbound (external field → schema key). Bindings are outbound (schema key → form field). Both use the same key vocabulary — same expression engine, same [] array syntax.
The schema is the clean boundary. DataSources don't know about PDF fields. Bindings don't know about APIs. Build the key picker once, use it everywhere.
Example Flows
Every path through the system follows the same pattern: something produces entries on a case, bindings route them to PDF fields.
A. Lawyer fills in debtor information and clicks save
Lawyer types 5 fields, saves → Entry (manual, 5 values, confirmed)
→ Bindings route each value to every form field that needs it
→ No DataSource needed. Source = "manual", auto-confirmed.
B. Credit report fills creditor schedules
MISMO XML → DataSource: credit-report (parse + classify + map)
→ Entry (credit-report, 80 values, pending review)
→ Bindings → Schedule D creditor rows, Schedule E/F creditor rows, ...
→ Lawyer reviews and confirms before values flow to PDFs
C. Case management sync fills debtor demographics
Clio REST API → DataSource: case-mgmt (field mapping)
→ Entry (case-mgmt, 12 values, confirmed)
→ Bindings → Form 101, Schedule A/B, Schedule I, ...
→ Same schema keys, same bindings, different source
→ If lawyer already typed the name manually, the API value doesn't override
D. Pay stub upload fills income
Pay stub PDF → DataSource: pay-stub (LLM extraction)
→ Entry (pay-stub, 5 values, pending review)
→ Bindings → Schedule I income fields
→ First time: lawyer confirms the LLM mapping (saved as reusable template)
→ Second time: auto-applied from the saved template
E. Lawyer corrects a value from the credit report
Fixes creditor name → Entry (manual, 1 value, confirmed)
→ Overrides the credit report's value for creditors.secured[0].name
→ Credit report entry still exists in timeline — activity feed shows what changed and why
F. Same data source, different domain (future)
Same MISMO XML → DataSource: credit-report-auto (different schema mapping)
→ Entry (credit-report, N values)
→ Different schema (insurance.auto.claim) → different bindings → insurance forms
→ The DataSource is scoped to a schema, not to forms
The Extraction Pipeline
Building a new domain follows a repeatable pipeline:
- Define the schema — Enumerate every data point in the domain. Give each a canonical key, type, label, and group.
- Process the forms — Take each government/standard PDF, extract its AcroForm fields (key, type, page, position, rect).
- Generate bindings — Map each PDF field to the appropriate schema key. This is where domain knowledge is captured: understanding that "Debtor1.First name" on Form 101 and "Debtor 1 Name" on Schedule D both mean
debtor1.first_name. - Compose form packages — Group leaf forms into composites (Petition, Schedules, Chapter 7 Package). Write cross-child bindings that route data between sibling forms.
- Configure data sources — Define how external data (credit reports, APIs, documents) maps to schema keys.
The result: populate a few schema keys from any source, and the binding engine cascades the values to every form that needs them.
What's Been Built
Bankruptcy Domain
Schemas:
bankruptcy.individual— ~1,100 schema keys covering debtor identity, income, expenses, assets, liabilities, creditors (secured, unsecured, priority), executory contracts, co-debtors, prior filings, and administrative databankruptcy.nonindividual— ~640 schema keys covering entity information, officers, revenue, assets, liabilities, and corporate-specific data- 12 shared administrative keys (
case.*,attorney.*)
Forms processed:
- 69 federal leaf forms — Every fillable PDF in the bankruptcy form set. Fields extracted, bindings generated, schema keys mapped.
- Local forms from IL and GA — State-specific local bankruptcy forms processed with the same pipeline, extending the federal form set.
- 19 composite forms — Chapter packages (Ch.7 Individual, Ch.7 Non-Individual, Ch.13 Individual, etc.), Petition group, Schedules group, and other logical groupings.
What the bindings encode:
The bindings are the captured domain knowledge. They encode:
debtor1.first_nameappears on 40+ forms under different field names- Creditor arrays in the schema map to repeating table rows on Schedule D, E/F, and the creditor matrix
- The means test (Form 122A) uses income values from Schedule I and expense values from Schedule J
- Summary totals on Schedule A/B Sum aggregate values from individual schedules
- Conditional forms are included or excluded based on case data (e.g., Schedule H only when
case.has_codebtors == true) - Cross-form sync ensures the same creditor list stays consistent between Schedules and the Chapter 13 Plan
Engine implementation:
- PDF field extraction (AcroForm field discovery with position and type metadata)
- Expression engine: tokenizer → parser → AST → evaluator, 22 functions, AST caching
- Binding resolver with condition evaluation and cycle detection
- PDF filler (pdf-lib AcroFields) + multi-form export (merged PDF or ZIP)
- DataSource framework — credit report parsing (live), CSV import, Clio API mapping config, pay-stub / document upload (extraction path scaffolded, LLM mapping planned)
- Client portal (
packages/portal) — tenant-branded intake app with embeddable widget (bubble, panel, full-page), dashboard for invited clients, and four published intake configs (Atlas, DebtStoppers, Greenfield, individual self-file) - Public intake + portal routes (
/intake/:slug,/portal/:tenantSlug) — tokenized invites, multipart file uploads, rate-limited, write directly to entries on a case
Why It Generalizes
The engine has no concept of "law" or "bankruptcy." It knows five things:
- Schemas — vocabularies of typed data points with dotted key notation
- Forms — PDFs with extractable AcroForm fields, composable into packages
- Bindings — routes from schema keys to form fields, with expressions and conditions
- DataSources — recipes for importing external data into schema keys
- Entries — batches of key=value changes with source tracking
Swap the schema, forms, and bindings — you have a different domain. The engine, server, database, and API routes are unchanged.
Domain-Agnostic Infrastructure
The platform's infrastructure is fully abstract:
- Database tables know nothing about law, insurance, or tax. They store schemas (JSONB entries), forms (JSONB fields/bindings), cases (JSONB references/dates), entries (JSONB values), and filings (JSONB snapshots).
- ~45 API routes handle CRUD for cases, entries, filings, contacts, tasks, notes, events, billing, activity, attachments, data sources — all tenant-scoped and role-aware, all abstract.
- Schema UI config shapes the client app per domain: status labels, party roles, reference fields, date fields, document checklists, event types, and the label for "Case" (which becomes "Claim" or "Return" or "Application" in other domains).
- Case management (tasks, notes, calendar, billing, activity, attachments, contacts) is universal across all domains without modification.
What Changes Per Domain
| To target... | What you build | Code changes |
|---|---|---|
| Another law type (immigration, family, PI) | New schemas + forms + bindings + UI config | None |
| Tax preparation | New schemas + forms. Expression engine handles calculations. | None (maybe DataSource for tax tables) |
| Insurance claims | New schemas + forms + 1-2 tables for payouts/settlements | Minimal — new routes for claim financials |
| Real estate closings | New schemas + forms + 1-2 tables for escrow management | Minimal — new routes for escrow |
| Healthcare credentialing | New schemas + forms + 1 table for credential expiry | Minimal — new routes for re-credentialing |
| Government permits | New schemas + forms + 1-2 tables for inspection workflows | Minimal — new routes for inspections |
For any industry where structured forms need to be filled with data from a shared vocabulary, the engine works as-is. The only question is whether the domain needs concepts beyond the core model (cases, entries, filings, contacts, tasks, billing) — and if so, it's 1-2 new tables, not a rewrite.
Cross-Domain Concept Mapping
The core concepts translate directly across industries:
| Dossier Concept | Law | Insurance | Real Estate | Tax | Healthcare | Government |
|---|---|---|---|---|---|---|
| Case | Bankruptcy case | Claim | Transaction | Return | Provider app | Permit app |
| Schema | Data vocabulary (1,100 keys) | Claim fields | Transaction fields | Tax data | Provider info | Application data |
| Form | Court forms | ACORD forms | Closing docs | IRS forms | Credentialing apps | Application forms |
| Filing | Court filing | Claim submission | County recording | IRS e-file | Board submission | Agency submission |
| Binding | Schema → form fields | Same | Same | Same | Same | Same |
| Validation | Means test, schedule totals | Coverage limits | Loan-to-value | Tax calculations | License expiry | Zoning compliance |
| Contact | Debtor, Attorney, Trustee | Claimant, Adjuster | Buyer, Seller, Agent | Taxpayer, CPA | Provider, Payer | Applicant, Inspector |
| DataSource | Credit report, CSV | Policy system | MLS, title search | W-2 import | NPDB, license DB | GIS, prior permits |
| Status workflow | Intake → Filed → Discharged | Reported → Settled → Closed | Listed → Closing → Recorded | Intake → Filed → Accepted | Submitted → Approved → Enrolled | Submitted → Approved → Issued |
Concepts that DON'T exist in the core model but specific domains would need:
- Insurance: Claim payout/settlement tracking (reserves, subrogation)
- Real estate: Escrow management (trust accounting, disbursement)
- Tax: Tax calculation engine (brackets, phase-outs — beyond simple expressions)
- Healthcare: License/credential expiry tracking (recurring re-verification cycles)
- Government: Multi-stage inspection workflows (sequential pass/fail gates)
None of these are needed for law verticals. Switching between bankruptcy, immigration, family law, PI, or estate — zero code changes.
Next Verticals
Ranked by volume and Dossier fit (detailed analysis in next-verticals.md):
Immigration (10/10 fit) — 8-13M USCIS form receipts/year. ~100+ federal fillable PDFs. Same architecture: federal forms, fillable PDFs, schema → bindings → AcroFields. No local forms needed.
Family Law (8/10 fit) — 1.5-2M matters/year. Very form-dense (15-30 forms per contested case). State-by-state build, but schema overlaps with bankruptcy (asset/debt inventories, financial disclosures).
Eviction (7/10 fit) — 3.6M filings/year. Simple forms (3-7 per case) but massive volume. Good for bulk automation.
Probate (7/10 fit) — 2.6M filings/year. Schema overlaps with bankruptcy (asset/debt inventories, creditor lists). State variation is the main cost.
Workers' Compensation (6/10 fit) — 2.5M claims/year. IAIABC standards provide a natural schema. Attorney-side market is open.
For detailed cross-domain analysis including what translates directly, what needs schema config, and what needs new DB/server work, see domain-comparison.md.
Technical Reference
For complete type definitions, expression syntax, scoping rules, database schema, and ID conventions, see Domain Model Reference.
docs/engine.md