Dossier Engine
Schema-driven document assembly. The engine knows nothing about law, insurance, or tax — all domain knowledge lives in schemas, forms, and bindings.
The Insight
Every regulated industry has the same problem: structured forms that need to be filled with data that already exists somewhere. A bankruptcy attorney types the debtor's name into Form 101, then types the same name into Schedule A/B, Schedule D, Schedule E/F, the Statement of Financial Affairs, and every other form in the package. The data exists — it was entered once. The routing is what's missing.
Dossier solves this with three ideas that compose into a general-purpose document assembly engine. The engine itself knows nothing about law, bankruptcy, insurance, tax, or any specific domain. All domain knowledge is externalized into three JSON artifacts: schemas, forms (with bindings), and data sources.
The result: populate a few schema keys from any source, and the binding engine cascades the values to every form that needs them — across 70+ forms in a bankruptcy package, or across any set of forms in any domain.
debtor1.first_name is the debtor's first name, regardless of which form asks for it, regardless of which source provided it.
debtor1.first_name once, and bindings carry it to every form that needs it — across 70+ forms in a bankruptcy package.
How It Works
The Schema sits at the center. DataSources write to it (inbound), Bindings read from it (outbound). Every interaction — manual or automated — produces the same artifact: an Entry (a batch of schema key = value pairs with source tracking).
Fill Once, Populate Everywhere
The debtor's first name is entered once. The binding engine routes it to every form that needs it. This works for every data point — creditor names cascade to every schedule, addresses appear on every form that asks, social security numbers are masked where required. One schema key, many targets.
debtor1.first_name → Form 101, field "Debtor1.First name" → Schedule A/B, field "Debtor 1 First name" → Schedule D, field "Debtor 1 Name" → Schedule E/F, field "Debtor 1 Name" → Statement of Financial Affairs, field "Debtor 1" → Declaration, field "Name of Debtor" → ...every form in the package that needs it
Any Source, Same Result
All data sources produce the same thing: an Entry with schema key = value pairs. The binding engine doesn't distinguish between a lawyer typing a name and a credit report XML providing 80 creditor records.
| Source | What Happens | Resulting Entry |
|---|---|---|
| Manual entry | Lawyer types 5 fields, clicks save | 1 entry, 5 values, source=“manual”, auto-confirmed |
| Credit report | MISMO XML parsed by DataSource config | 1 entry, 80 values, source=“credit-report”, pending review |
| Case mgmt API | REST sync from Clio/MyCase | 1 entry, 12 values, source=“case-mgmt”, auto-confirmed |
| Document upload | LLM reads a pay stub, maps to schema keys | 1 entry, 5 values, source=“pay-stub”, pending review |
| Manual correction | Lawyer fixes a creditor name from the credit report | 1 entry, 1 value, overrides the credit report's value |
Latest wins per key. The case history shows every entry — what changed, where it came from, when. A correction doesn't erase the original; both entries stay in the timeline.
Example Flows
Every path through the system follows the same pattern: something produces entries on a case, bindings route them to PDF fields.
What's Been Built
The first domain is bankruptcy. The extraction pipeline has processed every federal form, generated bindings, and composed packages — proving the engine works end-to-end.
bankruptcy.nonindividual — ~640 keys covering entity information, officers, revenue, assets, liabilities, and corporate-specific data.
12 shared administrative keys (
case.*, attorney.*).
Local forms from IL and GA — State-specific local bankruptcy forms processed with the same pipeline.
16 composite forms — Chapter packages (Ch.7 Individual, Ch.7 Non-Individual, Ch.13 Individual, etc.), Petition group, Schedules group, and other logical groupings.
debtor1.first_name appears on 40+ forms under different field names. Creditor arrays map to repeating table rows on Schedule D, E/F, and the creditor matrix. The means test uses income from Schedule I and expenses from Schedule J. Summary totals aggregate values from individual schedules. Conditional forms are included or excluded based on case data.
The Extraction Pipeline
Building a new domain follows a repeatable 5-step pipeline:
2. Process the forms — Take each government/standard PDF, extract its AcroForm fields (key, type, page, position, rect).
3. Generate bindings — Map each PDF field to the appropriate schema key. This is where domain knowledge is captured: understanding that "Debtor1.First name" on Form 101 and "Debtor 1 Name" on Schedule D both mean
debtor1.first_name.4. Compose form packages — Group leaf forms into composites (Petition, Schedules, Chapter 7 Package). Write cross-child bindings that route data between sibling forms.
5. Configure data sources — Define how external data (credit reports, APIs, documents) maps to schema keys.
Engine Implementation
Why It Generalizes
The engine has no concept of "law" or "bankruptcy." It knows five abstract things:
Swap the schema, forms, and bindings — you have a different domain. The engine, server, database, and API routes are unchanged.
What Changes Per Domain
| To target... | What you build | Code changes |
|---|---|---|
| Another law type (immigration, family, PI) | New schemas + forms + bindings + UI config | None |
| Tax preparation | New schemas + forms. Expression engine handles calculations. | None (maybe DataSource for tax tables) |
| Insurance claims | New schemas + forms + 1-2 tables for payouts/settlements | Minimal — new routes for claim financials |
| Real estate closings | New schemas + forms + 1-2 tables for escrow management | Minimal — new routes for escrow |
| Healthcare credentialing | New schemas + forms + 1 table for credential expiry | Minimal — new routes for re-credentialing |
| Government permits | New schemas + forms + 1-2 tables for inspection workflows | Minimal — new routes for inspections |
For any industry where structured forms need to be filled with data from a shared vocabulary, the engine works as-is. The only question is whether the domain needs concepts beyond the core model — and if so, it's 1-2 new tables, not a rewrite.
Cross-Domain Concept Mapping
The core concepts translate directly across industries:
| Dossier Concept | Law | Insurance | Real Estate | Tax | Healthcare | Government |
|---|---|---|---|---|---|---|
| Case | Bankruptcy case | Claim | Transaction | Return | Provider app | Permit app |
| Schema | Data vocabulary (1,100 keys) | Claim fields | Transaction fields | Tax data | Provider info | Application data |
| Form | Court forms | ACORD forms | Closing docs | IRS forms | Credentialing apps | Application forms |
| Filing | Court filing | Claim submission | County recording | IRS e-file | Board submission | Agency submission |
| Binding | Schema → form fields | Same | Same | Same | Same | Same |
| Validation | Means test, schedule totals | Coverage limits | Loan-to-value | Tax calculations | License expiry | Zoning compliance |
| Contact | Debtor, Attorney, Trustee | Claimant, Adjuster | Buyer, Seller, Agent | Taxpayer, CPA | Provider, Payer | Applicant, Inspector |
| DataSource | Credit report, CSV | Policy system | MLS, title search | W-2 import | NPDB, license DB | GIS, prior permits |
Next Verticals
Ranked by volume and Dossier fit:
| Vertical | Fit | Rationale |
|---|---|---|
| Immigration | 10/10 | 8-13M USCIS form receipts/year. ~100+ federal fillable PDFs. Same architecture: federal forms, fillable PDFs, schema → bindings → AcroFields. No local forms needed. |
| Family Law | 8/10 | 1.5-2M matters/year. Very form-dense (15-30 forms per contested case). State-by-state build, but schema overlaps with bankruptcy (asset/debt inventories, financial disclosures). |
| Eviction | 7/10 | 3.6M filings/year. Simple forms (3-7 per case) but massive volume. Good for bulk automation. |
| Probate | 7/10 | 2.6M filings/year. Schema overlaps with bankruptcy (asset/debt inventories, creditor lists). State variation is the main cost. |
| Workers' Compensation | 6/10 | 2.5M claims/year. IAIABC standards provide a natural schema. Attorney-side market is open. |
The 8 Names
Architecture
Four layers, one spine. The Schema sits at the center. DataSources write to it (inbound), Bindings read from it (outbound). Every interaction — manual or automated — produces Entries on a File.
Entities
The domain splits into two sides: Producers create entry values, Consumers use them. The schema key = value pair is the central currency.
schemaId.
Has fields[], children[], bindings[], and validations[].
bankruptcy.individual).
formId + version. Contains entries
(batches of values) and tracks which DataSources produced them.
domains/{domain}/data-sources/. Same expression engine as bindings.
source, timestamp, confirmed flag,
and a values map of schema key = value pairs.
A file is a timeline of entries.
Schema
The schema defines what data points exist for a domain. Every key has a type, label, hint, and group. Forms bind to schema keys. DataSources map to schema keys. One schema per domain variant.
debtor1.ssn
creditors.secured[].name
creditors.secured[].claim_amount
income.d1.gross_wages
sofa.q1.prior_addresses[]
case.district
...1,100 keys
entity.ein
officer[].name
officer[].title
creditors.secured[].name
revenue.gross_year1
case.district
...640 keys
vehicle.vin
incident.date
damage[].part
damage[].cost
future domain...
Schema Entry schema entry
One data point definition in a schema. The schema entry is the human-facing side — simpler than the raw form fields.
| Property | Type | Description |
|---|---|---|
| key | string | Dotted key — debtor.first_name, assets.total |
| type | text | money | date | boolean | enum | number | Data type (not PDF widget type) |
| label | string | Question label shown to user |
| hint | string? | Plain-language explanation |
| expression | string? | If present, this is computed (not entered by user) |
| options | string[]? | Choices for enum type |
Form
One entity replaces the old Form + Blueprint split. A form can be a leaf (has a PDF with fields), a composite (has children), or both. Recursive — composites can contain composites.
Gets filled and printed.
155 fields, 67 bindings
11 children, 17 bindings
15 children, 8 bindings, 335 schema keys
Form entity
A PDF with extracted fields, children, and bindings. All optional — a leaf has fields + filePath, a composite has children, a top-level form has a schemaId.
| Property | Type | Description |
|---|---|---|
| id | string | Deterministic UUID (see ID convention) |
| tenantId | string | null | Null = platform-level |
| number | string | null | Official form number — "101", "106E/F" |
| name | string | Full title |
| effectiveDate | string | null | When this version took effect — "2024-06-22" |
| pages | number | Page count of the PDF (0 for composites without a PDF) |
| schemaId | string | null | Schema this form binds to (typically set on the top-level form) |
| filePath | string | null | Path to the PDF file (null for pure composites) |
| children | ChildEntry[] | Child forms in this container (empty for leaf forms) |
| fields | FormField[] | Extracted AcroForm fields (empty for composites) |
| bindings | Binding[] | Bindings ($.field for self, $child.field for children) |
| validations | Validation[] | Form-level validation rules |
FormField on Form
A single field extracted from a PDF's AcroForm dictionary.
| Property | Type | Description |
|---|---|---|
| key | string | Field name from the PDF (e.g. debtor_name) |
| type | text | checkbox | dropdown | radio | signature | optionList | AcroForm widget type |
| page | number | 1-indexed page number |
| rect | { x, y, w, h } | null | Position in PDF points — used for UX highlighting |
| options | string[] | null | Choices for dropdown/radio/optionList |
| label | string | null | Human-readable label |
| hint | string | null | Help text explaining what to fill |
| needsReview | boolean | True if the key is generic and needs human cleanup |
| source | acro | llm | manual | How the field was discovered |
ChildEntry on Form
A reference to a child form within a composite form.
| Property | Type | Description |
|---|---|---|
| id | string | UUID of the child form |
| key | string | Alias used in expressions: $key.field |
| position | number | Print order |
| condition | string? | Include/exclude expression: case.has_preparer == true |
Binding
Connects a source expression to one or more target field addresses. The core data-wiring mechanism. Array syntax with [] supports repeating groups.
| Property | Type | Description |
|---|---|---|
| source | string | Expression: debtor.first_name, $b106ab.line55 + $b106ef.total |
| targets | string[] | Target addresses: ["$b106sum.1a", "$b106ab.line63"] |
| condition | string? | Boolean expression — binding only applies when true |
| note | string? | Documentation for the binding author |
Cross-form sync
A composite form's binding can route one schema key to multiple child forms. The debtor's name entered once appears on every form that needs it.
{
"source": "debtor1.first_name",
"targets": [
"$b101.Debtor1.First name",
"$b106ab.Debtor 1 First name",
"$b106d.Debtor 1 Name",
"$b106ef.Debtor 1 Name",
"$b107.Debtor 1 Name"
]
}
Validation on Form
| Property | Type | Description |
|---|---|---|
| expression | string | Boolean expression that should evaluate to true |
| severity | error | warning | Errors block export, warnings inform |
| description | string | Human message shown when validation fails |
DataSource
A reusable recipe that turns external data into schema key values. Combines parse rules (how to read the source format) with field mapping (how to transform into schema keys). Scoped to a schema, not to individual forms.
DataSource entity
| Property | Type | Description |
|---|---|---|
| id | string | Unique identifier |
| name | string | Human-readable name — "Credit Report (Bankruptcy)" |
| type | api | document | import | manual | How data is acquired |
| schemaId | string | Which schema this DataSource maps to |
| config | object | Parse rules, field mapping, execution config (API auth, XML paths, LLM prompts) |
Inbound vs Outbound
Author: integration developer or platform
Contains: parse rules, field mapping, execution config
Lifecycle: written once, used by every case
Example:
cms.contact.first_name → debtor1.first_name
Author: form builder (platform)
Contains: source key, target field(s), condition
Lifecycle: defined on a form, static
Example:
debtor1.first_name → $b101.Debtor1.First name
The schema is the clean boundary. DataSources don't know about PDF fields. Bindings don't know about APIs. Both use the same key vocabulary — build the key picker once, use it everywhere.
Input Types
| Type | Parse Step | Examples |
|---|---|---|
| api | Deterministic parser (XML→JSON, JSON→JSON) | Credit report (MISMO XML), Clio, PACER |
| document | LLM extraction, then optionally saved as reusable template | Pay stub, tax return, bank statement |
| import | Column/field mapping | CSV, JSON file upload |
| manual | None — direct user input | Schema form entry |
Data Flow
DataSource Structure
{
"id": "credit-report-bankruptcy",
"name": "Credit Report (Bankruptcy)",
"type": "api",
"schemaId": "bankruptcy.individual",
"singleton": { // one-to-one mappings
"debtor1.full_name": "CONCAT(borrowers[0].firstName, ' ', borrowers[0].lastName)",
"debtor1.ssn": "borrowers[0].ssn"
},
"collections": { // one-to-many mappings
"creditors": {
"source": "liabilities",
"filter": "AND(IN(status, 'Open'), balance > 0)",
"classify": {
"secured": { "rule": "IN(loanType, 'Secured', 'Mortgage')", "targetPrefix": "creditor.secured[]" },
"unsecured": { "rule": "DEFAULT", "targetPrefix": "creditor.unsecured[]" }
},
"fields": {
"name": "creditorName",
"account_number": "RIGHT(accountNumber, 4)",
"claim_amount": "balance"
}
}
}
}
For
document DataSources, the first upload uses LLM to extract fields. The user reviews
and confirms the mapping. The extraction rules are then saved as a reusable template — next time
the same document type is uploaded, it skips LLM and uses the saved template directly.
File & Entry
Everything that produces data on a file — typing, credit reports, API syncs — produces the same thing: an Entry. One event = one entry containing a batch of data point changes. A file is a timeline of entries.
File entity
| Property | Type | Description |
|---|---|---|
| id | string | UUID |
| tenantId | string | Owning tenant |
| formId | string | Top-level form this file is an instance of |
| formVersion | number | Frozen snapshot version |
| name | string | Case name — "Smith Ch7 Filing" |
| status | string | draft | in-review | ready | filed | closed |
Entry batch of values
| Property | Type | Description |
|---|---|---|
| source | string | What produced this entry — "manual", "credit-report", "case-mgmt" |
| timestamp | string | When this entry was created |
| confirmed | boolean | False = pending review (e.g. credit report auto-import) |
| values | Record<string, any> | Map of schema key = value pairs |
creditors.secured[0].claim_amount = 180000
creditors.unsecured[0].name = "AMEX"
creditors.unsecured[0].claim_amount = 12450
...76 more values
File timeline
Nickie Green's Chapter 7 case. Three entries: lawyer types basics, credit report is pulled, lawyer corrects a name.
Three entries, not 86. The credit report is one entry with 80 values (pending confirmation). The lawyer's correction is a separate entry that overrides one value from the credit report. Current state = merge all entries by priority and timestamp, latest wins per key. PDF generation = snapshot of the current state.
Composition Graph
A composite form contains children — other forms (leaf or composite). Each child has a key (alias used in expressions) and a position (print order). Children can have a condition that controls inclusion.
Bindings only flow downward. A composite form can reference any descendant form's fields via
$key.field. Never up, never sideways. Cross-tree bindings live on the common ancestor.
Expression Syntax
All source, condition, and computed expression fields are parsed by the expression engine in packages/core/src/expressions/.
References
| Pattern | Meaning | Example |
|---|---|---|
| $child.field | A child form's field | $b106ab.line55 |
| $.field | Current form's own field (self-ref) | $.line55 |
| group.key | Schema key (dotted notation) | debtor.first_name |
| '...' | String literal | 'installments' |
| bare word | Literal or function name | true, 42, CONCAT |
Disambiguation Rule
Starts with
$ → form field referenceHas a
. but no $ → schema keyNo
. and no $ → literal value or function name
Operators
+ - * / // arithmetic (+ also concatenates strings) == != > < >= <= // comparison
Functions
22 built-in functions following Excel/Google Sheets naming conventions. Used in bindings, DataSource mappings, validations, and computed fields.
| Category | Function | Example |
|---|---|---|
| String | CONCAT(a, b, ...) | CONCAT(first, ' ', last) |
| UPPER(s) | UPPER(name) → "JOHN" | |
| LOWER(s) | LOWER(name) → "john" | |
| TRIM(s) | TRIM(name) → strip whitespace | |
| LEFT(s, n) | LEFT(ssn, 3) → first 3 chars | |
| RIGHT(s, n) | RIGHT(acct, 4) → last 4 chars | |
| LEN(s) | LEN(name) → string length | |
| SUBSTITUTE(s, old, new) | SUBSTITUTE(phone, "-", "") | |
| Math | SUM(a, b, ...) | SUM(line1, line2) — skips nulls |
| MAX(a, b) | MAX(income, 0) | |
| MIN(a, b) | MIN(balance, limit) | |
| COUNT(arr) | COUNT(creditors) → array length | |
| ROUND(n, decimals?) | ROUND(amount, 2) → 1234.57 | |
| ABS(n) | ABS(income - expenses) | |
| Logical | IF(cond, then, else) | IF(joint, d2_name, "") |
| AND(a, b, ...) | AND(employed, income > 0) | |
| OR(a, b, ...) | OR(has_car, has_house) | |
| NOT(a) | NOT(exempt) | |
| IN(val, a, b, ...) | IN(type, "Secured", "Mortgage") | |
| Date | TODAY() | TODAY() → MM/DD/YYYY |
| YEAR(d) | YEAR(opened) → 2020 | |
| MONTH(d) | MONTH(opened) → 5 | |
| DAY(d) | DAY(filed) → 23 |
Date functions accept
YYYY-MM-DD, MM/DD/YYYY, and YYYY-MM.
All functions return null for null input (null propagation).
AST
Expressions are parsed into an AST with 5 node types:
literal → { value: "hello" | 42 | true } reference → { key: "debtor.first_name" } // schema key formRef → { child: "b106ab", field: "line55" } // $child.field binary → { op: "+", left: ..., right: ... } call → { name: "CONCAT", args: [...] }
Scoping Rules
Not every reference type is available in every context. This matrix defines what's legal where.
| Context | $child.field |
$.field |
schema key |
|---|---|---|---|
| Composite form binding source | ✓ | — | ✓ |
| Composite form binding target | ✓ | — | — |
| Composite form condition | ✓ | — | ✓ |
| Child entry condition | ✓ (siblings) | — | ✓ |
| Leaf form computed field | — | ✓ | ✓ |
| Leaf form validation | — | ✓ | ✓ |
A binding on a composite form can reference any form that is a descendant (child, grandchild, etc.) of that form. Never up, never sideways — only down. Cross-tree bindings (e.g. means test results → schedules summary) must live on the common ancestor form.
ID Convention
IDs are deterministic UUIDs that encode the entity type, domain, and a hex-leet name. Readable at a glance.
Type Prefixes
| Prefix | Meaning | Example |
|---|---|---|
| F | Form (leaf or composite) | F0001010-BA01-... — Form 101 |
| S | Schema | S0000001-BA01-... — bankruptcy.individual |
Older IDs may use
B (master blueprint) and BB (nested blueprint). These map to composite forms under the merged model.
Group 5: Hex Name (12 chars, read right to left)
NNNNNNNNCCCT │ │ └―― T: filer type (last char) │ └――――― CCC: chapter hint (0C7=Ch7, CD3=Ch13, 000=shared) └―――――――――――― NNNNNNNN: hex-leet name (zero-padded)
| Filer Type (T) | Meaning |
|---|---|
| 1 | Individual |
| 2 | Non-individual (corporation, LLC, partnership) |
| 0 | Form (no filer type) or shared |
Examples
F0000001-BA01-4000-8000-C40700000001 ← Ch7 Individual (top-level) F0000005-BA01-4000-8000-C40700000002 ← Ch7 Non-Individual (top-level) F0000101-BA01-4000-8000-BE7171090001 ← Petition (composite, Individual) F0000206-BA01-4000-8000-5C4ED0000002 ← Schedules (composite, Non-Individual) F0001010-BA01-4000-8000-000000000000 ← Form 101 (leaf, no filer suffix)
Database Tables
PostgreSQL 17 with Drizzle ORM. Core tables store forms, files, and entries.
| Table | Purpose | Key Columns |
|---|---|---|
| forms | All forms (leaf + composite) | id, tenant_id, name, number, schema_id, file_path, fields, children, bindings, validations |
| schemas | Domain schemas | id, name, domain, entries (jsonb) |
| cases | Client matters (top-level) | id, tenant_id, schema_id, form_id, name, status, references (jsonb), dates (jsonb) |
| files | Filed documents within a case | id, case_id, form_id, form_version, name, status, filed_at, snapshot (jsonb) |
| entries | Batches of values per case | case_id, source, timestamp, confirmed, values (jsonb) |
| data_sources | Reusable import recipes | id, name, type, schema_id, config (jsonb) |
fields, bindings, validations, children, and entries
are stored as JSONB. This keeps the schema flat — no join tables for these.
The structured types (FormField[], Binding[], etc.) live in code.
Example Flows
Every path through the system — manual or automated — follows the same pattern: something produces entries on a file, bindings route them to PDF fields.
insurance.auto.claim). The DataSource is scoped to a schema, not to forms.