Form Pipeline — From PDF to Live Engine
How a court PDF becomes a live, fillable, schema-bound form. The pipeline is the same for every domain (bankruptcy, immigration, …) — only the schema and the forms change.
The end-state of a processed form directory is a stack of small JSON files sitting next to the working PDF. Each file owns one slice of the form's identity (mechanical shape, semantic meaning, schema routing, filing rules). Together they let the runtime render an HTML data-entry UI, route values across many forms, and produce a filled PDF.
forms/b101/
b101_2024-06-22.pdf ← working PDF (AcroForm-injected if needed)
b101_2024-06-22_original.pdf ← preserved court source (flat / XFA / docx-derived)
b101_2024-06-22.meta.json ← form identity (number, name, pages, source kind, schema)
b101_2024-06-22.fields.json ← extracted AcroFields (key, type, rect, format, flags, maxLen…)
b101_2024-06-22.acro.json ← injection spec (only for flat / XFA / docx)
b101_2024-06-22.xfa.json ← XFA template metadata (XFA only)
b101_2024-06-22.docx.json ← Word content-control metadata (DOCX only)
b101_2024-06-22.knowledge.json ← semantic mapping: each field → meaning + schema domain
b101_2024-06-22.bindings.json ← schema path → field targets (generated)
b101_2024-06-22.validations.json ← expression-based filing rules
Four Source Types
Every PDF the operator processes falls into one of four buckets. The bucket determines what runs first and which side files exist; everything downstream of fields.json is identical.
| Source | What it is | How we make it fillable |
|---|---|---|
| native | PDF already has AcroForm fields embedded by the form author | Nothing — extract fields directly |
| xfa | Adobe XFA form (dynamic XML template, no AcroForm widgets) | Read the XFA template, auto-generate .acro.json, inject AcroFields, re-extract |
| flat | Static PDF with printed blanks / checkboxes / placeholders but no fields | Operator authors .acro.json from a visual read, inject, re-extract |
| docx | Court-issued Microsoft Word document with content controls | LibreOffice converts to PDF, content-control metadata exported to .docx.json, operator authors .acro.json, inject, re-extract |
extract-fields.mjs detects the source kind, writes meta.json with source: "native" | "xfa" | "flat" | "docx", and renames the original to _original.pdf for non-native sources. The court file is never modified — every working PDF is derived from _original.pdf via injection, so re-injecting from a fixed .acro.json is always possible.
What Each JSON Owns
Each file has one job. The boundaries are deliberate — they let us regenerate downstream artifacts (bindings, validations, filled PDFs) without re-reading the PDF, and they keep court-provided text (helptext, tooltips) separate from operator-authored interpretation (hints, schema mappings).
.meta.json — Form identity
Mechanical, generated by extract-fields.mjs, completed by the operator after the visual read.
{
"number": "101",
"name": "Voluntary Petition for Individuals Filing for Bankruptcy",
"effectiveDate": "2024-06-22",
"pages": 9,
"source": "native",
"schema": "bankruptcy"
}
schema declares which schema's vocabulary the form's bindings/validations reference, making each form folder self-contained. name is set by the operator from the PDF title (the extractor leaves it null).
.fields.json — PDF-native field shape
The source of truth for AcroField keys, positions, and PDF-native metadata. Generated by extract-fields.mjs after AcroFields exist (immediately for native, after injection for the other three sources).
{
"fields": [
{
"key": "Debtor1.First name",
"type": "text",
"page": 1,
"rect": { "x": 172, "y": 407, "w": 195, "h": 13 },
"options": null,
"flags": { "required": true },
"maxLen": 60,
"format": null,
"align": "left",
"defaultValue": null,
"altName": "Debtor 1 first name (required)"
},
{
"key": "Check Box5",
"type": "checkbox",
"page": 1,
"rect": { "x": 379, "y": 700, "w": 12, "h": 11 },
"options": ["Presumption", "No Presumption"],
"widgets": [
{ "onValue": "Presumption", "page": 1, "rect": { "x": 379, "y": 700, "w": 12, "h": 11 } },
{ "onValue": "No Presumption", "page": 1, "rect": { "x": 379, "y": 688, "w": 12, "h": 11 } }
]
}
]
}
Field shape (also FormField in @dossier/core):
| Key | Source on the PDF | Meaning |
|---|---|---|
key |
/T field name (or your .acro.json key for injected fields) |
Join key with knowledge.json |
type |
text / checkbox / dropdown / radio / signature / optionList |
Renderer + filler discriminator |
page, rect |
Widget annotation page + /Rect |
Used for live PDF preview overlay |
options, widgets |
Choice /Opt + per-widget on-values |
Single field, multiple visual checkboxes |
flags |
/Ff bits |
readOnly, required, multiline (text bit 13), comb (text bit 25), combo (choice bit 18), edit (choice bit 19), multiSelect (choice bit 22) |
maxLen |
/MaxLen |
Per-character cap for text fields |
format |
/AA/F JavaScript (AFDate_FormatEx, AFNumber_Format, AFPercent_Format, AFSpecial_Format, AFTime_Format) or XFA <picture> |
See Field formats below |
align |
/Q (0/1/2) |
left / center / right |
defaultValue |
/DV |
Pre-filled value to render when empty |
altName |
/TU |
Tooltip / accessibility label |
Inheritable keys (/Ff, /MaxLen, /Q, /DV) are walked up the /Parent chain so kid widgets inherit field-level metadata.
.acro.json — Injection spec (flat / XFA / docx only)
Tells inject-fields.mjs what AcroFields to stamp onto _original.pdf. For XFA it's auto-generated from the XFA template's positions; for flat and docx the operator authors it from a visual read.
{
"fields": [
{ "key": "debtor_name", "type": "text", "page": 1, "rect": { "x": 100, "y": 500, "w": 200, "h": 18 } },
{ "key": "executed_date", "type": "text", "page": 1, "rect": { "x": 118, "y": 145, "w": 115, "h": 14 },
"format": { "kind": "date", "pattern": "mm/dd/yyyy" } },
{ "key": "debtor_ssn", "type": "text", "page": 1, "rect": { "x": 200, "y": 600, "w": 110, "h": 14 },
"format": { "kind": "ssn" }, "flags": { "comb": true, "required": true }, "maxLen": 9 },
{ "key": "claim_amount", "type": "text", "page": 1, "rect": { "x": 410, "y": 320, "w": 100, "h": 14 },
"format": { "kind": "number", "decimals": 2, "currency": "$", "prependCurrency": true } },
{ "key": "explanation", "type": "text", "page": 1, "rect": { "x": 80, "y": 200, "w": 460, "h": 60 },
"flags": { "multiline": true } },
{ "key": "is_amended", "type": "checkbox", "page": 1, "rect": { "x": 450, "y": 700, "w": 12, "h": 12 } }
]
}
.acro.json is the only place the operator writes PDF-native shape directly — for native PDFs the original AcroForm authors did this work for us, and extract-fields.mjs reads it into .fields.json. The two files carry the same vocabulary so anything you can author in .acro.json (every entry in the table above) survives extraction unchanged.
Field formats — the recent .acro.json work
The format object is the PDF-native value-format declaration. It comes from Acrobat's /AA/F JavaScript on native PDFs, from XFA <picture> on XFA, and from operator authoring on flat/docx. Eight kinds are supported, all decoded into the same FieldFormat shape:
format.kind |
Source on native PDFs | Visual cue on flat PDFs | Notes |
|---|---|---|---|
date |
AFDate_FormatEx("mm/dd/yyyy") / AFDate_Format(N) |
"Date: //____" or "MM / DD / YYYY" mask | pattern is a literal mask: mm/dd/yyyy, m/d/yyyy, yyyy-mm-dd, mm-dd-yyyy, mmm d, yyyy |
time |
AFTime_Format(N) |
"HH:MM AM/PM" label | sepStyle 0..3 controls 12/24h preset |
number |
AFNumber_Format(decimals, sepStyle, negStyle, _, currency, prepend) |
"$____." or Amount column with comb boxes |
sepStyle 0..3 (1,234.56 / 1234.56 / 1.234,56 / 1234,56), negStyle 0..3 (Minus / Red / Parens / ParensRed), currency symbol, prependCurrency true/false |
percent |
AFPercent_Format(decimals, sepStyle) |
"%" symbol or "Rate" column | Same sepStyle table; suffix % |
ssn |
AFSpecial_Format(3) |
"Social Security Number: --___" | Stored as digits-only, displayed 999-99-9999 |
phone |
AFSpecial_Format(2) |
"(__) -" or labeled "Phone" with mask | Stored as digits-only, displayed (999) 999-9999 |
zip |
AFSpecial_Format(0) |
"ZIP: _____" | 5 digits |
zip4 |
AFSpecial_Format(1) |
"_-" | 5+4, displayed 99999-9999 |
This table is the contract between three layers:
- Extraction / authoring —
extract-fields.mjsparses/AA/FJavaScript into aformatobject; flat-PDF authors hand-write the same object into.acro.json. - Data entry UI —
mapFieldType(packages/client/src/pages/case-tabs/data-derive.ts) maps the kind to a typed input.MaskedInputrenders SSN/phone/zip/zip4 with caret-preserving mask handling and stores digits-only. Date / number / percent fall through to typed inputs and store ISO / numeric values. - PDF filler —
formatValueForField(packages/core/src/pdf/format-value.ts) is called before everysetTextinpdf-filler.ts. It re-applies the format on render so server-rendered or flattened PDFs match what Acrobat would have produced at runtime via/AA/F.
flags is the matching contract for boolean PDF behavior:
flags.* |
PDF source | Used by |
|---|---|---|
readOnly |
/Ff bit 1 |
UI disables the input |
required |
/Ff bit 2 |
Renderer shows the asterisk; generate-bindings.mjs synthesizes a required-field validation per bound schema path |
multiline |
/Ff bit 13 (text only) |
Renderer picks <textarea> |
comb |
/Ff bit 25 (text only) |
Renderer renders per-character boxes for the maxLen count |
combo / edit / multiSelect |
/Ff bits 18 / 19 / 22 (choice only) |
Renderer picks dropdown vs list, allows free-text vs strict, single vs multi |
maxLen and defaultValue and altName round out the metadata the renderer surfaces (per-character cap, pre-filled value, tooltip).
Field overlay — promotion to the schema layer
A schema key is bound to many forms (a debtor's SSN appears on 30+ bankruptcy forms). Each form might author a different format. aggregateBindingsForCase produces a FieldOverlay keyed by schema path:
interface FieldOverlayEntry {
required?: boolean // any bound form requires it
readOnly?: boolean // every bound form is readOnly
maxLen?: number // min across bound forms
promotedFormatKind?: 'ssn' | 'phone' | 'zip' | 'zip4' | 'date' | 'time' | 'number' | 'percent'
}
promotedFormatKind only fires when the schema has no format of its own and every bound form's format.kind agrees. The schema-level format always wins on conflict — .acro.json is the PDF view, the schema is the canonical view.
.xfa.json — XFA template metadata (XFA only)
Raw template data extracted from the XFA stream: field names, types, dropdown options, tooltips, validation rules, dynamic flag for fields with no fixed position. Used during knowledge authoring as a source of court-provided text — tooltip → helptext, options → note, format masks → hint. Never feeds the runtime directly.
.docx.json — Word content-control metadata (DOCX only)
Field names, types, and dropdown options extracted from the Word content controls. Same role as .xfa.json — informs .acro.json field naming and enriches knowledge.json.
.knowledge.json — Semantic mapping (operator-authored)
The hard step. The operator visually reads every page and writes the meaning of each AcroField, the schema path it maps to, the printed instruction text from the form, and a paralegal-facing hint. Field keys are the join key with fields.json.
{
"form": "b101",
"version": "2024-06-22",
"title": "Voluntary Petition for Individuals Filing for Bankruptcy",
"description": "Filed by individuals to initiate a bankruptcy case.",
"sections": [
{
"part": 1,
"title": "Identify Yourself",
"pages": [1, 2],
"lines": [
{
"line": "1",
"question": "Your full name",
"fields": {
"Debtor1.First name": {
"meaning": "Debtor 1 first name",
"domain": "debtor.first_name",
"helptext": "Your first name exactly as it appears on your government-issued photo ID.",
"hint": "The debtor's legal first name. Used across all petition and schedule headers."
}
}
}
]
}
],
"missing_acrofields": []
}
domain values fall into four buckets: a valid schema path (debtor.first_name, creditors[0].name), _computed (totals/summaries), _pagination (continuation indicators), or _extension.{concept}.* (form-specific data not in the schema). Schema paths are validated against flatten-schema.mjs <schema-key> output. PDF-native data — rects, formats, flags — is not in knowledge.json; that's fields.json's job.
.bindings.json — Schema → form fields (generated)
Generated by generate-bindings.mjs from knowledge.json + the current schema. Each field's domain is resolved to a schema path; the script writes one binding per resolved field, plus an extensions[] list (_extension.*) and a skipped[] list (_computed, _pagination). It also enriches fields.json with label (from knowledge meaning) and hint (from knowledge hint). Re-runnable any time the schema changes — the operator never touches the PDF again.
{
"bindings": [
{ "source": "debtor.first_name", "targets": ["Debtor1.First name"] }
],
"extensions": [
{ "domain": "_extension.service.method", "fields": ["service_method"] }
],
"skipped": []
}
.validations.json — Expression-based filing rules
Operator-authored expression rules that gate filing. References schema domain paths (not AcroField keys) so they survive form revisions.
{
"validations": [
{ "expression": "debtor.full_name != ''", "severity": "error", "description": "Debtor name is required" },
{ "expression": "!(creditors[0].name != '') || creditors[0].claim_amount != ''",
"severity": "error", "description": "Creditor 1 claim amount is required when name is provided" }
]
}
Required-field validations are now synthesized automatically by generate-bindings.mjs from flags.required on bound fields, so validations.json is reserved for cross-field logic, conditional requireds, and array-slot consistency rules that can't be derived mechanically.
Five Paths Through The Pipeline
The pipeline is a single directed graph; the source kind picks the entry point. Once fields.json exists, every path converges on the same three closing steps (knowledge → bindings → validations).
Native AcroForm
extract-fields.mjs
→ meta.json (source: "native") + fields.json (full PDF-native metadata captured)
→ knowledge.json (operator: visual read, semantic mapping)
→ update meta.json (operator: name + schema)
→ generate-bindings.mjs → bindings.json
→ validations.json (operator: cross-field rules)
XFA (/XFA key, no AcroForm fields)
extract-fields.mjs
→ meta.json (source: "xfa") + _original.pdf + .xfa.json + .acro.json (auto from XFA template)
→ [operator reviews .acro.json, adds dynamic fields if any]
→ inject-fields.mjs → working .pdf
→ extract-fields.mjs (re-run) → fields.json
→ knowledge.json (cross-reference .xfa.json for tooltips, options, format masks)
→ update meta.json
→ generate-bindings.mjs → bindings.json
→ validations.json
Flat (printed blanks, no AcroFields, no XFA)
extract-fields.mjs
→ meta.json (source: "flat") + _original.pdf rename
→ [operator: visual read → author .acro.json with rect + format + flags + maxLen]
→ inject-fields.mjs → working .pdf
→ extract-fields.mjs (re-run) → fields.json
→ knowledge.json (semantic notes from the same single visual pass)
→ update meta.json
→ generate-bindings.mjs → bindings.json
→ validations.json
DOCX (court-issued Word document)
extract-fields.mjs
→ .docx.json (content-control field names + types + dropdown options)
→ _original.pdf (LibreOffice headless conversion)
→ meta.json (source: "docx")
→ [operator: visual calibration → author .acro.json using .docx.json field keys]
→ inject-fields.mjs → working .pdf
→ extract-fields.mjs (re-run) → fields.json
→ knowledge.json (cross-reference .docx.json for field names + options)
→ update meta.json
→ generate-bindings.mjs → bindings.json
→ validations.json
The .docx file is preserved alongside _original.pdf — both are court-issued sources.
Flat informational (no fillable elements at all)
extract-fields.mjs
→ meta.json (source: "flat") + _original.pdf rename
→ knowledge.json (0 fields, sections document content structure)
→ update meta.json
→ [no bindings, no validations]
Notice forms, certificates of service templates, and other purely informational PDFs land here. missing_acrofields records that the form has no fillable elements.
Tools
| Script | Purpose |
|---|---|
domain_tools/scripts/extract-fields.mjs |
Detect source kind, extract AcroForm/XFA/DOCX fields, write meta.json + fields.json (+ .acro.json + .xfa.json + .docx.json as applicable). Handles DOCX→PDF via LibreOffice automatically. |
domain_tools/scripts/inject-fields.mjs |
Stamp AcroFields onto _original.pdf from .acro.json to produce the working .pdf. |
domain_tools/scripts/flatten-schema.mjs |
Print every valid dot-notation path in a registered schema. Operator pipes this into a tmpfile and validates knowledge.json domain values against it. |
domain_tools/scripts/generate-bindings.mjs |
Resolve knowledge.json domains against the schema; write .bindings.json; enrich .fields.json with label/hint; synthesize required-field validations. Re-runnable any time. |
domain_tools/scripts/verify-bindings.mjs |
Cross-check every binding against the current schema, fields, and knowledge. |
domain_tools/scripts/generate-groups.mjs |
Build composite-form group definitions from .knowledge.json group files. |
| Operator playbook | Use |
|---|---|
domain_tools/prompts/process-form.md |
Reusable prompt for processing one leaf form end-to-end (knowledge + bindings + validations + meta). |
domain_tools/prompts/process-group.md |
Reusable prompt for building a composite form (children + cross-bindings + validations). |
domain_tools/prompts/agent_forms_guide.md |
Detailed pipeline reference (every step, every file format, verification checklist). |
Why This Layering Survives Schema And Form Changes
The split between PDF-native (.fields.json, .acro.json) and semantic (.knowledge.json) is what makes the rest cheap. When the schema changes — paths added, renamed, restructured — only bindings.json is regenerated. When a new court PDF version drops, only meta.json + fields.json (or .acro.json + injection) are redone, and knowledge.json is updated for any added/removed fields. Validations live on schema paths, not field keys, so they survive both.
Every artifact has exactly one author and one regeneration path. That's why we can keep ~330 knowledge files in lockstep with a 2,200-path schema across five chapters of bankruptcy and have the same scripts handle immigration, tax, or insurance the moment a new schema is registered.
docs/form-pipeline.md