Internal · Form Pipeline

Form Pipeline — From PDF to Live Engine

How a court PDF becomes a live, fillable, schema-bound form. The pipeline is the same for every domain (bankruptcy, immigration, …) — only the schema and the forms change.

The end-state of a processed form directory is a stack of small JSON files sitting next to the working PDF. Each file owns one slice of the form's identity (mechanical shape, semantic meaning, schema routing, filing rules). Together they let the runtime render an HTML data-entry UI, route values across many forms, and produce a filled PDF.

forms/b101/
  b101_2024-06-22.pdf              ← working PDF (AcroForm-injected if needed)
  b101_2024-06-22_original.pdf     ← preserved court source (flat / XFA / docx-derived)
  b101_2024-06-22.meta.json        ← form identity (number, name, pages, source kind, schema)
  b101_2024-06-22.fields.json      ← extracted AcroFields (key, type, rect, format, flags, maxLen…)
  b101_2024-06-22.acro.json        ← injection spec (only for flat / XFA / docx)
  b101_2024-06-22.xfa.json         ← XFA template metadata (XFA only)
  b101_2024-06-22.docx.json        ← Word content-control metadata (DOCX only)
  b101_2024-06-22.knowledge.json   ← semantic mapping: each field → meaning + schema domain
  b101_2024-06-22.bindings.json    ← schema path → field targets (generated)
  b101_2024-06-22.validations.json ← expression-based filing rules

Four Source Types

Every PDF the operator processes falls into one of four buckets. The bucket determines what runs first and which side files exist; everything downstream of fields.json is identical.

Source	What it is	How we make it fillable
native	PDF already has AcroForm fields embedded by the form author	Nothing — extract fields directly
xfa	Adobe XFA form (dynamic XML template, no AcroForm widgets)	Read the XFA template, auto-generate `.acro.json`, inject AcroFields, re-extract
flat	Static PDF with printed blanks / checkboxes / placeholders but no fields	Operator authors `.acro.json` from a visual read, inject, re-extract
docx	Court-issued Microsoft Word document with content controls	LibreOffice converts to PDF, content-control metadata exported to `.docx.json`, operator authors `.acro.json`, inject, re-extract

extract-fields.mjs detects the source kind, writes meta.json with source: "native" | "xfa" | "flat" | "docx", and renames the original to _original.pdf for non-native sources. The court file is never modified — every working PDF is derived from _original.pdf via injection, so re-injecting from a fixed .acro.json is always possible.

What Each JSON Owns

Each file has one job. The boundaries are deliberate — they let us regenerate downstream artifacts (bindings, validations, filled PDFs) without re-reading the PDF, and they keep court-provided text (helptext, tooltips) separate from operator-authored interpretation (hints, schema mappings).

`.meta.json` — Form identity

Mechanical, generated by extract-fields.mjs, completed by the operator after the visual read.

{
  "number": "101",
  "name": "Voluntary Petition for Individuals Filing for Bankruptcy",
  "effectiveDate": "2024-06-22",
  "pages": 9,
  "source": "native",
  "schema": "bankruptcy"
}

schema declares which schema's vocabulary the form's bindings/validations reference, making each form folder self-contained. name is set by the operator from the PDF title (the extractor leaves it null).

`.fields.json` — PDF-native field shape

The source of truth for AcroField keys, positions, and PDF-native metadata. Generated by extract-fields.mjs after AcroFields exist (immediately for native, after injection for the other three sources).

{
  "fields": [
    {
      "key": "Debtor1.First name",
      "type": "text",
      "page": 1,
      "rect": { "x": 172, "y": 407, "w": 195, "h": 13 },
      "options": null,
      "flags": { "required": true },
      "maxLen": 60,
      "format": null,
      "align": "left",
      "defaultValue": null,
      "altName": "Debtor 1 first name (required)"
    },
    {
      "key": "Check Box5",
      "type": "checkbox",
      "page": 1,
      "rect": { "x": 379, "y": 700, "w": 12, "h": 11 },
      "options": ["Presumption", "No Presumption"],
      "widgets": [
        { "onValue": "Presumption",    "page": 1, "rect": { "x": 379, "y": 700, "w": 12, "h": 11 } },
        { "onValue": "No Presumption", "page": 1, "rect": { "x": 379, "y": 688, "w": 12, "h": 11 } }
      ]
    }
  ]
}

Field shape (also FormField in @dossier/core):

Key	Source on the PDF	Meaning
`key`	`/T` field name (or your `.acro.json` key for injected fields)	Join key with `knowledge.json`
`type`	`text` / `checkbox` / `dropdown` / `radio` / `signature` / `optionList`	Renderer + filler discriminator
`page`, `rect`	Widget annotation page + `/Rect`	Used for live PDF preview overlay
`options`, `widgets`	Choice `/Opt` + per-widget on-values	Single field, multiple visual checkboxes
`flags`	`/Ff` bits	`readOnly`, `required`, `multiline` (text bit 13), `comb` (text bit 25), `combo` (choice bit 18), `edit` (choice bit 19), `multiSelect` (choice bit 22)
`maxLen`	`/MaxLen`	Per-character cap for text fields
`format`	`/AA/F` JavaScript (`AFDate_FormatEx`, `AFNumber_Format`, `AFPercent_Format`, `AFSpecial_Format`, `AFTime_Format`) or XFA `<picture>`	See Field formats below
`align`	`/Q` (0/1/2)	`left` / `center` / `right`
`defaultValue`	`/DV`	Pre-filled value to render when empty
`altName`	`/TU`	Tooltip / accessibility label

Inheritable keys (/Ff, /MaxLen, /Q, /DV) are walked up the /Parent chain so kid widgets inherit field-level metadata.

`.acro.json` — Injection spec (flat / XFA / docx only)

Tells inject-fields.mjs what AcroFields to stamp onto _original.pdf. For XFA it's auto-generated from the XFA template's positions; for flat and docx the operator authors it from a visual read.

{
  "fields": [
    { "key": "debtor_name", "type": "text", "page": 1, "rect": { "x": 100, "y": 500, "w": 200, "h": 18 } },

    { "key": "executed_date", "type": "text", "page": 1, "rect": { "x": 118, "y": 145, "w": 115, "h": 14 },
      "format": { "kind": "date", "pattern": "mm/dd/yyyy" } },

    { "key": "debtor_ssn", "type": "text", "page": 1, "rect": { "x": 200, "y": 600, "w": 110, "h": 14 },
      "format": { "kind": "ssn" }, "flags": { "comb": true, "required": true }, "maxLen": 9 },

    { "key": "claim_amount", "type": "text", "page": 1, "rect": { "x": 410, "y": 320, "w": 100, "h": 14 },
      "format": { "kind": "number", "decimals": 2, "currency": "$", "prependCurrency": true } },

    { "key": "explanation", "type": "text", "page": 1, "rect": { "x": 80, "y": 200, "w": 460, "h": 60 },
      "flags": { "multiline": true } },

    { "key": "is_amended", "type": "checkbox", "page": 1, "rect": { "x": 450, "y": 700, "w": 12, "h": 12 } }
  ]
}

.acro.json is the only place the operator writes PDF-native shape directly — for native PDFs the original AcroForm authors did this work for us, and extract-fields.mjs reads it into .fields.json. The two files carry the same vocabulary so anything you can author in .acro.json (every entry in the table above) survives extraction unchanged.

Field formats — the recent `.acro.json` work

The format object is the PDF-native value-format declaration. It comes from Acrobat's /AA/F JavaScript on native PDFs, from XFA <picture> on XFA, and from operator authoring on flat/docx. Eight kinds are supported, all decoded into the same FieldFormat shape:

`format.kind`	Source on native PDFs	Visual cue on flat PDFs	Notes
`date`	`AFDate_FormatEx("mm/dd/yyyy")` / `AFDate_Format(N)`	"Date: //____" or "MM / DD / YYYY" mask	`pattern` is a literal mask: `mm/dd/yyyy`, `m/d/yyyy`, `yyyy-mm-dd`, `mm-dd-yyyy`, `mmm d, yyyy`
`time`	`AFTime_Format(N)`	"HH:MM AM/PM" label	`sepStyle` 0..3 controls 12/24h preset
`number`	`AFNumber_Format(decimals, sepStyle, negStyle, _, currency, prepend)`	"$____." or `Amount` column with comb boxes	`sepStyle` 0..3 (1,234.56 / 1234.56 / 1.234,56 / 1234,56), `negStyle` 0..3 (Minus / Red / Parens / ParensRed), `currency` symbol, `prependCurrency` true/false
`percent`	`AFPercent_Format(decimals, sepStyle)`	"%" symbol or "Rate" column	Same sepStyle table; suffix `%`
`ssn`	`AFSpecial_Format(3)`	"Social Security Number: --___"	Stored as digits-only, displayed `999-99-9999`
`phone`	`AFSpecial_Format(2)`	"(__) -" or labeled "Phone" with mask	Stored as digits-only, displayed `(999) 999-9999`
`zip`	`AFSpecial_Format(0)`	"ZIP: _____"	5 digits
`zip4`	`AFSpecial_Format(1)`	"_-"	5+4, displayed `99999-9999`

This table is the contract between three layers:

Extraction / authoring — extract-fields.mjs parses /AA/F JavaScript into a format object; flat-PDF authors hand-write the same object into .acro.json.
Data entry UI — mapFieldType (packages/client/src/pages/case-tabs/data-derive.ts) maps the kind to a typed input. MaskedInput renders SSN/phone/zip/zip4 with caret-preserving mask handling and stores digits-only. Date / number / percent fall through to typed inputs and store ISO / numeric values.
PDF filler — formatValueForField (packages/core/src/pdf/format-value.ts) is called before every setText in pdf-filler.ts. It re-applies the format on render so server-rendered or flattened PDFs match what Acrobat would have produced at runtime via /AA/F.

flags is the matching contract for boolean PDF behavior:

`flags.*`	PDF source	Used by
`readOnly`	`/Ff` bit 1	UI disables the input
`required`	`/Ff` bit 2	Renderer shows the asterisk; `generate-bindings.mjs` synthesizes a required-field validation per bound schema path
`multiline`	`/Ff` bit 13 (text only)	Renderer picks `<textarea>`
`comb`	`/Ff` bit 25 (text only)	Renderer renders per-character boxes for the `maxLen` count
`combo` / `edit` / `multiSelect`	`/Ff` bits 18 / 19 / 22 (choice only)	Renderer picks dropdown vs list, allows free-text vs strict, single vs multi

maxLen and defaultValue and altName round out the metadata the renderer surfaces (per-character cap, pre-filled value, tooltip).

Field overlay — promotion to the schema layer

A schema key is bound to many forms (a debtor's SSN appears on 30+ bankruptcy forms). Each form might author a different format. aggregateBindingsForCase produces a FieldOverlay keyed by schema path:

interface FieldOverlayEntry {
  required?: boolean              // any bound form requires it
  readOnly?: boolean              // every bound form is readOnly
  maxLen?: number                 // min across bound forms
  promotedFormatKind?: 'ssn' | 'phone' | 'zip' | 'zip4' | 'date' | 'time' | 'number' | 'percent'
}

promotedFormatKind only fires when the schema has no format of its own and every bound form's format.kind agrees. The schema-level format always wins on conflict — .acro.json is the PDF view, the schema is the canonical view.

`.xfa.json` — XFA template metadata (XFA only)

Raw template data extracted from the XFA stream: field names, types, dropdown options, tooltips, validation rules, dynamic flag for fields with no fixed position. Used during knowledge authoring as a source of court-provided text — tooltip → helptext, options → note, format masks → hint. Never feeds the runtime directly.

`.docx.json` — Word content-control metadata (DOCX only)

Field names, types, and dropdown options extracted from the Word content controls. Same role as .xfa.json — informs .acro.json field naming and enriches knowledge.json.

`.knowledge.json` — Semantic mapping (operator-authored)

The hard step. The operator visually reads every page and writes the meaning of each AcroField, the schema path it maps to, the printed instruction text from the form, and a paralegal-facing hint. Field keys are the join key with fields.json.

{
  "form": "b101",
  "version": "2024-06-22",
  "title": "Voluntary Petition for Individuals Filing for Bankruptcy",
  "description": "Filed by individuals to initiate a bankruptcy case.",
  "sections": [
    {
      "part": 1,
      "title": "Identify Yourself",
      "pages": [1, 2],
      "lines": [
        {
          "line": "1",
          "question": "Your full name",
          "fields": {
            "Debtor1.First name": {
              "meaning": "Debtor 1 first name",
              "domain": "debtor.first_name",
              "helptext": "Your first name exactly as it appears on your government-issued photo ID.",
              "hint": "The debtor's legal first name. Used across all petition and schedule headers."
            }
          }
        }
      ]
    }
  ],
  "missing_acrofields": []
}

domain values fall into four buckets: a valid schema path (debtor.first_name, creditors[0].name), _computed (totals/summaries), _pagination (continuation indicators), or _extension.{concept}.* (form-specific data not in the schema). Schema paths are validated against flatten-schema.mjs <schema-key> output. PDF-native data — rects, formats, flags — is not in knowledge.json; that's fields.json's job.

`.bindings.json` — Schema → form fields (generated)

Generated by generate-bindings.mjs from knowledge.json + the current schema. Each field's domain is resolved to a schema path; the script writes one binding per resolved field, plus an extensions[] list (_extension.*) and a skipped[] list (_computed, _pagination). It also enriches fields.json with label (from knowledge meaning) and hint (from knowledge hint). Re-runnable any time the schema changes — the operator never touches the PDF again.

{
  "bindings": [
    { "source": "debtor.first_name", "targets": ["Debtor1.First name"] }
  ],
  "extensions": [
    { "domain": "_extension.service.method", "fields": ["service_method"] }
  ],
  "skipped": []
}

`.validations.json` — Expression-based filing rules

Operator-authored expression rules that gate filing. References schema domain paths (not AcroField keys) so they survive form revisions.

{
  "validations": [
    { "expression": "debtor.full_name != ''", "severity": "error",   "description": "Debtor name is required" },
    { "expression": "!(creditors[0].name != '') || creditors[0].claim_amount != ''",
      "severity": "error", "description": "Creditor 1 claim amount is required when name is provided" }
  ]
}

Required-field validations are now synthesized automatically by generate-bindings.mjs from flags.required on bound fields, so validations.json is reserved for cross-field logic, conditional requireds, and array-slot consistency rules that can't be derived mechanically.

Five Paths Through The Pipeline

The pipeline is a single directed graph; the source kind picks the entry point. Once fields.json exists, every path converges on the same three closing steps (knowledge → bindings → validations).

Native AcroForm

extract-fields.mjs
  → meta.json (source: "native") + fields.json   (full PDF-native metadata captured)
→ knowledge.json   (operator: visual read, semantic mapping)
→ update meta.json (operator: name + schema)
→ generate-bindings.mjs  → bindings.json
→ validations.json (operator: cross-field rules)

XFA (`/XFA` key, no AcroForm fields)

extract-fields.mjs
  → meta.json (source: "xfa") + _original.pdf + .xfa.json + .acro.json (auto from XFA template)
→ [operator reviews .acro.json, adds dynamic fields if any]
→ inject-fields.mjs  → working .pdf
→ extract-fields.mjs (re-run)  → fields.json
→ knowledge.json   (cross-reference .xfa.json for tooltips, options, format masks)
→ update meta.json
→ generate-bindings.mjs  → bindings.json
→ validations.json

Flat (printed blanks, no AcroFields, no XFA)

extract-fields.mjs
  → meta.json (source: "flat") + _original.pdf rename
→ [operator: visual read → author .acro.json with rect + format + flags + maxLen]
→ inject-fields.mjs  → working .pdf
→ extract-fields.mjs (re-run)  → fields.json
→ knowledge.json   (semantic notes from the same single visual pass)
→ update meta.json
→ generate-bindings.mjs  → bindings.json
→ validations.json

DOCX (court-issued Word document)

extract-fields.mjs
  → .docx.json (content-control field names + types + dropdown options)
  → _original.pdf (LibreOffice headless conversion)
  → meta.json (source: "docx")
→ [operator: visual calibration → author .acro.json using .docx.json field keys]
→ inject-fields.mjs  → working .pdf
→ extract-fields.mjs (re-run)  → fields.json
→ knowledge.json   (cross-reference .docx.json for field names + options)
→ update meta.json
→ generate-bindings.mjs  → bindings.json
→ validations.json

The .docx file is preserved alongside _original.pdf — both are court-issued sources.

Flat informational (no fillable elements at all)

extract-fields.mjs
  → meta.json (source: "flat") + _original.pdf rename
→ knowledge.json   (0 fields, sections document content structure)
→ update meta.json
→ [no bindings, no validations]

Notice forms, certificates of service templates, and other purely informational PDFs land here. missing_acrofields records that the form has no fillable elements.

Tools

Script	Purpose
`domain_tools/scripts/extract-fields.mjs`	Detect source kind, extract AcroForm/XFA/DOCX fields, write `meta.json` + `fields.json` (+ `.acro.json` + `.xfa.json` + `.docx.json` as applicable). Handles DOCX→PDF via LibreOffice automatically.
`domain_tools/scripts/inject-fields.mjs`	Stamp AcroFields onto `_original.pdf` from `.acro.json` to produce the working `.pdf`.
`domain_tools/scripts/flatten-schema.mjs`	Print every valid dot-notation path in a registered schema. Operator pipes this into a tmpfile and validates `knowledge.json` `domain` values against it.
`domain_tools/scripts/generate-bindings.mjs`	Resolve `knowledge.json` domains against the schema; write `.bindings.json`; enrich `.fields.json` with `label`/`hint`; synthesize required-field validations. Re-runnable any time.
`domain_tools/scripts/verify-bindings.mjs`	Cross-check every binding against the current schema, fields, and knowledge.
`domain_tools/scripts/generate-groups.mjs`	Build composite-form group definitions from `.knowledge.json` group files.

Operator playbook	Use
`domain_tools/prompts/process-form.md`	Reusable prompt for processing one leaf form end-to-end (knowledge + bindings + validations + meta).
`domain_tools/prompts/process-group.md`	Reusable prompt for building a composite form (children + cross-bindings + validations).
`domain_tools/prompts/agent_forms_guide.md`	Detailed pipeline reference (every step, every file format, verification checklist).

Why This Layering Survives Schema And Form Changes

The split between PDF-native (.fields.json, .acro.json) and semantic (.knowledge.json) is what makes the rest cheap. When the schema changes — paths added, renamed, restructured — only bindings.json is regenerated. When a new court PDF version drops, only meta.json + fields.json (or .acro.json + injection) are redone, and knowledge.json is updated for any added/removed fields. Validations live on schema paths, not field keys, so they survive both.

Every artifact has exactly one author and one regeneration path. That's why we can keep ~330 knowledge files in lockstep with a 2,200-path schema across five chapters of bankruptcy and have the same scripts handle immigration, tax, or insurance the moment a new schema is registered.

Source: docs/form-pipeline.md

Form Pipeline — From PDF to Live Engine

Four Source Types

What Each JSON Owns

.meta.json — Form identity

.fields.json — PDF-native field shape

.acro.json — Injection spec (flat / XFA / docx only)

Field formats — the recent .acro.json work