Dossier · Internal docs
Internal · Form Pipeline

Form Pipeline — From PDF to Live Engine

How a court PDF becomes a live, fillable, schema-bound form. The pipeline is the same for every domain (bankruptcy, immigration, …) — only the schema and the forms change.

The end-state of a processed form directory is a stack of small JSON files sitting next to the working PDF. Each file owns one slice of the form's identity (mechanical shape, semantic meaning, schema routing, filing rules). Together they let the runtime render an HTML data-entry UI, route values across many forms, and produce a filled PDF.

forms/b101/
  b101_2024-06-22.pdf              ← working PDF (AcroForm-injected if needed)
  b101_2024-06-22_original.pdf     ← preserved court source (flat / XFA / docx-derived)
  b101_2024-06-22.meta.json        ← form identity (number, name, pages, source kind, schema)
  b101_2024-06-22.fields.json      ← extracted AcroFields (key, type, rect, format, flags, maxLen…)
  b101_2024-06-22.acro.json        ← injection spec (only for flat / XFA / docx)
  b101_2024-06-22.xfa.json         ← XFA template metadata (XFA only)
  b101_2024-06-22.docx.json        ← Word content-control metadata (DOCX only)
  b101_2024-06-22.knowledge.json   ← semantic mapping: each field → meaning + schema domain
  b101_2024-06-22.bindings.json    ← schema path → field targets (generated)
  b101_2024-06-22.validations.json ← expression-based filing rules

Four Source Types

Every PDF the operator processes falls into one of four buckets. The bucket determines what runs first and which side files exist; everything downstream of fields.json is identical.

Source What it is How we make it fillable
native PDF already has AcroForm fields embedded by the form author Nothing — extract fields directly
xfa Adobe XFA form (dynamic XML template, no AcroForm widgets) Read the XFA template, auto-generate .acro.json, inject AcroFields, re-extract
flat Static PDF with printed blanks / checkboxes / placeholders but no fields Operator authors .acro.json from a visual read, inject, re-extract
docx Court-issued Microsoft Word document with content controls LibreOffice converts to PDF, content-control metadata exported to .docx.json, operator authors .acro.json, inject, re-extract

extract-fields.mjs detects the source kind, writes meta.json with source: "native" | "xfa" | "flat" | "docx", and renames the original to _original.pdf for non-native sources. The court file is never modified — every working PDF is derived from _original.pdf via injection, so re-injecting from a fixed .acro.json is always possible.


What Each JSON Owns

Each file has one job. The boundaries are deliberate — they let us regenerate downstream artifacts (bindings, validations, filled PDFs) without re-reading the PDF, and they keep court-provided text (helptext, tooltips) separate from operator-authored interpretation (hints, schema mappings).

.meta.json — Form identity

Mechanical, generated by extract-fields.mjs, completed by the operator after the visual read.

{
  "number": "101",
  "name": "Voluntary Petition for Individuals Filing for Bankruptcy",
  "effectiveDate": "2024-06-22",
  "pages": 9,
  "source": "native",
  "schema": "bankruptcy"
}

schema declares which schema's vocabulary the form's bindings/validations reference, making each form folder self-contained. name is set by the operator from the PDF title (the extractor leaves it null).

.fields.json — PDF-native field shape

The source of truth for AcroField keys, positions, and PDF-native metadata. Generated by extract-fields.mjs after AcroFields exist (immediately for native, after injection for the other three sources).

{
  "fields": [
    {
      "key": "Debtor1.First name",
      "type": "text",
      "page": 1,
      "rect": { "x": 172, "y": 407, "w": 195, "h": 13 },
      "options": null,
      "flags": { "required": true },
      "maxLen": 60,
      "format": null,
      "align": "left",
      "defaultValue": null,
      "altName": "Debtor 1 first name (required)"
    },
    {
      "key": "Check Box5",
      "type": "checkbox",
      "page": 1,
      "rect": { "x": 379, "y": 700, "w": 12, "h": 11 },
      "options": ["Presumption", "No Presumption"],
      "widgets": [
        { "onValue": "Presumption",    "page": 1, "rect": { "x": 379, "y": 700, "w": 12, "h": 11 } },
        { "onValue": "No Presumption", "page": 1, "rect": { "x": 379, "y": 688, "w": 12, "h": 11 } }
      ]
    }
  ]
}

Field shape (also FormField in @dossier/core):

Key Source on the PDF Meaning
key /T field name (or your .acro.json key for injected fields) Join key with knowledge.json
type text / checkbox / dropdown / radio / signature / optionList Renderer + filler discriminator
page, rect Widget annotation page + /Rect Used for live PDF preview overlay
options, widgets Choice /Opt + per-widget on-values Single field, multiple visual checkboxes
flags /Ff bits readOnly, required, multiline (text bit 13), comb (text bit 25), combo (choice bit 18), edit (choice bit 19), multiSelect (choice bit 22)
maxLen /MaxLen Per-character cap for text fields
format /AA/F JavaScript (AFDate_FormatEx, AFNumber_Format, AFPercent_Format, AFSpecial_Format, AFTime_Format) or XFA <picture> See Field formats below
align /Q (0/1/2) left / center / right
defaultValue /DV Pre-filled value to render when empty
altName /TU Tooltip / accessibility label

Inheritable keys (/Ff, /MaxLen, /Q, /DV) are walked up the /Parent chain so kid widgets inherit field-level metadata.

.acro.json — Injection spec (flat / XFA / docx only)

Tells inject-fields.mjs what AcroFields to stamp onto _original.pdf. For XFA it's auto-generated from the XFA template's positions; for flat and docx the operator authors it from a visual read.

{
  "fields": [
    { "key": "debtor_name", "type": "text", "page": 1, "rect": { "x": 100, "y": 500, "w": 200, "h": 18 } },

    { "key": "executed_date", "type": "text", "page": 1, "rect": { "x": 118, "y": 145, "w": 115, "h": 14 },
      "format": { "kind": "date", "pattern": "mm/dd/yyyy" } },

    { "key": "debtor_ssn", "type": "text", "page": 1, "rect": { "x": 200, "y": 600, "w": 110, "h": 14 },
      "format": { "kind": "ssn" }, "flags": { "comb": true, "required": true }, "maxLen": 9 },

    { "key": "claim_amount", "type": "text", "page": 1, "rect": { "x": 410, "y": 320, "w": 100, "h": 14 },
      "format": { "kind": "number", "decimals": 2, "currency": "$", "prependCurrency": true } },

    { "key": "explanation", "type": "text", "page": 1, "rect": { "x": 80, "y": 200, "w": 460, "h": 60 },
      "flags": { "multiline": true } },

    { "key": "is_amended", "type": "checkbox", "page": 1, "rect": { "x": 450, "y": 700, "w": 12, "h": 12 } }
  ]
}

.acro.json is the only place the operator writes PDF-native shape directly — for native PDFs the original AcroForm authors did this work for us, and extract-fields.mjs reads it into .fields.json. The two files carry the same vocabulary so anything you can author in .acro.json (every entry in the table above) survives extraction unchanged.

Field formats — the recent .acro.json work

The format object is the PDF-native value-format declaration. It comes from Acrobat's /AA/F JavaScript on native PDFs, from XFA <picture> on XFA, and from operator authoring on flat/docx. Eight kinds are supported, all decoded into the same FieldFormat shape:

format.kind Source on native PDFs Visual cue on flat PDFs Notes
date AFDate_FormatEx("mm/dd/yyyy") / AFDate_Format(N) "Date: //____" or "MM / DD / YYYY" mask pattern is a literal mask: mm/dd/yyyy, m/d/yyyy, yyyy-mm-dd, mm-dd-yyyy, mmm d, yyyy
time AFTime_Format(N) "HH:MM AM/PM" label sepStyle 0..3 controls 12/24h preset
number AFNumber_Format(decimals, sepStyle, negStyle, _, currency, prepend) "$____." or Amount column with comb boxes sepStyle 0..3 (1,234.56 / 1234.56 / 1.234,56 / 1234,56), negStyle 0..3 (Minus / Red / Parens / ParensRed), currency symbol, prependCurrency true/false
percent AFPercent_Format(decimals, sepStyle) "%" symbol or "Rate" column Same sepStyle table; suffix %
ssn AFSpecial_Format(3) "Social Security Number: --___" Stored as digits-only, displayed 999-99-9999
phone AFSpecial_Format(2) "(__) -" or labeled "Phone" with mask Stored as digits-only, displayed (999) 999-9999
zip AFSpecial_Format(0) "ZIP: _____" 5 digits
zip4 AFSpecial_Format(1) "_-" 5+4, displayed 99999-9999

This table is the contract between three layers:

  1. Extraction / authoringextract-fields.mjs parses /AA/F JavaScript into a format object; flat-PDF authors hand-write the same object into .acro.json.
  2. Data entry UImapFieldType (packages/client/src/pages/case-tabs/data-derive.ts) maps the kind to a typed input. MaskedInput renders SSN/phone/zip/zip4 with caret-preserving mask handling and stores digits-only. Date / number / percent fall through to typed inputs and store ISO / numeric values.
  3. PDF fillerformatValueForField (packages/core/src/pdf/format-value.ts) is called before every setText in pdf-filler.ts. It re-applies the format on render so server-rendered or flattened PDFs match what Acrobat would have produced at runtime via /AA/F.

flags is the matching contract for boolean PDF behavior:

flags.* PDF source Used by
readOnly /Ff bit 1 UI disables the input
required /Ff bit 2 Renderer shows the asterisk; generate-bindings.mjs synthesizes a required-field validation per bound schema path
multiline /Ff bit 13 (text only) Renderer picks <textarea>
comb /Ff bit 25 (text only) Renderer renders per-character boxes for the maxLen count
combo / edit / multiSelect /Ff bits 18 / 19 / 22 (choice only) Renderer picks dropdown vs list, allows free-text vs strict, single vs multi

maxLen and defaultValue and altName round out the metadata the renderer surfaces (per-character cap, pre-filled value, tooltip).

Field overlay — promotion to the schema layer

A schema key is bound to many forms (a debtor's SSN appears on 30+ bankruptcy forms). Each form might author a different format. aggregateBindingsForCase produces a FieldOverlay keyed by schema path:

interface FieldOverlayEntry {
  required?: boolean              // any bound form requires it
  readOnly?: boolean              // every bound form is readOnly
  maxLen?: number                 // min across bound forms
  promotedFormatKind?: 'ssn' | 'phone' | 'zip' | 'zip4' | 'date' | 'time' | 'number' | 'percent'
}

promotedFormatKind only fires when the schema has no format of its own and every bound form's format.kind agrees. The schema-level format always wins on conflict — .acro.json is the PDF view, the schema is the canonical view.

.xfa.json — XFA template metadata (XFA only)

Raw template data extracted from the XFA stream: field names, types, dropdown options, tooltips, validation rules, dynamic flag for fields with no fixed position. Used during knowledge authoring as a source of court-provided text — tooltiphelptext, optionsnote, format masks → hint. Never feeds the runtime directly.

.docx.json — Word content-control metadata (DOCX only)

Field names, types, and dropdown options extracted from the Word content controls. Same role as .xfa.json — informs .acro.json field naming and enriches knowledge.json.

.knowledge.json — Semantic mapping (operator-authored)

The hard step. The operator visually reads every page and writes the meaning of each AcroField, the schema path it maps to, the printed instruction text from the form, and a paralegal-facing hint. Field keys are the join key with fields.json.

{
  "form": "b101",
  "version": "2024-06-22",
  "title": "Voluntary Petition for Individuals Filing for Bankruptcy",
  "description": "Filed by individuals to initiate a bankruptcy case.",
  "sections": [
    {
      "part": 1,
      "title": "Identify Yourself",
      "pages": [1, 2],
      "lines": [
        {
          "line": "1",
          "question": "Your full name",
          "fields": {
            "Debtor1.First name": {
              "meaning": "Debtor 1 first name",
              "domain": "debtor.first_name",
              "helptext": "Your first name exactly as it appears on your government-issued photo ID.",
              "hint": "The debtor's legal first name. Used across all petition and schedule headers."
            }
          }
        }
      ]
    }
  ],
  "missing_acrofields": []
}

domain values fall into four buckets: a valid schema path (debtor.first_name, creditors[0].name), _computed (totals/summaries), _pagination (continuation indicators), or _extension.{concept}.* (form-specific data not in the schema). Schema paths are validated against flatten-schema.mjs <schema-key> output. PDF-native data — rects, formats, flags — is not in knowledge.json; that's fields.json's job.

.bindings.json — Schema → form fields (generated)

Generated by generate-bindings.mjs from knowledge.json + the current schema. Each field's domain is resolved to a schema path; the script writes one binding per resolved field, plus an extensions[] list (_extension.*) and a skipped[] list (_computed, _pagination). It also enriches fields.json with label (from knowledge meaning) and hint (from knowledge hint). Re-runnable any time the schema changes — the operator never touches the PDF again.

{
  "bindings": [
    { "source": "debtor.first_name", "targets": ["Debtor1.First name"] }
  ],
  "extensions": [
    { "domain": "_extension.service.method", "fields": ["service_method"] }
  ],
  "skipped": []
}

.validations.json — Expression-based filing rules

Operator-authored expression rules that gate filing. References schema domain paths (not AcroField keys) so they survive form revisions.

{
  "validations": [
    { "expression": "debtor.full_name != ''", "severity": "error",   "description": "Debtor name is required" },
    { "expression": "!(creditors[0].name != '') || creditors[0].claim_amount != ''",
      "severity": "error", "description": "Creditor 1 claim amount is required when name is provided" }
  ]
}

Required-field validations are now synthesized automatically by generate-bindings.mjs from flags.required on bound fields, so validations.json is reserved for cross-field logic, conditional requireds, and array-slot consistency rules that can't be derived mechanically.


Five Paths Through The Pipeline

The pipeline is a single directed graph; the source kind picks the entry point. Once fields.json exists, every path converges on the same three closing steps (knowledge → bindings → validations).

Native AcroForm

extract-fields.mjs
  → meta.json (source: "native") + fields.json   (full PDF-native metadata captured)
→ knowledge.json   (operator: visual read, semantic mapping)
→ update meta.json (operator: name + schema)
→ generate-bindings.mjs  → bindings.json
→ validations.json (operator: cross-field rules)

XFA (/XFA key, no AcroForm fields)

extract-fields.mjs
  → meta.json (source: "xfa") + _original.pdf + .xfa.json + .acro.json (auto from XFA template)
→ [operator reviews .acro.json, adds dynamic fields if any]
→ inject-fields.mjs  → working .pdf
→ extract-fields.mjs (re-run)  → fields.json
→ knowledge.json   (cross-reference .xfa.json for tooltips, options, format masks)
→ update meta.json
→ generate-bindings.mjs  → bindings.json
→ validations.json

Flat (printed blanks, no AcroFields, no XFA)

extract-fields.mjs
  → meta.json (source: "flat") + _original.pdf rename
→ [operator: visual read → author .acro.json with rect + format + flags + maxLen]
→ inject-fields.mjs  → working .pdf
→ extract-fields.mjs (re-run)  → fields.json
→ knowledge.json   (semantic notes from the same single visual pass)
→ update meta.json
→ generate-bindings.mjs  → bindings.json
→ validations.json

DOCX (court-issued Word document)

extract-fields.mjs
  → .docx.json (content-control field names + types + dropdown options)
  → _original.pdf (LibreOffice headless conversion)
  → meta.json (source: "docx")
→ [operator: visual calibration → author .acro.json using .docx.json field keys]
→ inject-fields.mjs  → working .pdf
→ extract-fields.mjs (re-run)  → fields.json
→ knowledge.json   (cross-reference .docx.json for field names + options)
→ update meta.json
→ generate-bindings.mjs  → bindings.json
→ validations.json

The .docx file is preserved alongside _original.pdf — both are court-issued sources.

Flat informational (no fillable elements at all)

extract-fields.mjs
  → meta.json (source: "flat") + _original.pdf rename
→ knowledge.json   (0 fields, sections document content structure)
→ update meta.json
→ [no bindings, no validations]

Notice forms, certificates of service templates, and other purely informational PDFs land here. missing_acrofields records that the form has no fillable elements.


Tools

Script Purpose
domain_tools/scripts/extract-fields.mjs Detect source kind, extract AcroForm/XFA/DOCX fields, write meta.json + fields.json (+ .acro.json + .xfa.json + .docx.json as applicable). Handles DOCX→PDF via LibreOffice automatically.
domain_tools/scripts/inject-fields.mjs Stamp AcroFields onto _original.pdf from .acro.json to produce the working .pdf.
domain_tools/scripts/flatten-schema.mjs Print every valid dot-notation path in a registered schema. Operator pipes this into a tmpfile and validates knowledge.json domain values against it.
domain_tools/scripts/generate-bindings.mjs Resolve knowledge.json domains against the schema; write .bindings.json; enrich .fields.json with label/hint; synthesize required-field validations. Re-runnable any time.
domain_tools/scripts/verify-bindings.mjs Cross-check every binding against the current schema, fields, and knowledge.
domain_tools/scripts/generate-groups.mjs Build composite-form group definitions from .knowledge.json group files.
Operator playbook Use
domain_tools/prompts/process-form.md Reusable prompt for processing one leaf form end-to-end (knowledge + bindings + validations + meta).
domain_tools/prompts/process-group.md Reusable prompt for building a composite form (children + cross-bindings + validations).
domain_tools/prompts/agent_forms_guide.md Detailed pipeline reference (every step, every file format, verification checklist).

Why This Layering Survives Schema And Form Changes

The split between PDF-native (.fields.json, .acro.json) and semantic (.knowledge.json) is what makes the rest cheap. When the schema changes — paths added, renamed, restructured — only bindings.json is regenerated. When a new court PDF version drops, only meta.json + fields.json (or .acro.json + injection) are redone, and knowledge.json is updated for any added/removed fields. Validations live on schema paths, not field keys, so they survive both.

Every artifact has exactly one author and one regeneration path. That's why we can keep ~330 knowledge files in lockstep with a 2,200-path schema across five chapters of bankruptcy and have the same scripts handle immigration, tax, or insurance the moment a new schema is registered.

Source: docs/form-pipeline.md