Extraction reviews

Matcher can extract transaction candidates from documents and propose field mappings using AI — but AI output is never authoritative. Nothing is reconciled until a human approves it. This guide covers the human-in-the-loop (HITL) extraction-review queue, AI mapping proposals, and the related job actions.

The document-extraction lane is gated by a global kill-switch and a per-tenant opt-in. A tenant that has not opted in receives 403 before any document bytes are stored or egressed.

Enqueue a document for extraction

Upload a source document (PDF) to run deterministic + AI extraction. The resulting transaction candidates are queued in a review — nothing is reconciled yet.

curl -X POST "https://api.matcher.example.com/v1/imports/contexts/{contextId}/sources/{sourceId}/extract-document" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/pdf" \
  --data-binary @statement.pdf

The response (202 Accepted) returns the queued review id, the candidate count, and a status that is always PENDING_REVIEW on enqueue:

{
  "reviewId": "550e8400-e29b-41d4-a716-446655440000",
  "candidateCount": 12,
  "status": "PENDING_REVIEW"
}

The review queue

List reviews

Cursor-paginated list of extraction reviews for a context, optionally filtered by lifecycle status.

curl -X GET "https://api.matcher.example.com/v1/imports/contexts/{contextId}/extraction-reviews?status=PENDING_REVIEW&limit=50" \
  -H "Authorization: Bearer $TOKEN"

Query parameters: status (PENDING_REVIEW, APPROVED, REJECTED), limit (1–200), and cursor.

Get one review

curl -X GET "https://api.matcher.example.com/v1/imports/contexts/{contextId}/extraction-reviews/{reviewId}" \
  -H "Authorization: Bearer $TOKEN"

A review carries its lifecycle, the proposed candidates, provenance, and linkage state:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "contextId": "550e8400-e29b-41d4-a716-446655440000",
  "sourceId": "550e8400-e29b-41d4-a716-446655440000",
  "status": "PENDING_REVIEW",
  "candidates": [
    {
      "source": "text_layer",
      "fields": [
        { "canonicalKey": "amount", "value": "100.50", "confidence": 0.95, "page": 1 },
        { "canonicalKey": "date", "value": "2025-06-01", "confidence": 0.9, "page": 1 }
      ]
    }
  ],
  "version": 1,
  "createdAt": "2025-01-15T10:30:00Z",
  "updatedAt": "2025-01-15T10:30:00Z"
}

Each candidate declares the lane that produced it: text_layer (PDF text, higher trust) or vision (OCR/vision model, lower trust). Field values are verbatim tokens — money stays a string, never a parsed amount.

Approve or reject

Approve

Approving a PENDING_REVIEW review runs the single deterministic handoff into the normal ingestion pipeline (dedup + outbox + match-trigger) and links the resulting job to the review. This is the only path from an AI candidate to a reconciled transaction, and it runs only on explicit human approval.

curl -X POST "https://api.matcher.example.com/v1/imports/contexts/{contextId}/extraction-reviews/{reviewId}/approve" \
  -H "Authorization: Bearer $TOKEN"

{
  "reviewId": "550e8400-e29b-41d4-a716-446655440000",
  "ingestionJobId": "550e8400-e29b-41d4-a716-446655440000",
  "candidateCount": 12
}

Reject

Rejecting discards the candidates — nothing is ingested. The body is optional; an empty body is a valid “reject with no reason”.

curl -X POST "https://api.matcher.example.com/v1/imports/contexts/{contextId}/extraction-reviews/{reviewId}/reject" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "reason": "poor scan quality, re-upload" }'

The approving/rejecting principal is recorded for audit.

Mapping proposals

Before you declare a field map by hand, ask the advisor to inspect a representative sample and propose a config-only mapping. It is advisory and side-effect-free: producing a proposal persists nothing. You confirm the result through the existing field-map declaration path.

curl -X POST "https://api.matcher.example.com/v1/imports/contexts/{contextId}/sources/{sourceId}/mapping-proposal" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sample": "id;value;ccy;posted_at\nA1;10,50;BRL;2025-06-01\n",
    "format": "csv",
    "hints": { "locale": "pt-BR", "has_header": "true" }
  }'

The response carries the proposed field map, source dialect, and a per-field breakdown with confidence and rationale:

{
  "mapping": { "amount": "value", "external_id": "id" },
  "dialect": {
    "encoding": "utf-8",
    "delimiter": "semicolon",
    "decimalStyle": "comma",
    "dateStyle": "iso"
  },
  "fields": [
    { "canonicalKey": "amount", "sourceColumn": "value", "confidence": 0.92, "rationale": "numeric column with comma decimal" }
  ]
}

The response never carries parsed values, amounts, or transactions.

Fetch from an external transport

Trigger a manual fetch-and-ingest that lists every object matching the supplied transport coordinates (SFTP today) and streams each into the trusted-content ingestion pipeline. The body carries connection coordinates plus an opaque credential reference — never a secret.

curl -X POST "https://api.matcher.example.com/v1/imports/contexts/{contextId}/sources/{sourceId}/fetch" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "sftp",
    "host": "sftp.bank.example",
    "port": 22,
    "path": "outbound/returns",
    "glob": "*.ret",
    "credentialRef": "cred-handle-123",
    "format": "br/cnab240/febraban-base"
  }'

The response (202 Accepted) returns a per-file outcome in fetch order. Per-file intake failures are reported without failing the batch:

{
  "files": [
    { "name": "statement-2025-06.ret", "ingestionJobId": "550e8400-...", "transactionCount": 42 }
  ]
}

A transport-level failure (endpoint unreachable or credential rejected) returns 503.

Inspect job errors

After an import, list the stored per-row parse/normalization errors for a job (capped at 100 per job) to explain failed or partially-failed imports.

curl -X GET "https://api.matcher.example.com/v1/imports/contexts/{contextId}/jobs/{jobId}/errors" \
  -H "Authorization: Bearer $TOKEN"

{
  "items": [ ... ],
  "totalErrors": 137,
  "storedErrors": 100,
  "errorCap": 100,
  "truncated": true
}

totalErrors is the uncapped failure total; truncated is true when it exceeds the stored (capped) set.

Response codes

Status	Meaning
`200`	Review, list, mapping proposal, or job errors returned
`202`	Document enqueued / fetch accepted
`400`	Invalid input (empty body, bad status filter, invalid pagination)
`403`	Tenant not opted into document extraction
`404`	Review or job not found
`409`	Invalid review state transition
`422`	No candidates could be extracted / mapping sample rejected downstream
`503`	Extraction, review, proposal, or fetch not enabled on this deployment

​Enqueue a document for extraction

​The review queue

​List reviews

​Get one review

​Approve or reject

​Approve

​Reject

​Mapping proposals

​Fetch from an external transport

​Inspect job errors

​Response codes

Enqueue a document for extraction

The review queue

List reviews

Get one review

Approve or reject

Approve

Reject

Mapping proposals

Fetch from an external transport

Inspect job errors

Response codes