Skip to content

Documents

Document channels (folder tree), per-channel pipelines and metadata extraction, and the PaddleOCR-VL parsing pipeline driven by openkms-cli.

Document channels and CRUD

Feature Status Description
Document overview Dashboard at /documents with channel count, document count (from API stats), quick actions
Channel management Create channels at /documents/channels (tree structure); rename, description, move, merge, delete; settings per channel
Document channel view Browse documents by channel at /documents/channels/:channelId; list from GET /api/documents?channel_id=
Channel settings Per-channel pipeline, auto-process, metadata extraction (model + schema, supports object_type/list[object_type]), manual labels config at /documents/channels/:channelId/settings; tabbed UI (General, Processing, Metadata extraction, Manual Labels)
Document upload Upload to channel via modal (choose files, drag-and-drop); POST /api/documents/upload with channel_id; stores file to S3. Accepted types: PDF, PNG/JPG/JPEG/WEBP, DOCX, PPTX, XLSX. XLSX: preview rows + markdown built at upload (status=completed or failed if unreadable); no VLM pipeline. DOCX/PPTX (and other pipeline types): status=uploaded until processed
Document processing Process button on list/detail → POST /api/jobs. XLSX jobs run run_spreadsheet_preview (no channel pipeline required). Other extensions use the channel’s Paddle pipeline when configured; auto-process enqueues the same for non-XLSX
Document status Status badge (uploaded/pending/running/completed/failed) on document list and detail
Document detail View parsed Markdown at /documents/view/:id; Document Information: 3-column stats (Type, Size, Uploaded | Status, Markdown, File hash | Version panel with Versions + conditional Save version when working copy changed since last snapshot); METADATA section includes Lineage & lifecycle below Extract (collapsed by default; expands for series, relationships, lifecycle, dates, and read-only Applicable); right panel: Markdown | Page Index (refresh parses markdown to tree; Page Index hidden for XLSX); XLSX: left panel Workbook with sheet tabs and scrollable grid from parsing_result; explicit versions (document_versions) not created on routine save; scrollable layout (min-height 720px)
Document markdown edit Edit/View toggle, textarea for markdown, Save (PUT /markdown; rebuilds page index), Restore from S3 (POST /restore-markdown; rebuilds page index); POST /rebuild-page-index for manual rebuild from current markdown; Save in the panel header uses the same outlined control style as Edit/Cancel with accent emphasis (icon + label)
Document versions User-triggered checkpoints: POST /documents/{id}/versions snapshots current markdown and metadata (optional tag in API); list, preview, restore (POST .../versions/{vid}/restore); optional save-current before restore; Save as version modal (optional tag)
Document metadata extraction Single METADATA section on detail page; Extract button uses channel's LLM; configurable schema per channel (key, label, type: text/date/enum/object_type/list[object_type], description); object_type_extraction_max_instances limits instance count for extraction
Document info & metadata edit Edit document name and channel (PUT /api/documents/{id}); Edit metadata fields inline (PUT /metadata); Move document to channel via modal
Document metadata (unified) All metadata (extracted + manual) in single metadata JSONB; manual labels configure in channel settings Manual Labels tab (type: object_type or list[object_type]); object-instance pickers in METADATA section
Channel description Channel description shown on channel page; stored in document_channels.description

Document parsing (PaddleOCR-VL)

  • PaddleOCR-VL with mlx-vlm-server as VLM backend
  • Supports: PDF, PNG, JPG, JPEG, WEBP; DOCX and PPTX are converted to PDF with LibreOffice (soffice / libreoffice) in the worker/CLI, then parsed like PDF (Docker worker image installs writer + impress)
  • Output: Markdown, layout detection, parsing result JSON
  • Configurable: server URL, model, max concurrency
  • Channel metadata extraction during pipeline: if the extraction LLM errors (e.g. HTTP 502), the CLI logs a warning and still completes the parse so the document can reach completed; use Extract on the document page when the model is healthy

openkms-cli

  • CLI at openkms-cli/ built with Typer (≥0.9.0)
  • Tests: openkms-cli/tests/pip install -e ".[dev]" && pytest tests/ (VLM defaults merge / fetch wiring with mocks; parser restructure and bbox/layout helpers; no Paddle install required)
  • Configuration: openkms_cli/settings.py (CliSettings, pydantic-settings) lists every supported env var via validation_alias; parse/pipeline/auth read through get_cli_settings(); Typer no longer duplicates env via envvar=
  • Parse: openkms-cli parse run <input> [--output dir] [--vlm-url ...]; inputs: PDF, images, DOCX, PPTX (LibreOffice conversion); VLM URL/model/key can follow GET /internal-api/models/document-parse-defaults when OPENKMS_API_URL is set, needed OPENKMS_VLM_* values are missing, and CLI auth succeeds; when OPENKMS_VLM_MODEL is set in the environment, the CLI sends ?model_name=... so the backend returns that vl/ocr row's URL and key, or the default row if there is no match
  • Pipeline: openkms-cli pipeline list (list supported pipelines); openkms-cli pipeline run --input s3://.../original.<ext> – S3 or local input (stored key preserves extension); optional --s3-prefix (defaults to file hash), --skip-upload
  • Metadata extraction: when channel has extraction_model_id and extraction_schema, worker passes --extract-metadata --extraction-model-name <model_name>; CLI fetches model config from GET /api/models/config-by-name, extracts via pydantic-ai, PUTs to PUT /api/documents/{id}/metadata; LLM failure does not fail the pipeline after a successful parse
  • Uses PaddleOCR-VL for parsing (optional: pip install openkms-cli[parse]); pipeline needs pip install openkms-cli[pipeline]; extraction needs pip install openkms-cli[metadata]; PageIndex tree built-in (md_to_tree uses # headings)
  • Output structure matches backend: {file_hash}/original.{ext}, result.json, markdown.md, page_index.json (when pageindex installed), layout_det_*, block_*, markdown_out/*
  • Backend integration: subprocess-invokable for async jobs
  • Extensible: developers can add new Typer subapps in app.py