Document upload + parsing via PaddleOCR-VL; DOCX/PPTX converted with LibreOffice then parsed like PDF; XLSX preview (openpyxl) at upload + run_spreadsheet_preview job on re-process; store in S3/MinIO under {file_hash}/
Document detail view with Markdown, layout images, block images; loads files via backend proxy
Document list by channel: GET /api/documents?channel_id=
Delete document: DELETE /api/documents/{id}
Document info & metadata: Edit name (PUT /api/documents/{id}), edit metadata (PUT /metadata), Extract via pydantic-ai Agent + StructuredDict
Document markdown: Edit and save (PUT /markdown; rebuilds page index in S3), restore from S3 (POST /restore-markdown; rebuilds page index), optional POST /rebuild-page-index (also triggered from Page Index tab refresh); detail page shows Save/Cancel in panel header when editing (not View toggle); Page Index tab has refresh control (tooltip: parse markdown to tree)
Document versions: document_versions table; explicit snapshots of markdown + metadata (POST /versions, GET /versions, GET /versions/{id}, POST /versions/{id}/restore); version checkpoint uses JSON field tag (DB column tag); UI: version column in Document Information (3-column stats), Save version when working copy newer than last snapshot, optional tag in Save as version modal, Versions modal as a table (Version / Tag / Saved / Actions); list/preview/restore with optional save-current-first; not created on routine markdown/metadata save
Document lifecycle & lineage: series_id, effective_from / effective_to, lifecycle_status on documents; document_relationships (supersedes, amends, implements, see_also); PATCH /lifecycle, GET/POST/DELETE /relationships; document API is_current_for_rag (computed: currently applicable for normal KB answers/indexing); default KB semantic search and kb-index (lifecycle_index_mode default current_only) respect that unless opted out; document detail Lineage & lifecycle under the METADATA block, collapsed by default (expand loads relationships)
Document metadata (unified): extracted metadata and manual labels stored in single metadata JSONB; channel extraction_schema supports object_type and list[object_type]; label_config (Manual Labels tab) maps keys to Master Data object types with type (object_type | list[object_type]); single METADATA section on document detail
Authentication: OPENKMS_AUTH_MODE=oidc (default, OIDC via issuer discovery + JWKS) or local (PostgreSQL users, /api/auth/*, CLI HTTP Basic); backend verifies JWT Bearer or session; GET /api/auth/public-config (no auth) exposes auth_mode and allow_signup only; GET /internal-api/models/document-parse-defaults (auth; optional model_name) supplies VLM base_url, model_name, and provider api_key for openkms-cli; SPA OIDC uses oidc-client-ts (VITE_OIDC_ISSUER); frontend resolves local vs OIDC from the API with VITE_AUTH_MODE as fallback; Vite proxy for /api and /internal-api in dev
User profile: /profile shows current user from GET /api/auth/me (is_admin, roles, resolved permissions, header menu). User settings/settings: personal API keys (POST/GET/DELETE /api/auth/api-keys). Console → Users & Roles: /api/admin/users with console:users; Permission management (/console/permission-management): All under Roles edits the security_permissions catalog; a named role uses checkboxes + Save role permissions (draft, no per-click API); GET /api/admin/permission-reference includes operation_key_hints; overview nudge when catalog is only all; Data security (/console/data-security/*) remains local-user–centric; group data scopes behind OPENKMS_ENFORCE_GROUP_DATA_SCOPES
Route protection: Home (/) is always reachable without sign-in (static marketing content via HomeStaticLanding); all other MainLayout routes require authentication. 401 responses whose body indicates invalid/expired JWT clear SPA session via authAwareFetch / AuthContext so the same gate appears instead of raw API error JSON
Knowledge Map & home hub: SQLAlchemy app.models.knowledge_map (KnowledgeMapNode, KnowledgeMapResourceLink → taxonomy_nodes / taxonomy_resource_links); API app.api.knowledge_map at GET /api/taxonomy/nodes/tree, node PATCH (move/reorder/edit) + link CRUD; GET /api/home/hub (taxonomy summary field in JSON + scoped document relationship work items + placeholder share requests); SPA KnowledgeMap.tsx at /knowledge-map (legacy /taxonomy redirects; sidebar above Glossaries; Tree + Node details panels with scoped refer-tos; New node modal); signed-in Home.tsx with taxonomy:read centers KnowledgeMapForceGraph (react-force-graph-2d, wiki-style pan/zoom; tree + links APIs; term → /knowledge-map?node=, resource → channel/wiki/articles); MainLayout applies app-content--home on / for hub padding; permissions taxonomy:read / taxonomy:write; feature toggle key taxonomy (Console label: Knowledge Map)
Knowledge Bases: Full CRUD, documents, FAQs (manual + LLM-generated), chunks (pgvector), semantic search with hybrid filters (metadata_filters) and optional include_historical_documents, Q&A proxy, settings (chunk_config incl. lifecycle_index_mode, faq_prompt, metadata_keys); doc_metadata propagated from documents to FAQs/chunks per metadata_keys; openkms-cli pipeline run --pipeline-name kb-index; QA Agent service (FastAPI + LangGraph)
Wiki spaces: wiki_spaces, wiki_pages, wiki_files, wiki_space_documents (+ access_group_wiki_spaces); API /api/wiki-spaces (scoped like KBs when OPENKMS_ENFORCE_GROUP_DATA_SCOPES); PageIndex; GET /api/wiki-spaces/{id}/graph; vault mirror + POST .../import/vault; paginated page list (15); GET/POST/DELETE/api/wiki-spaces/{id}/documents for channel document links (GET list: linked_at + linked documentupdated_at for SPA “last updated”); embedded agentPOST/GET/DELETE/PATCH/api/agent/conversations, .../messages (list by wiki space, conversation delete/title optional, GFM + auto-scroll in SPA; LangGraph read-only tools; OPENKMS_AGENT_MODEL_ID or default LLM on Models/models) — wiki_agent_prototype.md; openkms-cliwiki put / sync / upload-file
openkms-cli tests:openkms-cli/tests/ — pip install -e ".[dev]" && pytest tests/ (VLM defaults merge + mocked fetch; parser _restructure_pages_after_predict and layout/bbox helpers; no Paddle in test env)
Console: System settings (/console/settings) — system_settings table (system_name, default_timezone, api_base_url_note); GET /api/public/system (unauthenticated) returns trimmed system_name only; GET/PUT /api/system/settings with console:settings; sidebar title is blank until that public response, then shows openKMS when the name is empty or whitespace; users, feature toggles, object types, link types, data sources, datasets, permission management, data security (groups + resource scopes); entry gated by console:* permissions or JWT admin; per-page permissions (e.g. console:feature_toggles)
Evaluation (experimental, feature toggle): query + expected answer pairs per KB; topic column; CSV import (topic, query, answer); items list paginated (GET .../itemsoffset/limit, default limit 10); run types search_retrieval (hybrid search + judge) and qa_answer (KB agent + judge); persisted evaluation_runs / evaluation_run_items; list/get/delete/compare runs in API and dataset detail UI; sidebar link when evaluationDatasets enabled
Glossaries: CRUD glossaries, terms with bilingual (EN/CN) support, definition, synonyms, AI suggestion (translation + definition + synonyms), search (EN, CN, definition, synonyms), export/import; dev.sh ensures pgvector on start; backend README + dev setup doc: pgvector install, Docker/PGDG, $libdir/vector troubleshooting
Objects & Links: ontology layer (object types, link types, instances); schema in Console; user-facing browse at /ontology (overview), /objects, /links; feature toggle objectsAndLinks
Data Sources: Console → Data Sources (PostgreSQL/Neo4j connections, encrypted creds). Datasets & object/link schema admin: Ontology sidebar (/ontology/datasets, /ontology/object-types, /ontology/link-types); ontology:read/ontology:write can use the same APIs as console:datasets / console:object_types / console:link_types where wired with require_any_permission.
Docs site: mkdocs.yml (Material theme) + .github/workflows/docs.yml publish docs/ to GitHub Pages at https://yingrui.github.io/openKMS/ on every push to main that touches docs/**, mkdocs.yml, or the workflow; reader-friendly entry pages (index.md, overview.md, quickstart.md, operations/docker.md, developer/setup.md) sit on top of the existing canonical references (architecture.md, functionalities.md, development_plan.md, security.md, tech_debt.md); docs/agents.md documents where each kind of doc edit goes, mirroring .cursor/rules/*.mdc. Folder rename docs/for developer/ → docs/developer/ to keep URLs space-free.
Pages | Documents; linked-docs picker; Wiki Copilot wired to /api/agent (persisted conversations; read tools; list/delete conversations, markdown + auto-scroll in panel); wiki-skills vendored via git subtree at third-party/wiki-skills, SKILL.md content in LangGraph system prompt
wiki_space_documents + agent_* tables; link/unlink/list; SPA uses API (not sessionStorage) for links
Tool visibility while streaming: astream_events (v2) → NDJSON tool_start / tool_end / tool_error (paired by run_id); wiki panel shows compact terminal-style rows interleaved with streamed text (not all tools then all text) and expandable I/O
Design for backend integration: subprocess-invokable
Pipeline CLI: openkms-cli pipeline list (list supported pipelines), openkms-cli pipeline run --input s3://.../original.pdf (optional --s3-prefix, --skip-upload; local input supported)
Backend async job spawns CLI for document parsing (offload from API process) – via procrastinate
Pipeline metadata extraction: when channel has extraction_model_id and extraction_schema, worker passes --extract-metadata --extraction-model-name; CLI fetches config from backend config-by-name, extracts via pydantic-ai, PUTs metadata to backend
PageIndex: pipeline builds markdown→tree via built-in md_to_tree (# headings); backend GET /documents/{id}/page-index and GET /documents/{id}/section; frontend Markdown | Page Index toggle; QA agent LangGraph page_index skill (read TOC, select section, extract content)
Pipeline checkpoint: after successful S3 upload, when --document-id and API auth (OIDC token or local Basic) are available, CLI PUTs parsed markdown then POST /api/documents/{id}/versions with tag: "Pipeline" (after optional metadata extraction)
Object types: schema with name, description, properties (string/number/boolean)
Object instances: CRUD under /api/object-types/{id}/objects (admin write)
Link types: schema with source/target object types
Link instances: CRUD under /api/link-types/{id}/links (admin write)
Console Object Types page: CRUD object types and properties; Edit dialog wider; property name/type read-only when editing; key_property (primary key) selector; is_master_data and display_property for document labels
Console Link Types page: CRUD link types
Ontology overview page (/ontology) – all object types and link types on one page
User-facing Objects list (/objects), Object type detail with instances (/objects/:typeId)
User-facing Links list (/links), Link type detail with instances (/links/:typeId)
Search filter on object instances
Feature toggle objectsAndLinks (gates sidebar and routes)
4b. Data Sources (Console) & Datasets / schema (Ontology)¶
Data sources: CRUD for PostgreSQL and Neo4j connections; credentials encrypted (Fernet)
Test connection: POST /api/data-sources/{id}/test
Neo4j delete all: POST /api/data-sources/{id}/neo4j-delete-all; confirmation modal in Console
Datasets: CRUD for PostgreSQL tables (schema + table) linked to data sources
List tables: GET /api/datasets/from-source/{id} for table picker
Console Data Sources page: table, Add/Edit modal, Test button
Datasets UI under Ontology (/ontology/datasets, detail /ontology/datasets/:id); legacy /console/datasets redirects
Dataset detail: click dataset → Data tab (rows with pagination, page size selector) and Metadata tab (column info)
Dataset rows/metadata API: GET /api/datasets/{id}/rows, GET /api/datasets/{id}/metadata
seed_mock_insurance_data.py: mock diseases, insurance products, relationships for demo datasets
Object types link to datasets (dataset_id); instance_count uses dataset row count when linked
Link types: cardinality (one-to-one, one-to-many, many-to-many) and optional dataset_id for many-to-many
Link types FK mapping: source_key_property, target_key_property, source_dataset_column, target_dataset_column
Many-to-many with dataset: connections read from junction table; link_count and list links from dataset
Many-to-one/one-to-many: link_count from source object type dataset where FK column is not null
Index to Neo4j: Object Types and Link Types pages; Index Objects/Links buttons when Neo4j data source exists; POST /api/object-types/index-to-neo4j, POST /api/link-types/index-to-neo4j
Ontology sidebar: Ontology top-level next to Glossaries; indented subnav (Datasets, Object types, Link types, Objects, Links, Object Explorer) when on those routes; schema admin at /ontology/datasets, /ontology/object-types, /ontology/link-types
Objects & Links visible when Neo4j data source exists (hasNeo4jDataSource in feature toggles)
Object Explorer: graph view at /object-explorer (react-force-graph-2d, Cypher execution, object/link type selection)
Objects page: instances and instance_count from Neo4j; Console Object Types: counts from datasets
Links page: instances and link_count from Neo4j; Console Link Types: counts from datasets
API params: count_from_neo4j on GET /api/object-types, /object-types/{id}, /api/link-types, /link-types/{id}
GET /api/auth/public-config + SPA uses API-reported mode (compatibility with local vs central IdP); optional mismatch banner vs VITE_AUTH_MODE
Profile page /profile and fetchAuthMe; user Settings/settings for personal API keys; user admin APIs /api/admin/users + Console Users (console:users, local only)
Protect backend routes with JWT Bearer or session (or local Basic for CLI)
Operation permissions: security_permissions (catalog), security_roles, security_role_permissions, user_security_roles; require_permission + GET /api/auth/permission-catalog (from DB) + admin CRUD /api/admin/security-permissions; OIDC resolves permissions by matching JWT realm role names to security_roles.name; Console sidebar and APIs use granular console:* keys; JWT realm admin / local-cli bypass
Pattern-based access control: optional OPENKMS_ENFORCE_PERMISSION_PATTERNS_STRICT middleware; default frontend_route_patterns / backend_api_patterns per catalog key (Alembic); SPA canAccessPath from permission-catalog union; docs and .env.example updated
Access groups: access_groups, junctions for channels/KBs/wiki/evaluation/datasets/object_types/link_types; data resources (data_resources, access_group_data_resources) with /api/admin/data-resources + Console Data resources page; group scopes include data_resource_ids; OPENKMS_ENFORCE_GROUP_DATA_SCOPES unions legacy ID lists with resource predicates (data_scope + data_resource_policy) for local non-admin filters (no OIDC enforcement in phase 1); OIDC: console may still manage groups, scopes, and data resources; PUT group members remains local-only
Upload decoupled from parsing: stores file only, status=uploaded
Document status field: uploaded → pending → running → completed/failed
run_pipeline task: spawns openkms-cli pipeline run as subprocess; wait capped by OPENKMS_PIPELINE_TIMEOUT_SECONDS (default 1800); when channel has extraction config (model_name from ApiModel), renders extraction args into template and runs metadata extraction in CLI
Jobs API: list, detail, create, retry, delete (/api/jobs)
Jobs.tsx: real API, status filter, create job, retry, delete
JobDetail.tsx: full job detail page with timing, document link, pipeline info, rendered command, event log
Process button on document list and detail for uploaded/failed docs
Reset status button for pending/failed docs (resets to uploaded if no active jobs)
Pipeline command template system: {variable} placeholders resolved at runtime
Template variables API: GET /api/pipelines/template-variables
Sonner toast system for project-wide notifications
Worker entry point: backend/worker.py
Model-aware command template: {vlm_url}, {model_name} resolved from linked ApiModel
Add/remove documents to/from knowledge base (join table kb_documents)
FAQ CRUD (manual create/edit/delete)
FAQ generation from documents via LLM (POST /faqs/generate returns preview; POST /faqs/batch saves selected; UI: review step with remove unqualified before save)
Chunk model with pgvector embeddings
pgvector extension enabled in database.py
Semantic search over chunks and FAQs (POST /search)
QA proxy to external agent service (POST /ask)
KB settings: agent URL, embedding model (embedding_model_id → Models / api_models; not backend OPENKMS_EMBEDDING_*), chunking config, FAQ generation prompt
openkms-cli pipeline run --pipeline-name kb-index: chunk documents, generate embeddings, bulk insert to pgvector
run_kb_index procrastinate task for background indexing
Frontend: KnowledgeBaseList with real CRUD (create, edit, delete)
Frontend: KnowledgeBaseDetail with Documents, FAQs, Chunks, Search, Q&A, Settings tabs
QA Agent Service project (qa-agent/): FastAPI + LangGraph, retrieves via backend search API (no DB access)
Batch document selection for FAQ generation in the UI (modal with doc picker, review generated FAQs, remove unqualified, save)
Re-index button triggers job via procrastinate (currently settings only saves config)
Before commit: Update docs/architecture.md, docs/development_plan.md, docs/functionalities.md to reflect changes. See .cursor/rules/docs-before-commit.mdc.