Obsidian_vault_pipeline
A automatic pipeline that processing obsidian vault by llm from pinbord, obsidian Clipper etc.
schema_version: "1.0.0" note_id: readme_en-5d661efc title: "Obsidian Vault Pipeline" description: "An auditable knowledge state runtime for Obsidian" date: 2026-04-07 type: meta
Obsidian Vault Pipeline
Auditable knowledge state runtime for Obsidian Vaults
Capture â Compile â Reuse
Current document version: v0.13.0
Primary docs:
- Architecture â durable knowledge model (six core terms; 250-line cap)
- Runtime â pipeline stages and CLIs by stage
- Packs â Core / Domain Pack / Workflow Profile
- Product Surfaces â UI / MCP / CLI / export
- Glossary â every other domain term, mapped back to the six core
- Milestone (įŽäŊ䏿)
- Active Backlog
- įŽäŊ䏿: æļæ
What This Is
Obsidian Vault Pipeline is not a loose collection of scripts, and it is not only RAG over Markdown. It is a local knowledge state runtime built around an Obsidian vault:
- Capture receives Pinboard, Clippings, raw Markdown, papers, GitHub repos, and web pages while keeping source lifecycle traceable.
- Compile turns material into deep dives, candidates, claims, evidence, relations, contradictions, registry rows, and graph rows.
- Reuse projects compiled knowledge into reader atlas pages, object pages, graph views, briefings, search, context packs, writing prompts, and the operator workbench.
Internally the runtime executes six pipeline stages: Ingest â Interpret â Absorb â Refine â Normalize â Derive (see RUNTIME). The product narrative is Capture â Compile â Reuse. The state model â Sources, Candidates, Canonical State, Projections, Access Surfaces, with Governance as the cross-cutting control plane â is documented in ARCHITECTURE.
The current release wires those layers into the actual runtime:
ovp --fullruns throughknowledge_indexby defaultovp --incrementalis the daily incremental entry point, including recent Pinboard + Clippings and downstream stagesovp --full --with-refineinsertsrefinebefore the final derived refreshovp-autopilotruns real-timeabsorb -> moc -> knowledge_indexovp-autopilot --with-refineaddsrefineto that pathovp-uiprovides a local UI. The default/entry is now a reader-first Knowledge Library, the operator dashboard lives under/ops, object pages expose source/backlink context, and/graph(also/map) renders a reader-facing knowledge map.
Why The Architecture Looks Like This
This repository started as a set of Obsidian automation scripts, but that model stopped scaling once the system grew:
- the main runtime and individual scripts drifted apart
- concepts, links, Atlas, graph, and retrieval indexes were tightly coupled without a clean truth boundary
- new domains like media, medical, or engineering research could not be modeled safely with a concept-only core
The current architecture is the direct answer to those failures:
- Capture â Compile â Reuse explains the product value
- The state model (Source / Candidate / Canonical State / Projection / Access Surface, with Governance as cross-cutting control) makes the trust boundary explicit; see ARCHITECTURE
- The six-stage runtime makes orchestration, identity normalization, and projection rebuilds explicit; see RUNTIME
research-techis the first standard built-in domain packdefault-knowledgeis retained as a compatibility pack for older vaults- The Pack API turns future domains into installable packs rather than hardcoded branches; see PACKS
So the project is no longer just a Vault automation repo. It is now:
a reader-first, evidence-backed knowledge atlas over an auditable knowledge state runtime
with:
research-techas the first explicit built-in standard packdefault-knowledgeretained as the default compatibility packknowledge.dbas a Projection (rebuildable from Canonical State, never authoritative)- vault markdown + registry + evidence chains as Canonical State (the long-term trust boundary)
Current Roadmap
OVP is evolving from a personal Zettelkasten into a typed knowledge platform â reader-first for humans, programmable for agents, extensible through domain packs.
- active backlog:
BACKLOG.md - current milestone:
MILESTONE.md - current merged roadmap rationale:
docs/plans/2026-04-29-consolidated-product-roadmap.md - reader product-shape note:
docs/plans/2026-04-29-reader-product-shape-and-backlog-reconciliation.md
Current milestone sequence:
| Milestone | Status | Meaning |
|---|---|---|
| M0âM3 | Done | Foundation, operator workbench, roadmap consolidation, reader-first atlas |
| M4 KSR Safety And Hot-Path Hardening | Done | projection labels, hot-path audit, wiring evals, evidence spans, candidate risk, JSONL streaming, projection lifecycle hardening |
| M5 Context Pack And Operational Runtime | Done | session snapshots, context budget, runtime state, runtime-state API, action queue health |
| M5a Quality And Dedup Hardening | Done | concept dedup pipeline integration, promote semantic guard, historical data cleanup |
| M8 Type Unification And Extraction Quality | Active | unified object kind taxonomy, Layer 1 entity_type, body-size-aware extraction, quote-grounding, single-pass LLM refactor |
| M9 Pack As Domain Ontology | Next | pack-defined object kind specs, typed relation constraints, schema registry |
| M10 Operational Knowledge Layer | Later | action types, permissions, cross-entity aggregation, decision memory |
| M11 Source Authority And Cross-Source Identity | Done | typed source-authority providers, entity layer (twitter_author / github_project / github_user / person / organization), runtime resolver, refresh wrapper, db backup (PRs #112â#124) |
| M12 Extraction-Time Entity Prime And Auto-Wikilink | Done | entity_aliases view, LLM extractor primed with known entities, auto-wikilink CLI (PRs #126â#128) |
| M13 Synthesis Layer (Crystal) | Done | Louvain communities + LLM-synthesized crystals + contradiction crystals + append-only versioning (PRs #130â#133, closes the L3 gap with NM 0.8) |
| M14 Intake Hardening (BL-058) | Done | URL preservation through deep-dive, deprecate legacy 13-section LLM rewrite, global URL dedup across the active staging chain (Clippings + 4 50-Inbox stages), audit-event stage field, fidelity-sample + prompt-ab measurement CLIs (PRs #170â#172, v0.13.0) |
| M20 Cognitive Surface | Done | USER.md + OVP_RULES.md + LLM context loader (BL-075), QUEUE â GENERATED task dispatcher (BL-076), /digest daily synthesis (BL-077), thin-note Reader shell (PR #206) |
| M21 Anchored Inquiry Surface | Done | Provider profiles + loader (BL-081), inquiry transcript schema + fileops (BL-082), two-layer context binder (BL-083), ovp-ask headless handler + CLI + write-back (BL-084), chats projection (BL-085), Reader /chat page (BL-086), "Ask about this" entry buttons (BL-087), /chats history list (BL-088) â PRs #210â#217, v0.21.0 |
Recent major changes (PRs #98â#124):
- JSONL streaming hardening, advisory file locks, runtime-state API fixes (#98â#100)
- Concept dedup pipeline + promote semantic guard, historical Evergreen cleanup (#101)
- Typed StepResult contracts + 4 pipeline guardrails (#109â#111)
- Liberate evergreen extractor prompt (#112) â no more 3-5 cap on atomic units per article
- Source authority subsystem (#113/#114): typed SignalProvider Protocol, domain/author whitelists, GitHub/arXiv/Twitter/Substack signals, yaml overrides, LLM-judge for new domains
- Entity layer (#115/#119/#120/#121/#123): twitter_author + github_project + github_user backfills, identity merge with person/organization split, runtime resolver â 1497 entities total on the OVP vault (521 twitter + 922 github + 54 person/organization), ~$0.10 one-shot
- Operational glue (#117/#122):
ovp-backup-dbSQLite online-backup snapshots,ovp-refresh-source-authoritychained refresh + launchd plist - 12 entity-layer review fixes (#124): read-side write side effects, identity-merge backlinks, lock race, append-only history, GitHub bare profile URLs, etc.
Domain Packs
The core runtime is now being formalized as a pack-aware platform.
- Built-in standard pack:
research-tech - Default compatibility pack:
default-knowledge - Runtime selection is exposed through
--packand--profile - Third-party packs can be discovered through the
ovp.packsentry point group or theOVP_PACK_MANIFESTSmanifest list
Examples:
ovp-packs
ovp-doctor --pack research-tech --json
ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
ovp --pack default-knowledge --profile full
Pack API documentation for third-party developers lives in:
docs/pack-api/README.mddocs/pack-api/manifest-and-hooks.mddocs/pack-api/dogfooding-with-media-pack.md
Platform Architecture
From a platform perspective, the system now has three layers:
- Core Platform
- Domain Pack
- Workflow Profile
1. Core Platform
Core owns the cross-domain pieces that must remain stable:
- runtime / vault layout
- CLI orchestration
- autopilot / queue / watcher
- canonical identity helpers
- registry framework
- derived
knowledge.db - graph / lint / audit infrastructure
- plugin / pack loading
- base evidence schema contracts
2. Domain Pack
A pack is not just a prompt bundle. It defines domain semantics:
- object kinds
- workflow profiles
- discovery boundaries
- absorb / refine / lint rules
- schemas / templates / prompt resources
The built-in packs are:
research-tech: the explicit technical research pack and the default workflow packdefault-knowledge: the compatibility layer
Future domains such as media or medical should arrive as external pack projects.
3. Workflow Profile
A workflow profile is an executable DAG under a pack.
The built-in profiles currently shipped are:
research-tech/fullresearch-tech/autopilotdefault-knowledge/full
Research-Tech Operational Surface
research-tech is no longer only an internal pack. It now has a minimal operational surface:
ovp-doctorreports default workflow pack, pack roles, operator docs, recipes, and optional vault healthovp-exportexports minimal compiled artifacts:object-pagetopic-overviewevent-dossiercontradictions
ovp-truthreads object / contradiction / neighborhood truth rows directly fromknowledge.dbovp-uilaunches a local UI. The default/entry is the reader-first Knowledge Library; the operator dashboard lives under/ops.docs/research-tech/RESEARCH_TECH_SKILLPACK.mddocs/research-tech/RESEARCH_TECH_VERIFY.mddocs/recipes/research-tech/*.md
Examples:
ovp-doctor --pack research-tech --json
ovp-truth objects --vault-dir /path/to/vault
ovp-ui --vault-dir /path/to/vault --port 8787
ovp-export --pack research-tech --target topic-overview --output-path /tmp/topic.md
default-knowledge/autopilot
That is why the default workflow path now runs:
ovp --full
ovp-autopilot --yes
You can still select packs explicitly:
ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
# compatibility path
ovp --pack default-knowledge --profile full
Plugin Design
The plugin / pack surface is no longer only a design memo. There is now a minimal working integration path.
Two discovery modes are supported:
- Python entry point group:
ovp.packs - Explicit manifest list:
OVP_PACK_MANIFESTS=/path/a.yaml:/path/b.yaml
The minimum third-party loading chain is:
- provide a manifest
- declare
entrypoints.pack - return a
BaseDomainPack - pass
api_versionvalidation - select it through
--pack/--profile
Hard boundaries currently enforced by core:
- a pack cannot turn semantic retrieval into canonical identity
- a pack cannot treat
knowledge.dbas Canonical State - a pack cannot bypass audit/logging
- all Projections must remain rebuildable
Runtime Model
Canonical State Boundary (full definition: ARCHITECTURE)
The system keeps a hard boundary:
- Canonical State: vault markdown + concept registry + evidence + audit
- Projections: Atlas, MOC, graph,
knowledge.db, lint, daily delta, crystals - not Canonical State:
knowledge.db
knowledge.db is a Projection. It stores:
- page FTS
- structured links
- mirrored raw sidecars
- timeline / audit events
- deterministic section embeddings
- read-only query / serve surfaces
It is rebuildable and does not own canonical identity resolution.
The Six Pipeline Stages (full description: RUNTIME)
| Stage | Responsibility | Representative commands | Can the LLM make major decisions here? |
|---|---|---|---|
| Ingest | Normalize incoming material | ovp --step pinboard ovp --step clippings ovp-article | No |
| Interpret | Produce deep interpretations | ovp-article ovp-github ovp-paper | Yes, with constrained output |
| Absorb | Compile interpretations into lifecycle actions | ovp-absorb ovp-evergreen | Yes, but only through structured results |
| Refine | Cleanup and breakdown existing notes | ovp-cleanup ovp-breakdown | Yes, but execution is controlled |
| Normalize | Maintain registry / aliases / identity merges / contradiction detection (formerly Canonical) | ovp-rebuild-registry ovp-merge-identities ovp-link-entities ovp-resolve-contradictions | No |
| Derive | Build Projections â retrieval / graph / crystals / lint | ovp-knowledge-index ovp-graph ovp-synthesize-community-crystals ovp-lint | No |
What ovp --full Actually Runs
Default full pipeline:
pinboard
â pinboard_process
â clippings
â articles
â quality
â fix_links
â absorb
â dedup
â note_type_normalize
â registry_sync
â moc
â knowledge_index
With refine enabled:
pinboard
â pinboard_process
â clippings
â articles
â quality
â fix_links
â absorb
â dedup
â note_type_normalize
â registry_sync
â moc
â refine
â knowledge_index
Important details:
absorbshells toovp_pipeline.commands.absorband emitspromoted_slugsfor downstream stepsdedupruns post-absorb concept deduplication scoped to recently promoted slugs (trigram-Jaccard similarity)note_type_normalizenormalizes note_type metadata across Evergreen filesrefineis a batch wrapper overcleanup + breakdownknowledge_indexalways runs last soknowledge.dbreflects final canonical state--step evergreenand--from-step evergreenare still accepted and map toabsorb
What ovp-autopilot Actually Runs
Default real-time path:
interpretation
â quality
â absorb
â moc
â knowledge_index
â auto_commit(optional)
Enable refine explicitly:
ovp-autopilot --watch=inbox --with-refine --yes
That changes the path to:
interpretation
â quality
â absorb
â moc
â refine
â knowledge_index
â auto_commit(optional)
Refine is not hidden or missing. It is wired in, but opt-in by default to avoid silent real-time structural rewrites of the whole knowledge base.
Command Overview
Daily entry points
| Command | Purpose |
|---|---|
ovp --check | Validate runtime configuration |
ovp --full | Run the full daily pipeline |
ovp --full --with-refine | Run full pipeline plus cleanup/breakdown |
ovp --step absorb | Run only the absorb layer |
ovp --step refine | Run only the batch refine layer |
ovp --from-step absorb | Resume from absorb onward |
Content processors
| Command | Purpose |
|---|---|
ovp-article --process-inbox --vault-dir <vault> | Process raw documents |
ovp-github --process-single <file> --vault-dir <vault> | Process GitHub inputs |
ovp-paper --process-single <file> --vault-dir <vault> | Process paper inputs |
Absorb / Refine / Canonical
| Command | Purpose |
|---|---|
ovp-absorb --recent 7 --json | Absorb recent deep dives |
ovp-absorb --file <source.md> --dry-run --json | Preview source lifecycle routing before moving or processing source material |
ovp-evergreen --recent 7 --json | Compatibility alias for ovp-absorb |
ovp-concept-dedup --vault-dir <vault> --threshold 0.82 | Find and propose concept deduplication clusters |
ovp-concept-dedup --vault-dir <vault> --apply | Apply deduplication proposal (archive losers, rewrite wikilinks) |
ovp-cleanup --all --json | Generate cleanup proposals |
ovp-cleanup --all --write --json | Apply deterministic cleanup |
ovp-breakdown --all --json | Generate breakdown proposals |
ovp-breakdown --all --write --json | Apply incremental breakdown |
ovp-rebuild-registry --json | Reconcile evergreen notes and registry |
ovp-promote-candidates review | Review candidate lifecycle |
ovp-moc --scan --vault-dir <vault> | Refresh MOC / Atlas |
Derived layer
| Command | Purpose |
|---|---|
ovp-knowledge-index --json | Rebuild knowledge.db |
ovp-knowledge-index --search "query" --json | Run FTS search |
ovp-knowledge-index --query "question" --json | Run embedding chunk query |
ovp-knowledge-index --get slug --json | Read a canonical page |
ovp-knowledge-index --stats --json | Read index stats |
ovp-knowledge-index --audit-recent --json | Read recent audit events |
ovp-knowledge-index --tools-json | Emit tool discovery JSON |
ovp-knowledge-index --serve | Start read-only stdio JSONL service |
ovp-graph daily today --vault-dir <vault> | Build daily graph delta |
ovp-lint --check --vault-dir <vault> | Run structure/link checks |
Operations
| Command | Purpose |
|---|---|
ovp-runtime-state --vault-dir <vault> --write --json | Build the operational runtime state projection from repair markers, workflow actions, pipeline events, and reuse events; writes 60-Logs/runtime-state/current.{json,md} |
GET /api/runtime-state | Local read endpoint for the provider-facing runtime-state projection; prefers the materialized 60-Logs/runtime-state/current.json and falls back to rebuild when missing |
POST /api/runtime-state | Refresh and write the materialized runtime-state projection |
Context packs
| Command | Purpose |
|---|---|
ovp-working-memory --vault-dir <vault> | Write the daily budgeted context pack to 60-Logs/working-memory/YYYY-MM-DD.md and emit trusted reuse events for selected objects |
ovp-prime --vault-dir <vault> --session-id <id> | Write an OVP Prime session snapshot to 60-Logs/session-snapshots/<id>.md, refresh latest.md, and emit ovp_prime reuse events |
AutoPilot
| Command | Purpose |
|---|---|
ovp-autopilot --watch=inbox --parallel=1 --yes | Default real-time pipeline |
ovp-autopilot --watch=inbox,pinboard --yes | Watch multiple sources |
ovp-autopilot --with-refine --yes | Add refine to the real-time path |
ovp-autopilot --no-commit --yes | Disable auto-commit |
Directory Layout
vault/
âââ 50-Inbox/
â âââ 01-Raw/
â âââ 02-Pinboard/
â âââ 03-Processed/
âââ 10-Knowledge/
â âââ Evergreen/
â âââ Atlas/
â âââ Atlas-Index.md
â âââ concept-registry.jsonl
â âââ alias-index.json
âââ 20-Areas/
â âââ {AI-Research, Investing, Programming, Tools}/Topics/YYYY-MM/
âââ 60-Logs/
â âââ pipeline.jsonl
â âââ refine-mutations.jsonl
â âââ transactions/
â âââ quality-reports/
â âââ daily-deltas/
â âââ working-memory/
â âââ session-snapshots/
â âââ runtime-state/
â âââ knowledge.db
âââ 70-Archive/
What knowledge.db Provides
knowledge.db is a rebuildable local derived index. It currently includes:
pages_indexpage_ftspage_linksraw_datatimeline_eventsaudit_eventspage_embeddings
It exists to power:
- keyword retrieval
- embedding retrieval
- canonical page reads
- audit browsing
- tool discovery and read-only serving
Default discovery now routes through this layer:
ovp-queryusesknowledge.dbby default- keyword retrieval uses FTS5 BM25
- semantic retrieval uses local deterministic embeddings
- QMD is no longer the default runtime dependency; it is opt-in via
--engine qmd
Quick Start
curl -fsSLO https://raw.githubusercontent.com/fakechris/obsidian_vault_pipeline/main/scripts/install-user.sh
less install-user.sh
bash install-user.sh
mkdir -p my-vault
cd my-vault
ovp --check
ovp --full
If you prefer the explicit PyPI two-step flow:
python3 -m pip install --user obsidian-vault-pipeline
python3 -m ovp_pipeline.installer
If your Python installation enforces PEP 668, prefer:
pipx install obsidian-vault-pipeline
The installer prefers a writable, safe bin directory that is already on PATH; if none is available, it falls back to ~/.local/bin. It does not edit your shell configuration.
If you want to see the refine layer explicitly:
ovp --full --with-refine
If you want a daemon:
ovp-autopilot --watch=inbox --parallel=1 --yes
Configuration
Put .env in the vault root:
AUTO_VAULT_API_KEY=your_key_here
AUTO_VAULT_API_BASE=https://api.minimaxi.com/anthropic
AUTO_VAULT_MODEL=anthropic/MiniMax-M2.7-highspeed
# Optional
PINBOARD_TOKEN=username:token
HTTP_PROXY=http://127.0.0.1:7897
Design Principles
- identity consistency before feature growth
- vault files + registry define Canonical State
knowledge.dbis a Projection, never an additional Canonical State- absorb is part of daily automation; refine is powerful and opt-in by default
- Wiki, MOC, dashboard, briefing, graph, reader pages, and context packs are projections that carry explicit projection metadata and must trace back to source/evidence
- reader-facing UI should explain knowledge first, then expose operator/debug detail
- docs must describe what actually ships, not a future architecture sketch
Related Resources
This document targets: v0.9.3
How to Install
- Download the template file from GitHub
- Move it anywhere in your vault
- Open it in Obsidian â done!
Stats
Stars
120
Forks
13
License
MIT
Last updated 5d ago
Categories