Obsidian_vault_pipeline

A automatic pipeline that processing obsidian vault by llm from pinbord, obsidian Clipper etc.

120 Stars
GitHubReport listing

schema_version: "1.0.0" note_id: readme_en-5d661efc title: "Obsidian Vault Pipeline" description: "An auditable knowledge state runtime for Obsidian" date: 2026-04-07 type: meta

Obsidian Vault Pipeline

License: MIT Python 3.10+ PyPI

Auditable knowledge state runtime for Obsidian Vaults
Capture → Compile → Reuse

đŸ‡¨đŸ‡ŗ įŽ€äŊ“中文

Current document version: v0.13.0

Primary docs:

What This Is

Obsidian Vault Pipeline is not a loose collection of scripts, and it is not only RAG over Markdown. It is a local knowledge state runtime built around an Obsidian vault:

  • Capture receives Pinboard, Clippings, raw Markdown, papers, GitHub repos, and web pages while keeping source lifecycle traceable.
  • Compile turns material into deep dives, candidates, claims, evidence, relations, contradictions, registry rows, and graph rows.
  • Reuse projects compiled knowledge into reader atlas pages, object pages, graph views, briefings, search, context packs, writing prompts, and the operator workbench.

Internally the runtime executes six pipeline stages: Ingest → Interpret → Absorb → Refine → Normalize → Derive (see RUNTIME). The product narrative is Capture → Compile → Reuse. The state model — Sources, Candidates, Canonical State, Projections, Access Surfaces, with Governance as the cross-cutting control plane — is documented in ARCHITECTURE.

The current release wires those layers into the actual runtime:

  • ovp --full runs through knowledge_index by default
  • ovp --incremental is the daily incremental entry point, including recent Pinboard + Clippings and downstream stages
  • ovp --full --with-refine inserts refine before the final derived refresh
  • ovp-autopilot runs real-time absorb -> moc -> knowledge_index
  • ovp-autopilot --with-refine adds refine to that path
  • ovp-ui provides a local UI. The default / entry is now a reader-first Knowledge Library, the operator dashboard lives under /ops, object pages expose source/backlink context, and /graph (also /map) renders a reader-facing knowledge map.

Why The Architecture Looks Like This

This repository started as a set of Obsidian automation scripts, but that model stopped scaling once the system grew:

  • the main runtime and individual scripts drifted apart
  • concepts, links, Atlas, graph, and retrieval indexes were tightly coupled without a clean truth boundary
  • new domains like media, medical, or engineering research could not be modeled safely with a concept-only core

The current architecture is the direct answer to those failures:

  • Capture → Compile → Reuse explains the product value
  • The state model (Source / Candidate / Canonical State / Projection / Access Surface, with Governance as cross-cutting control) makes the trust boundary explicit; see ARCHITECTURE
  • The six-stage runtime makes orchestration, identity normalization, and projection rebuilds explicit; see RUNTIME
  • research-tech is the first standard built-in domain pack
  • default-knowledge is retained as a compatibility pack for older vaults
  • The Pack API turns future domains into installable packs rather than hardcoded branches; see PACKS

So the project is no longer just a Vault automation repo. It is now:

a reader-first, evidence-backed knowledge atlas over an auditable knowledge state runtime

with:

  • research-tech as the first explicit built-in standard pack
  • default-knowledge retained as the default compatibility pack
  • knowledge.db as a Projection (rebuildable from Canonical State, never authoritative)
  • vault markdown + registry + evidence chains as Canonical State (the long-term trust boundary)

Current Roadmap

OVP is evolving from a personal Zettelkasten into a typed knowledge platform — reader-first for humans, programmable for agents, extensible through domain packs.

  • active backlog: BACKLOG.md
  • current milestone: MILESTONE.md
  • current merged roadmap rationale: docs/plans/2026-04-29-consolidated-product-roadmap.md
  • reader product-shape note: docs/plans/2026-04-29-reader-product-shape-and-backlog-reconciliation.md

Current milestone sequence:

MilestoneStatusMeaning
M0–M3DoneFoundation, operator workbench, roadmap consolidation, reader-first atlas
M4 KSR Safety And Hot-Path HardeningDoneprojection labels, hot-path audit, wiring evals, evidence spans, candidate risk, JSONL streaming, projection lifecycle hardening
M5 Context Pack And Operational RuntimeDonesession snapshots, context budget, runtime state, runtime-state API, action queue health
M5a Quality And Dedup HardeningDoneconcept dedup pipeline integration, promote semantic guard, historical data cleanup
M8 Type Unification And Extraction QualityActiveunified object kind taxonomy, Layer 1 entity_type, body-size-aware extraction, quote-grounding, single-pass LLM refactor
M9 Pack As Domain OntologyNextpack-defined object kind specs, typed relation constraints, schema registry
M10 Operational Knowledge LayerLateraction types, permissions, cross-entity aggregation, decision memory
M11 Source Authority And Cross-Source IdentityDonetyped source-authority providers, entity layer (twitter_author / github_project / github_user / person / organization), runtime resolver, refresh wrapper, db backup (PRs #112–#124)
M12 Extraction-Time Entity Prime And Auto-WikilinkDoneentity_aliases view, LLM extractor primed with known entities, auto-wikilink CLI (PRs #126–#128)
M13 Synthesis Layer (Crystal)DoneLouvain communities + LLM-synthesized crystals + contradiction crystals + append-only versioning (PRs #130–#133, closes the L3 gap with NM 0.8)
M14 Intake Hardening (BL-058)DoneURL preservation through deep-dive, deprecate legacy 13-section LLM rewrite, global URL dedup across the active staging chain (Clippings + 4 50-Inbox stages), audit-event stage field, fidelity-sample + prompt-ab measurement CLIs (PRs #170–#172, v0.13.0)
M20 Cognitive SurfaceDoneUSER.md + OVP_RULES.md + LLM context loader (BL-075), QUEUE → GENERATED task dispatcher (BL-076), /digest daily synthesis (BL-077), thin-note Reader shell (PR #206)
M21 Anchored Inquiry SurfaceDoneProvider profiles + loader (BL-081), inquiry transcript schema + fileops (BL-082), two-layer context binder (BL-083), ovp-ask headless handler + CLI + write-back (BL-084), chats projection (BL-085), Reader /chat page (BL-086), "Ask about this" entry buttons (BL-087), /chats history list (BL-088) — PRs #210–#217, v0.21.0

Recent major changes (PRs #98–#124):

  • JSONL streaming hardening, advisory file locks, runtime-state API fixes (#98–#100)
  • Concept dedup pipeline + promote semantic guard, historical Evergreen cleanup (#101)
  • Typed StepResult contracts + 4 pipeline guardrails (#109–#111)
  • Liberate evergreen extractor prompt (#112) — no more 3-5 cap on atomic units per article
  • Source authority subsystem (#113/#114): typed SignalProvider Protocol, domain/author whitelists, GitHub/arXiv/Twitter/Substack signals, yaml overrides, LLM-judge for new domains
  • Entity layer (#115/#119/#120/#121/#123): twitter_author + github_project + github_user backfills, identity merge with person/organization split, runtime resolver — 1497 entities total on the OVP vault (521 twitter + 922 github + 54 person/organization), ~$0.10 one-shot
  • Operational glue (#117/#122): ovp-backup-db SQLite online-backup snapshots, ovp-refresh-source-authority chained refresh + launchd plist
  • 12 entity-layer review fixes (#124): read-side write side effects, identity-merge backlinks, lock race, append-only history, GitHub bare profile URLs, etc.

Domain Packs

The core runtime is now being formalized as a pack-aware platform.

  • Built-in standard pack: research-tech
  • Default compatibility pack: default-knowledge
  • Runtime selection is exposed through --pack and --profile
  • Third-party packs can be discovered through the ovp.packs entry point group or the OVP_PACK_MANIFESTS manifest list

Examples:

ovp-packs
ovp-doctor --pack research-tech --json
ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
ovp --pack default-knowledge --profile full

Pack API documentation for third-party developers lives in:

  • docs/pack-api/README.md
  • docs/pack-api/manifest-and-hooks.md
  • docs/pack-api/dogfooding-with-media-pack.md

Platform Architecture

From a platform perspective, the system now has three layers:

  1. Core Platform
  2. Domain Pack
  3. Workflow Profile

1. Core Platform

Core owns the cross-domain pieces that must remain stable:

  • runtime / vault layout
  • CLI orchestration
  • autopilot / queue / watcher
  • canonical identity helpers
  • registry framework
  • derived knowledge.db
  • graph / lint / audit infrastructure
  • plugin / pack loading
  • base evidence schema contracts

2. Domain Pack

A pack is not just a prompt bundle. It defines domain semantics:

  • object kinds
  • workflow profiles
  • discovery boundaries
  • absorb / refine / lint rules
  • schemas / templates / prompt resources

The built-in packs are:

  • research-tech: the explicit technical research pack and the default workflow pack
  • default-knowledge: the compatibility layer

Future domains such as media or medical should arrive as external pack projects.

3. Workflow Profile

A workflow profile is an executable DAG under a pack.

The built-in profiles currently shipped are:

  • research-tech/full
  • research-tech/autopilot
  • default-knowledge/full

Research-Tech Operational Surface

research-tech is no longer only an internal pack. It now has a minimal operational surface:

  • ovp-doctor reports default workflow pack, pack roles, operator docs, recipes, and optional vault health
  • ovp-export exports minimal compiled artifacts:
    • object-page
    • topic-overview
    • event-dossier
    • contradictions
  • ovp-truth reads object / contradiction / neighborhood truth rows directly from knowledge.db
  • ovp-ui launches a local UI. The default / entry is the reader-first Knowledge Library; the operator dashboard lives under /ops.
  • docs/research-tech/RESEARCH_TECH_SKILLPACK.md
  • docs/research-tech/RESEARCH_TECH_VERIFY.md
  • docs/recipes/research-tech/*.md

Examples:

ovp-doctor --pack research-tech --json
ovp-truth objects --vault-dir /path/to/vault
ovp-ui --vault-dir /path/to/vault --port 8787
ovp-export --pack research-tech --target topic-overview --output-path /tmp/topic.md
  • default-knowledge/autopilot

That is why the default workflow path now runs:

ovp --full
ovp-autopilot --yes

You can still select packs explicitly:

ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
# compatibility path
ovp --pack default-knowledge --profile full

Plugin Design

The plugin / pack surface is no longer only a design memo. There is now a minimal working integration path.

Two discovery modes are supported:

  1. Python entry point group: ovp.packs
  2. Explicit manifest list: OVP_PACK_MANIFESTS=/path/a.yaml:/path/b.yaml

The minimum third-party loading chain is:

  1. provide a manifest
  2. declare entrypoints.pack
  3. return a BaseDomainPack
  4. pass api_version validation
  5. select it through --pack/--profile

Hard boundaries currently enforced by core:

  • a pack cannot turn semantic retrieval into canonical identity
  • a pack cannot treat knowledge.db as Canonical State
  • a pack cannot bypass audit/logging
  • all Projections must remain rebuildable

Runtime Model

Canonical State Boundary (full definition: ARCHITECTURE)

The system keeps a hard boundary:

  • Canonical State: vault markdown + concept registry + evidence + audit
  • Projections: Atlas, MOC, graph, knowledge.db, lint, daily delta, crystals
  • not Canonical State: knowledge.db

knowledge.db is a Projection. It stores:

  • page FTS
  • structured links
  • mirrored raw sidecars
  • timeline / audit events
  • deterministic section embeddings
  • read-only query / serve surfaces

It is rebuildable and does not own canonical identity resolution.

The Six Pipeline Stages (full description: RUNTIME)

StageResponsibilityRepresentative commandsCan the LLM make major decisions here?
IngestNormalize incoming materialovp --step pinboard ovp --step clippings ovp-articleNo
InterpretProduce deep interpretationsovp-article ovp-github ovp-paperYes, with constrained output
AbsorbCompile interpretations into lifecycle actionsovp-absorb ovp-evergreenYes, but only through structured results
RefineCleanup and breakdown existing notesovp-cleanup ovp-breakdownYes, but execution is controlled
NormalizeMaintain registry / aliases / identity merges / contradiction detection (formerly Canonical)ovp-rebuild-registry ovp-merge-identities ovp-link-entities ovp-resolve-contradictionsNo
DeriveBuild Projections — retrieval / graph / crystals / lintovp-knowledge-index ovp-graph ovp-synthesize-community-crystals ovp-lintNo

What ovp --full Actually Runs

Default full pipeline:

pinboard
→ pinboard_process
→ clippings
→ articles
→ quality
→ fix_links
→ absorb
→ dedup
→ note_type_normalize
→ registry_sync
→ moc
→ knowledge_index

With refine enabled:

pinboard
→ pinboard_process
→ clippings
→ articles
→ quality
→ fix_links
→ absorb
→ dedup
→ note_type_normalize
→ registry_sync
→ moc
→ refine
→ knowledge_index

Important details:

  • absorb shells to ovp_pipeline.commands.absorb and emits promoted_slugs for downstream steps
  • dedup runs post-absorb concept deduplication scoped to recently promoted slugs (trigram-Jaccard similarity)
  • note_type_normalize normalizes note_type metadata across Evergreen files
  • refine is a batch wrapper over cleanup + breakdown
  • knowledge_index always runs last so knowledge.db reflects final canonical state
  • --step evergreen and --from-step evergreen are still accepted and map to absorb

What ovp-autopilot Actually Runs

Default real-time path:

interpretation
→ quality
→ absorb
→ moc
→ knowledge_index
→ auto_commit(optional)

Enable refine explicitly:

ovp-autopilot --watch=inbox --with-refine --yes

That changes the path to:

interpretation
→ quality
→ absorb
→ moc
→ refine
→ knowledge_index
→ auto_commit(optional)

Refine is not hidden or missing. It is wired in, but opt-in by default to avoid silent real-time structural rewrites of the whole knowledge base.

Command Overview

Daily entry points

CommandPurpose
ovp --checkValidate runtime configuration
ovp --fullRun the full daily pipeline
ovp --full --with-refineRun full pipeline plus cleanup/breakdown
ovp --step absorbRun only the absorb layer
ovp --step refineRun only the batch refine layer
ovp --from-step absorbResume from absorb onward

Content processors

CommandPurpose
ovp-article --process-inbox --vault-dir <vault>Process raw documents
ovp-github --process-single <file> --vault-dir <vault>Process GitHub inputs
ovp-paper --process-single <file> --vault-dir <vault>Process paper inputs

Absorb / Refine / Canonical

CommandPurpose
ovp-absorb --recent 7 --jsonAbsorb recent deep dives
ovp-absorb --file <source.md> --dry-run --jsonPreview source lifecycle routing before moving or processing source material
ovp-evergreen --recent 7 --jsonCompatibility alias for ovp-absorb
ovp-concept-dedup --vault-dir <vault> --threshold 0.82Find and propose concept deduplication clusters
ovp-concept-dedup --vault-dir <vault> --applyApply deduplication proposal (archive losers, rewrite wikilinks)
ovp-cleanup --all --jsonGenerate cleanup proposals
ovp-cleanup --all --write --jsonApply deterministic cleanup
ovp-breakdown --all --jsonGenerate breakdown proposals
ovp-breakdown --all --write --jsonApply incremental breakdown
ovp-rebuild-registry --jsonReconcile evergreen notes and registry
ovp-promote-candidates reviewReview candidate lifecycle
ovp-moc --scan --vault-dir <vault>Refresh MOC / Atlas

Derived layer

CommandPurpose
ovp-knowledge-index --jsonRebuild knowledge.db
ovp-knowledge-index --search "query" --jsonRun FTS search
ovp-knowledge-index --query "question" --jsonRun embedding chunk query
ovp-knowledge-index --get slug --jsonRead a canonical page
ovp-knowledge-index --stats --jsonRead index stats
ovp-knowledge-index --audit-recent --jsonRead recent audit events
ovp-knowledge-index --tools-jsonEmit tool discovery JSON
ovp-knowledge-index --serveStart read-only stdio JSONL service
ovp-graph daily today --vault-dir <vault>Build daily graph delta
ovp-lint --check --vault-dir <vault>Run structure/link checks

Operations

CommandPurpose
ovp-runtime-state --vault-dir <vault> --write --jsonBuild the operational runtime state projection from repair markers, workflow actions, pipeline events, and reuse events; writes 60-Logs/runtime-state/current.{json,md}
GET /api/runtime-stateLocal read endpoint for the provider-facing runtime-state projection; prefers the materialized 60-Logs/runtime-state/current.json and falls back to rebuild when missing
POST /api/runtime-stateRefresh and write the materialized runtime-state projection

Context packs

CommandPurpose
ovp-working-memory --vault-dir <vault>Write the daily budgeted context pack to 60-Logs/working-memory/YYYY-MM-DD.md and emit trusted reuse events for selected objects
ovp-prime --vault-dir <vault> --session-id <id>Write an OVP Prime session snapshot to 60-Logs/session-snapshots/<id>.md, refresh latest.md, and emit ovp_prime reuse events

AutoPilot

CommandPurpose
ovp-autopilot --watch=inbox --parallel=1 --yesDefault real-time pipeline
ovp-autopilot --watch=inbox,pinboard --yesWatch multiple sources
ovp-autopilot --with-refine --yesAdd refine to the real-time path
ovp-autopilot --no-commit --yesDisable auto-commit

Directory Layout

vault/
├── 50-Inbox/
│   ├── 01-Raw/
│   ├── 02-Pinboard/
│   └── 03-Processed/
├── 10-Knowledge/
│   ├── Evergreen/
│   └── Atlas/
│       ├── Atlas-Index.md
│       ├── concept-registry.jsonl
│       └── alias-index.json
├── 20-Areas/
│   └── {AI-Research, Investing, Programming, Tools}/Topics/YYYY-MM/
├── 60-Logs/
│   ├── pipeline.jsonl
│   ├── refine-mutations.jsonl
│   ├── transactions/
│   ├── quality-reports/
│   ├── daily-deltas/
│   ├── working-memory/
│   ├── session-snapshots/
│   ├── runtime-state/
│   └── knowledge.db
└── 70-Archive/

What knowledge.db Provides

knowledge.db is a rebuildable local derived index. It currently includes:

  • pages_index
  • page_fts
  • page_links
  • raw_data
  • timeline_events
  • audit_events
  • page_embeddings

It exists to power:

  • keyword retrieval
  • embedding retrieval
  • canonical page reads
  • audit browsing
  • tool discovery and read-only serving

Default discovery now routes through this layer:

  • ovp-query uses knowledge.db by default
  • keyword retrieval uses FTS5 BM25
  • semantic retrieval uses local deterministic embeddings
  • QMD is no longer the default runtime dependency; it is opt-in via --engine qmd

Quick Start

curl -fsSLO https://raw.githubusercontent.com/fakechris/obsidian_vault_pipeline/main/scripts/install-user.sh
less install-user.sh
bash install-user.sh

mkdir -p my-vault
cd my-vault

ovp --check
ovp --full

If you prefer the explicit PyPI two-step flow:

python3 -m pip install --user obsidian-vault-pipeline
python3 -m ovp_pipeline.installer

If your Python installation enforces PEP 668, prefer:

pipx install obsidian-vault-pipeline

The installer prefers a writable, safe bin directory that is already on PATH; if none is available, it falls back to ~/.local/bin. It does not edit your shell configuration.

If you want to see the refine layer explicitly:

ovp --full --with-refine

If you want a daemon:

ovp-autopilot --watch=inbox --parallel=1 --yes

Configuration

Put .env in the vault root:

AUTO_VAULT_API_KEY=your_key_here
AUTO_VAULT_API_BASE=https://api.minimaxi.com/anthropic
AUTO_VAULT_MODEL=anthropic/MiniMax-M2.7-highspeed

# Optional
PINBOARD_TOKEN=username:token
HTTP_PROXY=http://127.0.0.1:7897

Design Principles

  • identity consistency before feature growth
  • vault files + registry define Canonical State
  • knowledge.db is a Projection, never an additional Canonical State
  • absorb is part of daily automation; refine is powerful and opt-in by default
  • Wiki, MOC, dashboard, briefing, graph, reader pages, and context packs are projections that carry explicit projection metadata and must trace back to source/evidence
  • reader-facing UI should explain knowledge first, then expose operator/debug detail
  • docs must describe what actually ships, not a future architecture sketch

Related Resources


This document targets: v0.9.3

Related

How to Install

  1. Download the template file from GitHub
  2. Move it anywhere in your vault
  3. Open it in Obsidian — done!

Stats

Stars

120

Forks

13

License

MIT

Last updated 5d ago

Categories