v1.2.2 — Apache 2.0

One pipeline.
Every document.
Any modality.

PolyDoc turns the messy reality of enterprise data — PDFs, slides, spreadsheets, emails, audio, video, and web pages — into a clean, searchable, RAG-ready knowledge base.

Get started in 60s → Explore features

Built on the open-source AI stack

PyTorch Milvus LangChain FastAPI Dask Hugging Face

17+

File formats

LLM backends

PDF strategies

100%

Open source

Why PolyDoc

Everything you need to build serious RAG.

No more glue code between five different libraries. PolyDoc is one cohesive pipeline from raw bytes to grounded answers.

◉

Massively multimodal

One unified MultimodalSample format handles text, images, audio, video, and tables — same API for every input.

⚡

Distributed by default

Built on Dask plus PyTorch multiprocessing. Scale from your laptop to a multi-GPU cluster without changing a line of code.

◇

Hybrid retrieval

Dense embeddings, SPLADE sparse, and ColPali visual retrieval — use one or stack them all in a single Milvus query.

⚙︎

Pluggable everything

Swap LLMs, embedders, retrievers, vector stores by editing a single YAML field. OpenAI, Anthropic, Mistral, Cohere, vLLM, AWS — all there.

▣

Production-ready

FastAPI services, Docker images on GHCR (CPU + GPU), incremental state tracking, full CI, and a streaming HTTP index API.

◎

Live web search

Augment local knowledge with iterative DuckDuckGo or Tavily lookups — automatically fused into RAG context.

Architecture

Five stages. One CLI. Zero glue code.

Every stage is independent, configurable via YAML, and composable. Run them locally, distributed, or as long-lived API services.

◰

01 / Process

Extract

Pull text, metadata, images, and audio from any file or URL. Ten built-in processors, easy to extend.

◳

02 / Postprocess

Refine

Chunk, deduplicate, translate, and tag. Run NER over your corpus. Filter low-quality content.

◫

03 / Index

Embed

Build a hybrid Milvus index — dense + SPLADE sparse — locally or on a remote standalone instance.

◐

04 / Retrieve

Search

Hybrid retrieval with optional rerankers. Live HTTP API for streaming new docs into your store.

◉

05 / RAG

Answer

LangChain LCEL pipeline routes context to the LLM of your choice. Run as CLI, batch, or service.

Quickstart

From zero to answers in four commands.

Install with uv, point at a folder of documents, and watch PolyDoc do the rest. Each stage writes structured JSONL so you can inspect and debug as you go.

One CLI for the entire pipeline
YAML configs for every stage — version-controllable
Drop in a Python Processor subclass for new file types
Use it as a library, a CLI, or a long-lived API service

        
        ~/your-project
      
# Install (CPU build)
$ uv pip install "polydoc[all,cpu]"

# 1. Extract from your files
$ polydoc process -c configs/process.yaml

# 2. Chunk, clean, tag
$ polydoc postprocess -c configs/post.yaml \
    -i outputs/merged/merged_results.jsonl

# 3. Build the hybrid index
$ polydoc index -c configs/index.yaml \
    -f outputs/post/results.jsonl

# 4. Ask a question
$ polydoc rag -c configs/rag.yaml

→ ✓ Answer grounded in 6 retrieved chunks.

File support

If it's a document, PolyDoc handles it.

Out-of-the-box processors for the formats you actually have. No "convert to PDF first" workflows.

.pdf .docx .pptx .xlsx .md .txt .eml .html .mp4 .mov .avi .mkv .mp3 .wav .aac URLs + your own

Ready to ingest the world?

Spin up PolyDoc with Docker, or install via uv and run your first RAG query in under five minutes.

Read the docs → View on GitHub