PolyDoc turns the messy reality of enterprise data — PDFs, slides, spreadsheets, emails, audio, video, and web pages — into a clean, searchable, RAG-ready knowledge base.
Built on the open-source AI stack
No more glue code between five different libraries. PolyDoc is one cohesive pipeline from raw bytes to grounded answers.
One unified MultimodalSample format handles text, images, audio, video, and tables — same API for every input.
Built on Dask plus PyTorch multiprocessing. Scale from your laptop to a multi-GPU cluster without changing a line of code.
Dense embeddings, SPLADE sparse, and ColPali visual retrieval — use one or stack them all in a single Milvus query.
Swap LLMs, embedders, retrievers, vector stores by editing a single YAML field. OpenAI, Anthropic, Mistral, Cohere, vLLM, AWS — all there.
FastAPI services, Docker images on GHCR (CPU + GPU), incremental state tracking, full CI, and a streaming HTTP index API.
Augment local knowledge with iterative DuckDuckGo or Tavily lookups — automatically fused into RAG context.
Every stage is independent, configurable via YAML, and composable. Run them locally, distributed, or as long-lived API services.
Pull text, metadata, images, and audio from any file or URL. Ten built-in processors, easy to extend.
Chunk, deduplicate, translate, and tag. Run NER over your corpus. Filter low-quality content.
Build a hybrid Milvus index — dense + SPLADE sparse — locally or on a remote standalone instance.
Hybrid retrieval with optional rerankers. Live HTTP API for streaming new docs into your store.
LangChain LCEL pipeline routes context to the LLM of your choice. Run as CLI, batch, or service.
Install with uv, point at a folder of documents, and watch PolyDoc do the rest. Each stage writes structured JSONL so you can inspect and debug as you go.
Processor subclass for new file types# Install (CPU build)
$ uv pip install "polydoc[all,cpu]"
# 1. Extract from your files
$ polydoc process -c configs/process.yaml
# 2. Chunk, clean, tag
$ polydoc postprocess -c configs/post.yaml \
-i outputs/merged/merged_results.jsonl
# 3. Build the hybrid index
$ polydoc index -c configs/index.yaml \
-f outputs/post/results.jsonl
# 4. Ask a question
$ polydoc rag -c configs/rag.yaml
→ ✓ Answer grounded in 6 retrieved chunks.
Out-of-the-box processors for the formats you actually have. No "convert to PDF first" workflows.
Spin up PolyDoc with Docker, or install via uv and run your first RAG query in under five minutes.