build
microsoft / MarkItDown
pythontool-use wrappersgpt-4oclaude-sonnet-4-6gemini-proVerifiedverified
Microsoft's Python utility for converting documents (PDF, DOCX, PPTX, etc.) into Markdown for LLMs.
star125.3k stars·download195.0k/wk·v0.1.1·MIT
MarkItDown is a lightweight Microsoft utility that converts a wide range of document formats — PDF, DOCX, PPTX, XLSX, HTML, images, and more — into clean Markdown ready for LLM ingestion.
🎯 Use Cases
- RAG Document Ingestion: Normalize many input formats into Markdown.
- LLM Preprocessing: Feed clean text to chat or extraction prompts.
- CLI / Scripting: One-shot conversion in pipelines.
✨ Features
- Many input formats (PDF, Office, HTML, images, audio)
- Plugin architecture for extending conversions
- CLI and Python API
- Designed specifically for LLM-friendly Markdown output
👍 Pros
- Wide format coverage from a single tool
- Microsoft-maintained with active development
- Easy to drop into existing pipelines
👎 Cons & Limitations
- PDF parsing quality varies by document
- Complex layouts may need a heavier parser (e.g. LlamaParse)