AI Environment

PDF to Markdown Conversion Guide

Convert PDF research papers into clean Markdown so AI tools can read, quote, and analyze them more reliably.

PDFs were built for human eyes, not machine reasoning.

Overview

Converting PDF documents to text formats is essential for effective AI workflows in research. Raw PDFs often contain complex layouts, images, and formatting that can interfere with AI processing, embeddings, and analysis. This guide covers multiple methods for converting PDFs to clean, AI-friendly text formats.

Key benefits:

Better AI Processing: Clean text without layout artifacts or formatting issues
Cost Efficiency: Pre-convert PDFs once instead of processing them repeatedly
Token Optimization: Text formats use fewer tokens than PDF processing
Embedding Quality: Consistent text extraction improves embedding accuracy
Workflow Integration: Text files work seamlessly with all AI tools and models

Common use cases:

Research literature analysis
AI-assisted content summarization
Embedding creation for semantic search
Large language model context preparation

PDF conversion methods

Method 1: MinerU MCP (recommended)

Use for Claude Code workflows, batch processing, and high accuracy.

Info

Parse PDFs directly within Claude without context-switching. See the full MinerU MCP guide for detailed setup and usage.

MinerU MCP integrates document parsing directly into your AI workflow:

90%+ accuracy with VLM mode for complex layouts
Batch processing up to 200 documents at once
109 languages supported via OCR
Table and formula recognition

Quick setup:

claude mcp add mineru-mcp -e MINERU_API_KEY=your-key -- npx mineru-mcp

Then simply ask Claude: "Parse this PDF with VLM mode: [URL]"

Pros: integrated workflow, high accuracy, batch capable. Cons: requires an API key from mineru.net.

Method 2: MinerU web interface (Free tier)

Use for quick one-off conversions and testing before MCP setup.

Use MinerU without MCP setup via their web interface:

Visit MinerU or GitHub
Use the web interface for quick conversions
Download converted markdown

Free tier: limited daily conversions. Pros: no setup required, good quality. Cons: daily limits, manual download process.

Method 3: Mistral OCR script (Batch offline)

Use for very large offline batch jobs and scripted workflows.

For bulk processing outside of Claude (100+ papers at once), use the Mistral OCR script approach.

Trade-off: MinerU MCP is better for integrated Claude workflows. Mistral script is better for large offline batch jobs.

Method 4: manual copy-paste (fallback)

Use for emergency single documents when other methods are unavailable.

Open PDF in a PDF reader
Select and copy text from each page
Paste into a text editor or Markdown file
Save with .md or .txt extension

Limitations: Time-consuming, layout issues, manual errors. Use MinerU instead.

Mistral OCR API setup (optional)

Use Mistral OCR when you want hosted OCR with strong layout handling and Markdown output. The maintained script is scripts/ocr/mistral_batch_ocr.py.

Step 1: get Mistral API key

Visit Mistral AI Console
Create an account or sign in
Navigate to API Keys section
Click "Create new key"
Copy and save your API key securely

Check current pricing and limits in the Mistral console before large batches.

Step 2: environment setup

Set your API key as an environment variable:

export MISTRAL_API_KEY="your_api_key_here"

set MISTRAL_API_KEY=your_api_key_here

echo 'export MISTRAL_API_KEY="your_api_key_here"' >> ~/.zshrc
source ~/.zshrc

Step 3: run the batch OCR script

pip install mistralai
python scripts/ocr/mistral_batch_ocr.py readings/pdfs readings/markdown

The script:

Recursively processes PDFs and preserves folder structure
Uses Mistral batch jobs against /v1/ocr
Writes page.markdown from current OCR responses, with a fallback for older page.text shapes
Cleans up request/result JSONL files unless --keep-work-files is set
Skips large PDFs above --max-size-mb

# Include extracted image payloads if needed
python scripts/ocr/mistral_batch_ocr.py readings/pdfs readings/markdown --include-images

# Keep JSONL files for debugging
python scripts/ocr/mistral_batch_ocr.py readings/pdfs readings/markdown --keep-work-files

Local Baidu/PaddleOCR option

Use local PaddleOCR when you need high-volume OCR without metered hosted API calls. PaddleOCR is Baidu's open-source OCR toolkit; current model families such as PP-OCRv5 and PP-OCRv6 can run locally after model/runtime download.

The maintained script is scripts/ocr/paddle_unlimited_ocr.py. "Unlimited" means there is no per-page API meter; throughput is limited by your machine, installed PaddlePaddle runtime, and model downloads.

pip install paddleocr
# Install PaddlePaddle for your platform from the official Paddle install page.

python scripts/ocr/paddle_unlimited_ocr.py readings/pdfs readings/markdown --lang en
python scripts/ocr/paddle_unlimited_ocr.py readings/pdfs readings/markdown --lang ch --ocr-version PP-OCRv5

Use this route for:

Sensitive documents that should not leave the machine
Very large batches where hosted OCR cost is the constraint
Chinese or multilingual OCR experiments where PaddleOCR models are a good fit

Avoid it when you need turnkey setup, because local runtime and model installation can be more fragile than a hosted API.

Alternative API options

Google Document AI

Use when you already work in Google Cloud.

Visit Google Cloud Console
Enable Document AI API
Create a processor for OCR
Use Python client library for batch processing

Azure Form Recognizer

Use in enterprise Azure environments.

Visit Azure Portal
Create Cognitive Services resource
Use Form Recognizer service
REST API or SDK integration

Best practices and tips

File organization

your-project/
├── pdfs/           # Original PDFs
│   ├── session-1/
│   ├── session-2/
│   └── articles/
└── markdown/       # Converted text files
    ├── session-1/
    ├── session-2/
    └── articles/