AI Environment
PDF to Markdown Conversion Guide
Convert PDF research papers into clean Markdown so AI tools can read, quote, and analyze them more reliably.
PDFs were built for human eyes, not machine reasoning.
Overview
Converting PDF documents to text formats is essential for effective AI workflows in research. Raw PDFs often contain complex layouts, images, and formatting that can interfere with AI processing, embeddings, and analysis. This guide covers multiple methods for converting PDFs to clean, AI-friendly text formats.
Key benefits:
- Better AI Processing: Clean text without layout artifacts or formatting issues
- Cost Efficiency: Pre-convert PDFs once instead of processing them repeatedly
- Token Optimization: Text formats use fewer tokens than PDF processing
- Embedding Quality: Consistent text extraction improves embedding accuracy
- Workflow Integration: Text files work seamlessly with all AI tools and models
Common use cases:
- Research literature analysis
- AI-assisted content summarization
- Embedding creation for semantic search
- Large language model context preparation
PDF conversion methods
Method 1: MinerU MCP (recommended)
Use for Claude Code workflows, batch processing, and high accuracy.
Info
Parse PDFs directly within Claude without context-switching. See the full MinerU MCP guide for detailed setup and usage.
MinerU MCP integrates document parsing directly into your AI workflow:
- 90%+ accuracy with VLM mode for complex layouts
- Batch processing up to 200 documents at once
- 109 languages supported via OCR
- Table and formula recognition
Quick setup:
claude mcp add mineru-mcp -e MINERU_API_KEY=your-key -- npx mineru-mcpThen simply ask Claude: "Parse this PDF with VLM mode: [URL]"
Pros: integrated workflow, high accuracy, batch capable. Cons: requires an API key from mineru.net.
Method 2: MinerU web interface (Free tier)
Use for quick one-off conversions and testing before MCP setup.
Use MinerU without MCP setup via their web interface:
Free tier: limited daily conversions. Pros: no setup required, good quality. Cons: daily limits, manual download process.
Method 3: Mistral OCR script (Batch offline)
Use for very large offline batch jobs and scripted workflows.
For bulk processing outside of Claude (100+ papers at once), use the Mistral OCR script approach.
Trade-off: MinerU MCP is better for integrated Claude workflows. Mistral script is better for large offline batch jobs.
Method 4: manual copy-paste (fallback)
Use for emergency single documents when other methods are unavailable.
- Open PDF in a PDF reader
- Select and copy text from each page
- Paste into a text editor or Markdown file
- Save with
.mdor.txtextension
Limitations: Time-consuming, layout issues, manual errors. Use MinerU instead.
Mistral OCR API setup (optional)
Use Mistral OCR when you want hosted OCR with strong layout handling and Markdown output. The maintained script is scripts/ocr/mistral_batch_ocr.py.
Step 1: get Mistral API key
- Visit Mistral AI Console
- Create an account or sign in
- Navigate to API Keys section
- Click "Create new key"
- Copy and save your API key securely
Check current pricing and limits in the Mistral console before large batches.
Step 2: environment setup
Set your API key as an environment variable:
export MISTRAL_API_KEY="your_api_key_here"Step 3: run the batch OCR script
pip install mistralai
python scripts/ocr/mistral_batch_ocr.py readings/pdfs readings/markdownThe script:
- Recursively processes PDFs and preserves folder structure
- Uses Mistral batch jobs against
/v1/ocr - Writes
page.markdownfrom current OCR responses, with a fallback for olderpage.textshapes - Cleans up request/result JSONL files unless
--keep-work-filesis set - Skips large PDFs above
--max-size-mb
# Include extracted image payloads if needed
python scripts/ocr/mistral_batch_ocr.py readings/pdfs readings/markdown --include-images
# Keep JSONL files for debugging
python scripts/ocr/mistral_batch_ocr.py readings/pdfs readings/markdown --keep-work-filesLocal Baidu/PaddleOCR option
Use local PaddleOCR when you need high-volume OCR without metered hosted API calls. PaddleOCR is Baidu's open-source OCR toolkit; current model families such as PP-OCRv5 and PP-OCRv6 can run locally after model/runtime download.
The maintained script is scripts/ocr/paddle_unlimited_ocr.py. "Unlimited" means there is no per-page API meter; throughput is limited by your machine, installed PaddlePaddle runtime, and model downloads.
pip install paddleocr
# Install PaddlePaddle for your platform from the official Paddle install page.
python scripts/ocr/paddle_unlimited_ocr.py readings/pdfs readings/markdown --lang en
python scripts/ocr/paddle_unlimited_ocr.py readings/pdfs readings/markdown --lang ch --ocr-version PP-OCRv5Use this route for:
- Sensitive documents that should not leave the machine
- Very large batches where hosted OCR cost is the constraint
- Chinese or multilingual OCR experiments where PaddleOCR models are a good fit
Avoid it when you need turnkey setup, because local runtime and model installation can be more fragile than a hosted API.
Alternative API options
Google Document AI
Use when you already work in Google Cloud.
- Visit Google Cloud Console
- Enable Document AI API
- Create a processor for OCR
- Use Python client library for batch processing
Azure Form Recognizer
Use in enterprise Azure environments.
- Visit Azure Portal
- Create Cognitive Services resource
- Use Form Recognizer service
- REST API or SDK integration
Best practices and tips
File organization
your-project/
├── pdfs/ # Original PDFs
│ ├── session-1/
│ ├── session-2/
│ └── articles/
└── markdown/ # Converted text files
├── session-1/
├── session-2/
└── articles/Quality control
- Spot check: Review converted files for accuracy
- Complex layouts: Some academic PDFs may need manual review
- Images and tables: OCR may not capture complex visual elements
- Languages: Ensure API supports your document languages
Cost optimization
- Batch processing: Convert all PDFs at once rather than individually
- File size limits: Be aware of API size restrictions
- Free tiers: Use free options for small projects
- Caching: Store converted files to avoid re-processing
Workflow integration
- Version control: Track both original PDFs and converted text
- Backup: Keep original PDFs as source of truth
- Naming: Maintain consistent file naming conventions
- Metadata: Preserve citation information alongside converted text
Troubleshooting
Common issues
"API Key Not Found"
- Verify environment variable is set:
echo $MISTRAL_API_KEY - Restart your terminal/command prompt
- Check for typos in variable name
"File Too Large"
- Default limit is 36MB per PDF
- Split large documents or use
--max-sizeparameter - Consider alternative conversion methods for very large files
"Processing Failed"
- Check PDF file integrity
- Some PDFs may have copy protection
- Try alternative conversion methods
"Rate Limits Exceeded"
- APIs have rate limits (requests per minute/hour)
- Implement delays between requests
- Consider paid plans for higher limits
Integration with case study workflow
Recommended usage pattern
- Initial Setup: Convert all PDFs at project start
- Ongoing: Convert new PDFs as they're added
- Processing: Use converted Markdown files for all AI workflows
- Storage: Maintain both original PDFs and converted text
Case study references
- AI-assisted literature analysis
- Human-AI synthesis workflows
- Agentic workflow design
- API Keys Guide: For Mistral setup
- Model Reference Guide: For compatible AI models
Advanced options
Next steps
- Get Started: Choose your preferred conversion method
- Test: Convert a few sample PDFs
- Scale: Process your full document collection
- Integrate: Use converted files in your AI workflows
- Monitor: Track conversion quality and costs
Warning