Research Memex - AI-Powered Research

Overview

Converting PDF documents to text formats is essential for effective AI workflows in research. Raw PDFs often contain complex layouts, images, and formatting that can interfere with AI processing, embeddings, and analysis. This guide covers multiple methods for converting PDFs to clean, AI-friendly text formats. Key Benefits:

Better AI Processing: Clean text without layout artifacts or formatting issues
Cost Efficiency: Pre-convert PDFs once instead of processing them repeatedly
Token Optimization: Text formats use fewer tokens than PDF processing
Embedding Quality: Consistent text extraction improves embedding accuracy
Workflow Integration: Text files work seamlessly with all AI tools and models

Common Use Cases:

Research literature analysis
AI-assisted content summarization
Embedding creation for semantic search
Large language model context preparation

PDF Conversion Methods

Method 1: Manual Copy-Paste (Free, Limited)

Best for: Small numbers of documents, simple layouts

Open PDF in a PDF reader
Select and copy text from each page
Paste into a text editor or Markdown file
Save with .md or .txt extension

Limitations: Time-consuming, layout issues, manual errors

Method 2: MinerU (Free Tier Available)

Best for: Good balance of quality and ease MinerU is an open-source tool that provides high-quality PDF text extraction with layout preservation. Setup:

Visit MinerU or GitHub
Use the web interface for quick conversions
For batch processing, install locally

Free Tier: Limited daily conversions Pros: Good quality, preserves structure Cons: Daily limits, requires internet for web version

Method 3: Official APIs (Paid, High Quality)

Best for: Batch processing, high volume, automated workflows Several AI companies offer OCR APIs specifically designed for document processing:

Mistral OCR
Google Document AI
Azure Form Recognizer
AWS Textract

Mistral OCR API Setup (Optional)

The case study provides a ready-to-use script (batch_ocr.py) that leverages Mistral’s OCR API for high-quality PDF conversion.

Step 1: Get Mistral API Key

Visit Mistral AI Console
Create an account or sign in
Navigate to API Keys section
Click “Create new key”
Copy and save your API key securely

Pricing: ~$0.001 per page (varies by document complexity)

Step 2: Environment Setup

Set your API key as an environment variable:

export MISTRAL_API_KEY="your_api_key_here"

Step 3: Using the Batch OCR Script

The case study provides batch_ocr.py for automated conversion:

# Basic usage
python batch_ocr.py input_folder output_folder

# Example
python batch_ocr.py readings/pdfs readings/markdown

Script Features:

✅ Recursive folder processing
✅ Maintains original folder structure
✅ Batch processing for efficiency
✅ Automatic cleanup of temporary files
✅ Markdown output with page breaks

File Size Limits: Default 36MB per PDF (configurable)

batch_ocr.py — Mistral OCR batch processor

Copy this script into your workspace (e.g., batch_ocr.py) or ask your AI assistant to generate it for you. It requires the mistralai Python SDK (pip install mistralai) and reads from an input folder of PDFs, writes Markdown outputs, and cleans up temporary batch files automatically.

import os
import json
import base64
import time
import argparse
from pathlib import Path
from mistralai import Mistral


def encode_pdf_to_base64(pdf_path):
    """Encode a PDF file to base64."""
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except Exception as e:
        print(f"Error encoding {pdf_path}: {e}")
        return None


def find_pdf_files(input_folder, max_size_mb=36):
    """Recursively find all PDF files in the input folder that are within size limit."""
    pdf_files = []
    skipped_files = []
    max_size_bytes = max_size_mb * 1024 * 1024  # Convert MB to bytes

    for root, dirs, files in os.walk(input_folder):
        for file in files:
            if file.lower().endswith('.pdf'):
                full_path = os.path.join(root, file)
                relative_path = os.path.relpath(full_path, input_folder)

                # Check file size
                try:
                    file_size = os.path.getsize(full_path)
                    if file_size <= max_size_bytes:
                        pdf_files.append((full_path, relative_path))
                    else:
                        size_mb = file_size / (1024 * 1024)
                        skipped_files.append((relative_path, size_mb))
                        print(f"Skipping {relative_path} ({size_mb:.1f}MB) - exceeds {max_size_mb}MB limit")
                except OSError as e:
                    print(f"Error checking size of {relative_path}: {e}")

    if skipped_files:
        print(f"\nSkipped {len(skipped_files)} file(s) due to size limit:")
        for path, size in skipped_files[:5]:  # Show first 5
            print(f"  - {path} ({size:.1f}MB)")
        if len(skipped_files) > 5:
            print(f"  ... and {len(skipped_files) - 5} more")

    return pdf_files


def create_batch_file(pdf_files, batch_file_path, input_folder):
    """Create a JSONL batch file for OCR processing."""
    entries = []
    with open(batch_file_path, 'w') as file:
        for index, (pdf_path, relative_path) in enumerate(pdf_files):
            print(f"Encoding PDF {index + 1}/{len(pdf_files)}: {relative_path}")
            base64_pdf = encode_pdf_to_base64(pdf_path)

            if base64_pdf:
                entry = {
                    "custom_id": f"{index}|{relative_path}",  # Store relative path in custom_id
                    "body": {
                        "document": {
                            "type": "document_url",
                            "document_url": f"data:application/pdf;base64,{base64_pdf}"
                        },
                        "include_image_base64": False  # We don't need image base64 for text extraction
                    }
                }
                file.write(json.dumps(entry) + '\n')
                entries.append(entry)
            else:
                print(f"Failed to encode {pdf_path}, skipping...")

    return len(entries)


def create_output_structure(input_folder, output_folder, pdf_files):
    """Create the output folder structure matching the input."""
    for _, relative_path in pdf_files:
        # Get the directory part of the relative path
        relative_dir = os.path.dirname(relative_path)
        if relative_dir:
            output_dir = os.path.join(output_folder, relative_dir)
            os.makedirs(output_dir, exist_ok=True)


def process_batch_results(results_file, output_folder):
    """Process the batch results and save markdown files."""
    processed_count = 0

    with open(results_file, 'r') as f:
        for line in f:
            if line.strip():
                try:
                    result = json.loads(line)
                    custom_id = result.get('custom_id', '')

                    # Extract index and relative path from custom_id
                    if '|' in custom_id:
                        index, relative_path = custom_id.split('|', 1)

                        # Change extension from .pdf to .md
                        md_relative_path = os.path.splitext(relative_path)[0] + '.md'
                        output_path = os.path.join(output_folder, md_relative_path)

                        # Extract the OCR content
                        response = result.get('response', {})
                        if response and response.get('status_code') == 200:
                            body = response.get('body', {})
                            pages = body.get('pages', [])

                            # Create output directory if needed
                            os.makedirs(os.path.dirname(output_path), exist_ok=True)

                            # Write markdown content
                            with open(output_path, 'w', encoding='utf-8') as md_file:
                                for i, page in enumerate(pages):
                                    md_file.write(f"# Page {page.get('index', i) + 1}\n\n")
                                    md_file.write(page.get('text', ''))
                                    md_file.write("\n\n---\n\n")

                            processed_count += 1
                        else:
                            status = response.get('status_code') if response else 'Unknown'
                            print(f"Failed to process {relative_path}: status {status}")
                except json.JSONDecodeError as e:
                    print(f"Error parsing result line: {e}")

    return processed_count


def main():
    parser = argparse.ArgumentParser(description="Batch OCR PDF conversion using Mistral OCR API.")
    parser.add_argument("input_folder", type=str, help="Path to folder containing PDF files.")
    parser.add_argument("output_folder", type=str, help="Path to folder where Markdown files will be saved.")
    parser.add_argument("--api-key", type=str, default=None, help="Mistral API key (optional if env var is set).")
    parser.add_argument("--max-size", type=int, default=36, help="Maximum PDF size in MB (default: 36).")

    args = parser.parse_args()

    # Resolve absolute paths
    input_folder = Path(args.input_folder).resolve()
    output_folder = Path(args.output_folder).resolve()

    if not input_folder.exists() or not input_folder.is_dir():
        print(f"Input folder does not exist or is not a directory: {input_folder}")
        return

    print(f"Input folder: {input_folder}")
    print(f"Output folder: {output_folder}")

    # Ensure output folder exists
    output_folder.mkdir(parents=True, exist_ok=True)

    # Resolve API key
    api_key = args.api_key or os.getenv("MISTRAL_API_KEY")
    if not api_key:
        print("Error: No API key provided. Use --api-key or set MISTRAL_API_KEY environment variable.")
        return

    # Initialize Mistral client
    try:
        client = Mistral(api_key=api_key)
        print("Mistral client initialized successfully.")
    except Exception as e:
        print(f"Failed to initialize Mistral client: {e}")
        return

    # Find all PDF files
    print(f"\nSearching for PDF files in: {input_folder}")
    pdf_files = find_pdf_files(str(input_folder), max_size_mb=args.max_size)

    if not pdf_files:
        print("No PDF files found in the input folder.")
        return

    print(f"Found {len(pdf_files)} PDF file(s)")

    # Create output folder structure
    create_output_structure(str(input_folder), str(output_folder), pdf_files)

    # Create batch file
    batch_file_path = "ocr_batch_requests.jsonl"
    print(f"\nCreating batch file: {batch_file_path}")
    num_entries = create_batch_file(pdf_files, batch_file_path, str(input_folder))

    if num_entries == 0:
        print("No valid entries created for batch processing.")
        return

    print(f"Created batch file with {num_entries} entries")

    # Upload batch file
    print("\nUploading batch file...")
    try:
        with open(batch_file_path, "rb") as f:
            batch_data = client.files.upload(
                file={
                    "file_name": batch_file_path,
                    "content": f
                },
                purpose="batch"
            )
        print(f"Batch file uploaded successfully. ID: {batch_data.id}")
    except Exception as e:
        print(f"Error uploading batch file: {e}")
        return

    # Create batch job
    print("\nCreating batch job...")
    try:
        created_job = client.batch.jobs.create(
            input_files=[batch_data.id],
            model="mistral-ocr-latest",
            endpoint="/v1/ocr",
            metadata={"job_type": "batch_ocr_processing"}
        )
        print(f"Batch job created. ID: {created_job.id}")
    except Exception as e:
        print(f"Error creating batch job: {e}")
        return

    # Monitor job progress
    print("\nMonitoring job progress...")
    while True:
        try:
            retrieved_job = client.batch.jobs.get(job_id=created_job.id)

            status = retrieved_job.status
            total = retrieved_job.total_requests
            failed = retrieved_job.failed_requests
            succeeded = retrieved_job.succeeded_requests

            print(f"\rStatus: {status} | Total: {total} | Succeeded: {succeeded} | Failed: {failed}", end='', flush=True)

            if status not in ["QUEUED", "RUNNING"]:
                print()  # New line
                break

            time.sleep(5)  # Check every 5 seconds

        except Exception as e:
            print(f"\nError checking job status: {e}")
            return

    # Download results
    if retrieved_job.status in ["SUCCEEDED", "SUCCESS"] and retrieved_job.output_file:
        print(f"\nJob completed successfully! Downloading results...")
        try:
            # Download the results file
            results_response = client.files.download(file_id=retrieved_job.output_file)
            results_file_path = "ocr_batch_results.jsonl"

            # Handle httpx.Response object
            if hasattr(results_response, 'iter_bytes'):
                results_content = b''.join(results_response.iter_bytes())
            else:
                # Fallback for other response types
                results_content = results_response.content if hasattr(results_response, 'content') else bytes(results_response)

            with open(results_file_path, "wb") as f:
                f.write(results_content)

            print(f"Results downloaded to: {results_file_path}")

            # Process results and create markdown files
            print("\nProcessing results and creating markdown files...")
            processed = process_batch_results(results_file_path, str(output_folder))
            print(f"\nSuccessfully processed {processed} files")

            # Clean up temporary files
            if os.path.exists(batch_file_path):
                os.remove(batch_file_path)
            if os.path.exists(results_file_path):
                os.remove(results_file_path)

        except Exception as e:
            print(f"Error downloading or processing results: {e}")
    else:
        print(f"\nJob failed with status: {retrieved_job.status}")
        if getattr(retrieved_job, "errors", None):
            print(f"Errors: {retrieved_job.errors}")


if __name__ == "__main__":
    main()

Alternative API Options

Google Document AI

Best for: Google ecosystem integration

Visit Google Cloud Console
Enable Document AI API
Create a processor for OCR
Use Python client library for batch processing

Azure Form Recognizer

Best for: Enterprise environments

Visit Azure Portal
Create Cognitive Services resource
Use Form Recognizer service
REST API or SDK integration

Best Practices and Tips

File Organization

your-project/
├── pdfs/           # Original PDFs
│   ├── session-1/
│   ├── session-2/
│   └── articles/
└── markdown/       # Converted text files
    ├── session-1/
    ├── session-2/
    └── articles/

Quality Control

Spot Check: Review converted files for accuracy
Complex Layouts: Some academic PDFs may need manual review
Images/Tables: OCR may not capture complex visual elements
Languages: Ensure API supports your document languages

Cost Optimization

Batch Processing: Convert all PDFs at once rather than individually
File Size Limits: Be aware of API size restrictions
Free Tiers: Use free options for small projects
Caching: Store converted files to avoid re-processing

Workflow Integration

Version Control: Track both original PDFs and converted text
Backup: Keep original PDFs as source of truth
Naming: Maintain consistent file naming conventions
Metadata: Preserve citation information alongside converted text

Troubleshooting

Common Issues

“API Key Not Found”

Verify environment variable is set: echo $MISTRAL_API_KEY
Restart your terminal/command prompt
Check for typos in variable name

“File Too Large”

Default limit is 36MB per PDF
Split large documents or use --max-size parameter
Consider alternative conversion methods for very large files

“Processing Failed”

Check PDF file integrity
Some PDFs may have copy protection
Try alternative conversion methods

“Rate Limits Exceeded”

APIs have rate limits (requests per minute/hour)
Implement delays between requests
Consider paid plans for higher limits

Integration with Case Study Workflow

Recommended Usage Pattern

Initial Setup: Convert all PDFs at project start
Ongoing: Convert new PDFs as they’re added
Processing: Use converted Markdown files for all AI workflows
Storage: Maintain both original PDFs and converted text

Case Study References

AI-powered literature analysis
Human-AI synthesis workflows
Agentic workflow design
API Keys Guide: For Mistral setup
Model Reference Guide: For compatible AI models

Advanced Options

Custom Script Modifications

The batch_ocr.py script can be customized for specific needs:

Change output format (currently Markdown)
Modify file size limits
Add custom metadata extraction
Integrate with other APIs

Local OCR Alternatives

For offline processing or sensitive documents:

Tesseract OCR: Free, open-source
OCRmyPDF: PDF-specific OCR tool
PyMuPDF: Python PDF processing library

Commercial Solutions

Adobe Acrobat: High-quality OCR
ABBYY FineReader: Enterprise OCR solution
Readiris: User-friendly OCR software

Next Steps

Get Started: Choose your preferred conversion method
Test: Convert a few sample PDFs
Scale: Process your full document collection
Integrate: Use converted files in your AI workflows
Monitor: Track conversion quality and costs

Remember: Converting PDFs once saves time and costs in the long run compared to processing them repeatedly in AI workflows.

Getting Started

Core References

Essential Tools

AI Environment

Agentic AI Tools

PDF to Markdown Conversion Guide

Overview

PDF Conversion Methods

Method 1: Manual Copy-Paste (Free, Limited)

Method 2: MinerU (Free Tier Available)

Method 3: Official APIs (Paid, High Quality)

Mistral OCR API Setup (Optional)

Step 1: Get Mistral API Key

Step 2: Environment Setup

Step 3: Using the Batch OCR Script

Alternative API Options

Google Document AI

Azure Form Recognizer

Best Practices and Tips

File Organization

Quality Control

Cost Optimization

Workflow Integration

Troubleshooting

Common Issues

Integration with Case Study Workflow

Recommended Usage Pattern

Case Study References

Advanced Options

Next Steps

Getting Started

Core References

Essential Tools

AI Environment

Agentic AI Tools

​Overview

​PDF Conversion Methods

​Method 1: Manual Copy-Paste (Free, Limited)

​Method 2: MinerU (Free Tier Available)

​Method 3: Official APIs (Paid, High Quality)

​Mistral OCR API Setup (Optional)

​Step 1: Get Mistral API Key

​Step 2: Environment Setup

​Step 3: Using the Batch OCR Script

​Alternative API Options

​Google Document AI

​Azure Form Recognizer

​Best Practices and Tips

​File Organization

​Quality Control

​Cost Optimization

​Workflow Integration

​Troubleshooting

​Common Issues

​Integration with Case Study Workflow

​Recommended Usage Pattern

​Case Study References

​Advanced Options

​Next Steps

Overview

PDF Conversion Methods

Method 1: Manual Copy-Paste (Free, Limited)

Method 2: MinerU (Free Tier Available)

Method 3: Official APIs (Paid, High Quality)

Mistral OCR API Setup (Optional)

Step 1: Get Mistral API Key

Step 2: Environment Setup

Step 3: Using the Batch OCR Script

Alternative API Options

Google Document AI

Azure Form Recognizer

Best Practices and Tips

File Organization

Quality Control

Cost Optimization

Workflow Integration

Troubleshooting

Common Issues

Integration with Case Study Workflow

Recommended Usage Pattern

Case Study References

Advanced Options

Next Steps