i-am-ai

OCR has always been the unglamorous first step in document AI workflows. You can have the fanciest RAG pipeline or the smartest agent, but if your document ingestion layer can't reliably extract text from PDFs, scanned forms, tables, or screenshots, everything downstream falls apart.

PaddleOCR 3.5 just shipped with a feature that should make life easier for teams building document AI on Hugging Face-native infrastructure: you can now run PaddleOCR models—including the PP-OCRv5 series and PaddleOCR-VL 1.5 document parsing models—with Transformers as the inference backend.

The pitch is simple. Set engine="transformers" and PaddleOCR handles the OCR pipeline orchestration while Transformers runs the supported models. If you're already using PyTorch and Transformers for the rest of your stack, this integration removes friction.

What Actually Changed

PaddleOCR 3.5 introduced a new inference-engine abstraction. Developers now pick a backend via the engine parameter and configure backend-specific settings through engine_config.

In practice:

PaddleOCR continues to manage the multi-step pipelines behind OCR and document parsing tasks (text detection, recognition, layout analysis, etc.).
Transformers becomes one of the supported runtime options for executing the models that power those pipelines.
You get access to Transformers' device placement, dtype controls, and attention implementation options through engine_config.

The stack now looks like this:

Layer	Role	Examples
Application	What you're building	RAG, agents, document analytics
Model	OCR and parsing capabilities	PP-OCRv5, PaddleOCR-VL 1.5
Inference Backend	Runtime executing the models	Paddle static/dynamic, Transformers

This release is infrastructure work. PaddleOCR still provides the models and task pipelines. The new part is that Transformers becomes a first-class backend option for running them, which matters if you're already living in the Hugging Face ecosystem.

Why This Actually Matters for Document AI

Let's be real: most of the hard work in document AI happens before the LLM sees anything.

You need to turn messy PDFs, scanned receipts, complex page layouts, tables, charts, and handwritten notes into clean structured data. If this ingestion layer is brittle, your downstream RAG system retrieves the wrong context, your agent misses key information, or your analytics pipeline produces garbage.

PaddleOCR has been quietly solving this problem for years with models like PP-OCRv5 for general OCR and PaddleOCR-VL 1.5 for full document parsing (layout detection, table extraction, formula recognition).

What's new is that these capabilities now integrate more naturally with Transformers-centered stacks. If your team is already using Transformers for model serving, experimentation, or artifact management via the Hub, you can now bring PaddleOCR's OCR and document parsing into that same environment without switching paradigms.

Less integration friction means you can focus on the actual application layer—whether that's RAG, search, analytics, or agentic workflows.

Quick Start

Install PaddleOCR 3.5, PaddleX, Transformers, and a PyTorch build matching your hardware. For CUDA 12.6:

python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
python -m pip install "paddleocr==3.5.0" "paddlex==3.5.2" "transformers>=5.4.0"

Run from the command line:

paddleocr ocr \
  -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png \
  --device gpu:0 \
  --engine transformers

Or use the Python API:

from paddleocr import PaddleOCR

pipeline = PaddleOCR(
    device="gpu:0",
    engine="transformers",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False,
    engine_config={"dtype": "float32"},
)

results = pipeline.predict(
    "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png"
)

for result in results:
    print(result)

You can tune backend behavior through engine_config:

engine_config = {
    "dtype": "bfloat16",
    "device_type": "gpu",
    "device_id": 0,
    "attn_implementation": "sdpa",
}

The optimal config depends on your model, hardware, and deployment constraints. The demo Space uses float32 for broad compatibility, but you'll want to experiment for production.

When Should You Use This?

Use the Transformers backend when you want PaddleOCR's OCR and document parsing capabilities to fit naturally into a Hugging Face-centered workflow.

This makes sense if you're:

Building RAG, document AI, or agent applications on top of PyTorch/Transformers infrastructure
Already using the Hub for model discovery, versioning, or deployment
Prioritizing developer experience and ecosystem integration over absolute throughput

If raw OCR throughput is the primary concern, PaddleOCR's default paddle_static backend is typically faster.

This isn't about replacing one backend with another. It's about flexibility. Use PaddleOCR for battle-tested OCR and document parsing models, and pick the inference backend that fits your stack.

The Bigger Picture

Document ingestion is still the unsexy foundation of most AI applications that touch the real world. If your system can't reliably turn a messy scanned invoice or a complex research paper into structured data, no amount of clever prompting or fancy agent orchestration will save you.

PaddleOCR has been solving this problem at scale for years. The Transformers backend integration doesn't change the core models or their capabilities—it just makes them easier to use in environments where Hugging Face tooling is already the default.

That's a pragmatic move. Most teams building document AI today are already using Transformers for something—whether it's embeddings, rerankers, or the downstream LLM itself. Being able to run your entire document processing pipeline on a consistent backend reduces operational complexity.

Try the live demo on Spaces or explore PaddleOCR models on the Hub.

Resources

What Actually Changed

PaddleOCR 3.5 introduced a new inference-engine abstraction. Developers now pick a backend via the engine parameter and configure backend-specific settings through engine_config.

In practice:

PaddleOCR continues to manage the multi-step pipelines behind OCR and document parsing tasks (text detection, recognition, layout analysis, etc.).
Transformers becomes one of the supported runtime options for executing the models that power those pipelines.
You get access to Transformers' device placement, dtype controls, and attention implementation options through engine_config.

The stack now looks like this:

Layer	Role	Examples
Application	What you're building	RAG, agents, document analytics
Model	OCR and parsing capabilities	PP-OCRv5, PaddleOCR-VL 1.5
Inference Backend	Runtime executing the models	Paddle static/dynamic, Transformers

Why This Actually Matters for Document AI

Let's be real: most of the hard work in document AI happens before the LLM sees anything.

Less integration friction means you can focus on the actual application layer—whether that's RAG, search, analytics, or agentic workflows.

Quick Start

Install PaddleOCR 3.5, PaddleX, Transformers, and a PyTorch build matching your hardware. For CUDA 12.6:

python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
python -m pip install "paddleocr==3.5.0" "paddlex==3.5.2" "transformers>=5.4.0"

Run from the command line:

paddleocr ocr \
  -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png \
  --device gpu:0 \
  --engine transformers

Or use the Python API:

from paddleocr import PaddleOCR

pipeline = PaddleOCR(
    device="gpu:0",
    engine="transformers",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False,
    engine_config={"dtype": "float32"},
)

results = pipeline.predict(
    "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png"
)

for result in results:
    print(result)

You can tune backend behavior through engine_config:

engine_config = {
    "dtype": "bfloat16",
    "device_type": "gpu",
    "device_id": 0,
    "attn_implementation": "sdpa",
}

The optimal config depends on your model, hardware, and deployment constraints. The demo Space uses float32 for broad compatibility, but you'll want to experiment for production.

When Should You Use This?

Use the Transformers backend when you want PaddleOCR's OCR and document parsing capabilities to fit naturally into a Hugging Face-centered workflow.

This makes sense if you're:

Building RAG, document AI, or agent applications on top of PyTorch/Transformers infrastructure
Already using the Hub for model discovery, versioning, or deployment
Prioritizing developer experience and ecosystem integration over absolute throughput

If raw OCR throughput is the primary concern, PaddleOCR's default paddle_static backend is typically faster.

This isn't about replacing one backend with another. It's about flexibility. Use PaddleOCR for battle-tested OCR and document parsing models, and pick the inference backend that fits your stack.

The Bigger Picture

Try the live demo on Spaces or explore PaddleOCR models on the Hub.

PaddleOCR 3.5 Adds Transformers Backend: OCR Gets a Hugging Face-Native Option

What Actually Changed

Why This Actually Matters for Document AI

Quick Start

When Should You Use This?

The Bigger Picture

Resources

PaddleOCR 3.5 Adds Transformers Backend: OCR Gets a Hugging Face-Native Option

What Actually Changed

Why This Actually Matters for Document AI

Quick Start

When Should You Use This?

The Bigger Picture

Resources