Kreuzberg v4.3.0 Launches with PaddleOCR Integration and Groundbreaking Document Processing Benchmarks
Kreuzberg, the open-source document intelligence framework, has released version 4.3.0 with native Rust integration of PaddleOCR, significantly enhancing OCR performance for East Asian languages. Simultaneously, the team unveiled comprehensive, reproducible benchmarks comparing Kreuzberg against leading tools like Apache Tika and Unstructured, demonstrating superior speed, efficiency, and lower resource consumption.

Kreuzberg, the open-source polyglot document intelligence framework written in Rust, has unveiled its v4.3.0 release alongside a landmark set of performance benchmarks, marking a significant leap in document processing for AI and enterprise workflows. The update introduces native support for PaddleOCR—a high-accuracy optical character recognition system—through a fully integrated Rust backend, enabling seamless extraction of text from complex documents in Chinese, Japanese, Korean, German, French, and English. This development is particularly transformative for teams deploying multilingual AI pipelines, where traditional OCR tools often struggle with East Asian scripts.
Complementing the software release, Kreuzberg’s team published an interactive, fully reproducible benchmark suite at kreuzberg.dev/benchmarks, comparing its performance against seven industry-standard document extraction tools: Apache Tika, Docling, Unstructured, PDFPlumber, PyMuPDF4LLM, MarkItDown, and Mineru. All tests were executed in a standardized Linux environment via GitHub Actions, ensuring fair and transparent evaluation across throughput, cold start latency, memory usage, CPU load, success rates, and extraction quality. The results show Kreuzberg consistently outperforms competitors, with document processing times often measured in milliseconds rather than seconds, and a dramatically smaller installation footprint—critical for containerized and edge deployments.
One of the most notable advancements in v4.3.0 is the elimination of LibreOffice as a dependency for legacy formats like .doc and .ppt. Previously, these formats required external system libraries, complicating deployment and increasing container sizes. Kreuzberg now handles them natively, reducing system dependencies and improving reliability across diverse environments—from cloud servers to local developer machines. This move aligns with the project’s broader mission of simplifying document ingestion pipelines for developers working in Rust, Python, JavaScript (Node, Bun, WASM), Ruby, Java, Go, PHP, Elixir, and C#.
The integration of PaddleOCR represents more than a technical upgrade; it’s a strategic response to growing demand for multilingual document understanding in global AI systems. PaddleOCR’s state-of-the-art models, particularly strong in handling dense, non-Latin scripts, were previously accessible only through Python-based wrappers, forcing developers to bridge ecosystems or sacrifice performance. Kreuzberg’s native Rust implementation eliminates this friction, allowing for faster inference, better memory management, and tighter integration with other Rust-based AI components.
According to the project’s documentation, the new benchmarks reveal Kreuzberg achieves up to 3.7x higher throughput on batched PDF processing compared to Apache Tika, with cold start times under 200ms—nearly half that of Unstructured. Memory consumption remains consistently below 200MB even under heavy load, while maintaining extraction accuracy above 95% across common formats. These metrics make Kreuzberg a compelling alternative for organizations scaling document AI pipelines where latency and cost are critical.
As AI-driven document processing becomes foundational to compliance, legal tech, healthcare records, and financial automation, Kreuzberg’s transparent benchmarking and minimal-dependency architecture position it as a serious contender in the document intelligence space. The project’s MIT license and active GitHub community further encourage adoption and contribution, with developers invited to submit enhancements, report edge cases, or expand language support.
While unrelated to the software project, the name ‘Kreuzberg’—a historically vibrant district in Berlin—echoes the project’s ethos of diversity and resilience, much like the neighborhood’s multicultural fabric. The team has not explicitly confirmed this connection, but the symbolism resonates with the framework’s commitment to supporting global languages and open collaboration.
For developers and enterprises seeking a fast, reliable, and language-inclusive document processor, Kreuzberg v4.3.0 offers a compelling, production-ready solution. The combination of performance benchmarks and native PaddleOCR support sets a new standard for transparency and capability in open-source document intelligence tools.


