Automated document analysis is undergoing a major transformation. Thanks to advances in LLM (Large Language Models), simple optical character recognition (OCR) is no longer limited to extracting text: She now aspires to interpret, understand and process documents intelligently. On the occasion of the conference ICDAR 2024 (International Conference on Document Analysis and Recognition) which took place in early September in Athens, researchers and companies, including Luminess, represented by François Wieckowiak in CIFRE thesis, shared their progress on the subject. Fascinating exchanges which open perspectives for the future developments of theIntelligent Document Processing (IDP).
I. OCR and LLM: from text recognition to document intelligence
OCR, once seen as a standalone technology, is now being rethought in light of LLMs, these models capable ofanalyze and understand text at an unparalleled level. As NVIDIA's Thomas Breuel demonstrated in his keynote, the LLM era marks a turning point for document analysis. Three approaches stand out in this new era:
- The classic OCR and LLM association : text extraction via OCR is followed by a contextual interpretation of an LLM. This method allows to perform advanced tasks such as question-answering on the content of a document (e.g. answering specific questions based on the extracted text). This is a powerful approach but is dependent on the performance of the OCR;
- Emerging multimodal models : able to analyze both image and text, these systems overcome some of the limitations of traditional approaches. The TiLT model, presented at the conference, not only allowsextract the text, but also to "understand" the document taking into account layout and graphic elements. For example, it can differentiate a header from a footer, or understand the hierarchy of headings in a document;
- Complete elimination of OCR : models like Donut or GPT-4 Vision are able to directly process visual documents, analyzing their content without any prior text extractionA technological feat, but one that raises questions about industrial adoption.
Are such approaches viable in production environments with limited resources?
II. Inspiring Use Cases: The Future of OCR at Your Fingertips
Advances in OCR and LLM are translating into concrete results. Several studies presented at the conference highlighted innovative applications :
- Recognition of food product labels : a study compared the OCR + LLM approach with the GPT-4 Vision model. The latter has outperformed traditional OCR by 15% in precision by directly processing images, highlighting the effectiveness of visual analysis;
- Improved layout : LAPDoc demonstrated that the integration of a spatial format to represent the layout improves LLM performance by 20% compared to raw OCR outputs, taking into account the document structure;
- Handwriting recognition : in low production environments, the use of models like CRNN (Convolutional Recurrent Neural Network) + CTC (Connectionist Temporal Classification) has shown that less complex solutions can be just as effective, or even preferable, in some contexts. These models have achieved a 95% accuracy on handwritten text samples, while requiring 50% fewer resources computational than LLMs.
These examples illustrate not only the potential of new technologies, but also the need to choose solutions tailored to specific needs of each application.
III. The promises… and the limits of LLM in the OCR
Despite the enthusiasm they generate, these new technologies present challenges.
One of the main points raised at the conference was the difficulty diagnosing errors in hybrid systems. When an anomaly occurs, is it the OCR that misinterpreted a character? Or the LLM that failed to correctly contextualize the information?
In addition, the collection andAccess to training data remains problematic. In specific areas, such as scientific or legal documents, theAccess to annotated and quality data is still too limitedThis is hampering the adoption of these technologies in sectors where precision is essential.
Another challenge lies in the expenses management. LLMs, due to their complexity and the computing power they require, are resource-hungry. According to a recent study, the cost of using an LLM can be up to 10 times higher than that of a traditional OCR system for similar tasks.
In a production context, this raises a key question: is it still necessary to deploy such complex models? Or Could lighter, but sufficiently efficient, approaches suffice?
IV. New playgrounds for Luminess: which path to follow?
For Luminess, these technological advances represent both opportunities and challenges.. It is clear that the integration of LLMs into the IDP can radically transform the way we approach document automation. However, the technological choice will always depend on the specific needs of our customers..
For example, for standard invoice processing, lightweight solutions such as OCR, combined with intelligent layout techniques, could perfectly meet cost and resource constraints. On the other hand, for the analysis of complex contracts requiring a deep understanding of the legal context, the use of LLM would be justified.
Conclusion: a transformation in progress
The ICDAR 2024 conference showed that the future of document recognition is no longer limited to text extraction. For Luminess, it is about finding the right balance between technological performance and operational constraints.. LLMs offer enormous potential, but must be adapted to the realities of production.
In the coming months, Luminess plans to launch a pilot project integrating LLMs for processing complex documents in the banking sector, while optimizing its existing OCR solutions for more routine tasks. This hybrid approach will allow the benefits and challenges of these new technologies to be concretely evaluated in a real production environment.
This thinking is at the heart of Luminess' future innovations, to continue to push the limits of the IDP and offer ever more efficient and tailored solutions to our customers.
By Tony Bonnet and François Wieckowiak