Document AI

Document AI, also known as Document Intelligence, refers to a field of technology that employs machine learning (ML) techniques, such as natural language processing (NLP).¹ These techniques are used to develop computer models capable of analyzing documents in a manner akin to human review.

Through NLP, computer systems are able to understand relationships and contextual nuances in document contents, which facilitates the extraction of information and insights. Additionally, this technology enables the categorization and organization of the documents themselves.²

The applications of Document AI extend to processing and parsing a variety of semi-structured documents, such as forms, tables, receipts, invoices, tax forms, contracts, loan agreements, and financial reports.

Key features

Machine learning is utilized in Document AI to extract information from both printed and digital documents. This technology recognizes images, text, and characters in various languages, aiding in the extraction of insights from unstructured documents. The use of this technology can improve the speed and quality of decision-making in document analysis. Additionally, the automation of data extraction and validation can contribute to increased efficiency in document analysis processes. Since the early 2020s, the integration of large language models has extended Document AI beyond extraction toward generative tasks, including the automated drafting of forms, contracts, and document summaries.³

Example

A business letter contains information in the form of text, as well as other types of information, such as the position of the text. For instance, a typical letter contains two addresses before the body of the text. The address at the very top (sometimes aligned to the right) is the sender address. This is normally followed by the date of the letter, with the place of writing. After this, the receiver address is listed.

The distinction between the sender address and the receiver address is conveyed solely by the position of the address on the page, i.e. there is no textual indication like Sender: in front of the addresses.

Data dimensions and ML architecture

Data is typically distinguished into spatial data and time-series data, the former includes things like images, maps and graphs, while the latter includes signals such as stock prices or voice recordings. Document AI combines text data, which has a time dimension, with other types of data, such as the position of an address in a business letter, which is spatial.

Historically in machine learning spatial data was analyzed using a convolutional neural network, and temporal data using a recurrent neural network. With the advent of dimension-type agnostic transformer architecture, these two different types of dimension can be more easily combined, Document AI is an example of this.

Benchmarks

Several public datasets are used to evaluate Document AI systems.

FUNSD (Form Understanding in Noisy Scanned Documents) contains 199 annotated forms with token- and block-level labels for form understanding tasks.⁴

CORD (Consolidated Receipt Dataset) supports key information extraction from receipts.⁵

DocVQA contains approximately 50,000 questions over 12,000 document images for layout-aware visual question answering.⁶

Common uses

Document AI systems are used to automate document processing and information extraction in business and financial workflows, including invoice and receipt processing, data entry automation, anomaly detection, mortgage processing, loan portfolio monitoring, credit risk management, and fraud detection such as counterfeit currency and fraudulent checks.

They are also applied in regulatory compliance and contract analysis, including assessing changes in legal and regulatory documents. In real estate, Document AI supports document classification and structured information extraction for standardized processing and analytics.⁷ With the adoption of generative AI, Document AI systems can also generate and pre-fill structured documents such as contracts or business forms from natural language prompts.⁸

References

Cui, Lei; Xu, Yiheng; Lv, Tengchao; Wei, Furu (2021). "Document AI: Benchmarks, Models and Applications". arXiv:2111.08609 [cs.CL].
"Why Digitizing Documents has been Accelerated by COVID-19 Pandemic". eWEEK. 15 January 2021. Retrieved 2021-02-11.
"Document AI Custom Extractor GA release". Google Cloud Blog. 10 January 2024. Retrieved 2026-05-20.
Jaume, Guillaume; Kemal Ekenel, Hazim; Thiran, Jean-Philippe (September 2019). "FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents". 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). 2: 1–6. doi:10.1109/ICDARW.2019.10029.
Park, Seunghyun; Shin, Seung (2019). CORD: A Consolidated Receipt Dataset for Post-OCR Parsing (PDF). Document Intelligence Workshop at NeurIPS.
Mathew, Minesh; Karatzas, Dimosthenis; Jawahar, C. V. (2021). DocVQA: A Dataset for VQA on Document Images. IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2200–2209. arXiv:2007.00398.
Bodenbender, Mario; Kurzrock, Björn-Martin; Müller, Philipp Maximilian (April 2019). "Broad application of artificial intelligence for document classification, information extraction and predictive analytics in real estate". Journal of General Management. 44 (3): 170–179. doi:10.1177/0306307018823113. ISSN 0306-3070.
"Drafting documents from natural-language prompts". aiPDF. Retrieved 2026-05-20.

[1] Cui, Lei; Xu, Yiheng; Lv, Tengchao; Wei, Furu (2021). "Document AI: Benchmarks, Models and Applications". arXiv:2111.08609 [cs.CL].

[2] "Why Digitizing Documents has been Accelerated by COVID-19 Pandemic". eWEEK. 15 January 2021. Retrieved 2021-02-11.

[3] "Document AI Custom Extractor GA release". Google Cloud Blog. 10 January 2024. Retrieved 2026-05-20.

[4] Jaume, Guillaume; Kemal Ekenel, Hazim; Thiran, Jean-Philippe (September 2019). "FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents". 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). 2: 1–6. doi:10.1109/ICDARW.2019.10029.

[5] Park, Seunghyun; Shin, Seung (2019). CORD: A Consolidated Receipt Dataset for Post-OCR Parsing (PDF). Document Intelligence Workshop at NeurIPS.

[6] Mathew, Minesh; Karatzas, Dimosthenis; Jawahar, C. V. (2021). DocVQA: A Dataset for VQA on Document Images. IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2200–2209. arXiv:2007.00398.

[7] Bodenbender, Mario; Kurzrock, Björn-Martin; Müller, Philipp Maximilian (April 2019). "Broad application of artificial intelligence for document classification, information extraction and predictive analytics in real estate". Journal of General Management. 44 (3): 170–179. doi:10.1177/0306307018823113. ISSN 0306-3070.

[8] "Drafting documents from natural-language prompts". aiPDF. Retrieved 2026-05-20.

1

2

3

4

5

6

7

8