Extracting Structured Data from PDF using Azure Document Intelligence and pdfplumber

 

https://www.pexels.com/photo/pile-of-covered-books-159751/
https://www.pexels.com/photo/pile-of-covered-books-159751/

In this blog, we explore how to extract structured data from PDF files using both Azure Document Intelligence and the open source library, pdfplumber. Azure Document Intelligence is a cloud-based service that leverages AI to understand and extract data from documents, while pdfplumber is a versatile Python library that provides helpful tools for working with PDF files.

Extracting structured data, such as tables or forms, from PDF documents is a common need for further analysis or processing. We'll demonstrate how to achieve this using both approaches. Since PDF files can be large and complex, it's worth considering a balanced approach that optimizes for both cost and accuracy when choosing between these tools.

Our goal is to explore whether pdfplumber can effectively handle simpler PDF files, such as manuals and scientific data sheets, while reserving Azure Document Intelligence for more complex scenarios. While pdfplumber might have some limitations compared to Azure Document Intelligence when dealing with intricate layouts, it can offer a practical and cost-effective solution for straightforward PDF extraction tasks.

We'll also test the extracted text with a large language model (LLM) to explore how well it can interpret and utilize the structured data obtained from the PDFs. While this is an exploratory experiment designed for learning purposes, it provides valuable insights into the capabilities and potential applications of these tools.

The source for this blog can be found [here]

Using Azure Document Intelligence

To extract structured data using Azure Document Intelligence, we'll start by setting up the necessary environment and dependencies. You'll want to have the Azure SDK installed and configured with your Azure credentials.

After receiving the AnalyzeResult from Azure Document Intelligence, we can parse the tables and extract the structured data. The result provides a rich set of information, including tables, key-value pairs, and text content.

We then format the content into a structured text format that can be easily interpreted by an LLM. From there, we can ask the LLM to answer several questions based on the extracted data. One question even asks the LLM to respond in JSON format based on the structured data extracted from the PDF.

Using pdfplumber

Let's see how pdfplumber handles the same PDF file. pdfplumber offers a straightforward API to access the content of PDF files, including text, tables, and metadata.

After extracting the text and tables from pdfplumber (which conveniently requires minimal additional formatting), we can prompt the LLM with the same set of questions as before.

Demo

Feel free to watch the demo (no audio) of this experiment. For the best viewing experience, we recommend watching in full screen mode.




The demo shows that we use Azure Document Intelligence to extract text (including tables) from a PDF document. and the LLM is able to use the extracted text to ask 2 questions.

It also shows that we can do the same with pdfplumber. However, we are seeing some extra lines in the tables. Otherwise the results from LLM are identical.

Conclusion

In this blog, we've explored how to extract structured data from PDF files using both Azure Document Intelligence and pdfplumber. Azure Document Intelligence offers a powerful and comprehensive solution for complex documents, while pdfplumber provides a practical and cost-effective alternative for simpler PDF files. By thoughtfully combining these tools, we can efficiently extract structured data and work with large language models for further analysis and processing.





Comments