Discarding some pages in PDF documents in Azure Blob Storage Container.

Image by https://www.pexels.com/@pixabay/
Image by https://www.pexels.com/@pixabay/

I was looking at how to discard some pages in a PDF document and generate a new PDF file without them. There are many applications that I can download to perform this task. And, there are also ways to do this online. Doing this online is not an option because PDF documents contain confidential information. The applications running on my MacBookPro are also not an option because my PDF documents are in Azure blob storage's container. Furthermore, there are many PDF documents that I need to work with, and I may repeat the same task in the future.

Hence I am looking at this.



  1. Read the PDF files from a blob container,
  2. Get the splitting information from a file,
  3. Split the PDF document and write it back to a destination blob container.
Here is the Python code, I am using Blob Storage Connection String, and we can use Shared Access Signature Token too (I have written a few blogs on SAS Token.).

Dependencies

Here are the dependencies
PyPDF2
python-dotenv

azure-identity==1.12.0
azure-storage-blob==12.14.1

Environment Parameters

BLOB_STORAGE_CONN_STR=
IN_CONTAINER_NAME=
OUT_CONTAINER_NAME=
Please have them in a .env file

Split PDF information

Please have the split PDF information in a JSON format in a file, data.json. the format is

[
    {
        "file": "<file name>",
        "pages": [<page to keep>, ..]
    },
]

In the page to keep array, we can have integer example 2 and also string to define ranges, example "3-10" which is page 3 to page 10.

example

[
    {
        "file": "sample1.pdf",
        "pages": [10]
    },
    {
        "file": "sample2.pdf",
        "pages": [1, "5-7"]
    },
]

Python Code

import io
import os
import json
from typing import List
from PyPDF2 import PdfReader, PdfWriter

from azure.core.credentials import AzureKeyCredential
from azure.storage.blob import BlobServiceClient

from dotenv import load_dotenv

load_dotenv()

def get_container_clients():
    BLOB_STORAGE_CONN_STR = os.getenv("BLOB_STORAGE_CONN_STR")
    IN_CONTAINER_NAME = os.getenv("IN_CONTAINER_NAME")
    OUT_CONTAINER_NAME = os.getenv("OUT_CONTAINER_NAME")
    
    blob_client = BlobServiceClient.from_connection_string(BLOB_STORAGE_CONN_STR)
    return blob_client.get_container_client(
        container=IN_CONTAINER_NAME), blob_client.get_container_client(
        container=OUT_CONTAINER_NAME)

def get_split_info():
    map_name_to_pages = {}
    with open("./data.json") as f:
        data = json.load(f)
        for d in data:
            map_name_to_pages[d["file"]] = d["pages"]

    return map_name_to_pages

def pages_to_keep(page_info: List[str]):
    pages = []

    for p in page_info:
        if type(p) == str:
            nums = p.split("-")
            for i in range(int(nums[0]), int(nums[1]) +1):
                pages.append(i -1)
        else:
            pages.append(p -1)    

    return pages

if __name__ == "__main__":
    map_name_to_pages = get_split_info()

    in_container_client, out_container_client = get_container_clients()
    
    for name in in_container_client.list_blob_names():
        blob = in_container_client.download_blob(name).readall()
        pdf = PdfReader(io.BytesIO(blob))
        
        if name in map_name_to_pages:
            pages = pages_to_keep(map_name_to_pages[name])

            pdfWriter = PdfWriter()

            for page_num in pages:
                pdfWriter.add_page(pdf.pages[page_num])
        
            with io.BytesIO() as bytes_stream:
                pdfWriter.write(bytes_stream)
                bytes_stream.seek(0)
                
                out_container_client.upload_blob(name, bytes_stream)

Conclusion

I hope that this will save you some time if you want to do a similar thing.


 


Comments

Popular posts from this blog

OpenAI: Functions Feature in 2023-07-01-preview API version

Storing embedding in Azure Database for PostgreSQL

Happy New Year, 2024 from DALL-E