Image by https://www.pexels.com/@pixabay/ |
I was looking at how to discard some pages in a PDF document and generate a new PDF file without them. There are many applications that I can download to perform this task. And, there are also ways to do this online. Doing this online is not an option because PDF documents contain confidential information. The applications running on my MacBookPro are also not an option because my PDF documents are in Azure blob storage's container. Furthermore, there are many PDF documents that I need to work with, and I may repeat the same task in the future.
Hence I am looking at this.
- Read the PDF files from a blob container,
- Get the splitting information from a file,
- Split the PDF document and write it back to a destination blob container.
Here is the Python code, I am using Blob Storage Connection String, and we can use Shared Access Signature Token too (I have written a few blogs on SAS Token.).
Dependencies
Here are the dependencies
PyPDF2 python-dotenv azure-identity==1.12.0 azure-storage-blob==12.14.1
Environment Parameters
BLOB_STORAGE_CONN_STR= IN_CONTAINER_NAME= OUT_CONTAINER_NAME=
Please have them in a .env file
Split PDF information
Please have the split PDF information in a JSON format in a file, data.json. the format is
[ { "file": "<file name>", "pages": [<page to keep>, ..] }, ]
In the page to keep array, we can have integer example 2 and also string to define ranges, example "3-10" which is page 3 to page 10.
example
[ { "file": "sample1.pdf", "pages": [10] }, { "file": "sample2.pdf", "pages": [1, "5-7"] }, ]
Python Code
import io import os import json from typing import List from PyPDF2 import PdfReader, PdfWriter from azure.core.credentials import AzureKeyCredential from azure.storage.blob import BlobServiceClient from dotenv import load_dotenv load_dotenv() def get_container_clients(): BLOB_STORAGE_CONN_STR = os.getenv("BLOB_STORAGE_CONN_STR") IN_CONTAINER_NAME = os.getenv("IN_CONTAINER_NAME") OUT_CONTAINER_NAME = os.getenv("OUT_CONTAINER_NAME") blob_client = BlobServiceClient.from_connection_string(BLOB_STORAGE_CONN_STR) return blob_client.get_container_client( container=IN_CONTAINER_NAME), blob_client.get_container_client( container=OUT_CONTAINER_NAME) def get_split_info(): map_name_to_pages = {} with open("./data.json") as f: data = json.load(f) for d in data: map_name_to_pages[d["file"]] = d["pages"] return map_name_to_pages def pages_to_keep(page_info: List[str]): pages = [] for p in page_info: if type(p) == str: nums = p.split("-") for i in range(int(nums[0]), int(nums[1]) +1): pages.append(i -1) else: pages.append(p -1) return pages if __name__ == "__main__": map_name_to_pages = get_split_info() in_container_client, out_container_client = get_container_clients() for name in in_container_client.list_blob_names(): blob = in_container_client.download_blob(name).readall() pdf = PdfReader(io.BytesIO(blob)) if name in map_name_to_pages: pages = pages_to_keep(map_name_to_pages[name]) pdfWriter = PdfWriter() for page_num in pages: pdfWriter.add_page(pdf.pages[page_num]) with io.BytesIO() as bytes_stream: pdfWriter.write(bytes_stream) bytes_stream.seek(0) out_container_client.upload_blob(name, bytes_stream)
Conclusion
I hope that this will save you some time if you want to do a similar thing.
Comments
Post a Comment