PII Anonymizer

 

https://www.pexels.com/photo/woman-covering-her-face-with-corn-leaves-906024/

I was playing around with Presidio Anonymizer. Quote

The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values. Presidio anonymizer supports both anonymization and deanonymization by applying different operators. Operators are built-in text manipulation classes which can be easily extended.

I am particular interested in creating operators for it. As a result, I have created a python library. pii_anonymizer.

For sample, I have a text

Her name is Mary Ann. My name is James Bond. My phone number is 212-555-5555.
My credit card is 5548364515335857. Again my name is James Bond and number is
212-555-5555.

and I want to anonymize it to

Her name is Monique Hamilton. My name is Jesse Townsend. My phone number is <phone_number_1>.
My credit card is XXXXXXXXXXXXXXXX. Again my name is Jesse Townsend and number is
<phone_number_1>.

  • The names are replaced. When "James Bond" appeared twice, we replaced with the same name. Names are generated with Faker library.
  • Credit card number is masked, I want to have the option to change mask character e.g. "******" or this case "XXXXXX"
  • Phone numbers are labeled consistently.
Here is the sample code. Also found here.
import asyncio

from faker import Faker

from pii_anonymizer.generators.label_generator import LabelGenerator
from pii_anonymizer.generators.mask_generator import MaskGenerator
from pii_anonymizer.generators.name_generator import NameGenerator
from pii_anonymizer.hosting import container
from pii_anonymizer.protocols.i_text_analyzer import ITextAnalyzer
from pii_anonymizer.protocols.i_text_anonymizer import ITextAnonymizer

# sample code to show how to use the text analyzer and text anonymizer
# there are 3 entities in the text: PERSON, PHONE_NUMBER, CREDIT_CARD

text = """Her name is Mary Ann. My name is James Bond. My phone number is 212-555-5555.
My credit card is 5548364515335857. Again my name is James Bond and number is
212-555-5555."""
print("Original text:")
print(text)
print()

# seed the faker so that the generated data is consistent
Faker.seed(100)

# set the mask character for the CREDIT_CARD entity default is "*" for masking
MaskGenerator.mask_char_mapping["CREDIT_CARD"] = "X"


async def main():
    # get the text analyzer from the DI container
    text_analyzer = container[ITextAnalyzer]

    analyzed_result = await text_analyzer.analyze(
        text=text,
        entities=["PERSON", "PHONE_NUMBER", "CREDIT_CARD"],
        language="en",
    )

    print("Analyzed result:")
    print(analyzed_result)
    print()

    # get the text anonymizer from the DI container
    text_anonymizer = container[ITextAnonymizer]

    anonymized_result = await text_anonymizer.anonymize(
        text=text,
        analyzer_results=analyzed_result,
        operators={
            "PERSON": NameGenerator(),
            "PHONE_NUMBER": LabelGenerator(),
            "CREDIT_CARD": MaskGenerator(),
        },
    )

    print("Anonymized result:")
    print(anonymized_result)
    print()

    print("Anonymized text:")
    print(anonymized_result.text)
    print()


if __name__ == "__main__":
    loop = asyncio.new_event_loop()
    loop.run_until_complete(main())

The library is created with Dependency Injection for the different services. The intention is that we can switch to use other anonymizer in future if needed,









Comments

Popular posts from this blog

Happy New Year, 2024 from DALL-E

Bing Search as tool for OpenAI

OpenAI: Functions Feature in 2023-07-01-preview API version