PII Anonymizer

 

https://www.pexels.com/photo/woman-covering-her-face-with-corn-leaves-906024/

I was playing around with Presidio Anonymizer. Quote

The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values. Presidio anonymizer supports both anonymization and deanonymization by applying different operators. Operators are built-in text manipulation classes which can be easily extended.

I am particular interested in creating operators for it. As a result, I have created a python library. pii_anonymizer.

For sample, I have a text

Her name is Mary Ann. My name is James Bond. My phone number is 212-555-5555.
My credit card is 5548364515335857. Again my name is James Bond and number is
212-555-5555.

and I want to anonymize it to

Her name is Monique Hamilton. My name is Jesse Townsend. My phone number is <phone_number_1>.
My credit card is XXXXXXXXXXXXXXXX. Again my name is Jesse Townsend and number is
<phone_number_1>.

  • The names are replaced. When "James Bond" appeared twice, we replaced with the same name. Names are generated with Faker library.
  • Credit card number is masked, I want to have the option to change mask character e.g. "******" or this case "XXXXXX"
  • Phone numbers are labeled consistently.
Here is the sample code. Also found here.
import asyncio

from faker import Faker

from pii_anonymizer.generators.label_generator import LabelGenerator
from pii_anonymizer.generators.mask_generator import MaskGenerator
from pii_anonymizer.generators.name_generator import NameGenerator
from pii_anonymizer.hosting import container
from pii_anonymizer.protocols.i_text_analyzer import ITextAnalyzer
from pii_anonymizer.protocols.i_text_anonymizer import ITextAnonymizer

# sample code to show how to use the text analyzer and text anonymizer
# there are 3 entities in the text: PERSON, PHONE_NUMBER, CREDIT_CARD

text = """Her name is Mary Ann. My name is James Bond. My phone number is 212-555-5555.
My credit card is 5548364515335857. Again my name is James Bond and number is
212-555-5555."""
print("Original text:")
print(text)
print()

# seed the faker so that the generated data is consistent
Faker.seed(100)

# set the mask character for the CREDIT_CARD entity default is "*" for masking
MaskGenerator.mask_char_mapping["CREDIT_CARD"] = "X"


async def main():
    # get the text analyzer from the DI container
    text_analyzer = container[ITextAnalyzer]

    analyzed_result = await text_analyzer.analyze(
        text=text,
        entities=["PERSON", "PHONE_NUMBER", "CREDIT_CARD"],
        language="en",
    )

    print("Analyzed result:")
    print(analyzed_result)
    print()

    # get the text anonymizer from the DI container
    text_anonymizer = container[ITextAnonymizer]

    anonymized_result = await text_anonymizer.anonymize(
        text=text,
        analyzer_results=analyzed_result,
        operators={
            "PERSON": NameGenerator(),
            "PHONE_NUMBER": LabelGenerator(),
            "CREDIT_CARD": MaskGenerator(),
        },
    )

    print("Anonymized result:")
    print(anonymized_result)
    print()

    print("Anonymized text:")
    print(anonymized_result.text)
    print()


if __name__ == "__main__":
    loop = asyncio.new_event_loop()
    loop.run_until_complete(main())

The library is created with Dependency Injection for the different services. The intention is that we can switch to use other anonymizer in future if needed,









Comments