I was playing around with Presidio Anonymizer. Quote
The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values. Presidio anonymizer supports both anonymization and deanonymization by applying different operators. Operators are built-in text manipulation classes which can be easily extended.
I am particular interested in creating operators for it. As a result, I have created a python library. pii_anonymizer.
For sample, I have a text
Her name is Mary Ann. My name is James Bond. My phone number is 212-555-5555. My credit card is 5548364515335857. Again my name is James Bond and number is 212-555-5555.
and I want to anonymize it to
Her name is Monique Hamilton. My name is Jesse Townsend. My phone number is <phone_number_1>. My credit card is XXXXXXXXXXXXXXXX. Again my name is Jesse Townsend and number is <phone_number_1>.
- The names are replaced. When "James Bond" appeared twice, we replaced with the same name. Names are generated with Faker library.
- Credit card number is masked, I want to have the option to change mask character e.g. "******" or this case "XXXXXX"
- Phone numbers are labeled consistently.
Here is the sample code. Also found here.
import asyncio from faker import Faker from pii_anonymizer.generators.label_generator import LabelGenerator from pii_anonymizer.generators.mask_generator import MaskGenerator from pii_anonymizer.generators.name_generator import NameGenerator from pii_anonymizer.hosting import container from pii_anonymizer.protocols.i_text_analyzer import ITextAnalyzer from pii_anonymizer.protocols.i_text_anonymizer import ITextAnonymizer # sample code to show how to use the text analyzer and text anonymizer # there are 3 entities in the text: PERSON, PHONE_NUMBER, CREDIT_CARD text = """Her name is Mary Ann. My name is James Bond. My phone number is 212-555-5555. My credit card is 5548364515335857. Again my name is James Bond and number is 212-555-5555.""" print("Original text:") print(text) print() # seed the faker so that the generated data is consistent Faker.seed(100) # set the mask character for the CREDIT_CARD entity default is "*" for masking MaskGenerator.mask_char_mapping["CREDIT_CARD"] = "X" async def main(): # get the text analyzer from the DI container text_analyzer = container[ITextAnalyzer] analyzed_result = await text_analyzer.analyze( text=text, entities=["PERSON", "PHONE_NUMBER", "CREDIT_CARD"], language="en", ) print("Analyzed result:") print(analyzed_result) print() # get the text anonymizer from the DI container text_anonymizer = container[ITextAnonymizer] anonymized_result = await text_anonymizer.anonymize( text=text, analyzer_results=analyzed_result, operators={ "PERSON": NameGenerator(), "PHONE_NUMBER": LabelGenerator(), "CREDIT_CARD": MaskGenerator(), }, ) print("Anonymized result:") print(anonymized_result) print() print("Anonymized text:") print(anonymized_result.text) print() if __name__ == "__main__": loop = asyncio.new_event_loop() loop.run_until_complete(main())
The library is created with Dependency Injection for the different services. The intention is that we can switch to use other anonymizer in future if needed,
Comments
Post a Comment