Images by kevin-ku-92347 from pexels.com |
Very often, we need to generate test data for testing purposes. I use Faker quite a bit since I am working on Machine Learning and Data Science projects often. Unfortunately, I ended up having to rewrite code with it to generate data for different projects. I have the aspiration to write a wrapper around it that allows me to generate different data for each column in a pandas' DataFrame. I was procrastinating on this idea for years now, and I have decided to put an end to it. So, it is the inception of a Faker-enabled Data Generator, FDG.
MVP
A definition file to define the column names, and the kind of data that we wish to generate.
- A list of column names and what is the faker function to call for each column.
- A list of rows definition
- Number of rows to generate (count)
- locale (optional, Defaults to "en_US")
- seed (optional) - this is useful if we want to generate the same set of data from each run.
- columns (optional) - any columns to be generated with a different function from the columns definition above.
Example
{ "columns": [ { "name": "fname", "func": "faker.first_name()" }, { "name": "lname", "func": "faker.last_name()" }, { "name": "job", "func": "faker.job()" }, { "name": "locale" } ], "rows": [ { "count": 100, "locale": "en_US", "seed": 1234, "columns": { "locale": { "text": "en-US" } } }, { "count": 20, "locale": "ja_JP", "seed": 1234, "columns": { "locale": { "text": "ja-JP" } } } ] }
With this metadata, we can generate a pandas DataFrame like this.
- 4 columns
- 100 rows of English locale
- 20 rows of Japanese locale
fname lname job locale 0 Tammy Alexander Actor en-US 1 Christopher Bender Television floor manager en-US 2 Austin Steele Chief Technology Officer en-US 3 Katie Young Airline pilot en-US 4 Valerie Smith Youth worker en-US .. ... ... ... ... 115 零 斉藤 寿司職人 ja-JP 116 篤司 太田 アニメーター ja-JP 117 晃 中村 調理師 ja-JP 118 明美 阿部 占い師 ja-JP 119 和也 鈴木 建築家 ja-JP [120 rows x 4 columns]
Source Code
To keep it simple, I have all the source code in a folder.
- requirements.txt - fake_data_builder.py - main.py - metadata.json - metadata_simple.json
requirements.txt
This file contains the dependencies
pandas==1.3.3 faker==8.14.0
metadata.json & metadata_simple.json
This files contain a simple and complex data definition. We shall use it later.
metadata.json
{ "columns": [ { "name": "fname", "func": "faker.first_name()" }, { "name": "lname", "func": "faker.last_name()" }, { "name": "home-address", "func": "faker.address()" }, { "name": "date_of_birth", "func": "faker.date_of_birth()" }, { "name": "email-address", "func": "faker.ascii_free_email()" }, { "name": "phone-number", "func": "faker.phone_number()" }, { "name": "company", "func": "faker.company()" }, { "name": "job", "func": "faker.job()" }, { "name": "enrolL_date", "func": "faker.date_between(start_date='-10y')" }, { "name": "employee_id", "func": "faker.bothify('????-########')" }, { "name": "locale" } ], "rows": [ { "count": 100, "locale": "en_US", "seed": 1234, "columns": { "locale": { "text": "en-US" } } }, { "count": 20, "locale": "ja_JP", "seed": 1234, "columns": { "locale": { "text": "ja-JP" } } } ] }
metadata_simple.json
{ "columns": [ { "name": "language", "text": "en_US" }, { "name": "customer_name", "func": "faker.company()" }, { "name": "transcript", "func": "' '.join(faker.paragraphs())" } ], "rows": [ { "count": 5, "locale": "en_US", "seed": 1234 } ] }
fake_data_builder.py
This is the builder class. Most of the logics are encapsulated here. The implementation is very simple though. It reads the metadata information and then generate a pandas DataFrame accordingly.
"""Fake Data Builder.""" import json import pandas as pd from faker import Faker class FakeDataBuilder: """Fake Data Builder.""" def __init__(self, metadata_filename: str): """Create an instance of builder. Args: metadata_filename (str): metadata filename (JSON format) """ with open(metadata_filename) as metadata_file: metadata = json.load(metadata_file) self.rows = metadata["rows"] self.columns = metadata["columns"] def __build_rows( self, df: pd.DataFrame, count: int, column_generator: dict, locale: str, seed: int, ): if locale: faker = Faker(locale) if seed: Faker.seed(seed) else: Faker.seed(0) for _ in range(count): df = df.append( self.__build_row(faker, column_generator), ignore_index=True, ) return df def __get_column_val( self, faker: Faker, col: dict, column_generator: dict, ): col_name = col["name"] coln = ( column_generator.get(col_name) if column_generator and column_generator.get(col_name) else col ) return coln.get("text") if coln.get("text") else eval(coln["func"]) def __build_row(self, faker, column_generator: dict): data = {} for col in self.columns: try: data[col["name"]] = self.__get_column_val( faker, col, column_generator, ) except AttributeError: raise AttributeError(f"Unrecognized column func, {col['func']}.") return data def build(self): """Build the dataframe. Returns: pandas.dataframe: Pandas dataframe containing the data. """ column_names = [x["name"] for x in self.columns] df = pd.DataFrame(columns=column_names) for row in self.rows: df = self.__build_rows( df, row["count"], row.get("columns"), row.get("locale"), row.get("seed"), ) return df
main.py
This file is created to test the builder class.
"""Test code.""" from fake_data_builder import FakeDataBuilder builder = FakeDataBuilder("./metadata.json") # or you can choose "./metadata_simple.json df = builder.build() print(df)
Once you have all the files in a folder, we can do this (to test it out)
pip install -r requirements.txt
python main.py
and you will see a dump of the pandas DataFrame.
Note:
The list of Faker functions can be found here.
Comments
Post a Comment