Generating Test Data with Faker

Images by kevin-ku-92347 from pexels.com
Images by kevin-ku-92347 from pexels.com


Very often, we need to generate test data for testing purposes. I use Faker quite a bit since I am working on Machine Learning and Data Science projects often. Unfortunately, I ended up having to rewrite code with it to generate data for different projects. I have the aspiration to write a wrapper around it that allows me to generate different data for each column in a pandas' DataFrame. I was procrastinating on this idea for years now, and I have decided to put an end to it. So, it is the inception of a Faker-enabled Data Generator, FDG.

MVP

A definition file to define the column names, and the kind of data that we wish to generate.
  1. A list of column names and what is the faker function to call for each column.
  2. A list of rows definition
    1. Number of rows to generate (count)
    2. locale (optional, Defaults to "en_US")
    3. seed (optional) - this is useful if we want to generate the same set of data from each run.
    4. columns (optional) - any columns to be generated with a different function from the columns definition above.
Example
{
    "columns": [
        { "name": "fname", "func": "faker.first_name()" },
        { "name": "lname", "func": "faker.last_name()" },
        { "name": "job", "func": "faker.job()" },
        { "name": "locale" }
    ],
    "rows": [
        {
            "count": 100,
            "locale": "en_US",
            "seed": 1234,
            "columns": {
                "locale": {
                    "text": "en-US"
                }
            }
        },
        {
            "count": 20,
            "locale": "ja_JP",
            "seed": 1234,
            "columns": {
                "locale": {
                    "text": "ja-JP"
                }
            }
        }
    ]
}
With this metadata, we can generate a pandas DataFrame like this.
  1. 4 columns
  2. 100 rows of English locale
  3. 20 rows of Japanese locale
           fname      lname                       job locale
0          Tammy  Alexander                     Actor  en-US
1    Christopher     Bender  Television floor manager  en-US
2         Austin     Steele  Chief Technology Officer  en-US
3          Katie      Young             Airline pilot  en-US
4        Valerie      Smith              Youth worker  en-US
..           ...        ...                       ...    ...
115            零         斉藤                      寿司職人  ja-JP
116           篤司         太田                    アニメーター  ja-JP
117            晃         中村                       調理師  ja-JP
118           明美         阿部                       占い師  ja-JP
119           和也         鈴木                       建築家  ja-JP

[120 rows x 4 columns]

Source Code

To keep it simple, I have all the source code in a folder.
- requirements.txt
- fake_data_builder.py
- main.py
- metadata.json
- metadata_simple.json

requirements.txt

This file contains the dependencies
pandas==1.3.3
faker==8.14.0

metadata.json & metadata_simple.json

This files contain a simple and complex data definition. We shall use it later.

metadata.json

{
    "columns": [
        { "name": "fname", "func": "faker.first_name()" },
        { "name": "lname", "func": "faker.last_name()" },
        { "name": "home-address", "func": "faker.address()" },
        { "name": "date_of_birth", "func": "faker.date_of_birth()" },
        { "name": "email-address", "func": "faker.ascii_free_email()" },
        { "name": "phone-number", "func": "faker.phone_number()" },
        { "name": "company", "func": "faker.company()" },
        { "name": "job", "func": "faker.job()" },
        {
            "name": "enrolL_date",
            "func": "faker.date_between(start_date='-10y')"
        },
        {
            "name": "employee_id",
            "func": "faker.bothify('????-########')"
        },
        {
            "name": "locale"
        }
    ],
    "rows": [
        {
            "count": 100,
            "locale": "en_US",
            "seed": 1234,
            "columns": {
                "locale": {
                    "text": "en-US"
                }
            }
        },
        {
            "count": 20,
            "locale": "ja_JP",
            "seed": 1234,
            "columns": {
                "locale": {
                    "text": "ja-JP"
                }
            }
        }
    ]
}

metadata_simple.json

{
    "columns": [
        { "name": "language", "text": "en_US" },
        { "name": "customer_name", "func": "faker.company()" },
        { "name": "transcript", "func": "' '.join(faker.paragraphs())" }
    ],
    "rows": [
        {
            "count": 5,
            "locale": "en_US",
            "seed": 1234
        }
    ]
}

fake_data_builder.py

This is the builder class. Most of the logics are encapsulated here. The implementation is very simple though. It reads the metadata information and then generate a pandas DataFrame accordingly.
"""Fake Data Builder."""

import json
import pandas as pd
from faker import Faker


class FakeDataBuilder:
    """Fake Data Builder."""

    def __init__(self, metadata_filename: str):
        """Create an instance of builder.

        Args:
            metadata_filename (str): metadata filename (JSON format)
        """
        with open(metadata_filename) as metadata_file:
            metadata = json.load(metadata_file)
            self.rows = metadata["rows"]
            self.columns = metadata["columns"]

    def __build_rows(
        self,
        df: pd.DataFrame,
        count: int,
        column_generator: dict,
        locale: str,
        seed: int,
    ):
        if locale:
            faker = Faker(locale)

        if seed:
            Faker.seed(seed)
        else:
            Faker.seed(0)

        for _ in range(count):
            df = df.append(
                self.__build_row(faker, column_generator),
                ignore_index=True,
            )

        return df

    def __get_column_val(
        self,
        faker: Faker,
        col: dict,
        column_generator: dict,
    ):
        col_name = col["name"]
        coln = (
            column_generator.get(col_name)
            if column_generator and column_generator.get(col_name)
            else col
        )

        return coln.get("text") if coln.get("text") else eval(coln["func"])

    def __build_row(self, faker, column_generator: dict):
        data = {}
        for col in self.columns:
            try:
                data[col["name"]] = self.__get_column_val(
                    faker,
                    col,
                    column_generator,
                )
            except AttributeError:
                raise AttributeError(f"Unrecognized column func, {col['func']}.")
        return data

    def build(self):
        """Build the dataframe.

        Returns:
            pandas.dataframe: Pandas dataframe containing the data.
        """
        column_names = [x["name"] for x in self.columns]
        df = pd.DataFrame(columns=column_names)

        for row in self.rows:
            df = self.__build_rows(
                df,
                row["count"],
                row.get("columns"),
                row.get("locale"),
                row.get("seed"),
            )

        return df

main.py

This file is created to test the builder class.
"""Test code."""

from fake_data_builder import FakeDataBuilder

builder = FakeDataBuilder("./metadata.json") # or you can choose "./metadata_simple.json
df = builder.build()
print(df)

Once you have all the files in a folder, we can do this (to test it out)
pip install -r requirements.txt
python main.py
and you will see a dump of the pandas DataFrame.

screenshot of terminal


Note:
The list of Faker functions can be found here.

Comments

Popular posts from this blog

OpenAI: Functions Feature in 2023-07-01-preview API version

Storing embedding in Azure Database for PostgreSQL

Happy New Year, 2024 from DALL-E