![]() |
| Images by kevin-ku-92347 from pexels.com |
Very often, we need to generate test data for testing purposes. I use Faker quite a bit since I am working on Machine Learning and Data Science projects often. Unfortunately, I ended up having to rewrite code with it to generate data for different projects. I have the aspiration to write a wrapper around it that allows me to generate different data for each column in a pandas' DataFrame. I was procrastinating on this idea for years now, and I have decided to put an end to it. So, it is the inception of a Faker-enabled Data Generator, FDG.
MVP
A definition file to define the column names, and the kind of data that we wish to generate.
- A list of column names and what is the faker function to call for each column.
- A list of rows definition
- Number of rows to generate (count)
- locale (optional, Defaults to "en_US")
- seed (optional) - this is useful if we want to generate the same set of data from each run.
- columns (optional) - any columns to be generated with a different function from the columns definition above.
Example
{
"columns": [
{ "name": "fname", "func": "faker.first_name()" },
{ "name": "lname", "func": "faker.last_name()" },
{ "name": "job", "func": "faker.job()" },
{ "name": "locale" }
],
"rows": [
{
"count": 100,
"locale": "en_US",
"seed": 1234,
"columns": {
"locale": {
"text": "en-US"
}
}
},
{
"count": 20,
"locale": "ja_JP",
"seed": 1234,
"columns": {
"locale": {
"text": "ja-JP"
}
}
}
]
}
With this metadata, we can generate a pandas DataFrame like this.
- 4 columns
- 100 rows of English locale
- 20 rows of Japanese locale
fname lname job locale 0 Tammy Alexander Actor en-US 1 Christopher Bender Television floor manager en-US 2 Austin Steele Chief Technology Officer en-US 3 Katie Young Airline pilot en-US 4 Valerie Smith Youth worker en-US .. ... ... ... ... 115 零 斉藤 寿司職人 ja-JP 116 篤司 太田 アニメーター ja-JP 117 晃 中村 調理師 ja-JP 118 明美 阿部 占い師 ja-JP 119 和也 鈴木 建築家 ja-JP [120 rows x 4 columns]
Source Code
To keep it simple, I have all the source code in a folder.
- requirements.txt - fake_data_builder.py - main.py - metadata.json - metadata_simple.json
requirements.txt
This file contains the dependencies
pandas==1.3.3 faker==8.14.0
metadata.json & metadata_simple.json
This files contain a simple and complex data definition. We shall use it later.
metadata.json
{
"columns": [
{ "name": "fname", "func": "faker.first_name()" },
{ "name": "lname", "func": "faker.last_name()" },
{ "name": "home-address", "func": "faker.address()" },
{ "name": "date_of_birth", "func": "faker.date_of_birth()" },
{ "name": "email-address", "func": "faker.ascii_free_email()" },
{ "name": "phone-number", "func": "faker.phone_number()" },
{ "name": "company", "func": "faker.company()" },
{ "name": "job", "func": "faker.job()" },
{
"name": "enrolL_date",
"func": "faker.date_between(start_date='-10y')"
},
{
"name": "employee_id",
"func": "faker.bothify('????-########')"
},
{
"name": "locale"
}
],
"rows": [
{
"count": 100,
"locale": "en_US",
"seed": 1234,
"columns": {
"locale": {
"text": "en-US"
}
}
},
{
"count": 20,
"locale": "ja_JP",
"seed": 1234,
"columns": {
"locale": {
"text": "ja-JP"
}
}
}
]
}
metadata_simple.json
{
"columns": [
{ "name": "language", "text": "en_US" },
{ "name": "customer_name", "func": "faker.company()" },
{ "name": "transcript", "func": "' '.join(faker.paragraphs())" }
],
"rows": [
{
"count": 5,
"locale": "en_US",
"seed": 1234
}
]
}
fake_data_builder.py
This is the builder class. Most of the logics are encapsulated here. The implementation is very simple though. It reads the metadata information and then generate a pandas DataFrame accordingly.
"""Fake Data Builder."""
import json
import pandas as pd
from faker import Faker
class FakeDataBuilder:
"""Fake Data Builder."""
def __init__(self, metadata_filename: str):
"""Create an instance of builder.
Args:
metadata_filename (str): metadata filename (JSON format)
"""
with open(metadata_filename) as metadata_file:
metadata = json.load(metadata_file)
self.rows = metadata["rows"]
self.columns = metadata["columns"]
def __build_rows(
self,
df: pd.DataFrame,
count: int,
column_generator: dict,
locale: str,
seed: int,
):
if locale:
faker = Faker(locale)
if seed:
Faker.seed(seed)
else:
Faker.seed(0)
for _ in range(count):
df = df.append(
self.__build_row(faker, column_generator),
ignore_index=True,
)
return df
def __get_column_val(
self,
faker: Faker,
col: dict,
column_generator: dict,
):
col_name = col["name"]
coln = (
column_generator.get(col_name)
if column_generator and column_generator.get(col_name)
else col
)
return coln.get("text") if coln.get("text") else eval(coln["func"])
def __build_row(self, faker, column_generator: dict):
data = {}
for col in self.columns:
try:
data[col["name"]] = self.__get_column_val(
faker,
col,
column_generator,
)
except AttributeError:
raise AttributeError(f"Unrecognized column func, {col['func']}.")
return data
def build(self):
"""Build the dataframe.
Returns:
pandas.dataframe: Pandas dataframe containing the data.
"""
column_names = [x["name"] for x in self.columns]
df = pd.DataFrame(columns=column_names)
for row in self.rows:
df = self.__build_rows(
df,
row["count"],
row.get("columns"),
row.get("locale"),
row.get("seed"),
)
return df
main.py
This file is created to test the builder class.
"""Test code."""
from fake_data_builder import FakeDataBuilder
builder = FakeDataBuilder("./metadata.json") # or you can choose "./metadata_simple.json
df = builder.build()
print(df)
Once you have all the files in a folder, we can do this (to test it out)
pip install -r requirements.txt
python main.py
and you will see a dump of the pandas DataFrame.
Note:
The list of Faker functions can be found here.


Comments
Post a Comment