
Description
This Python project provides a robust solution for generating synthetic datasets tailored for Master Data Management (MDM) testing and data integration projects.
- User can run a simple command from CLI to determine the size of the test datasetBy taking user input to determine dataset size
- We then Split the Data set into two subset in 20 & 80 Weight Ratio, we leave the 80% Dataset untouched
- We then take the 20% dataset, randonly pick records and make familair edits (Jennifer replaced by Jenn or David by Dave, Street by ST, Avenue by Ave. or Apartment by Apt. or #)
- When we edit and add such altered records to the 20% set, we have carefully marked Original records with the use of Indicator. This helps in Later Validations.
- In the end we have several Dataset, But user is presented with two main Dataset {Data_Set_3_LOAD1.csv} will be good trusted records and Dataset {Data_Set_3_LOAD2.csv} with mixed corrupted data
In the end we have a realistic test bed for assessing record matching, data cleansing, and standardization workflows. The ability to test how systems handle inconsistencies and merge similar records (MDM use case) is invaluable for improving data quality, ensuring seamless integration, and refining entity resolution processes. This tool is especially beneficial for evaluating record matching algorithms and validating data governance strategies in enterprise environments.
Usage
Help Options :
mockdatagen –help
Generate 10 records with no display on screen :
mockdatagen –number 10 –print N

High Level Conceptual Data Flow Diagram:

Release history
Version 1.0.0 - Date 6/14/2025 { Run CLI Command like mockdatagen --number 10 --print N }
Version 1.1.0 - Date 6/15/2025 { Added Unit test cases, Pylint for quality and github actions}
======================= Developers Notes ======================
Dynamic Update Pylint & Coverage Badge via shell script
pylint_badge.sh
Tree
tree /F /A > tree_output.txt
pylint local run
uv run pylint myapp > pylint_report.txt || true