MockDataGen

Description

This Python project provides a robust solution for generating synthetic datasets tailored for Master Data Management (MDM) testing and data integration projects.

User can run a simple command from CLI to determine the size of the test datasetBy taking user input to determine dataset size
We then Split the Data set into two subset in 20 & 80 Weight Ratio, we leave the 80% Dataset untouched
We then take the 20% dataset, randonly pick records and make familair edits (Jennifer replaced by Jenn or David by Dave, Street by ST, Avenue by Ave. or Apartment by Apt. or #)
When we edit and add such altered records to the 20% set, we have carefully marked Original records with the use of Indicator. This helps in Later Validations.
In the end we have several Dataset, But user is presented with two main Dataset {Data_Set_3_LOAD1.csv} will be good trusted records and Dataset {Data_Set_3_LOAD2.csv} with mixed corrupted data

In the end we have a realistic test bed for assessing record matching, data cleansing, and standardization workflows. The ability to test how systems handle inconsistencies and merge similar records (MDM use case) is invaluable for improving data quality, ensuring seamless integration, and refining entity resolution processes. This tool is especially beneficial for evaluating record matching algorithms and validating data governance strategies in enterprise environments.