Sunbelt Computer Software

GEn - Synthetic Dataset Generator

This project provides a modular, Python-based synthetic dataset generator designed specifically for Jupyter Notebook environments. It features a user-friendly gui built with ipywidgets, allowing users to select data fields across multiple domains and generate a logically consistent, relational dataset exportable as a CSV file.

Features

Interactive GUI: Use checkboxes, sliders, and buttons directly in your Jupyter Notebook to configure your dataset.
Relational Integrity: Later fields reference previously generated values in the same row to maintain logical realism (e.g., shipping dates always follow order dates).
Extensive Data Domains: Supports Demographics, Health Metrics, Geography, Products, Commerce, and IT/System Data.
Instant Export: Automatically generates a timestamped CSV file containing your customized dataset.
Preview Integration: Displays a sample of the generated DataFrame directly within the notebook output.

Demo

Requirements

To use the generator, you need Python installed along with the following libraries:

pandas
numpy
Faker
ipywidgets
IPython

You can install these dependencies using pip and the provided requirements.txt file:

pip install -r requirements.txt

Usage

Open a Jupyter Notebook.
Copy and paste the generator class and UI code into a single cell.
Run the cell to display the interactive widget.
Expand the categories to select the specific fields you need for your dataset.
Adjust the "Row Count" slider to define the size of your dataset (from 10 to 10,000 rows).
Click "Generate & Export". The script will create the data, save it as a CSV in your current working directory, and display a preview.

Data Domains and Fields

The generator currently supports the following fields grouped by domain:

Demographics: Customer ID, Name, Age, Gender, Education Level, Occupation, Email, Registered Date
Health Metrics: Height (cm), Weight (kg), BMI, Blood Type, Heart Rate (bpm)
Geography: Country, City, State, Zip Code, Latitude, Longitude, Timezone
Products: SKU, Product Name, Category, Model, Size, Color, Ratings, Review Count
Commerce: Price, Quantity, Discount (%), Total, Order Status, Order Date, Shipping Date, Shipping Carrier, Payment Method
IT/System Data: IP Address, MAC Address, User Agent, OS, UUID

The Golden Rules (Relational Logic)

This tool is designed to produce realistic data by enforcing strict rules:

Temporal Integrity: Registered dates precede order dates. Shipping dates logically follow order dates based on order status (e.g., Cancelled orders have no shipping date).
Identity Correlation: Email addresses are derived directly from the generated Name.
Financial Accuracy: The Total price is strictly calculated from Price, Quantity, and Discount.
Biological Consistency: BMI is mathematically derived from Height and Weight. Health metrics adjust based on Age and BMI.
Product Constraints: Product models, sizes, and names are bound to the selected product Category.
Geographical Specifics: Payment methods adapt to regional preferences.

Extensibility

The code utilizes a class-based approach (UniversalGenerator). You can easily add custom data domains by updating the self.domains dictionary in the __init__ method and adding the corresponding generation logic in the _generate_row method.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
demo		demo
GEn.ipynb		GEn.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEn - Synthetic Dataset Generator

Features

Demo

Requirements

Usage

Data Domains and Fields

The Golden Rules (Relational Logic)

Extensibility

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

GEn - Synthetic Dataset Generator

Features

Demo

Requirements

Usage

Data Domains and Fields

The Golden Rules (Relational Logic)

Extensibility

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages