Sunbelt Computer Software

🛡️ AI-Based Network Intrusion Detection System (NIDS)

A production-ready machine learning system for detecting network intrusions in real-time,
trained on the NSL-KDD dataset using a RandomForest classifier.

📋 Table of Contents

🔍 Project Overview

This system detects network intrusions (attacks) in real-time by analysing network connection features. It classifies traffic as:

Label	Meaning
`0` — Normal	Legitimate network traffic
`1` — Attack	Intrusion / malicious activity

Attack categories in NSL-KDD:

DoS — Denial of Service (e.g., smurf, neptune)
Probe — Network scanning (e.g., ipsweep, portsweep)
R2L — Remote-to-Local attacks (e.g., guess_passwd, ftp_write)
U2R — User-to-Root attacks (e.g., buffer_overflow, rootkit)

🏗️ Architecture

NSL-KDD Dataset
      │
      ▼
┌─────────────────┐
│  preprocess.py  │  ← Load, clean, one-hot encode, label binary
└────────┬────────┘
         │  X_train, y_train
         ▼
┌─────────────────┐
│    train.py     │  ← RandomForestClassifier, evaluate, save model
└────────┬────────┘
         │  nids_model.pkl + feature_columns.pkl
         ▼
┌─────────────────┐     ┌──────────────────┐
│   predict.py    │────▶│  FastAPI (api.py) │  /predict endpoint
└────────┬────────┘     └──────────────────┘
         │
         ▼
┌─────────────────┐
│  Streamlit UI   │  ← Interactive dashboard with charts
└─────────────────┘

📁 Project Structure

nids/
├── data/
│   ├── KDDTrain+.txt          # NSL-KDD training set (download manually)
│   └── KDDTest+.txt           # NSL-KDD test set (download manually)
│
├── model/
│   ├── nids_model.pkl         # Saved RandomForest model (generated after training)
│   └── feature_columns.pkl    # Saved feature schema (generated after training)
│
├── src/
│   ├── __init__.py
│   ├── preprocess.py          # Data loading, encoding, preprocessing pipeline
│   ├── train.py               # Model training and evaluation
│   └── predict.py             # Inference module with feature alignment
│
├── app/
│   ├── __init__.py
│   ├── app.py                 # Streamlit dashboard
│   └── api.py                 # FastAPI REST API
│
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── README.md

📊 Dataset

NSL-KDD is an improved version of the KDD Cup 1999 dataset, widely used for evaluating NIDS.

Download Instructions

Visit: https://www.unb.ca/cic/datasets/nsl.html
Download NSL-KDD.zip
Extract and place files in the data/ directory:

data/
├── KDDTrain+.txt
└── KDDTest+.txt

Note: The files should be the raw .txt versions without headers
(41 feature columns + label + difficulty level = 43 columns total).

Dataset Statistics

Split	Records	Normal	Attack
Train	125,973	67,343 (53.5%)	58,630 (46.5%)
Test	22,544	9,711 (43.1%)	12,833 (56.9%)

⚙️ Setup & Installation

Prerequisites

Python 3.11 (required)
pip 23+
VS Code (recommended)
Git

1. Clone / Create the project

# If using git
git clone https://github.com/yourname/nids.git
cd nids

# OR create the directory manually and place all files

2. Create a virtual environment

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3.11 -m venv venv
source venv/bin/activate

3. Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Place dataset files

# Create data directory
mkdir -p data

# Copy downloaded NSL-KDD files here
# data/KDDTrain+.txt
# data/KDDTest+.txt

🖥️ Step-by-Step Run Guide (VS Code)

Step 1 — Open project in VS Code

code .

Step 2 — Select Python interpreter

Press Ctrl+Shift+P → Python: Select Interpreter
Choose the one inside your venv folder

Step 3 — Open integrated terminal

Press Ctrl+` to open terminal, then activate venv:

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

Step 4 — Download dataset

Download KDDTrain+.txt and KDDTest+.txt from the NSL-KDD website and place in data/.

Step 5 — Train the model

python src/train.py

Expected output:

[INFO] Loading dataset from: data/KDDTrain+.txt
[INFO] Loaded 125973 records
[INFO] Training RandomForestClassifier...
[INFO] Training complete.
[INFO] Accuracy  : 0.9978  (99.78%)
[INFO] Precision : 0.9981
[INFO] Recall    : 0.9971
[INFO] F1 Score  : 0.9976
[INFO] Model saved → model/nids_model.pkl

Step 6 — Test prediction (optional)

python src/predict.py

Step 7 — Launch Streamlit dashboard

streamlit run app/app.py

Browser will open at: http://localhost:8501

Step 8 — Launch FastAPI (optional, separate terminal)

uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload

API docs at: http://localhost:8000/docs

🚀 Running the Project

Train Model

python src/train.py

Run Prediction Test

python src/predict.py

Streamlit Dashboard

streamlit run app/app.py

FastAPI Server

uvicorn app.api:app --reload --port 8000

🎛️ Streamlit Dashboard

The dashboard provides:

Feature input form — 20+ network traffic feature inputs
Quick load buttons — Load example Normal or Attack records
Real-time prediction — Instant Normal/Attack classification
Confidence metrics — Probability scores for both classes
Gauge chart — Visual attack probability meter
Bar chart — Class probability comparison
Feature summary table — Key features at a glance
System log — Terminal-style result output

🌐 FastAPI Endpoints

`GET /health`

Check if the model is loaded and ready.

`POST /predict`

Predict a single network traffic record.

Request body:

{
  "duration": 0,
  "protocol_type": "tcp",
  "service": "http",
  "flag": "SF",
  "src_bytes": 215,
  "dst_bytes": 45076,
  "logged_in": 1,
  ...
}

Response:

{
  "prediction": "Normal",
  "label": 0,
  "confidence": 98.5,
  "attack_probability": 1.5
}

`POST /predict/batch`

Predict multiple records at once.

{
  "records": [ {...}, {...} ]
}

`GET /model/info`

Returns model metadata (type, n_estimators, features).

Interactive API Docs

Visit http://localhost:8000/docs for Swagger UI.

📈 Model Performance

Typical results on NSL-KDD dataset:

Results may vary slightly due to random state and dataset version.

Model configuration:

RandomForestClassifier(
    n_estimators=100,
    max_features="sqrt",
    class_weight="balanced",
    n_jobs=-1,
    random_state=42
)

🐳 Docker Deployment

Build and run with Docker Compose

# Build images
docker-compose build

# Start both services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Services:

Streamlit → http://localhost:8501
FastAPI → http://localhost:8000

Build manually

# Build image
docker build -t nids:latest .

# Run Streamlit
docker run -p 8501:8501 -v $(pwd)/data:/app/data -v $(pwd)/model:/app/model nids:latest

# Run FastAPI
docker run -p 8000:8000 -v $(pwd)/data:/app/data -v $(pwd)/model:/app/model nids:latest \
  uvicorn app.api:app --host 0.0.0.0 --port 8000

Important: Mount data/ and model/ as volumes so the container can access
your dataset and trained model.

🔬 Technical Deep Dive

Feature Engineering

The NSL-KDD dataset has 41 features:

9 basic features — connection properties (duration, bytes, protocol)
13 content features — content within connections (hot indicators, login attempts)
9 time-based traffic features — last 2 seconds of connections
10 host-based traffic features — same host connections in last 100 connections

Categorical features (one-hot encoded):

protocol_type → tcp, udp, icmp (3 values)
service → http, ftp, smtp, etc. (70 values)
flag → SF, S0, REJ, etc. (11 values)

Feature Alignment

A critical challenge in production ML systems is ensuring that inference inputs match the exact column schema used during training. This project solves it with:

feature_columns.pkl — saved list of columns after one-hot encoding
preprocess_single_record() — aligns single records to training schema:
- Adds missing one-hot columns as 0
- Removes unseen categories
- Enforces column order with reindex()

Why RandomForest?

Handles mixed feature types well (numeric + categorical after encoding)
Robust to outliers — important for network traffic data
Feature importance — built-in interpretability
Parallelisable — fast training with n_jobs=-1
Consistent results — established benchmark on NSL-KDD (>99% accuracy)

🔧 Troubleshooting

`FileNotFoundError: KDDTrain+.txt`

→ Download the NSL-KDD dataset and place files in data/

`FileNotFoundError: nids_model.pkl`

→ Run python src/train.py first to train and save the model

`ModuleNotFoundError`

→ Ensure your virtual environment is activated: source venv/bin/activate

Streamlit: `No module named 'src'`

→ Run Streamlit from the project root: streamlit run app/app.py (not from inside app/)

Feature mismatch error

→ Delete model/feature_columns.pkl and model/nids_model.pkl, retrain with python src/train.py

Low performance on test set

→ This is normal — NSL-KDD test set (KDDTest+.txt) contains novel attack types
not seen in training. ~77-99% test accuracy is expected depending on configuration.

👨‍💻 Author

Final Year Major Project — AI-Based Network Intrusion Detection System
Built with Python 3.11 · scikit-learn · Streamlit · FastAPI

📜 License

MIT License — free for academic and educational use.

Metric	Training Set	Test Set
Accuracy	~99.8%	~99.2%
Precision	~99.8%	~99.1%
Recall	~99.7%	~99.3%
F1 Score	~99.8%	~99.2%

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
.vscode		.vscode
data		data
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
api.py		api.py
app.py		app.py
nids_project.zip		nids_project.zip
requirements.txt		requirements.txt
train.py		train.py

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

🛡️ AI-Based Network Intrusion Detection System (NIDS)

📋 Table of Contents

🔍 Project Overview

🏗️ Architecture

📁 Project Structure

📊 Dataset

Download Instructions

Dataset Statistics

⚙️ Setup & Installation

Prerequisites

1. Clone / Create the project

2. Create a virtual environment

3. Install dependencies

4. Place dataset files

🖥️ Step-by-Step Run Guide (VS Code)

Step 1 — Open project in VS Code

Step 2 — Select Python interpreter

Step 3 — Open integrated terminal

Step 4 — Download dataset

Step 5 — Train the model

Step 6 — Test prediction (optional)

Step 7 — Launch Streamlit dashboard

Step 8 — Launch FastAPI (optional, separate terminal)

🚀 Running the Project

Train Model

Run Prediction Test

Streamlit Dashboard

FastAPI Server

🎛️ Streamlit Dashboard

🌐 FastAPI Endpoints

GET /health

POST /predict

POST /predict/batch

GET /model/info

Interactive API Docs

📈 Model Performance

🐳 Docker Deployment

Build and run with Docker Compose

Build manually

🔬 Technical Deep Dive

Feature Engineering

Feature Alignment

Why RandomForest?

🔧 Troubleshooting

FileNotFoundError: KDDTrain+.txt

FileNotFoundError: nids_model.pkl

ModuleNotFoundError

Streamlit: No module named 'src'

Feature mismatch error

Low performance on test set

👨‍💻 Author

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`POST /predict`

`POST /predict/batch`

`GET /model/info`

`FileNotFoundError: KDDTrain+.txt`

`FileNotFoundError: nids_model.pkl`

`ModuleNotFoundError`

Streamlit: `No module named 'src'`

Packages