GitHub - workoff-13/NIDS_PROJECT · GitHub
Skip to content

workoff-13/NIDS_PROJECT

Folders and files

Repository files navigation

🛡️ AI-Based Network Intrusion Detection System (NIDS)

A production-ready machine learning system for detecting network intrusions in real-time,
trained on the NSL-KDD dataset using a RandomForest classifier.

Python scikit-learn Streamlit FastAPI


📋 Table of Contents


🔍 Project Overview

This system detects network intrusions (attacks) in real-time by analysing network connection features. It classifies traffic as:

Label Meaning
0 — Normal Legitimate network traffic
1 — Attack Intrusion / malicious activity

Attack categories in NSL-KDD:

  • DoS — Denial of Service (e.g., smurf, neptune)
  • Probe — Network scanning (e.g., ipsweep, portsweep)
  • R2L — Remote-to-Local attacks (e.g., guess_passwd, ftp_write)
  • U2R — User-to-Root attacks (e.g., buffer_overflow, rootkit)

🏗️ Architecture

NSL-KDD Dataset
      │
      ▼
┌─────────────────┐
│  preprocess.py  │  ← Load, clean, one-hot encode, label binary
└────────┬────────┘
         │  X_train, y_train
         ▼
┌─────────────────┐
│    train.py     │  ← RandomForestClassifier, evaluate, save model
└────────┬────────┘
         │  nids_model.pkl + feature_columns.pkl
         ▼
┌─────────────────┐     ┌──────────────────┐
│   predict.py    │────▶│  FastAPI (api.py) │  /predict endpoint
└────────┬────────┘     └──────────────────┘
         │
         ▼
┌─────────────────┐
│  Streamlit UI   │  ← Interactive dashboard with charts
└─────────────────┘

📁 Project Structure

nids/
├── data/
│   ├── KDDTrain+.txt          # NSL-KDD training set (download manually)
│   └── KDDTest+.txt           # NSL-KDD test set (download manually)
│
├── model/
│   ├── nids_model.pkl         # Saved RandomForest model (generated after training)
│   └── feature_columns.pkl    # Saved feature schema (generated after training)
│
├── src/
│   ├── __init__.py
│   ├── preprocess.py          # Data loading, encoding, preprocessing pipeline
│   ├── train.py               # Model training and evaluation
│   └── predict.py             # Inference module with feature alignment
│
├── app/
│   ├── __init__.py
│   ├── app.py                 # Streamlit dashboard
│   └── api.py                 # FastAPI REST API
│
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── README.md

📊 Dataset

NSL-KDD is an improved version of the KDD Cup 1999 dataset, widely used for evaluating NIDS.

Download Instructions

  1. Visit: https://www.unb.ca/cic/datasets/nsl.html
  2. Download NSL-KDD.zip
  3. Extract and place files in the data/ directory:
data/
├── KDDTrain+.txt
└── KDDTest+.txt

Note: The files should be the raw .txt versions without headers
(41 feature columns + label + difficulty level = 43 columns total).

Dataset Statistics

Split Records Normal Attack
Train 125,973 67,343 (53.5%) 58,630 (46.5%)
Test 22,544 9,711 (43.1%) 12,833 (56.9%)

⚙️ Setup & Installation

Prerequisites

  • Python 3.11 (required)
  • pip 23+
  • VS Code (recommended)
  • Git

1. Clone / Create the project

# If using git
git clone https://github.com/yourname/nids.git
cd nids

# OR create the directory manually and place all files

2. Create a virtual environment

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3.11 -m venv venv
source venv/bin/activate

3. Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Place dataset files

# Create data directory
mkdir -p data

# Copy downloaded NSL-KDD files here
# data/KDDTrain+.txt
# data/KDDTest+.txt

🖥️ Step-by-Step Run Guide (VS Code)

Step 1 — Open project in VS Code

code .

Step 2 — Select Python interpreter

  • Press Ctrl+Shift+PPython: Select Interpreter
  • Choose the one inside your venv folder

Step 3 — Open integrated terminal

Press Ctrl+` to open terminal, then activate venv:

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

Step 4 — Download dataset

Download KDDTrain+.txt and KDDTest+.txt from the NSL-KDD website and place in data/.

Step 5 — Train the model

python src/train.py

Expected output:

[INFO] Loading dataset from: data/KDDTrain+.txt
[INFO] Loaded 125973 records
[INFO] Training RandomForestClassifier...
[INFO] Training complete.
[INFO] Accuracy  : 0.9978  (99.78%)
[INFO] Precision : 0.9981
[INFO] Recall    : 0.9971
[INFO] F1 Score  : 0.9976
[INFO] Model saved → model/nids_model.pkl

Step 6 — Test prediction (optional)

python src/predict.py

Step 7 — Launch Streamlit dashboard

streamlit run app/app.py

Browser will open at: http://localhost:8501

Step 8 — Launch FastAPI (optional, separate terminal)

uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload

API docs at: http://localhost:8000/docs


🚀 Running the Project

Train Model

python src/train.py

Run Prediction Test

python src/predict.py

Streamlit Dashboard

streamlit run app/app.py

FastAPI Server

uvicorn app.api:app --reload --port 8000

🎛️ Streamlit Dashboard

The dashboard provides:

  • Feature input form — 20+ network traffic feature inputs
  • Quick load buttons — Load example Normal or Attack records
  • Real-time prediction — Instant Normal/Attack classification
  • Confidence metrics — Probability scores for both classes
  • Gauge chart — Visual attack probability meter
  • Bar chart — Class probability comparison
  • Feature summary table — Key features at a glance
  • System log — Terminal-style result output

🌐 FastAPI Endpoints

GET /health

Check if the model is loaded and ready.

POST /predict

Predict a single network traffic record.

Request body:

{
  "duration": 0,
  "protocol_type": "tcp",
  "service": "http",
  "flag": "SF",
  "src_bytes": 215,
  "dst_bytes": 45076,
  "logged_in": 1,
  ...
}

Response:

{
  "prediction": "Normal",
  "label": 0,
  "confidence": 98.5,
  "attack_probability": 1.5
}

POST /predict/batch

Predict multiple records at once.

{
  "records": [ {...}, {...} ]
}

GET /model/info

Returns model metadata (type, n_estimators, features).

Interactive API Docs

Visit http://localhost:8000/docs for Swagger UI.


📈 Model Performance

Typical results on NSL-KDD dataset:

Metric Training Set Test Set
Accuracy ~99.8% ~99.2%
Precision ~99.8% ~99.1%
Recall ~99.7% ~99.3%
F1 Score ~99.8% ~99.2%

Results may vary slightly due to random state and dataset version.

Model configuration:

RandomForestClassifier(
    n_estimators=100,
    max_features="sqrt",
    class_weight="balanced",
    n_jobs=-1,
    random_state=42
)

🐳 Docker Deployment

Build and run with Docker Compose

# Build images
docker-compose build

# Start both services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Services:

Build manually

# Build image
docker build -t nids:latest .

# Run Streamlit
docker run -p 8501:8501 -v $(pwd)/data:/app/data -v $(pwd)/model:/app/model nids:latest

# Run FastAPI
docker run -p 8000:8000 -v $(pwd)/data:/app/data -v $(pwd)/model:/app/model nids:latest \
  uvicorn app.api:app --host 0.0.0.0 --port 8000

Important: Mount data/ and model/ as volumes so the container can access
your dataset and trained model.


🔬 Technical Deep Dive

Feature Engineering

The NSL-KDD dataset has 41 features:

  • 9 basic features — connection properties (duration, bytes, protocol)
  • 13 content features — content within connections (hot indicators, login attempts)
  • 9 time-based traffic features — last 2 seconds of connections
  • 10 host-based traffic features — same host connections in last 100 connections

Categorical features (one-hot encoded):

  • protocol_type → tcp, udp, icmp (3 values)
  • service → http, ftp, smtp, etc. (70 values)
  • flag → SF, S0, REJ, etc. (11 values)

Feature Alignment

A critical challenge in production ML systems is ensuring that inference inputs match the exact column schema used during training. This project solves it with:

  1. feature_columns.pkl — saved list of columns after one-hot encoding
  2. preprocess_single_record() — aligns single records to training schema:
    • Adds missing one-hot columns as 0
    • Removes unseen categories
    • Enforces column order with reindex()

Why RandomForest?

  • Handles mixed feature types well (numeric + categorical after encoding)
  • Robust to outliers — important for network traffic data
  • Feature importance — built-in interpretability
  • Parallelisable — fast training with n_jobs=-1
  • Consistent results — established benchmark on NSL-KDD (>99% accuracy)

🔧 Troubleshooting

FileNotFoundError: KDDTrain+.txt

→ Download the NSL-KDD dataset and place files in data/

FileNotFoundError: nids_model.pkl

→ Run python src/train.py first to train and save the model

ModuleNotFoundError

→ Ensure your virtual environment is activated: source venv/bin/activate

Streamlit: No module named 'src'

→ Run Streamlit from the project root: streamlit run app/app.py (not from inside app/)

Feature mismatch error

→ Delete model/feature_columns.pkl and model/nids_model.pkl, retrain with python src/train.py

Low performance on test set

→ This is normal — NSL-KDD test set (KDDTest+.txt) contains novel attack types
not seen in training. ~77-99% test accuracy is expected depending on configuration.


👨‍💻 Author

Final Year Major Project — AI-Based Network Intrusion Detection System
Built with Python 3.11 · scikit-learn · Streamlit · FastAPI


📜 License

MIT License — free for academic and educational use.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors