A production-ready machine learning system for detecting network intrusions in real-time,
trained on the NSL-KDD dataset using a RandomForest classifier.
- Project Overview
- Architecture
- Project Structure
- Dataset
- Setup & Installation
- Step-by-Step Run Guide (VS Code)
- Running the Project
- Streamlit Dashboard
- FastAPI Endpoints
- Model Performance
- Docker Deployment
- Technical Deep Dive
- Troubleshooting
This system detects network intrusions (attacks) in real-time by analysing network connection features. It classifies traffic as:
| Label | Meaning |
|---|---|
0 — Normal |
Legitimate network traffic |
1 — Attack |
Intrusion / malicious activity |
Attack categories in NSL-KDD:
- DoS — Denial of Service (e.g., smurf, neptune)
- Probe — Network scanning (e.g., ipsweep, portsweep)
- R2L — Remote-to-Local attacks (e.g., guess_passwd, ftp_write)
- U2R — User-to-Root attacks (e.g., buffer_overflow, rootkit)
NSL-KDD Dataset
│
▼
┌─────────────────┐
│ preprocess.py │ ← Load, clean, one-hot encode, label binary
└────────┬────────┘
│ X_train, y_train
▼
┌─────────────────┐
│ train.py │ ← RandomForestClassifier, evaluate, save model
└────────┬────────┘
│ nids_model.pkl + feature_columns.pkl
▼
┌─────────────────┐ ┌──────────────────┐
│ predict.py │────▶│ FastAPI (api.py) │ /predict endpoint
└────────┬────────┘ └──────────────────┘
│
▼
┌─────────────────┐
│ Streamlit UI │ ← Interactive dashboard with charts
└─────────────────┘
nids/
├── data/
│ ├── KDDTrain+.txt # NSL-KDD training set (download manually)
│ └── KDDTest+.txt # NSL-KDD test set (download manually)
│
├── model/
│ ├── nids_model.pkl # Saved RandomForest model (generated after training)
│ └── feature_columns.pkl # Saved feature schema (generated after training)
│
├── src/
│ ├── __init__.py
│ ├── preprocess.py # Data loading, encoding, preprocessing pipeline
│ ├── train.py # Model training and evaluation
│ └── predict.py # Inference module with feature alignment
│
├── app/
│ ├── __init__.py
│ ├── app.py # Streamlit dashboard
│ └── api.py # FastAPI REST API
│
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── README.md
NSL-KDD is an improved version of the KDD Cup 1999 dataset, widely used for evaluating NIDS.
- Visit: https://www.unb.ca/cic/datasets/nsl.html
- Download
NSL-KDD.zip - Extract and place files in the
data/directory:
data/
├── KDDTrain+.txt
└── KDDTest+.txt
Note: The files should be the raw
.txtversions without headers
(41 feature columns + label + difficulty level = 43 columns total).
| Split | Records | Normal | Attack |
|---|---|---|---|
| Train | 125,973 | 67,343 (53.5%) | 58,630 (46.5%) |
| Test | 22,544 | 9,711 (43.1%) | 12,833 (56.9%) |
- Python 3.11 (required)
- pip 23+
- VS Code (recommended)
- Git
# If using git
git clone https://github.com/yourname/nids.git
cd nids
# OR create the directory manually and place all files# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3.11 -m venv venv
source venv/bin/activatepip install --upgrade pip
pip install -r requirements.txt# Create data directory
mkdir -p data
# Copy downloaded NSL-KDD files here
# data/KDDTrain+.txt
# data/KDDTest+.txtcode .- Press
Ctrl+Shift+P→ Python: Select Interpreter - Choose the one inside your
venvfolder
Press Ctrl+` to open terminal, then activate venv:
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activateDownload KDDTrain+.txt and KDDTest+.txt from the NSL-KDD website and place in data/.
python src/train.pyExpected output:
[INFO] Loading dataset from: data/KDDTrain+.txt
[INFO] Loaded 125973 records
[INFO] Training RandomForestClassifier...
[INFO] Training complete.
[INFO] Accuracy : 0.9978 (99.78%)
[INFO] Precision : 0.9981
[INFO] Recall : 0.9971
[INFO] F1 Score : 0.9976
[INFO] Model saved → model/nids_model.pkl
python src/predict.pystreamlit run app/app.pyBrowser will open at: http://localhost:8501
uvicorn app.api:app --host 0.0.0.0 --port 8000 --reloadAPI docs at: http://localhost:8000/docs
python src/train.pypython src/predict.pystreamlit run app/app.pyuvicorn app.api:app --reload --port 8000The dashboard provides:
- Feature input form — 20+ network traffic feature inputs
- Quick load buttons — Load example Normal or Attack records
- Real-time prediction — Instant Normal/Attack classification
- Confidence metrics — Probability scores for both classes
- Gauge chart — Visual attack probability meter
- Bar chart — Class probability comparison
- Feature summary table — Key features at a glance
- System log — Terminal-style result output
Check if the model is loaded and ready.
Predict a single network traffic record.
Request body:
{
"duration": 0,
"protocol_type": "tcp",
"service": "http",
"flag": "SF",
"src_bytes": 215,
"dst_bytes": 45076,
"logged_in": 1,
...
}Response:
{
"prediction": "Normal",
"label": 0,
"confidence": 98.5,
"attack_probability": 1.5
}Predict multiple records at once.
{
"records": [ {...}, {...} ]
}Returns model metadata (type, n_estimators, features).
Visit http://localhost:8000/docs for Swagger UI.
Typical results on NSL-KDD dataset:
Results may vary slightly due to random state and dataset version.
Model configuration:
RandomForestClassifier(
n_estimators=100,
max_features="sqrt",
class_weight="balanced",
n_jobs=-1,
random_state=42
)# Build images
docker-compose build
# Start both services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose downServices:
- Streamlit → http://localhost:8501
- FastAPI → http://localhost:8000
# Build image
docker build -t nids:latest .
# Run Streamlit
docker run -p 8501:8501 -v $(pwd)/data:/app/data -v $(pwd)/model:/app/model nids:latest
# Run FastAPI
docker run -p 8000:8000 -v $(pwd)/data:/app/data -v $(pwd)/model:/app/model nids:latest \
uvicorn app.api:app --host 0.0.0.0 --port 8000Important: Mount
data/andmodel/as volumes so the container can access
your dataset and trained model.
The NSL-KDD dataset has 41 features:
- 9 basic features — connection properties (duration, bytes, protocol)
- 13 content features — content within connections (hot indicators, login attempts)
- 9 time-based traffic features — last 2 seconds of connections
- 10 host-based traffic features — same host connections in last 100 connections
Categorical features (one-hot encoded):
protocol_type→ tcp, udp, icmp (3 values)service→ http, ftp, smtp, etc. (70 values)flag→ SF, S0, REJ, etc. (11 values)
A critical challenge in production ML systems is ensuring that inference inputs match the exact column schema used during training. This project solves it with:
feature_columns.pkl— saved list of columns after one-hot encodingpreprocess_single_record()— aligns single records to training schema:- Adds missing one-hot columns as
0 - Removes unseen categories
- Enforces column order with
reindex()
- Adds missing one-hot columns as
- Handles mixed feature types well (numeric + categorical after encoding)
- Robust to outliers — important for network traffic data
- Feature importance — built-in interpretability
- Parallelisable — fast training with
n_jobs=-1 - Consistent results — established benchmark on NSL-KDD (>99% accuracy)
→ Download the NSL-KDD dataset and place files in data/
→ Run python src/train.py first to train and save the model
→ Ensure your virtual environment is activated: source venv/bin/activate
→ Run Streamlit from the project root: streamlit run app/app.py (not from inside app/)
→ Delete model/feature_columns.pkl and model/nids_model.pkl, retrain with python src/train.py
→ This is normal — NSL-KDD test set (KDDTest+.txt) contains novel attack types
not seen in training. ~77-99% test accuracy is expected depending on configuration.
Final Year Major Project — AI-Based Network Intrusion Detection System
Built with Python 3.11 · scikit-learn · Streamlit · FastAPI
MIT License — free for academic and educational use.
