This project demonstrates how to segment customers using the RFM (Recency, Frequency, Monetary) model combined with K-Means clustering. It helps identify patterns in customer behavior and groups them into actionable segments such as "Loyal Customers", "At Risk", and "Need Attention".
All customer data used in this project comes from OTA (Online Travel Agency) transactional data from kaggle. Customer IDs were randomized, it didn't exist in the data.
- Source: Proprietary OTA customer booking data
- Anonymized with randomized
CustomerID - Time Period: 1 year snapshot
Fields used:
Recency: Days since last bookingFrequency: Number of bookingsMonetary: Total value of bookings
After loading the dataset, I:
- Removed duplicates and nulls
- Calculated RFM values
- Applied square root transformation to reduce skew
Each customer was scored on a scale of 1 to 4 based on R, F, and M quartiles using pd.qcut. These scores were summed into an RFM_Score.
rfm_df['RFM_Score'] = rfm_df['R_Score'] + rfm_df['F_Score'] + rfm_df['M_Score']Output:
To determine the best number of clusters for K-Means, I used:
- Elbow Method
- Silhouette Score
from sklearn.metrics import silhouette_scoreOutput:
- Based on results, I chose k = 4.
I standardized the RFM sqrt values and applied KMeans clustering.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=4, random_state=42)
rfm_df['Cluster'] = kmeans.fit_predict(X_scaled)Output:
Before running the clustering algorithms I checked whether the square-root transformation really reduced skewness and brought the three R ∙ F ∙ M features onto comparable shapes.
Output:
Once the optimal k was chosen I calculated the mean Recency,
Frequency, and Monetary values for each cluster and counted how many
customers fell into every group.
Output:
I assigned human-readable labels to each cluster based on their average RFM behavior:
cluster_labels = {
0: 'Champions',
1: 'At Risk',
2: 'Loyal Customers',
3: 'Need Attention'
}
rfm_df['Segment'] = rfm_df['Cluster'].map(cluster_labels)Output:
I aggregated cluster statistics and visualized them using bar plots.
cluster_summary = rfm_df.groupby('Segment')[['Recency', 'Frequency', 'Monetary']].mean().round(1)Output:
- Python (Pandas, Scikit-learn, Seaborn, Matplotlib)
- JupyterLab (GitHub Codespaces)
git clone https://github.com/13Saksham/rfm-customer-segmentation-python.git
cd rfm-customer-segmentation-python
- Kaggle Datasets
- Concepts inspired by DataCamp, StackOverflow, and self-practice
