Can we create a data processing pipeline and machine learning classifier that will predict whether a bank client will purchase a subscription to a bank term deposit (BTD)?
This project makes use of the UCI bank marketing data set to train and evaluate the classifiers defined in this repository.
- Setup Docker for desktop
- Pull the docker image for this repo:
docker pull pauldhein/telemarket-model:latest
This experiment processes a CSV dataset into train/dev/test folds after first converting the label column to binary and all string-valued columns to categorical data. Categorical data columns are also one-hot encoded to aid in training. The script can be run with the following command:
docker run -it pauldhein/telemarket-model:latest python dataset_processor.py -p ../data/bank-full.csv
The -p
flag specifies the full path of the CSV file to be processed.
This experiment performs a grid search over specified hyperparameter sets for five different classification models. Extra processing is done during the experiment to:
- Account for class imbalance in the labels
- Select the top 20 features to improve classification using a mutual information statistic
- Standardize the feature values using min-max standardization
The experiment can be run using the following command:
docker run -it pauldhein/telemarket-model:latest python model_trainer.py -p ../data/bank-full
The -p
flag provides the prefix-path of the dataset to be used for training and evaulation.
This experiment verifies the performance of the various saved classifier configurations on either development data or test data. The experiment can be run using the following command:
docker run -it pauldhein/telemarket-model:latest python model_evaluator.py -t -d
The -t
and -d
flags specify to evaulate on the test set and the development set respectively.
Model Name | Accuracy | Precision | Recall | F1 | AUC-ROC |
---|---|---|---|---|---|
Logistic Regression | 0.8623 | 0.4324 | 0.8677 | 0.5772 | 0.8647 |
Random Forest | 0.9053 | 0.5870 | 0.4238 | 0.4922 | 0.6938 |
Naive Bayes | 0.8371 | 0.3616 | 0.6592 | 0.4670 | 0.7589 |
Linear SVM | 0.2765 | 0.0501 | 0.3161 | 0.0865 | 0.2939 |
Neural Network | 0.8555 | 0.4242 | 0.9350 | 0.5836 | 0.8904 |