IA applied in Credit Risk — Home Credit

7 min readOct 6, 2019

This article is part of my last project on Data Scientist nanodegree at Udacity!

The Capstone Project challenge us to test all our data science skills that we learned on the program solving real-world problems.

I thought a few days about which project I would like to do. Then, I took the best amazing ultra advanced technology to organize the work. A paper! \o/

I choosed a dataset from Kaggle that was provided by Home Credit for a competition.

Home Credit

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data — including telco and transactional information — to predict their clients’ repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they’re challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

This description was taken from here.

The strategy

Home Credit provided some csv files in the competition. I’ll use only the dataset that resume all infos about customers, but to get better results, we need to check more information on others datasets.

You can see the description of each file in my Kaggle Notebook.

The dataset have 122 characteristics from 307511 loans. Let’s see some graphs about the principal features of this dataset.

The ability to repayed or not the loan it’s what we are trying to predict. See that we have 92% of the dataset with TARGET:0 and only 8% with TARGET:1.

It’s a unbalanced class problem and we need to handle with it when build the models.

To solve this problem, I broke the project into 5 steps.

As you can see, we have a classification problem here. I’ll start creating some plots to understand the data. Then, I’ll apply some techniques to transform the data, check for missing values, etc.

Probably we need to reduce the dataset, we can do that understand how features are important and how not.

With the dataset prepared, we’ll create some models. I’ll start with simple algorithms, like LogisticRegression, Random Forest, and try more complex algorithms if needed.

If possible, we’ll apply GridSearchCV to find best parameters and apply cross-validation technique.

To evaluate our models we’ll use F1-Score and Kaggle’s Score. These scores can works better with our unbalanced classes problem.

Kaggle’s Score are provided when we submit our predicts for the competition.

Data understanding

Now that we have a clear estrategy what I think that can works with this problem, let’s see some infos about the dataset.

I really like to use pandas-profiling to quick analysis datasets. This dataset it’s too big that Kaggle Notebook can take 83 hours to run it.

I ran the ProfileReport on AWS SageMaker with an instance type: ml.m5.24xlarge, vCPU: 96, Mem (GiB): 384GB.

You can see the complete Dataset overview here.

Let’s continue seeing some features distribution.

PCA

Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset.

We can use it as dimensionality reduction technique.

Relevant features

Using the pearson correlation we can check what features are more important to understand the TARGET.

Now that we know what are the relevant features for TARGET, let’s see some graphs to better visualization.

You will see several graphs that show how the TARGET distributes itself over the relevant features.

Seaborn’s pairplot for relevant features

Modeling

I tried some algorithms that can works with unbalanced classes, because we have 92% of the dataset for TARGET:0 and only 8% of the dataset for TARGET:1.

Logistic Regression

Logistic Regression is a simple machine learning algorithm. It’s very similar to Linear Regression, let’s see the difference between Linear Regression and Logistic Regression below.

ClassificationReport — LogisticRegression (basic parameters)

We can see that our results are not good enough using LogisticRegression.

We got 68.98% of accuracy and 52.68% of F1-Score (macro). :(

Random Forest

I really like this algorithm, it’s my favorite algorithm and I always try to solve problem of classification with it.

This algorithm create a model based on DecisionTrees and each DecisionTree try to understand a small piece of the dataset.

Random Forest Simplified from williamkoehrsen’s post.

Random Forest works little better than LogisticRegression with basic parameters

Classification Report — Random Forest (basic parameters)

LGBMClassifier

It’s a new machine learning algorithm. This algorithm will create a gradient boosting model.

LGBMClassifier obtained similar results to Random Forest with basic parameters.

Classification Report — LGBMClassifier (basic parameters)

Keras

Keras it’s a great lib to Deep Learning models, I utilize Sequential model from Keras to create a Multilayer Perceptron (MLP) model.

Keras didn’t perform properly in this dataset, but I think that the problem is my basic skills with this algorithm.

We got 72.54% of accuracy and 23.80% of F1-Score (macro).

Refinement

We got better results using Random Forest and LGBMClassifier than Logistic Regression and Keras, then I decided apply GridSearchCV to find best parameters for each of these models.

Random Forest

parameters = {
    "n_estimators": [10, 30, 50, 100],
    "criterion": ["gini", "entropy"],
    "max_depth": [10, 20, 30],
    "min_samples_leaf": [50, 100, 200]    
}

We got 78.92% (improvement 6.94%) of accuracy and 57.65% (improvement 3.46%) of F1-Score (macro) with the best estimator.

LGBMClassifier

parameters = {
    "n_estimators": [100, 300, 500, 1000],
    "boosting_type": ["gbdt", "dart", "goss"],
    "max_depth": [10, 20, 30] 
}

We got 81.14% (improvement 9.66%) of accuracy and 58.28% (improvement 4.05%) of F1-Score (macro).

Results

After get the properly hyperparameters for each model, I sent the predictions to Kaggle. Let’s see below the Kaggle Score for each algorithm that I tried here.

The Kaggle Score for this competition use ROC AUC score.

We can see that Random Forest performs better than LGBMClassifier in the real world. It’s my first experience with LGBMClassifier, but I always works with Random Forest to solve this type of problem. I think that we can try more parameters and a better data preparation to get more results with this algorithm.

Improvements

To improve this study I think that we can use others csv files provided by HomeCredit, it’s can result in better scores because we have csv files with bureau data. Bureau data has application from previous loans that client got from other institutions and that were reported to Credit Bureau.

Another thing is test other algorithms and try to got more data for target 1, perhaps applying data augmentation technique.

Conclusion

To summarize, we started with understanding the data, applying techniques to handle with missing values, PCA, relevant features, etc. Then, we create some models and to finalize, we send files with predictions to get the real score from Kaggle Competition.

This is a complex problem and we need more experimentation to find a better solution, perhaps, more features to achieve better scores.

I’m happy to create this study because I learned more about credit risk and data science skills like TensorFlow and Keras. I worked with a new algorithm that I never worked before (LGBMClassifier).

GitHub Repo

Kaggle Notebook

Thank you for read it and feel free to comment your opinion and suggestion about this work.