Machine-learning techniques were used to construct forecasting models of consumer credit risk. Using mimic data from consumer credit risk domain, binary logistic regression was used to build the models to predict the likelihood of default. The goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 65%. This paper compares a 4-variable model and a 12-variable model based on simplicity and profitability. Using the selected model, cluster analysis was then performed to maximize the estimated profitability. The 4-variale model achieves a profit $ 122.340.69 on 1000 accounts. KS of the model is 0.542. The 12-variable model achieves profit $ 126.062.48 on 1000 accounts. KS of the model is 0.606. The profit difference on тисячі accounts base is only $ 3.721.79. The Cluster1 segment of 4-variable model achieves profit $ 143.616.62, which is determinant as the best segment.

Анотація наукової статті за медичними технологіями, автор наукової роботи - Shen Kenneth


Область наук:
  • Медичні технології
  • Рік видавництва: 2019
    Журнал: European journal of economics and management sciences

    Наукова стаття на тему 'MAXIMIZING PROFITABILITY THROUGH MODEL SIMPLICITY AND CLUSTER ANALYSIS'

    Текст наукової роботи на тему «MAXIMIZING PROFITABILITY THROUGH MODEL SIMPLICITY AND CLUSTER ANALYSIS»

    ?Shen Kenneth, The Wardlaw-Hartridge School, NJ

    Edison, NJ

    E-mail: Ця електронна адреса захищена від спам-ботів. Вам потрібно увімкнути JavaScript, щоб побачити її.

    MAXIMIZING PROFITABILITY THROUGH MODEL SIMPLICITY AND CLUSTER ANALYSIS

    Abstract. Machine-learning techniques were used to construct forecasting models of consumer credit risk. Using mimic data from consumer credit risk domain, binary logistic regression was used to build the models to predict the likelihood of default. The goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 65%. This paper compares a 4-variable model and a 12-variable model based on simplicity and profitability. Using the selected model, cluster analysis was then performed to maximize the estimated profitability. The 4-variale model achieves a profit $ 122.340.69 on 1000 accounts. KS of the model is 0.542. The 12-variable model achieves profit $ 126.062.48 on 1000 accounts. KS of the model is 0.606. The profit difference on тисячі accounts base is only $ 3.721.79. The Cluster1 segment of 4-variable model achieves profit $ 143.616.62, which is determinant as the best segment.

    Keywords: Machine-learning, logistic regression, consumer credit risk, K-means model.

    1. Introduction

    While developing a product or service, a predictive statistical model is needed to maximize the profitability of a product or service. While a predictive statistical model should be as accurate as possible to predict the likelihood of default, a statistical model with too many predictors can also cost company both time and money. It takes time to collect data, so it is reasonable to assume that it would cost additional time if a more complex model was selected. Data collection also costs money, and the more variables there are in a model, the more data would need to be acquired. The purposes of this paper are to explore the cost of simplicity and how a predictive statistical model can be maximized to increase a company's profitability.

    2. Method

    2.1 Logistic Regression

    • Define binary variable to predict the likelihood of default using the binary response variable GoodBad;

    • Before building the models, random samples taken from the dataset were partitioned into

    two independent files: a training dataset and a validation dataset;

    • Models were developed and tested using the backward selection option in Proc Logistic procedure;

    • Through the process of model development and validation, 4-variale and 12-varaible models were selected for comparison.

    2.2 Model Comparison

    • ROC curves, Gains tables and KS test were generated for each model;

    • Data was classified into four categories ERROR !, ERROR2, VALID1 and VALID2 using a selected cutoff probability;

    • Profitability reports were generated for each model using a profitability function.

    2.3 Cluster Analysis using K-means model

    • D ata was standardized using Proc Stdize procedure with range method;

    • K-means was used to partition data into 3 clusters. The K-means method identifies 3 centroids, and then allocates every data point

    to the nearest cluster, while keeping the cen-troids as small as possible;

    • Canonical discriminant analysis was performed using Proc Candisc procedure;

    • The most profitable subpopulation to target was identified.

    Table 1.- Analysis of maximum likelihood estimates for 4-variable model

    Analysis of Maximum Likelihood Estimates

    Parameter Parameter Description DF Estimate Standard Error Wald Chi-Square Pr > ChiSq

    Intercept 1 -0.05250 0.25820 0.04 0.8389

    X1 Utilization of all revolving bankcard trades 1 0.00058 0.00025 5.23 0.0223

    X2 Highest utilization on any single bank revolving trade 1 0.00099 0.00039 6.46 0.0111

    X3 Total collection / charge off / Repossession dollars within 12 months 1 0.00005 0.00002 10.57 0.0011

    X4 Percent of trades never delinquencies or derogatory 1 -0.03620 0.00310 136.36 <.0001

    3.2 4-Variable Model Performance The gains table (See Table 3) is tabulated as be-

    All 4 variables are significant in the level of 0.05. low that KS achieves 0.542. The model can achieve percent concordant of68.8 and Area Under Curve of0.688 (S ee Table 2 and Figure 1).

    Table 2.- Association of predicted probabilities and observed for 4-variable model

    Association of Predicted Probabilities and Observed

    Responses

    Percent Concordant 68.8 Somers 'D 0.377

    Percent Discordant 31.1 Gamma 0.378

    Percent Tied 0.2 Tau-a 0.147

    Pairs 2.082.730 c 0.689

    Table 3.- Gains table for 4-variable model

    Decile Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS

    1 2 3 4 5 6 7 8 9 10

    1 80 80 0.327 0.327 0.439 0.172 1.000 0.356 0.366

    2 46 127 0.190 0.259 0.693 0.082 0.170 0.116 0.533

    3 20 147 0.080 0.199 0.801 0.049 0.082 0.062 0.542

    4 6 153 0.025 0.155 0.835 0.038 0.049 0.043 0.470

    5 9 162 0.038 0.132 0.885 0.031 0.038 0.034 0.416

    3. Result

    3.1 4-Variable Model

    A logistic regression model using 4-varible was established as below (See Table 1).

    1 2 3 4 5 6 7 8 9 10

    6 15 177 0.062 0.120 0.969 0.029 0.031 0.029 0.399

    7 3 180 0.011 0.105 0.984 0.027 0.029 0.028 0.307

    8 2 182 0.009 0.093 0.996 0.026 0.027 0.026 0.212

    9 1 183 0.003 0.083 0.999 0.025 0.026 0.025 0.108

    10 0 183 0.000 0.074 1.000 0.025 0.025 0.025 0.000

    Total 183 183 0.074 0.074 1.000 0.025 1.000 0.074 0.542

    Figure 1. ROC curve for 4-variable model

    3.3 4-variable Model Profitability Calculation ample here is if the cutoffprobability of 0.116 from the If it is assumed when the predict default prob- mean score at the second decile is used, the model de-ability is greater than a given number, it would be a velopment data can be classified into 4 categories: ER-bad account; otherwise, it would be a good account. ROR1, ERROR2, VALID1 and VALID2. The profit-GoodBad can be assigned to the scored data. An ex- ability can be listed in the below table (See Table 4).

    Table 4.- Profitability table for 4-variable model

    Outcome type Percentage n Profit Profit per 1000 account

    ERROR1 17% 571 ($ 105.833.14) ($ 185.347.01)

    ERROR2 10% 327 $ 0.00 $ 0.00

    VALID1 9% 295 $ 0.00 $ 0.00

    VALID2 64% 2078 $ 506.009.54 $ 243.507.96

    Total 100% 3271 $ 400.176.39 $ 122.340.69

    Here ERROR1 is a category in which the ac- counts are actually bad, so $ 105.833.14 is lost on 571 counts are assigned to be good. However, the ac- accounts. It is equivalent that $ 185.347.01 is lost on

    1000 accounts. ERROR2 is a category in which the accounts are assigned to be bad, but they are actually good accounts, so money is neither lost nor earned; VALID1 is a category in which the accounts are assigned to be bad and the accounts are actually bad, so a lost is successfully avoided; VALID2 is a category in which the accounts are assigned to be good and they are actually good accounts, so $ 506,009.54 is successfully earned on 2078 accounts. It is equiva-

    lent that $ 243.507.96 is earned on 1000 accounts. This is a winning business in which $ 400.176.39 can be earned on the total 3271 accounts; equivalently, $ 122.340.69 can be earned on тисячі accounts.

    3.4 4-Variable Model K-means Cluster and Profitability

    Customers are clustered into 3 clusters. A SAS procedure FASTCLUS using K-means method was performed.

    Figure 2: Plot of canonical variables identified by cluster value

    The resulting plot (See Figure 2) illustrates the If the same cutoff probability of 0.116 at the sec-

    spatial separation of the clusters calculated in the ond decile is applied to the Cluster1 segment, there

    FASTCLUS procedure. Here blue circles represent is the profitability table (See Table 5) below. Higher

    the Cluster1, which is assumed to be the best seg- profit can be achieved. The Cluster1 is determined

    ment in profit. as the best segment.

    Table 5.- Profitability table for Clusterl segment

    Outcome type Percentage n Profit pper1000

    ERROR1 0.1972097 523 ($ 96.893.07) ($ 185.264.00)

    ERROR2 0.0343137 91 $ 0.00 $ 0.00

    VALID1 0.0286576 76 $ 0.00 $ 0.00

    VALID2 0.739819 1962 $ 477.764.34 $ 243.508.84

    Total 1 2652 $ 380.871.27 $ 143.616.62

    3.5 12-Variable Model A logistic regression model using 12-varible was

    established as below (See Table 6). Table 6.- Analysis of maximum likelihood estimates for 12-variable model

    Analysis of Maximum Likelihood Estimates

    Parameter Parameter Description DF Estimate Standard Wald Pr > ChiSq

    Error Chi-Square

    Intercept 1 -1.63090 0.38640 17.81 <.0001

    X1 Utilization of all revolving bankcard trades 1 0.00138 0.00032 19.16 <.0001

    X2 Highest utilization on any single bank revolving trade 1 0.00716 0.00170 17.66 <.0001

    X3 Total collection / charge off / repossession dollars within 12 months 1 0.00006 0.00002 10.45 0.0012

    X4 Percent of trades never delinquencies or derogatory 1 -0.01380 0.00441 9.82 0.0017

    X5 Trades open greater than or equal to 1-year payment ratio 1 -0.01310 0.00435 9.11 0.0025

    X6 Inquiries in last 6 months 1 0.08310 0.02800 8.83 0.003

    X7 Aggregate utilization of revolving trades 1 0.00777 0.00270 8.31 0.0039

    X8 Aggregate credit limit on revolving trades 1 0.00000 0.00000 7.44 0.0064

    X9 Number of 30 DPD trades reported within 2 years 1 0.16170 0.06240 6.71 0.0096

    X10 Number of30-180 DPD within 6 months 1 0.07540 0.02930 6.62 0.0101

    X11 Number of revolving trades with high utilization 1 0.03660 0.01600 5.24 0.0221

    X12 The average credit limit of trades 1 -0.00001 0.00001 4.94 0.0263

    3.6 12-Variable Model Performance 72 and Area Under Curve of 0.72 (See Table 7 and

    All 12 variables are significant in the level of Figure 3). 0.05. The model can achieve percent concordant of

    Table 7.- Association of predicted probabilities and observed for 12-variable model

    Association of Predicted Probabilities and Observed

    Responses

    Percent Concordant 72 Somers 'D 0.44

    Percent Discordant 28 Gamma 0.44

    Percent Tied 0 Tau-a 0.171

    Pairs 2.082.730 c 0.72

    ROC Curves for Comparisons

    1.00

    0.75

    >

    % 0.50

    cz dl

    Lfl

    0.25

    0.00 -

    0.00 0.25 0.50 0.75 1.00

    1 - Specificity

    ROC Cuive (Area)

    -Model (0.7199)

    -4-Variable Model (0.6886)

    - 12-Variable Model (0.7199)

    Figure 3. ROC curve for comparison

    The ROC curve above shows that there is very 0.25 and 0.75. Will this difference be a big impact little difference between probabilities of the two on profitability? The gains table (See Table 8) will models, especially when 1-Specificity is between be explored further, tabulated as below.

    Table 8.- Gains table for 12-variable model

    Decile Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS

    1 97 97 0.398 0.398 0.532 0.207 1 0.400 0.467

    2 42 139 0.170 0.284 0.760 0.101 0.207 0.143 0.606

    3 17 156 0.067 0.211 0.851 0.062 0.101 0.079 0.595

    4 12 168 0.050 0.171 0.918 0.038 0.062 0.047 0.560

    5 7 175 0.029 0.143 0.957 0.023 0.038 0.030 0.493

    6 4 179 0.015 0.121 0.977 0.015 0.023 0.019 0.408

    7 1 180 0.004 0.105 0.982 0.010 0.015 0.012 0.305

    8 1 181 0.004 0.092 0.988 0.006 0.010 0.008 0.203

    9 2 183 0.008 0.083 0.999 0.003 0.006 0.005 0.108

    10 0 183 0.001 0.074 1 0.000 0.003 0.002 0.000

    Total 183 183 0.074 0.074 1 0.000 1 0.074 0.606

    The Gains and Lift charts show only a small ad- wise, it would be a good account. GoodBad can be

    vantage of the 12-variable model over the simpler assigned to the scored data. For example, if the cutoff

    one. KS achieves 0.606. probability of 0.143 from the mean score at second

    3.7 12-variable Model Profitability Calculation decile is used, the model development data can be

    Similar to 4-variable model profitability calcula- classified into 4 categories: ERROR1, ERROR2,

    tion, when the predict default probability is greater VALID1 and VALID2. The profitability is listed in

    than a given number, it would a bad account. Other- the below table (See Table 9).

    Table 9.- Profitability table for 12-variable model

    Outcome type pct n Profit pper1000

    ERROR1 0.16631 544 ($ 98.427.90) ($ 180.933.65)

    ERROR2 0.0963008 315 $ 0.00 $ 0.00

    VALID1 0.0984408 322 $ 0.00 $ 0.00

    VALID2 0.6389483 2090 $ 510.778.27 $ 244.391.52

    1 3271 $ 412.350.37 $ 126.062.48

    Here ERROR1 is a category in which the accounts are assigned to be good. However, the accounts are actually bad, so $ 98.427.9 is lost on 544 accounts. It is equivalent that $ 180.933.65 is lost on 1000 accounts; ERROR2 is a category in which the accounts are assigned to be bad but they are actually good accounts, so money is neither lost nor earned; VALID1 is a category in which the accounts are assigned to be bad and they are actually bad accounts, so a loss is successfully avoided; VALID2 is a category in which the accounts are assigned to be good and the accounts actually good accounts, so $ 510.779.27 is successfully earned on 2090 accounts. It is equivalent that $ 244.391.52 is earned on 1000 accounts. This is also winning business in $ 412.350.37 is earned on the total 3271 accounts; equivalently, $ 126.062.48 can be earned on тисячі accounts. 4. Discussion

    The profit difference on тисячі accounts base is $ 3.721.79. It appears the 12-variable model has a

    References:

    1. Credit Default Risk Prediction. Available at: URL: https: //repods.io/en/blog/Credit-default-risk-pre-diction

    2. Modern Machine Learning Algorithms: Strengths and Weaknesses. Available at: URL: https: //elitedata-science.com/machine-learning-algorithms

    little advantage over the 4-variable model. However, the cost in term of time and money also needs to take into consideration. Using 4-variable model or 12-variable model would depend on how much it could cost in complexity when the number of predictors is increased from 4 to 12.

    5. Conclusion

    The research paper built two logistic models in predicting the likelihood of default. Two models were evaluated and compared based on concordance, AUC, KS, simplicity, and profitability. No recommendation is provided on which model is a better choice to a company, but the final profitability that each model can give is calculated. It will depend on the cost and incremental complexity to implement the models. The analysis also finished an unsupervised clustering process, targeting the most profitable cluster segment.

    3. Peng C.J., Lee K. L., Ingersoll G. M. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research, 96 (1), - P. 3-14.

    4. Fawcett T. An introduction to ROC analysis [J]. Pattern recognition letters, 2006; 27 (8): 861-874.

    5. Stokes M., Davis C. S. Categorical Data Analysis Using the SAS System, SAS Institute Inc., 2001..

    6. SAS / STAT® 15.1 User's Guide the FASTCLUS Procedure. Available at URL: https: //support.sas.com/ documentation / onlinedoc / stat / 151 / fastclus.pdf


    Ключові слова: MACHINE-LEARNING / LOGISTIC REGRESSION / CONSUMER CREDIT RISK / K-MEANS MODEL

    Завантажити оригінал статті:

    Завантажити