--- title: "Developing a Credit Scorecard" author: "Shichen Xie, Michael Thomas" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Developing a Credit Scorecard} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Traditional Credit Scoring Using Logistic Regression After installing scorecard via instructions in the [README](https://github.com/ShichenXie/scorecard#Installation) section, load the package into your environment. ```r library(scorecard) ``` ### Data Preparation Let's use the *germancredit* dataset for the purposes of this demonstration. ```r data("germancredit") str(germancredit) ``` The `var_filter` function drops column variables that don't meet the thresholds for missing rate (> 95% by default), information value (IV) (< 0.02 by default), or identical value rate (> 95% by default). ```r dt_f <- var_filter(germancredit, y = "creditability") ``` ### Split Data into Train / Test Sets When building scorecard models, a subset of the observations should be held out from the data used to train the model (similar to most other traditional modeling approaches), and instead be apportioned to the *test* set. We can perform this sampling to create the *train* and *test* datasets using the `split_df` function. ```r dt_list <- split_df(dt_f, y = "creditability", ratios = c(0.6, 0.4), seed = 30) label_list <- lapply(dt_list, function(x) x$creditability) ``` ### Weight-of-Evidence (WoE) binning Weight-of-Evidence binning is a technique for binning both continuous and categorical independent variables in a way that provides the most robust bifurcation of the data against the dependent variable. This technique can be easily executed across all independent variables using the `woebin` function. ```r bins <- woebin(dt_f, y = "creditability") # woebin_plot(bins) ``` The user can also adjust bin breaks interactively by using the `woebin_adj` function. ```r # breaks_adj <- woebin_adj(dt_f, y = "creditability", bins = bins) ``` Furthermore, the user can set the bin breaks manually via the `breaks_list = list()` argument in the `woebin` function. Note the use of *%,%* as a separator to create a single bin from two classes in a categorical independent variable. ```r breaks_adj <- list( age.in.years = c(26, 35, 40), other.debtors.or.guarantors = c("none", "co-applicant%,%guarantor") ) bins_adj <- woebin(dt_f, y = "creditability", breaks_list = breaks_adj) ``` Once your WoE bins are established for all desired independent variables, apply the binning logic to the training and test datasets. ```r dt_woe_list <- lapply(dt_list, function(x) woebin_ply(x, bins_adj)) ``` ### Logistic Regression Example Logistic regression can often be leveraged effectively to assist in building the scorecards. ```r m1 <- glm( creditability ~ ., family = binomial(), data = dt_woe_list$train) # vif(m1, merge_coef = TRUE) # summary(m1) # Select a formula-based model by AIC (or by LASSO for large dataset) m_step <- step(m1, direction = "both", trace = FALSE) m2 <- eval(m_step$call) # vif(m2, merge_coef = TRUE) # summary(m2) ``` If oversampling is a concern, the following code chunk could be uncommented and run to help adjust for this issue. ```r # Read documentation on handling oversampling (support.sas.com/kb/22/601.html) # library(data.table) # p1 <- 0.03 # bad probability in population # r1 <- 0.3 # bad probability in sample dataset # dt_woe <- copy(dt_woe_list$train)[, weight := ifelse(creditability == 1, p1/r1, (1-p1)/(1-r1) )][] # fmla <- as.formula(paste("creditability ~", paste(names(coef(m2))[-1], collapse = "+"))) # m3 <- glm(fmla, family = binomial(), data = dt_woe, weights = weight) ``` ### Evaluating Model Performance Using KS & ROC The `perf_eva` function provides model accuracy statistics (such as mse, rmse, logloss, r2, ks, auc, gini) and plots (such as ks, lift, gain, roc, lz, pr, f1, density). ```r # First, get probabalistic predictions pred_list <- lapply(dt_woe_list, function(x) predict(m2, x, type = 'response')) # Then evaluate model accuracy perf <- perf_eva(pred = pred_list, label = label_list) ``` ### Create Scorecard Once the model has been selected, scorecards can be created via the `scorecard` function. Note that the default target points is 600, target odds is 1/19 and points to double the odds is 50. See `?scorecard` for more information on the function and its arguments. The scorecard can then be applied to the original data using the `scorecard_ply` function. Lastly, a chart encompassing Population Stability Index (PSI) statistics can be rendered via the `perf_psi` function. ```r # Build the card card <- scorecard(bins_adj, m2) # Obtain Credit Scores score_list <- lapply(dt_list, function(x) scorecard_ply(x, card)) # Analyze the PSI perf_psi(score = score_list, label = label_list) ```