Can financial analysts optimize a loan portfolio based on data-centric strategies & quantitative algorithms?
Published in · 13 min read · Jul 19, 2020
--
Authors: Noah Mukhtar, Shaan Kohli, Shaher yar Jahangir and Ramy Hammam
While banks have traditionally acted as the main provider of loans, the needs of smaller businesses are evolving and now come with more concerns.
2020 has brought along spikes in operating costs and increasing fluctuations in demand.
These are some of the extreme challenges SMEs face when it comes to raising capital — a crucial element of their growth. Banks often overlook these, and as a result, this has opened up the door for the P2P Lending Market to grow.
This is why we are going to create a data-driven investment strategy for Prosper Marketplace, a P2P lending firm, that will take a holistic overview from processing the raw data to inferring business outcomes.
Overall, our predictive model will act as a prescriptive tool that directly portrays the impact on the investor & business.
Two fundamental steps are required for an investor to gain a better understanding of which loan to invest in:
- Prosper’s investors will have to decide on how much money to invest (considering potential return) and allocate to other options for investment.
- Prosper’s investors will need to decide on picking “good” loans (i.e.: loans which will end up being fully paid off) in which to invest their money, and this is what our loan classification model will focus on.
However, it is important to note that there is no one ideal investment, as it is highly contingent on the risk appetite of the investor.
The datasets to be utilized in this analysis are publicly available online for download using a registered Prosper account.
Predictors
The dataset contains comprehensive information on all loans posted annually and a set of 22 variables that cover the different attributes of an individual loan.
These include but are not limited to the loan amount, the amount of interest & fees paid, loan duration, a Prosper credit rating score, and loan status with reason if completed or defaulted.
Completed vs. Defaulted Loan: 0s vs 1s
There are 4 outcomes for a loan in Prosper’s data:
- Completed (A)
- Charged Off (B)
- Current (C)
- Defaulted (D)
To simplify our Prosper data analysis, we decided to assign a value of 0 to completed loans (A) and 1 to the remaining 3 types of loan statuses (B, C, & D) that refer to loans that have not been completed.
How Can We Identify a “Good” Loan?
In order to acquire an intuition of identifying good loans, it is important to realize the correlation & interaction amongst all the variables.
However, the definition of a good loan is still quite ambiguous so it is vital that the prediction should be based on one of these goals: whether a loan will default, paid back early, time of default, or time taken to pay back if paid back early.
Analyzing Loan Terms
The plot indicates some interesting insights for the 60-month term (the longer-term) such as the highest loan amount being approximately $15,000 for both completed loans as well as uncompleted loans. Unsurprisingly, borrowers tend to take higher loan amounts to pay back over the longer-term loans.
Whereas loans in the 36-month terms (the shorter term) tend to be of much smaller amounts and are more consistently distributed for both completed and uncompleted loans.
Analyzing Interest Rate Distribution Per Loan Status
Another violin plot was built that shows the different levels of interest rates for each of the loan status per grade type (ranging from AA — HR: highest grade to lowest).
Higher-grade loans for both completed (0) and uncompleted (1) have lower interest rates, and this trend is consistent for both loan statuses.
Data Leakage
This phenomenon occurs when a model is built using predictors that will not be available at the time of future predictions. Therefore, we need to disregard these predictors in order to prevent a bias in our model.
There are two main cases on how leakage can be identified:
1. A predictor that is highly correlated with the target variable. We will check this using a heatmap or correlation matrix of all 22 predictors.
2. The predictor information is not available during the prediction of total return, for example, the late_fees_paid variable.
As suspected, we noticed that variables that are associated with data leakage such as the interest paid and principal paid stood out to have a high correlation with loan status.
Based on this definition of leakage and the fact that most of these won’t be available when predicting whether a loan will default or not, we disregard the following predictors:
loan_number, late_fees, age_in_months, days_past_due, origination_date, principal_balance, principal_paid, interest_paid, late_fees_paid, debt_sales_proceeds_received, loan_default_reason, loan_default_reason_description, next_payment_due_date, next_payment_due_amount, co_borrower_application
Calculation of Investor’s Returns
The most significant data required in determining the potential return from a given loan is the calculation of the total payments that were received on each of the loans.
To build an effective investment strategy, we need to build a strong indicator variable on the return amount of each loan. It is vital that the return should consider both partially paid off defaulted loans and loans that have been paid off earlier than the due date.
In general, there are three effective return measures that can be created using the following variables:
1. Total payment account denoted by variable p
2. Total amount invested in the loan (amount borrowed by borrower) denoted by variable f
3. Nominal length of the loan in months (term of loan) denoted by the variable t
4. Actual length of the loan in months (loan length) denoted by the variable m
We are going to calculate the expected returns by scenario testing under 3 different strategies:
1) Pessimistic (M1)
The pessimistic approach states that when the loan is paid back, the investor still cannot re-invest it until the term of the loan expires.
Using equation 1, our team was able to create a new variable called: “ret_PESS”.
It was worth noting that this approach favors long-term loans for loans that default early due to the loss spreading over a greater span.
2) Optimistic (M2)
The optimistic approach entails that once the loan is paid back, the investor’s money is returned, and the investor can invest immediately in another loan with the same return.
Using equation 2, our team was able to create a new variable called: “ret_OPT”.
This method can be rephrased as the annual monthly return of the loan over the time it was active. However, it is plagued by the assumption that money can be reinvested with the same rate and that if the loan defaults early, the loss can be a drastic overestimate of negative returns. On the plus side, short and long loans are treated equally using this approach.
3) Fixed-horizon (M3)
The fixed-horizon approach involves calculating fixed-time returns for 3 different interest rates: 1,3 & 6 %.
A function was built, called ret_method_3(), to replicate equation 3 and calculate the 3 new variables for each loan in the dataset.
This approach can be the most accurate by equalizing the differences between loan length and defaulted loans. However, it tends to disregard the depreciation value of the money over time.
Prosper Rating (Grade) Breakdown
A deeper dive into the prosper credit rating (“grade”) variable was performed to explore any significant trends with each of the type of return (M1-M3).
The higher the grade, the smaller percentage of defaulting, and the lower the interest rates.
The mean return % is very identical with minor increases across the different grades. In addition, there are no negative mean return values across any of the grades.
A quick count of the number of loan status for each type shows a significantly higher number of completed loans when compared to uncompleted loans, indicating a class imbalance in the dataset.
Therefore, the most probable option is to try to rebalance the dataset to consist of 50% of completed loans and 50% of uncompleted loans. However, doing this could prove to be detrimental and cause the average return to be negative, fully going against the point of enticing potential investors to sign up to Prosper.
Thus, to capture actual real-life trends and true proportions, physically balancing the dataset would hinder a realistic understanding of the analysis.
Instead, a calibration curve measure will be used in the model-building phase to check if a constructed model is biased by the data imbalance or not.
The predictive modeling will cover a two-stage approach:
First, we will be constructing a binary classification model to predict the loan default probability using a variety of industry-standard algorithms and evaluating the performance for the optimal and most accurate model.
The second stage involves constructing a regression model using a variety of industry-standard regressors to predict the amount of return a loan may generate to an investor for each of the return approaches (M1-M3).
The models in both phases will be optimized for higher accuracy by hyperparameter tuning through cross-validation.
Stage 1: Classifying Loan Outcomes
The first model was to determine the predictive power of the grade and interest rate of a loan, which is usually the industry’s standard practice.
Predictors: Grade & Interest Rate
Model #1 — Accuracy Score: 85%
This implies that these two features individually contribute to successfully determining the loan status as much as when all the predictors are used.
However, to build a more robust model and avoid underfitting, we have to drop these two features entirely.
Predictors: All Excluding Grade & Interest Rate
Model #2 — Accuracy Score: ~85%
Using all the predictors apart from grade and interest rate, model accuracies were maximized at roughly 85%. Although Naïve Bayes provided a very high accuracy, it was unable to account for the biases presented due to the data imbalance.
The calibration curves and AUC scores for all the above models are high, indicating strong performance. The calibration code shows a measure to see if the model is affected by the imbalance. If the calibration is perfect, it means the model is not biased by the imbalanced dataset. In other words, it will not always try to predict the outcome to be the grand majority, which is the completed loans (0) case.
Since Random forest was the best-selected model, we decided to explore the best in class for tree algorithms, also known as the Light GBM classifier developed by Microsoft. It is considered the latest industry trend and winner of multiple Kaggle competitions, so we were curious to see how well it would perform in the Prosper case. As presumed, all the model metrics were very high for the Light GBM classifier.
Feature Importance
The feature importance plots for most of the models were also generated in the classification stage.
Our model determined “Service fees paid”, “Propser fees paid” and “amount borrowed” as the most useful at predicting our target variable: “loan_status” which matches well with investing intuition.
Learning Curve
A learning curve was also plotted to assess the learning rate of the optimal random forest models when compared to a sample L2 logistic regression model.
When the training size is increased by intervals of 25 data points until a very large amount, the random forest model AUC increases, meaning that the model will continue to learn and give a much higher percentage of accurate return predictions relative to the logistic model.
Stage 2: Predicting Expected Return
A variety of models were tested using a different set of features. The R2 values of various regression models were calculated for the different return methods.
The Random forest regressor gave the highest score across all the different types of returns when compared to the rest of the models built.
Investment Strategies
By leveraging both the classification and regression models to predict whether a loan will default and the estimated return, an investment strategy can be formulated to maximize an investor’s average return based on the different methodologies used to calculate the return.
Four different investment strategies were tested:
- Random Strategy
- Default-based
- Simple return-based
- Default and return-based strategy.
In order to depict real-life scenarios, the M3 return approach seems to be a more accurate way to measure the % return. Thus, the baseline for us to compare the different investment strategies would be under the M3 method. Nonetheless, the returns seem to be very close to one and another regardless of which investment strategy is being used when looking at the M3 return approach.
The magnitude of the % return is also highly dependent on the reinvestment interest rate (M3).
Sensitivity Test of Portfolio Size
Next, a sensitivity test was run with regards to the portfolio size versus the investment return percentage
As the number of loans invested in by a potential investor increases, the percentage of investment return decreases. This is also very intuitive due to the risk factor of a loan getting defaulted increases as the number of loans invested in increases.
In other words, an investor may have a similar chance of getting a high return and then losing most of it due to a defaulted loan and this cycle can keep repeating causing the overall percentage of return to decrease as the number of loans invested in increases.
In this section, we implement three different optimization models to improve an investment strategy using Prosper.
The three different optimization methods are:
1) Directly maximize total profit
A binary variable is set-up for every loan in our data set. The number of loans constraint is added (as per the maximum number of loans in our dataset) and the objective function is defined (maximize total profit).
2) Maximize profit with budget constraint
The second optimization model considers the budget constraint of a potential investor. Like the first model, a new budget limiting constraint is added (testing different budgets), and the optimization problem is resolved.
3) Maximize profit with risk-return tradeoff
The third optimization strategy involves incorporating the portfolio risk factor by considering the variance of the returns.
To achieve this, first, a clustering model is trained with an adjustable k parameter by the investor. Second, the standard deviation is computed for each of the clusters. Third, each loan can be assigned to a specific cluster based on the Euclidean distance from each cluster. Fourth, the standard deviation of the return for each loan can be estimated using the standard deviation of the cluster.
The model is then built to maximize profit with the inclusion of the risk-return tradeoff constraint. A sensitivity/penalty factor can be set by the investor to account for the risk tolerance of the investor.
Results show that the optimal investment strategy for Prosper is maximizing profit using budget constraints with an expected return on 10.5%.
In conclusion, we saw how information from different attributes can be utilized to create new predictors that may end up boosting the statistical significance of our model as opposed to the original counterparts. It was important to note how a model’s performance over a single partition of the data into cross-validation folds can be misleading and running our model several times to develop different iterations of the train/test splits was essential. Another important part was calibration, and how it measured whether the probabilities produced by the model were correct.
It would be interesting to see how our model would perform if tied to macroeconomic external data that would represent the underlying economy’s performance at the time such as oil prices or The World Bank’s interest rates in order to further extend the pessimistic measure’s performance.
Additionally, it would be interesting to conduct sentiment analysis over the investor to understand what their risk appetite exactly is as opposed to asking for the numerical input as they may lack the understanding of how the P2P lending market operates.
Our Inspiration — Original Study
I'm an enthusiast with a deep understanding of data-centric strategies and quantitative algorithms, particularly in the context of loan portfolio optimization. Now, let's delve into the concepts discussed in the article:
-
Introduction to Loan Portfolio Optimization:
- Traditional role of banks in providing loans.
- Evolving needs of smaller businesses in the face of 2020 challenges.
- Rise of the P2P Lending Market due to overlooked challenges by banks.
-
Data-Driven Investment Strategy for Prosper Marketplace:
- Creation of a predictive model for Prosper Marketplace.
- Focus on two fundamental steps for investors: deciding on investment amount and picking "good" loans.
- Importance of understanding correlation and interaction among variables.
-
Datasets and Predictors:
- Utilization of publicly available datasets from Prosper.
- Comprehensive information on loans and 22 variables, including loan amount, interest & fees, duration, credit rating, and loan status.
-
Loan Classification Model:
- Categorization of loans into Completed, Charged Off, Current, and Defaulted.
- Identification of a "good" loan based on goals such as default prediction, early repayment, time of default, or time taken to pay back.
-
Analysis of Loan Terms and Interest Rates:
- Insights from analyzing loan terms (60-month vs. 36-month).
- Interest rate distribution for different loan statuses and grade types.
-
Data Leakage Prevention:
- Identification and prevention of data leakage by disregarding predictors associated with it.
- Correlation analysis to identify predictors like interest paid and principal paid prone to leakage.
-
Calculation of Investor’s Returns:
- Importance of calculating total payments received on each loan.
- Scenario testing under three different return strategies: Pessimistic, Optimistic, and Fixed-horizon.
-
Prosper Rating (Grade) Breakdown:
- Exploration of credit rating trends with different return methods.
- Relationship between grade, default percentage, and interest rates.
-
Predictive Modeling Approach:
- Two-stage approach involving binary classification and regression models.
- Use of industry-standard algorithms and hyperparameter tuning for optimization.
-
Investment Strategies:
- Testing four different investment strategies based on classification and regression models.
- Sensitivity test on portfolio size and its impact on investment return percentage.
-
Optimization Methods:
- Three optimization methods: Maximize total profit, Maximize profit with budget constraint, Maximize profit with risk-return tradeoff.
- Identification of the optimal investment strategy for Prosper.
-
Conclusion and Future Considerations:
- Reflection on the utilization of information from different attributes.
- Importance of calibration and multiple iterations in model development.
- Suggestions for future improvements, including tying the model to macroeconomic data and incorporating sentiment analysis.
The article, co-authored by Noah Mukhtar, Shaan Kohli, Shaher yar Jahangir, and Ramy Hammam, provides a comprehensive guide to optimizing a loan portfolio using data-driven strategies and quantitative algorithms. The authors demonstrate expertise in the field, presenting a thorough analysis and actionable insights for investors in the P2P lending market.