# Innovative exploration
###### tags: `modellicity`
# Modellicity PD-Wizard
PD Wizard is a proprietary inhouse software application that equips the user with the ability to build a binary classification model (most commonly probability of default, PD, models in the credit risk space) and generate corresponding documentation. This involves the following stages:
(1) Data extraction
(2) Data treatment
(3) Variable selection
(4) Model selection, calibraiton, and training
(5) Model backtesting
(6) Model documentation: methodology and process reporting, deployment specs
The following sections provide the key areas of scientific study and innovation that we explored to build this application.
## Direct AUROC optimization
Based on our experience, the most important key performance indicator (KPI) used in the credit risk industry is the area under the receiver operating characteristics curve (AUROC) for probability of default (PD). However, PD models do not directly (or indirectly) optimize for this KPI; rather, the KPI that the modelling algorithm tends to aim to optimize is typically the cross-entropy loss function or the maximum likelihood function.
In the start of the project we explored approaches to more directly optimize for AUROC. We found a paper written by Google ([1]), rederived the theorems pertaining to AUROC and replicated the approach. However, we found that the results were very sensitive to the learning rate hyperparameter; we thus decided to explore an alternative approach.
A less direct approach we explored was a variation of k-fold cross validation. Beginning with a logistic regression model, we split the dataset into training and backtesting. We then resplit the training dataset into a training and validation dataset k-1 times via the k-fold cross validation method. Setting maximum AUROC train and AUROC validation absolute difference thresholds (5%), we picked the model configuration that had the greatest AUROC value. Although the results were promising (i.e., backtesting dataset had notably better AUROC values than the traditional approach), we found that the operation was computationally expensive. We also had concerns that if we included other modelling frameworks (e.g., neural networks, ensemble models) to create an effective champion challenger approach, we will require to intensively focus minimizing computational time. We thus decided to apply an Agile approach by focusing on the industry standard and complete the application first before revisiting this approach.
[1] Eban, Elan et al. Scalable Learning of Non-decomposable Objectives. AISTATS 2017.
## General extension of pandas module
The [`pandas`](https://pandas.pydata.org/) module in Python pandas is a fast, powerful, flexible, and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. While this module provides many useful components for our work, there are certain elements that are lacking that we have extended and enhanced internally.
### Extending core DataFrame and Series objects
The core `pd.DataFrame` and `pd.Series` entities are core objects in the `pandas` module that are used to manipulate and process data. While these objects do possess a number of features that one can use to manipulate data, it is lacking in several core features that we have had to extend and improve upon.
#### Capabilities for cleaning data (missing values, removing outliers, etc.) These are lacking from `pandas`
Model development requires a considerable amount of data cleaning and treatment. Issues including isolating missing values, determining outliers, and non-numeric variables (e.g., categorical, datetime, etc.) cannot be directly included into most modelling algorithm packages (e.g., you cannot feed string data into `sklearn`'s models).
Thus, identifying these key issues are essential. The `pandas` module is fairly limited in its univariate analytical tools (e.g., generating a long form table showing percent missing, etc.), and thus we created class extensions to generate the tables and analytics we need in the appropriate format.
Although we have successfully completed the key analytics needed, some of them do not scale very well for large datasets (e.g., >500MB). We have explored improvements (e.g., parellelization, eliminating deep copying in certain places, avoiding replication of calculations via memoization techniques), however, we have not yet reached a satisfactory level of performance for some of those functions for large datasets.
## Novel modelling approaches
### Logit regression modelling for percentage type target variables
Early in our project, we we were considering including non-binary target variables types restricted between 0 and 1 (e.g., loss-given default model and exposure at default model). The motivation behind this is that we wanted the software application to also model for loss given default (LGD) and exposure at default (EAD). We explored three candidate models:
1. Beta regression model: we discarded this candidate due to convergence issues when using R and non-availability of existing packages when using Python on our test datasets.
2. Linear regression on the logit of the target variable: There were two main approaches we explored: a proxy approach (taking the logistic of the logit estimate); an exact approach (using probability theory and calculus to derive the target estimate from the logit estimate). For the latter approach, we researched the literature (i.e., online journals and blogs, PhD practictioners, and a full-time statistics professor from University of Waterloo) and found no research on this topic. We thus derived and calculated the values using probability theory, calculus, and available computational packages. We found that the exact approach yielded better results than the proxy for our testing datasets.
3. Direct optimization: Structuring the model equivalently to a one-layered neural network with a sigmoid function and sum of square error as the loss function, we applied a global optimization search algorithm in R. We found the results to be marginally better than the exact approach in 2. The main concern with this approach is computational time and the unclarity of how many iterations were considered enough to reach a global or near global optimimum. Thus, more research was needed to determine the most suitable conditions for applying this type of model.
Although we applied this approach one of our clients, for our internal project, as we decided to focus on binary classification before considering other types of target variables, we tabled the progress and research on finding effective models for datasets with this type of target variable
### Extension and improvement sklearn's logistic regression modelling
During the modelling phase of the binary classifier, we applied `sklearn`'s "newton-cg" approach for estimating the parameters of the logistic regression model. As we wanted to significantly reduce the computational time of training the model (while still maintaining the global optimization calculation achieved by the "newton-cg" approach), we decided to dig deep into the inner working of the library.
After rederiving the key linear algebra theorems of conjugate-gradient descent and confirmed that the `sklearn` application was consistent with (and in general a near exact approximation of) the exact approach, we explored finding ways to enhance the speed via `numpy` and `pandas`. Although, we were able to make simplifications to the original approach, we found the boost to be only marginal. Moreover, as we wanted to move on to complete the rest of the software application, we decided to rely on `sklearn` for the time being.
We did however, include an additional layer to allow for manual calibration (i.e., changing the intercept term so that the training dataset's average event rate would be as close as possible to a pre-set target). This is common practice in the credit risk industry where margins of conservatism are added to ensure that the model meets regulatory requirements. As `sklearn` does not provide this feature, we wrote our own function using a gradient descent approach. Although we noted that this adds computational time, we determined that this was a must-have feature and we will need to reapproach the model training module once we completed the application and addressed handling large datasets before the training step
## Weight-of-evidence and information value from scratch
One of the most important activities done in probability of default (PD) modelling is weight of evidence (WOE) binning. This is used as a data treatment tool (e.g., missing value treatment, outlier treatment for numeric varaibles), a feature engineering tool (e.g., creating income buckets whose trend is more in line with default behaviour), and a tool for univariate analysis (i.e., via the information value (IV) key performance indicator). As were were unable to find any packages in Python that directly creates the WOE bins, we had to create our own class and methods.
### Weight-of-evidence numeric
For numeric variables, it is very important that WOE trends are monotonic and flow in an intuitive direction (e.g., income should have a decreasing WOE value if default is the target). Using a publicly available function in GitHub, we used the Spearman ranking coefficient to determine if the trend was monotonic. However, we found calculating this value was computationally expensive, and so we revamped the function using`numpy` calls.
Additionaly, we wanted left-closed, right-open intervals not available in `pandas` qcut function. Thus, we dug deep into the source code of the `pandas` library and extracted the essential elements to apply the desired interval binning. However, as we wanted to ensure that the bin edges (which were essentially quantiles) were unique, we needed to recalculate these edges repeatedly, which was computationally expensive. We researched for theoretical and computationally optimal solutions, and came out dry. We did derive a candidate solution (via a percentile vs quantile matching approach), but realized that testing it extensively would be time consuming and it was unclear if the benefits would be significant (it is still a O(n) problem), and decided to accept the current approach for now and potentially explore parallelization as an alternative solution later on.
Another item we needed to modify was the rounding scheme done, more specifically decimal ceiling and flooring rounding for certain bin edges (which does not exist in the standard libraries). We originally used the `numpy` rounding scheme, but found it fail under some of our inhouse developed unit tests. With research and experimentation, we found that the `numpy` rounding scheme was not always reliable for us, and so we had to resort to a more fundamental approach. Although this approach is a little more time consuming and may be problematic for very big numbers, we decided that it was sufficient for the majority of the cases, and could be rexplored at a later time.
Although we settled for a quantile approach, we did explore early one an optimal binning strategy that maximizes the information value (IV) key performance indicator (KPI). However, we realized that this endeavour will be very computationally demanding (as it is basically a grid-search) and adding a monotonicity constraint would require additional effort. Since, monotonicity was the higher priority, we decided to implement the unique quantile approach and move one to completing the application and revisit this method at a later stage.
A final area we explored was incorporating business judgment to automatically filter out variables that do not have an intuitive direction with the target variable (e.g., income buckets are expected to have a decreasing default rate, so if the direction is opposite, we would either elimintate income as a key driver or seek rebinning). Our preliminary approach was to build a data dictionary that included expected direction of all drivers. We currently apply this approach manually for our clients but found it challenging to incorporate it in the current PD Wizard framework because it requires a lot of upfront work and prior knowledge of the elements of the dataset. As we are still streamlining the model development process and optimizing the calculations by using multiple public and private datasets, we decided to incorporate this feature at a much more mature state of the software application, and it will likely be optional since direction may not always be of interest for certain key drivers (e.g., sentimental types of variables which could go either way). A second reason for making this feature optional is the possibility that direction may not always be monotonic (e.g., lifetime of the loan tends to have inverted U relationship with default) in practice and more sophisticated models (e.g., neural networks and ensemble models) may more accurately capture this complexity.
### Weight of evidence categorical
For non-numeric variables, if the user provides restriction on the number of bins to be used, say k, PD Wizard simply finds the most common k-1 variables and creates a catch-all bin to put all the other values (note: missing values are given a separate bin if they exist). An area we explored is finding an optimal combination of k categories that would yield the highest information value (IV). As this is a combinatorics problem, with the added difficulty of explainability when combining certain bins (e.g., combining ratings like "Good" and "Bad" in a rating system of: "Unrated", "Very Bad", "Bad", "Good", "Excellent"), we decided to postpone this application as it will require prior information fed into it (e.g., a list of combinations that would not make sense) to avoid nonsensical groupings.
# |toqito>
toqito is an open source library for studying various objects in quantum information, namely, states, channels, and measurements. toqito focuses on providing numerical tools to study problems pertaining to entanglement theory, nonlocal games, and other aspects of quantum information that are often associated with computer science.
## Tools for entanglement detection of a quantum state
Determining whether a quantum state is entangled or not is useful for a number of applications in quantum information, specifically for cryptographic tasks, quantum state distinguishability, quantum communication, etc.
In general, determining whether a quantum state is entangled is an NP-hard problem. For certain cases, there are less computationally intensive tests that may be applied if the state possesses specific properties.
As all quantum states can be represented by the class of density matrices (that is, matrices that are positive semidefinite with trace equal to one), one can provide a description of the quantum state as a denisty matrix to determine whether the state is entangled or separable.
There are a number of such tests in the literature for special cases. For instance, for bipartite quantum systems of a combined dimension of 6 checking whether the quantum state has positive partial transpose is necessary and sufficient to determine whether it is entangled [arXiv:9604005](https://arxiv.org/abs/quant-ph/9604005), [arXiv:9605038](https://arxiv.org/abs/quant-ph/9605038). The rank of the matrix can also be used in certain cases to determine whether the state is entangled [arXiv:0002089](https://arxiv.org/abs/quant-ph/0002089).
There are also other criterion by which to check whether a quantum state is entangled, namely, the realignment criterion [arXiv:0205017](https://arxiv.org/abs/quant-ph/0205017), the cross-norm criterion [arXiv:0709.3766](https://arxiv.org/abs/0709.3766), the separability-from-spectrum criterion [arXiv:1309.2006](https://arxiv.org/abs/1309.2006), and many others [arXiv:0204159](https://arxiv.org/abs/quant-ph/0204159), [arXiv:9806049](https://arxiv.org/abs/quant-ph/9806094), etc.
In the event that none of the less computationally intensive tests determine whether the given state is entangled, it is possible to run an algorithm that computes whether a given state possesses a symmetric extension [arXiv:0707.4398](https://arxiv.org/abs/0707.4398). This check is NP-hard, but in practice, many low-dimensional quantum states of interest can be determined in a reasonable amount of time.
The goal of this research area is to combine the known results in the literature and run a battery of entanglement detection tests on the quantum state to determine whether it is entangled or not. The end result will be a computational tool that researchers can use to specify a quantum state and obtain an answer as to whether the state is entangled or not.
## Entanglement cost of distinguishing two-qubit ensembles
The problem of distinguishing a quantum state from an ensemble of quantum states is a core problem in the field of quantum information. This task has implications toward quantum cryptographic and quantum communication tasks. It also has connections to the foundations of quantum mechanics.
The setting of quantum state distinguishability involves two spatially separated parties (typically referred to as "Alice" and "Bob") who are unable to communicate with each other. They are both provided with a randomly selected quantum state from a third party (typically referred to as the "referee") and Alice and Bob are tasked with determining what random state they were provided. They want to maximize their probability of being able to randomly select this state.
It is possible for Alice and Bob to make use of quantum resources in the form of a shared quantum state to aid them in their task. They can perform measurements on their respective quantum subsystems which they may then use to maximize their probability of selecting the correct random state. Depending on the class of measurements that Alice and Bob use on their subsystems determines the maximal probability with which they can succeed in their task. This quantity is referred to as the entanglement cost.
Using |toqito> one may phrase these scenarios as computations and obtain the maximal probability of success. A closed-form formula for the entanglement cost of four Bell states is known and is provided in [arXiv:1408.6981](https://arxiv.org/abs/1408.6981).
We are presently using |toqito> to extend this result to general two-qubit ensembles which encapsulates a wider class of quantum states. Numerically, the tools that |toqito> offers has suggested a closed form entanglement cost of these more general sets of states.
The intent of this task would be to extend the scope of numerical tools and to publish a result in the literature that proves this result and makes use of the |toqito> software to verify certain numerical properties of the problem.
## Upper bounds on the quantum value of a nonlocal game
Determining what tasks can be performed more efficiently when given access to quantum resources is an emerging field of research. Nonlocal games serve as an abstract mechanism to conceptualize such tasks.
In general, it is NP-hard to determine how well one can succeed in a given task when using quantum resources. There are heuristic techniques in the literature that enable one to compute these quantities. These computations are guaranteed to eventually converge to the desired value, however for certain problems, solving them remains intractable and also a wide field of research.
A technique known as the NPA hierarchy (named for the author's namesake) [arXiv:0803.4290](https://arxiv.org/abs/0803.4290) is one such heuristic that allows one to compute upper bounds on the success of quantum tasks. This quantity is achieved via a hierarchy of semidefinite programs that eventually converge to the true quantum value of a given nonlocal game.
Implementing this technique in |toqito> is an ongoing task that involves a large undertaking for implementation. Having this computational tool will be incredibly useful for studying the field of nonlocal games, quantum computational complexity, and the study of Bell inequalities and entanglement.