Vincent Russo
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Innovative exploration ###### tags: `modellicity` # Modellicity PD-Wizard PD Wizard is a proprietary inhouse software application that equips the user with the ability to build a binary classification model (most commonly probability of default, PD, models in the credit risk space) and generate corresponding documentation. This involves the following stages: (1) Data extraction (2) Data treatment (3) Variable selection (4) Model selection, calibraiton, and training (5) Model backtesting (6) Model documentation: methodology and process reporting, deployment specs The following sections provide the key areas of scientific study and innovation that we explored to build this application. ## Direct AUROC optimization Based on our experience, the most important key performance indicator (KPI) used in the credit risk industry is the area under the receiver operating characteristics curve (AUROC) for probability of default (PD). However, PD models do not directly (or indirectly) optimize for this KPI; rather, the KPI that the modelling algorithm tends to aim to optimize is typically the cross-entropy loss function or the maximum likelihood function. In the start of the project we explored approaches to more directly optimize for AUROC. We found a paper written by Google ([1]), rederived the theorems pertaining to AUROC and replicated the approach. However, we found that the results were very sensitive to the learning rate hyperparameter; we thus decided to explore an alternative approach. A less direct approach we explored was a variation of k-fold cross validation. Beginning with a logistic regression model, we split the dataset into training and backtesting. We then resplit the training dataset into a training and validation dataset k-1 times via the k-fold cross validation method. Setting maximum AUROC train and AUROC validation absolute difference thresholds (5%), we picked the model configuration that had the greatest AUROC value. Although the results were promising (i.e., backtesting dataset had notably better AUROC values than the traditional approach), we found that the operation was computationally expensive. We also had concerns that if we included other modelling frameworks (e.g., neural networks, ensemble models) to create an effective champion challenger approach, we will require to intensively focus minimizing computational time. We thus decided to apply an Agile approach by focusing on the industry standard and complete the application first before revisiting this approach. [1] Eban, Elan et al. Scalable Learning of Non-decomposable Objectives. AISTATS 2017. ## General extension of pandas module The [`pandas`](https://pandas.pydata.org/) module in Python pandas is a fast, powerful, flexible, and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. While this module provides many useful components for our work, there are certain elements that are lacking that we have extended and enhanced internally. ### Extending core DataFrame and Series objects The core `pd.DataFrame` and `pd.Series` entities are core objects in the `pandas` module that are used to manipulate and process data. While these objects do possess a number of features that one can use to manipulate data, it is lacking in several core features that we have had to extend and improve upon. #### Capabilities for cleaning data (missing values, removing outliers, etc.) These are lacking from `pandas` Model development requires a considerable amount of data cleaning and treatment. Issues including isolating missing values, determining outliers, and non-numeric variables (e.g., categorical, datetime, etc.) cannot be directly included into most modelling algorithm packages (e.g., you cannot feed string data into `sklearn`'s models). Thus, identifying these key issues are essential. The `pandas` module is fairly limited in its univariate analytical tools (e.g., generating a long form table showing percent missing, etc.), and thus we created class extensions to generate the tables and analytics we need in the appropriate format. Although we have successfully completed the key analytics needed, some of them do not scale very well for large datasets (e.g., >500MB). We have explored improvements (e.g., parellelization, eliminating deep copying in certain places, avoiding replication of calculations via memoization techniques), however, we have not yet reached a satisfactory level of performance for some of those functions for large datasets. ## Novel modelling approaches ### Logit regression modelling for percentage type target variables Early in our project, we we were considering including non-binary target variables types restricted between 0 and 1 (e.g., loss-given default model and exposure at default model). The motivation behind this is that we wanted the software application to also model for loss given default (LGD) and exposure at default (EAD). We explored three candidate models: 1. Beta regression model: we discarded this candidate due to convergence issues when using R and non-availability of existing packages when using Python on our test datasets. 2. Linear regression on the logit of the target variable: There were two main approaches we explored: a proxy approach (taking the logistic of the logit estimate); an exact approach (using probability theory and calculus to derive the target estimate from the logit estimate). For the latter approach, we researched the literature (i.e., online journals and blogs, PhD practictioners, and a full-time statistics professor from University of Waterloo) and found no research on this topic. We thus derived and calculated the values using probability theory, calculus, and available computational packages. We found that the exact approach yielded better results than the proxy for our testing datasets. 3. Direct optimization: Structuring the model equivalently to a one-layered neural network with a sigmoid function and sum of square error as the loss function, we applied a global optimization search algorithm in R. We found the results to be marginally better than the exact approach in 2. The main concern with this approach is computational time and the unclarity of how many iterations were considered enough to reach a global or near global optimimum. Thus, more research was needed to determine the most suitable conditions for applying this type of model. Although we applied this approach one of our clients, for our internal project, as we decided to focus on binary classification before considering other types of target variables, we tabled the progress and research on finding effective models for datasets with this type of target variable ### Extension and improvement sklearn's logistic regression modelling During the modelling phase of the binary classifier, we applied `sklearn`'s "newton-cg" approach for estimating the parameters of the logistic regression model. As we wanted to significantly reduce the computational time of training the model (while still maintaining the global optimization calculation achieved by the "newton-cg" approach), we decided to dig deep into the inner working of the library. After rederiving the key linear algebra theorems of conjugate-gradient descent and confirmed that the `sklearn` application was consistent with (and in general a near exact approximation of) the exact approach, we explored finding ways to enhance the speed via `numpy` and `pandas`. Although, we were able to make simplifications to the original approach, we found the boost to be only marginal. Moreover, as we wanted to move on to complete the rest of the software application, we decided to rely on `sklearn` for the time being. We did however, include an additional layer to allow for manual calibration (i.e., changing the intercept term so that the training dataset's average event rate would be as close as possible to a pre-set target). This is common practice in the credit risk industry where margins of conservatism are added to ensure that the model meets regulatory requirements. As `sklearn` does not provide this feature, we wrote our own function using a gradient descent approach. Although we noted that this adds computational time, we determined that this was a must-have feature and we will need to reapproach the model training module once we completed the application and addressed handling large datasets before the training step ## Weight-of-evidence and information value from scratch One of the most important activities done in probability of default (PD) modelling is weight of evidence (WOE) binning. This is used as a data treatment tool (e.g., missing value treatment, outlier treatment for numeric varaibles), a feature engineering tool (e.g., creating income buckets whose trend is more in line with default behaviour), and a tool for univariate analysis (i.e., via the information value (IV) key performance indicator). As were were unable to find any packages in Python that directly creates the WOE bins, we had to create our own class and methods. ### Weight-of-evidence numeric For numeric variables, it is very important that WOE trends are monotonic and flow in an intuitive direction (e.g., income should have a decreasing WOE value if default is the target). Using a publicly available function in GitHub, we used the Spearman ranking coefficient to determine if the trend was monotonic. However, we found calculating this value was computationally expensive, and so we revamped the function using`numpy` calls. Additionaly, we wanted left-closed, right-open intervals not available in `pandas` qcut function. Thus, we dug deep into the source code of the `pandas` library and extracted the essential elements to apply the desired interval binning. However, as we wanted to ensure that the bin edges (which were essentially quantiles) were unique, we needed to recalculate these edges repeatedly, which was computationally expensive. We researched for theoretical and computationally optimal solutions, and came out dry. We did derive a candidate solution (via a percentile vs quantile matching approach), but realized that testing it extensively would be time consuming and it was unclear if the benefits would be significant (it is still a O(n) problem), and decided to accept the current approach for now and potentially explore parallelization as an alternative solution later on. Another item we needed to modify was the rounding scheme done, more specifically decimal ceiling and flooring rounding for certain bin edges (which does not exist in the standard libraries). We originally used the `numpy` rounding scheme, but found it fail under some of our inhouse developed unit tests. With research and experimentation, we found that the `numpy` rounding scheme was not always reliable for us, and so we had to resort to a more fundamental approach. Although this approach is a little more time consuming and may be problematic for very big numbers, we decided that it was sufficient for the majority of the cases, and could be rexplored at a later time. Although we settled for a quantile approach, we did explore early one an optimal binning strategy that maximizes the information value (IV) key performance indicator (KPI). However, we realized that this endeavour will be very computationally demanding (as it is basically a grid-search) and adding a monotonicity constraint would require additional effort. Since, monotonicity was the higher priority, we decided to implement the unique quantile approach and move one to completing the application and revisit this method at a later stage. A final area we explored was incorporating business judgment to automatically filter out variables that do not have an intuitive direction with the target variable (e.g., income buckets are expected to have a decreasing default rate, so if the direction is opposite, we would either elimintate income as a key driver or seek rebinning). Our preliminary approach was to build a data dictionary that included expected direction of all drivers. We currently apply this approach manually for our clients but found it challenging to incorporate it in the current PD Wizard framework because it requires a lot of upfront work and prior knowledge of the elements of the dataset. As we are still streamlining the model development process and optimizing the calculations by using multiple public and private datasets, we decided to incorporate this feature at a much more mature state of the software application, and it will likely be optional since direction may not always be of interest for certain key drivers (e.g., sentimental types of variables which could go either way). A second reason for making this feature optional is the possibility that direction may not always be monotonic (e.g., lifetime of the loan tends to have inverted U relationship with default) in practice and more sophisticated models (e.g., neural networks and ensemble models) may more accurately capture this complexity. ### Weight of evidence categorical For non-numeric variables, if the user provides restriction on the number of bins to be used, say k, PD Wizard simply finds the most common k-1 variables and creates a catch-all bin to put all the other values (note: missing values are given a separate bin if they exist). An area we explored is finding an optimal combination of k categories that would yield the highest information value (IV). As this is a combinatorics problem, with the added difficulty of explainability when combining certain bins (e.g., combining ratings like "Good" and "Bad" in a rating system of: "Unrated", "Very Bad", "Bad", "Good", "Excellent"), we decided to postpone this application as it will require prior information fed into it (e.g., a list of combinations that would not make sense) to avoid nonsensical groupings. # |toqito> toqito is an open source library for studying various objects in quantum information, namely, states, channels, and measurements. toqito focuses on providing numerical tools to study problems pertaining to entanglement theory, nonlocal games, and other aspects of quantum information that are often associated with computer science. ## Tools for entanglement detection of a quantum state Determining whether a quantum state is entangled or not is useful for a number of applications in quantum information, specifically for cryptographic tasks, quantum state distinguishability, quantum communication, etc. In general, determining whether a quantum state is entangled is an NP-hard problem. For certain cases, there are less computationally intensive tests that may be applied if the state possesses specific properties. As all quantum states can be represented by the class of density matrices (that is, matrices that are positive semidefinite with trace equal to one), one can provide a description of the quantum state as a denisty matrix to determine whether the state is entangled or separable. There are a number of such tests in the literature for special cases. For instance, for bipartite quantum systems of a combined dimension of 6 checking whether the quantum state has positive partial transpose is necessary and sufficient to determine whether it is entangled [arXiv:9604005](https://arxiv.org/abs/quant-ph/9604005), [arXiv:9605038](https://arxiv.org/abs/quant-ph/9605038). The rank of the matrix can also be used in certain cases to determine whether the state is entangled [arXiv:0002089](https://arxiv.org/abs/quant-ph/0002089). There are also other criterion by which to check whether a quantum state is entangled, namely, the realignment criterion [arXiv:0205017](https://arxiv.org/abs/quant-ph/0205017), the cross-norm criterion [arXiv:0709.3766](https://arxiv.org/abs/0709.3766), the separability-from-spectrum criterion [arXiv:1309.2006](https://arxiv.org/abs/1309.2006), and many others [arXiv:0204159](https://arxiv.org/abs/quant-ph/0204159), [arXiv:9806049](https://arxiv.org/abs/quant-ph/9806094), etc. In the event that none of the less computationally intensive tests determine whether the given state is entangled, it is possible to run an algorithm that computes whether a given state possesses a symmetric extension [arXiv:0707.4398](https://arxiv.org/abs/0707.4398). This check is NP-hard, but in practice, many low-dimensional quantum states of interest can be determined in a reasonable amount of time. The goal of this research area is to combine the known results in the literature and run a battery of entanglement detection tests on the quantum state to determine whether it is entangled or not. The end result will be a computational tool that researchers can use to specify a quantum state and obtain an answer as to whether the state is entangled or not. ## Entanglement cost of distinguishing two-qubit ensembles The problem of distinguishing a quantum state from an ensemble of quantum states is a core problem in the field of quantum information. This task has implications toward quantum cryptographic and quantum communication tasks. It also has connections to the foundations of quantum mechanics. The setting of quantum state distinguishability involves two spatially separated parties (typically referred to as "Alice" and "Bob") who are unable to communicate with each other. They are both provided with a randomly selected quantum state from a third party (typically referred to as the "referee") and Alice and Bob are tasked with determining what random state they were provided. They want to maximize their probability of being able to randomly select this state. It is possible for Alice and Bob to make use of quantum resources in the form of a shared quantum state to aid them in their task. They can perform measurements on their respective quantum subsystems which they may then use to maximize their probability of selecting the correct random state. Depending on the class of measurements that Alice and Bob use on their subsystems determines the maximal probability with which they can succeed in their task. This quantity is referred to as the entanglement cost. Using |toqito> one may phrase these scenarios as computations and obtain the maximal probability of success. A closed-form formula for the entanglement cost of four Bell states is known and is provided in [arXiv:1408.6981](https://arxiv.org/abs/1408.6981). We are presently using |toqito> to extend this result to general two-qubit ensembles which encapsulates a wider class of quantum states. Numerically, the tools that |toqito> offers has suggested a closed form entanglement cost of these more general sets of states. The intent of this task would be to extend the scope of numerical tools and to publish a result in the literature that proves this result and makes use of the |toqito> software to verify certain numerical properties of the problem. ## Upper bounds on the quantum value of a nonlocal game Determining what tasks can be performed more efficiently when given access to quantum resources is an emerging field of research. Nonlocal games serve as an abstract mechanism to conceptualize such tasks. In general, it is NP-hard to determine how well one can succeed in a given task when using quantum resources. There are heuristic techniques in the literature that enable one to compute these quantities. These computations are guaranteed to eventually converge to the desired value, however for certain problems, solving them remains intractable and also a wide field of research. A technique known as the NPA hierarchy (named for the author's namesake) [arXiv:0803.4290](https://arxiv.org/abs/0803.4290) is one such heuristic that allows one to compute upper bounds on the success of quantum tasks. This quantity is achieved via a hierarchy of semidefinite programs that eventually converge to the true quantum value of a given nonlocal game. Implementing this technique in |toqito> is an ongoing task that involves a large undertaking for implementation. Having this computational tool will be incredibly useful for studying the field of nonlocal games, quantum computational complexity, and the study of Bell inequalities and entanglement.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully