# TFM 2021-2022
### Digital phenotyping for the detection and
monitoring of mental disease
#### Agata Lapedriza (alapedriza@uoc.edu), Mercedes Balcells( merche@mit.edu)
Depression and other mental disorders have significantly increased in recent years. In particular, depressive disorders are leading the cause of disability in most of the developed countries for ages 15-44 years old, which has dramatically increased the suicide rates. This has motivated an interest in developing systems for the early detection of these mental disorders. Currently, the diagnosis of mental disorders and the tracking of patients are mainly based on clinical assessments of self-reported symptoms, involving filling out surveys or face-to-face interviews. Unfortunately these procedures have limitations in terms of scalability and often the diagnosis does not happen until the condition of the patient is critical. At that late point of diagnosis the treatment is less effective than in cases where the diagnosis happens earlier.
We offer a scholarship to do research part-time (15 hours a week) for 6 months for master students in Data Science, Artificial Intelligence, Computer Science or related areas.
During the internship the student will be working in a project related to Digital Phenotyping. The research carried out during the internship may be part of the Master's Thesis.
The approximate gross salary is 6E per hour (the exact salary is stipulated by the university where the student is enrolled).
Important dates:
- Deadline to express interest: January 31st, 2022.
- Starting date: February 2022.
If you are interested in this position please send an e-mail to Agata Lapedriza (alapedriza@uoc.edu) expressing your interest. During the internship the student will be
supervised by Agata Lapedriza and Merche Balcells.
Requirements
+ Advanced knowledge of the Python language and the main libraries related to Data Analysis and Machine Learning, such as NumPy, Pandas, or Scikit-learn.
+ Willingness to learn and interest in technology and Artificial Intelligence.
+ Interest in having research experience and joining a research group
+ Good level of English.
+ Knowledge of Machine Learning. We will value knowledge in Deep Learning.
+ Knowledge of any of the Deep Learning frameworks, such as Tensorflow, will be valued.
### User Lifetime Revenue Prediction
#### Arnau Escapa (arnau.escapa@socialpoint.es, SocialPoint)
Nowadays, video game developers record every virtual action performed by their players. As each player can remain in the game for years, this results in an exceptionally rich dataset that can be used to understand and predict player behavior. Predicting long term revenue is a key company problem since these predictions are the compass to evaluate and drive the marketing efforts. The goal is to have a decent revenue prediction just a few days after a user is registered in the game.
The project consists in testing different different approaches to the problem:
+ Multilevel models: a standard approach to analyzing clustered and longitudinal data in the social, behavioral and medical sciences.
+ Deep Learning models: https://arxiv.org/abs/1811.12799
+ Parametric models: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.526.3400&rep=rep1&type=pdf
### Causality Modeling
#### Arnau Escapa (arnau.escapa@socialpoint.es, SocialPoint)
Nowadays, video game developers record every virtual action performed by their players. With such a rich dataset it is often hard to get clear conclusions on why the players behave the way they do.
The project consists in applying Causal Models in order to find concrete events that are the cause of relevant events inside the game such as: make the first payment, watch a video ad, be an engaged user or stop playing the game.
Reference: https://towardsdatascience.com/introduction-to-causality-in-machine-learning-4cee9467f06f
### Wine catalog and recommender
#### Mireia Ribera (ribera@ub.edu) with support of Santi Seguí in the recommender part
This project combines a recommender with visualizing the exploration of wines by their many features: origin, type, grape, etc.
It is advisable to have gone to Recommenders subject and to Natural Language Processing. Information visualization is mandatory. HTML and javascript knowledge is a must.
+ Chen, Bernard, et al. "The computational wine wheel 2.0 and the TriMax triclustering in wineinformatics." Industrial Conference on Data Mining. Springer, Cham, 2016.
+ Flanagan, Brendan, et al. "Predicting and visualizing wine characteristics through analysis of tasting notes from viewpoints." International Conference on Human-Computer Interaction. Springer, Cham, 2015.
+ Emma Sheridan 2019 Un-Bottling the data https://towardsdatascience.com/un-bottling-the-data-2da3187fb186
+ Datasets: Wine enthusiast, Guia penin, Kaggle wine reviews https://www.kaggle.com/zynicide/wine-reviews
### The Ersilia Model Hub: a repository of AI/ML models for infectious and neglected tropical diseases
#### Jordi Vitrià (jordi.vitria@ub.edu), Miquel Duran (Ersilia)
The Ersilia Open Source Initiative (EOSI; https://ersilia.io) is a small non-profit organisation aimed at strengthening biomedical research capacity in low- and middle-income countries (LMIC). The current biomedical research system concentrates most of the scientific potential in high-income countries (HIC), which jointly account for over 90% of the publications worldwide. With such a small scientific and innovation workforce, LMIC largely depend on the solutions devised in the Global North, which are oftentimes unable to meet the needs on the ground. The effect is aggravated by the lack of involvement of local researchers in studies focused on endemic diseases. We are convinced that AI/ML can be key to establishing more egalitarian North-South collaborations. Compared to other technologies, digital assets provide cost-effective solutions that can be adapted to underfunded settings, which gives LMIC a unique opportunity to ‘decolonise’ their science and set up sustainable and world-class research programs. In this context, we created EOSI to promote implementation and build capacity in AI/ML, especially in Sub-Saharan Africa.
We are currently building the Ersilia Model Hub, the first repository of open-source, ready-to-use AI/ML models for drug discovery in infectious and neglected tropical diseases. We expect Ersilia to become a reference AI/ML resource for scientists and companies tackling diseases of the Global South. According to our roadmap, we will achieve 500 models by March 2022. Ersilia’s models are powered by the Chemical Checker (CC), a data-driven drug-discovery resource focused on ‘transfer learning’. The CC technology was published in the journal Nature Biotechnology (Duran-Frigola et al., 2020) and was validated in the internationally recognised DREAM Challenge competition (CTD-squared Pan Cancer Activity Prediction), where it scored as a top-performing team. Three important additions to Ersilia are (1) confidence estimation and explainability of predictions (Jiménez-Luna et al., 2020), (2) privacy-preserving AI/ML options, and (3) lightening of models, so as to make them amenable to the bare-minimum computational resources available to some of our collaborators in LMIC.
We are seeking expert support to move forward with Ersilia. We adhere to the principles of Open Science, which means we work in the open-source domain, aim at peer-reviewed publications and care deeply about scientific credit and authorship. At the moment, we are growing a community of users and contributors thanks to the Code for Science & Society incubator.
**References**
+ Duran-Frigola, M., Pauls, E., Guitart-Pla, O., Bertoni, M., Alcalde, V., Amat, D., Juan-Blanco, T., and
+ Aloy, P. (2020). Extending the small-molecule similarity principle to all levels of biology with the Chemical Checker. Nat. Biotechnol. 38, 1087–1096.
+ Jiménez-Luna, J., Grisoni, F., and Schneider, G. (2020). Drug discovery with explainable artificial intelligence. Nature Machine Intelligence 2, 573–584.
### Ground-based Cloud Classification with Deep Learning (CCN vs Transformers)
#### Jordi Vitrià (jordi.vitria@ub.edu), Gerard Gómez (UB)
Clouds play an essential role in the circulation of water vapour and affects the earth’s energy balance. In the study of **weather forecasting and climate change**, clouds are always regarded as the core factor.
<center>
<img width="450" src="https://i.imgur.com/g9BCeF3.jpg"></center>
<center>
Cloud Atlas
</center>
The traditional cloud observation (https://cloudatlas.wmo.int/en/home.html) is much dependent on the observers’ experience, and thus, it is **time-consuming**. We propose to develop a neural net for accurate ground-based cloud classification. To this end, we will explore CCN as well as **Transformer architectures**.
Depending on the performance of the neural net, it will be deployed in the field at the Observatori Fabra (http://www.fabra.cat/) in Tibidabo, Barcelona
<center>
<img width="450" src="https://i.imgur.com/auicCIo.jpg">
</center>
<center>
Example of Cirrus fibratus radiatus from Observatori Fabra. 27/10/2021
</center>
---
### 2 MSc projects in collaboration with ESADE
#### Carlos Carrasco (carlos.carrasco@esade.edu)(ESADE)
You can find the information about these projects in Campus Virtual.
*Note: You can choose among these projects, but we will assign only 2 projects from this list.*
+ DEEPFAKES: FAKE NEWS THAT COULD DESTROY DEMOCRACY
+ Detecting Fake News with Natural Language Processing
+ Dethrone the king - Youtubers, Celebrities and scandals in social networks.
+ Distance Matters
+ Google Street View
+ Misinformation Virality
+ Online Videogames
+ Personalized Routes
+ Storm the Capitol
---
### Pattern Recognition in mice behaviours using machine learning.
#### Eloi Puertas (epuertas@ub.edu), Mercé Masana i Nadal (UB)
This project is already assigned.
---
### Exploring Machine Learning Devops alternatives.
#### Eloi Puertas Prats (epuertas@ub.edu)
One of the most difficult stages in the DataScience Pipeline is to put into production trained models and make them work in real life applications. Machine Learning Devops tries to solve these problems by following the learned rules from Software Engineering and deploying models using automatization. In this project we will explore different alternatives to create a full automatized devops pipeline in a classic DataScience project, from collecting data from different sources (bots, crontabs….) to train online models, deploy them and finally putting into production using a containerized web application.
The main goals of the project are:
+ Full automatization in the DataSicence Pipeline, no further human interaction is needed once the project is deployed.
+ Daily Models Building, the model is retrained every day when new train data is available. * Dashboard for controlling which models are in production at any time.
+ Data Version Control, like Version Control System, but with the data fetched for the application.
+ Continuous Machine Learning, train new models continuously and updating them directly to production, retrieving the metrics obtained during the training.
+ Connectors to Tensorflow visualization toolkit.
The principal tool we going to use is Github, Github Actions and python scripting.
Platforms to explore: https://cml.dev/ https://dvc.org/ https://studio.iterative.ai/
Requeriments: 1 Student, Strong skills using Github and python scripting.
Refs:
+ Sergios Karagiannakos, Deep Learning in Production (https://leanpub.com/DLProd)
---
### Crowd learning from any type of information
#### Jerónimo Hernández González (jeronimo.hernandez@ub.edu), Aritz Pérez (BCAM) (aperez@bcamath.org)
Crowdsourcing approaches to machine learning aim to obtain data from a community of collaborators who are not required to be experts or knowledgeable in the project in which they participate. Among others, this type of (altruistic or paid) workers has been commonly used to label datasets. Different methods have been proposed for learning from this type of data [1], frequently assuming a model of workers’ behavior. Most of these models assume that the information provided by the workers is just a label, although several studies show that it is possible to take advantage of more vague information of supervision. In this work, the student will work on the learning from crowdsourced data problem where the information provided by workers can take any form. An appropriate model for the workers in this new framework needs to be conceived. The objective is to propose a whole learning technique that leads to robust classifiers when dealing with this type of data, in comparison with the standard approach to learning from crowds.
Requirements:
+ Python
+ Knowledge of PGMs and Bayesian inference (valuable)
+ No. students: 1
Bibliography:
+ [1] Raykar, V. C., Yu, S., Zhao, L. H., Florin, C., Bogoni, L., Moy, L., Hermosillo Valadez, G., Florin, C., Bogoni, L., & Moy, L. (2010). Learning From Crowds. Journal of Machine Learning Research, 11, 1297–1322.
---
### Mental health estimation with data coming from multiple sources
#### Jerónimo Hernández González (jeronimo.hernandez@ub.edu)
Data collection is arguably one of the principal factors constraining the use of machine learning in many domains, including health care. It is a laborious task which takes physicians’ time, and that is why it is considered as expensive. In other cases, when analyzing diseases with reduced prevalence, it is simply hard to obtain cases for a robust study from one single hospital. Thus, in practice in many studies, data is collected by several hospital or health care centers. This comes with a cost: not always the collection of data is performed exactly in the same way. Thus, the benefit of having more data for learning is traded-off with the accuracy of the method on data from every single center.
In this work, the student will work with a real-world application where the objective is to predict a mental health issue in a dataset of patients from 5 different health care centers. The objective is to use specific techniques from multiple-source learning [1] in order to overcome the performance of a classifier learned with the whole dataset, as well as selecting carefully the performance metrics that should be taken into account -
Requirements: Python
No. students: 1
Bibliography:
+ [1] Crammer, K., Kearns, M., & Wortman, J. (2008). Learning from Multiple Sources. Journal of Machine Learning Research, 9, 1757–1774.
---
### Topic modeling for improving recomendations
#### Paula Gómez (paula.gomez@ub.edu) (UB)
Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterise a set of documents.
In this project, we propose working along with TV3 (Catalan public broadcaster) on their text metadata (sinopsis, subtitles..) in order to find appropriate tags from unsupervised clusters so that they end up helping to improve their **recommender system**.

Bibliografia:
+ Spatial topic modeling in online social media for location recommendation https://dl.acm.org/doi/abs/10.1145/2507157.2507174
+ Collaborative topic modeling for recommending scientific articles https://dl.acm.org/doi/abs/10.1145/2020408.2020480
---
### Mitigating and leveraging popularity bias in recomender systems
#### Paula Gómez (paula.gomez@ub.edu) (UB)
Recommender system usually faces popularity bias issues: from the data perspective, items exhibit uneven (usually long-tail) distribution on the interaction frequency; from the method perspective, collaborative filtering methods are prone to amplify the bias by over-recommending popular items. In this project, we propose to research over **causality methods** in order to help mitigating and leveraging popularity bias in a recommender system.

Bibliografia:
+ Causal intervention for leveraging popularity bias https://arxiv.org/pdf/2105.06067.pdf
+ Managing Popularity Bias in Recommender Systems with Personalized Re-ranking https://arxiv.org/pdf/1901.07555.pdf
---
### Deep Learning with Noisy Labels
#### Petia Radeva (petia.ivanova@ub.edu), Ricardo Marques (UB)
One of the key reasons why deep neural networks (DNNs) have been so successful in image classification is the collections of massive labeled datasets such as COCO and ImageNet. However, it is time-consuming and expensive to collect such high-quality manual annotations. A single image often requires agreement from multiple annotators to reduce label error. On the other hand, there exist other less expensive sources to collect labeled data, such as search engines, social media websites, or reducing the number of annotators per image. However, those low-cost approaches introduce low-quality annotations with label noise. Many studies have shown that label noise can significantly affect the accuracy of the learned classifiers. In this Master work, we will address the problema of how to effectively train on noisy labeled datasets?
The prominent issue in training DNNs on noisy labeled data is that DNNs often overfit to the noise, which leads to performance degradation. We will addresses this issue by optimizing for a model’s parameters that are less prone to overfitting and more robust against label noise. Specifically, we will combine self-learning approaches with self-distillation neural network to make a hypothesis on noisy labels and avoid the algorithms degradation due to the noisy labels. The key idea of our method is: a noise-tolerant model should be able to consistently learn the underlying knowledge from data despite different label noise.
Requirements: This project ideally should be developed by a team of 3 persons. In case the team is of less members the project will be properly rescaled considering only some of the goals.
BIBLIOGRAPHY:
+ Xiao, et al., Learning from Massive Noisy Labeled Data for Image Classification, CVPR’2019.
---
### Applying Machine learning on Epidemiological study of Mediterranean Diet
#### Petia Radeva (petia.ivanova@ub.edu), Ramon Estruch/Rosa Casas
Despite the evident link between diet and non-communicable diseases, interventions able to alter dietary habits and improve public health and individuals’ wellbeing have achieved limited impact. Since one of the main reasons for this failure is the multifactored individual response, personalized medicine tools using nutritional, medical, phenotypic, exposome and genetic information about the individuals may allow increasing effectiveness of healthy dietary guidelines and nutritional recommendations.
In this project we will explore two rich datasets from two clinical studies Predimed-Plus and WAHA and apply machine learning (ML) techniques in order to extract which variables are directly related to the cognitive function and cardiometabolic health in adulthood and elderly. Predimed-Plus involved 507 subjects from Hospital Clinic site and WAHA - 657 subjects from Barcelona and Loma Linda cohorts. We will apply machine learning (ML) approaches to analyze Data about diet, as well as nutrients and non-nutrient intake, Anthropometric measurements, Quality of life and other lifestyle-related factors (e.g. physical activity, sleep), Genomic data, metabolomic and microbiomic data to achieve the goal of effective precision nutrition. Given the PredimedPlus and WAHA variables, a ML-based predictive model will be trained to predict cognitive and cardiometabolc health from the whole population. The population predictive model will use ML (Support vector machines, Random forest, Gradient Boost, etc) algorithms that will be trained on the variables of the PredimedPlus and WAHA studies in order to predict the health outcomes of the person and generate proper insights from them.
The project will be developed in close collaboration with Hospital Clinic (Dr. Ramon Estruch) and the Faculty of Pharmacy and Nutrition Science (Dr. Rosa Maria Lamuela).
Requirements: This project ideally should be developed by a team of 3 persons. In case the team is of less members the project will be properly rescaled considering only some of the goals.
BIBLIOGRAPHY:
+ Chatterjee, A., Gerdes, M. W., & Martinez, S. G. (2020). Identification of risk factors associated with obesity and overweight—A machine learning overview. Sensors, 20(9), 2734.
---
### A Computational Approach to Understand Animal Behaviour
#### Ignasi Cos (ignasi.cos@ub.edu)(UB)
Animal behaviour remains largely unknown because of two main operational problems. First, our limitations to observe and record freely moving animals in their environment. Second, our poor understanding of the variety of the animal behaviour itself. In this context, recent techniques to record kinematic data from flying birds offer a promising avenue to overcome these hurdles, as accelerometers may provide a temporal series of continuous, real-time flow of quantitative data about the kinematics of movement encompassing each behaviour. Second, behaviours as we conceptualize them, may occur at different and sometimes conflicting time scales, ranging from one second (pecking), to a few seconds (diving, taking-off, landing, deglutition), to hours (flying over the ocean), thus defining quite a fuzzy boundary between what we view as a simple action and what may be considered a behaviour. Related to this, previous work was performed to define and create a sensible pipeline for the generation of ethograms (catalogues of the different behaviours displayed by a species) across different temporal scales, to be tested with tri-axial acceleration recordings from the Red-Billed Tropicbird (Phaeton Aethereus). This pipeline is currently available and was based on a step-by-step procedure: First, a method of temporal segmentation by means of an adaptive threshold and window; second a method of grouping and aligning those behaviours based on a metric of cross-correlation, resulting in groups of segments that correspond to macroscopically identifiable behaviours; third, the use of a recurrent neural network trained with fragments of these groups to classify previously unobserved recordings.
Based on this work, the duties of the candidate will be to assess the use of metrics of coherence and similarity for the grouping process of animal behaviour, and to improve the design of a recurrent neural network classifier to capture the temporal dynamics of these temporal series and to optimize behavioural separability as defined by these metrics. Data validation will be performed by combining behavioural clusters with independent GPS and pressure data, supporting or refuting the consistency between the behaviours identified by our network and current ethological knowledge about this species.
---
### Approximate Nearest Neighbor Search with Neural architectures.
#### David Buchaca (davidbuchaca@ub.edu) (UB)
Approximate Nearest Neighbor search is a crucial technique that allows fast search with vectors generated from embeddings produced by deep learning models.
Traditional search techniques that are based on the presence of certain words are not suitable for searching at the "embedding space".
This project will involve investigating different approximate nearest neighbor techniques and implementing one that allows filters to be applied during search.
Some key ideas involve computing cosine distances replacing a matrix product by a "matrix multiplication without multiplication" (https://arxiv.org/pdf/2106.10860.pdf) and
building an index that can be trained online while examples are stored in a database.
Ideal number of people for the project: 1 or 2 motivated students.
Relevant material
https://matsui528.github.io/cvpr2020_tutorial_retrieval/
https://www.jstage.jst.go.jp/article/mta/6/1/6_2/_pdf/
http://vldb.org/pvldb/vol14/p1964-wang.pdf
https://arxiv.org/pdf/1806.09823.pdf
https://arxiv.org/abs/2106.10860
https://big-ann-benchmarks.com
Relevant Libraries
https://github.com/nmslib/hnswlib
https://github.com/jina-ai/pqlite
https://github.com/spotify/annoy
---
### Non-discrimination in deep learning: Application to trustworthy brain image quantification
#### Karim Lekadir (karim.lekadir@ub.edu), Carla Sendra
In the age of big data and digital health, there has been excitement about the extraordinary
opportunities that emerging technologies may offer in tomorrow’s healthcare. However, as these technologies have been developed and pilot tested, concerns have arisen regarding their ethical, societal and legal implications. For example, a study published in Science (Obermeyer et al., 2019) received a lot of media attention when it demonstrated that a risk assessment algorithm widely used for patient referral in the US discriminated against Black patients. The authors estimated that “remedying this disparity would increase the percentage of Black patients receiving additional help from 17.7% to 46.5%”. Hence, in recent years, researchers, organisations, and opinion leaders have expressed the need for new solutions for developing AI
algorithms that are free of bias and fair across sex, age, ethnic and other population groups. The goal of this project is to implement machine learning techniques for brain image quantification that are fair and non-discriminative.
In a first step, the student will implement state-of-the-art deep neural networks for the
segmentation of brain magnetic resonance images (MRI). Subsequently, the student will
implement a number of criteria and metrics to estimate the level of bias in the obtained
segmentation results, such as by using statistical parity, group fairness, equalised odds and
predictive equality. In parallel, the student will perform an in-depth literature review on fairness in statistics and machine learning, building on experiences in various fields beyond medicine and imaging. Finally, a number of bias correction measures will be analysed and implemented to address the biases and discriminations identified by using the standard deep learning techniques.
In particular, the student will consider (1) pre-processing approaches to improve the training
dataset through re-sampling (under- or over-sampling), data augmentation (image synthesis
using adversarial learning) or sample weighting to neutralise discriminatory effects; (2) in processing approaches that modify the learning algorithm in order to remove discrimination
during the model training process, such as by adding explicit constraints in the loss functions to minimise the performance difference between subgroups of individuals; and (3) post-processing approaches to correct the outputs of the AI algorithm depending on the individual’s group, such as by using the equalised odds post-processing technique.
The results of this project are expected to be published as a scientific paper in a peer reviewed publication.
---
### Multi-modal deep learning: Application to integrative modelling of electrocardiography and cardiac imaging
#### Karim Lekadir (karim.lekadir@ub.edu)
The field of machine learning has seen important developments and applications in medicine over the last year. In particular, powerful deep learning techniques such as convolutional neural networks have been developed to process complex medical data types such as medical images, genetic data or time-series data (e.g. electrocardiography). However, these developments have often taken place in specialised fashions and there is a lack of deep learning implementations that allow to seamlessly integrate medical data from different types. Yet, such integrative approaches would enable to produce richer models that describe the complex biological and anatomical patterns in health and disease.
In cardiology, for example, cardiac structure and electrical activity are two important, inter-linked aspects of cardiac health and disease. Cardiac morphology is typically measured using imaging modalities such as magnetic resonance images (MRI). The electrocardiogram (ECG) is a non-stationary physiological signal that represents the electrical activity of the heart. It is widely used to identify patterns or abnormalities in cardiac rhythms and waveforms. For both cardiac MRI and ECG signals, there has been many deep learning developments in the recent state-of-the-art (e.g. Somani et al. EP Europace 2021, Campello et al. TMI 2021). However, there has not been any work investigating deep learning for integrative investigation of both cardiac structure (cardiac MRI) and cardiac electrical activity (ECG).
The goal of this project is to develop new deep learning methods that will allow to encode the associations between electrocardiography signals and dynamic cardiac images. This will allow to develop new methods that predict cardiac structural dynamics over a heartbeat as a function of the electrocardiographic signals. This will be achieved thanks to access to a large database of ECG and cardiac MRI datasets (>10,000s) from the H2020 euCanSHare project (www.eucanshare.eu), which is coordinated by the University of Barcelona.
Concretely, the student will first review the state-of-the-art for generative models and similar applications with multimodal data in medical imaging. Then the student will implement and design a methodology in close collaboration with the supervisors. In particular, the aim is to extract meaningful patterns from electrocardiographic data that can be leveraged for infering functional changes in the heart. Since the imaging space is very high-dimensional compared to the ECG one, a prior shape will be provided as the initial condition for the model. A successful latent representation derived from this multimodal data may be used later in downstream tasks, such as anomaly detection or diagnosis.
Number of students: 2 students.
---
### “Do’s & don’ts” in machine and deep learning for medical applications
#### Oliver Diaz (oliver.diaz@ub.edu)
There are a multitude of machine and deep learning methods in the literature that achieve outstanding and promising results in various medical domains, such as in medical imaging. Most of the time, the literature focuses on outperforming the state-of-the-art as the community strives for more accurate models. However, there are no all-purpose metrics nor standard benchmark datasets to perform a reliable comparison and evaluation of new machine learning methods. Depending on the application and datasets, by choosing an inappropriate evaluation or comparison metric, one can (inadvertently) exploit the lack of well-established standards to achieve remarkable results.
The goal of this project is to investigate, implement and evaluate bad practices and common mistakes (including tricks and unintentional errors) that are often used to artificially increase the performance of machine/deep models. This will lead to guidelines and recommendations on best practices that should be consistently applied for developing trustworthy machine and deep learning models.
The specific contributions and benefits to the community and to the students are as follows:
+ The students will propose a list of not-to-do's when applying machine and deep learning in medical applications and show their risks by examples.
+ Learn a broad spectrum of applications such as classification, regression, segmentation, prediction and prognosis using deep and machine learning.
+ Work with imaging and non-imaging datasets of different organs and modalities such as brain and cardiac MRI, Covid-19 chest X-ray images, breast cancer mammography, longitudinal exposome and health data, and many more.
+ Publish the results in a high-impact international journal where the students will be co-authors and their experimental results will be included in the analysis.
Number of students: 2 students.
---
### Image dataset balancing in deep learning: Application to breast cancer detection
#### Oliver Diaz (oliver.diaz@ub.edu)
Breast cancer is now the most common cancer worldwide, surpassing lung cancer in 2020 for the first time. It has become a major public health burden that calls for new approaches to accelerate the early detection and personalised treatment. Deep learning models trained from large breast cancer imaging datasets, in particular from screening mammography (MMG), has the potential to improve breast cancer detection in real-world practice. However, the training of robust and generalisable deep learning models from MMG data is faced with an important technical challenge: Existing MMG databases in clinical practice are highly imbalanced, as the prevalence of benign tumours (class 1: healthy subjects) is much higher in the screening population than the one of malignant tumours (class 2: pathological cases). Yet, deep learning models are known to lack accuracy and robustness when the training data is imbalanced, as they result in classifications that are biased towards the majority class.
This project proposes to develop unbiased and generalised tools for breast cancer detection by developing new methods that can balance and augment the training data before the estimation of the deep learning models. Concretely, the student will investigate, implement and evaluate an emerging type of neural networks called class-conditional generative adversarial networks (cGANs), which will enable to perform image synthesis, i.e. the creation of artificial yet realistic new instances of the minor class, namely MMG images with malignant breast cancer MMG images. The student will adapt, test and compare the proposal class-conditional GAN-based data augmentation with traditional over-sampling and under-sampling techniques. The project will also define and run several experiments based on large mammography multi-centre European datasets (from Spain, Portugal and the UK).
This project’s contributions will be compiled into the open-access MediGAN toolbox (https://medigan.readthedocs.io) developed at the University of Barcelona, which will be shared with the medical imaging community and clinical practice.
The research will be part of EuCanImage (www.eucanimage.eu), a large-scale European project coordinated by the University of Barcelona, which is building a European cancer imaging platform for enhanced artificial intelligence in clinical oncology.
---
### Endoluminal image classification
#### Santi Seguí (santi.segui@ub.edu)
Capsule endoscopy is a procedure that uses a tiny wireless camera to take pictures of your digestive tract. A wireless capsule endoscopy (WCE) camera sits inside a vitamin-size capsule you swallow and takes more than 50.000 images which are sent to an external device in order to be analyzed.
Although this is an amazing non-invasive product that allows the full visualization of the entire endoluminal track, its application is limited due to one main problem: the diagnosis, visualization of more than 60.000 images, is a hard and tedious task that must be done by experts. So, WCE needs AI to become a real clinical procedure. In this project, the student will study and apply self-supervised models (using deep Learning) for the problem of WCE.
Requirements: 1-2 students
---
### RecSys Challenge Competition
#### Santi Seguí (santi.segui@ub.edu), Pere Gilabert
Every year in the scope of the International RecSys Conference (https://recsys.acm.org/recsys22/) a new real RecSys challenge is proposed. Each year the challenge is organized by a different company (twitter 2021; twitter 2020; trivago 2019; spotify 2018). The goal of this Master Thesis is not to win (that would be amazing) the challenge but to study and participate in it. Last year, a master's student participated, finishing in 9th place. Information on the challenge is expected to be released by the end of the year and the challenge will begin in mid-February.
Requirements: 1-2 students
---