---
title:Data Study Group Catch-up Presentations
---
:sparkles::mag::sparkles: **Data Study Group Catch-up Presentations**:sparkles::mag::sparkles:
Please leave your constructive thoughts, comments and questions for each group presentation below.
Add your name to your feedback if you feel comfortable so the teams can reach out to you and discuss further.
Feedback could include:
What you found interesting?
What might they have not thought about/suggestions?
:cake::birthday:**CatsAi**:birthday::cake:
* Very nice intuition around buyer behaviours from your EDA!
* Excellent presentation Tatiana and Prakhar! (Thank you :) )
* Great talk! How tight is the relationship between temperature and sales? Can you see some levelling off?
* Like the key findings! How did you identify these features? Maybe PCA/ ICA might help (Ben M)?
* **Answer**: PCA/ICA will be a nice idea to look at feature importance, but since we are trying to incorporate the explanability element within the project, PCA-processed features would be more difficult to interpret.
* You mention 1/3 of regions have a particularly sweet tooth, what did you look at for this? Sweet vs Savory? :+1:
* How do you know it's 1/3 of the regions? do you have info on individual shopper?
* **Answer:** Indeed we have data on individual level, but it is difficult to visualise, we therefore use the region they belong to and look at the total sales for each category in each region.
* For explanations, would you be able to do something like calculating likelihoods of different explanation models and then compare e.g. a Bayes factor? (Anurag) :+1:
* Could you please expand on the last point of the Key Findings? Why *should* we get those products (or were they the ones being bought more often in those dates)? Also, why did regions 3-6 look empty in the histograms? (Alvaro - DSG Team)
**Answer for the first part:** It was mentioned as small joke, actually. What we meant was that on rainy days people tend to buy more `Savoury Ready to Proof` and `Bread Ready to Bake` goods. From a machine learning 'prediction' perspective, I would put it as something you *would* buy rather than something you *should* because that's how the consumer behavior is around rainy days. :+1: A.
**Answer for the second part**: The regions 3-6 looked empty because the overall sales in the first two regions were *much* higher than the others. We also did a percentage sales by category (different savoury and sweet bread types) and found areas which were more interested in particular categories (not included in the presentation today). :+1: (Think density plots (or normalisation by total sales per region) for final figures A.) - Thank you, we'll note that down :)
* Thank you for your presentation! Can you tell us more about counterfactual explanations? What do they add compared to SHAP (or other post-hoc explanation methods)?
* **Answer:** counterfactual explanations describe causal relationships between the features and the prediction. Therefore, they offer a clear understanding of the predictions and one can infer things like if temperature changes 2 degress, the sales would change by 15%. More details: https://christophm.github.io/interpretable-ml-book/counterfactual.html and a neat software package: https://github.com/interpretml/DiCE
:recycle::leaves:**Greenvest Solutions**:leaves::recycle:
* Luke, Gim, awesome presentation! :100:
* Are you considering traditional seasonal-trend estimation methods as an intermediate step (in complexity) when compared to LSTM?
* How do you intend going about determining how much time dependence for the LSTM? :+1:
* Would something like autocorrelation inform this decision?
**Answer to above two questions (somewhat):** Definitely something we have started to look into. We have decomposed the timeseries into trend and season as well as doing Fourier-transform on the time series to capture the timescales of the dynamics. Autocorrelation is definitely helpful to inform how to use LSTM.
* Would more turbines affect the wind direction/strength?
* For local area, the wind farm would change the wind streamline (considering wind as a type of fluid) and therefore the velocity and pressure distribution among the wind farm area. But it won't quite affect the area far away.
* Can you give an intuition on why LSTMs are suitable for time series modeling? [Excellent presentation btw!]
**Answer:** For time-series such as the weather forecast, where there are multiple-scale correlations (due to local vs global weather systems), LSTM is a good first-step to capture both short-term time-correlation and long-term time-correlation.
* Super slick presentation thanks guys :leaves:
:microscope::hospital:**CRUK**:hospital::microscope:
* Great explanation! This isn't much of a question, but a former lab member worked on network inference for a while and came up with an interesting method based on Partial Information Decomposition. Might throw up some helpful interactions [https://www.cell.com/cell-systems/pdfExtended/S2405-4712(17)30386-1 Happy to chat - Ivan Croydon-Veleslavov]
* Thanks! It looks really interesting. (+1 for this! (JA), +1 too (TY))
* Love the idea of the convolutions, do you think you have enough samples?
* Probably not but we are trying out the idea and then the challenge owner can decide whether they want to increase the data amount, e.g. by using public databases, and pursue the idea further.
* For the Bayesian network appraoch, I am thinking about the sensitivity of the prior distribution of the formation of the network and what are your thoughts about this?
* We are still in the process of structure learning. We intend to create the structure of the model from the available data and then improve it by adding or pruning some dependencies based on elicited knowledge from our experts and/or from gene ontologies. Then, we will be able to populate the model and test the posterior probabilities (Ali).
* Great explanation! Can you explain how you evaluate the effect of knocking off a gene? As in how do you find out what exactly is the outcome (probably more of a biology question)?
* First step, you check what is the difference (i.e. differential expression) relative to the control, for all these 17k genes. Hopefully, there you also see (think of it also as sanity test) the gene you knock out to be the most significantly differentially expressed. And then of course the rest of the genes. But the question is always: Does this gene regulates this other gene directly or indirectly? Happy to disuss more on that :D
:+1: Would be nice to discuss if you have time (Divya Balasubramanian - CatsAI) :)
* Have you looked into Graph Convolution Neural Networks for modelling the interaction between genes?
* Yes, the plan is to use the pixel intensity on the images after biological grouping to draw a graph were the pixel location is the node location and the pixel intensity the node size. The distance bewteen pixels gives the edge.
:battery::bulb:**Strathclyde**:bulb::battery:
* more of a engineering question, but could you use large scale battery storage to "cusion" your system and slow down/reduce your risk of a crash? and how would this have an effect on your simulation/modeling?
* Yes! large scale batteries can indeed be used for reducing the riss of a blackout. It is important to note that the blackouts usually include loss of lines that limits the power transfer to a particular area. Batteries can help as they are can be installed in a more dispersed manner, reducing the dependence on far away generators. The modling of batteries is well understood and ther are simulation packages that can reliably simulate them. Its just that the utilities are yet to use these models for their simulations.
* To what extent could your system be described as a random projection from a low-dimensional dynamical system? Do factor analysis type methods for dynamical systems work well? https://arxiv.org/abs/1608.06315
* Interesting paper. I do not think the grid dyamics can be described by 'random' projection. But there is research being done that can approximate the dynamics into a lower dimensional system by using non-linear features. The hope is that the features learnt from the auto-encoders/lstms essentially are these non-linear maps that project the system to a low-dimensional representation with minimal error.
* How are you planning on generalizing the results of ML models built on simulated data? +1
* Great Question. The hope is to use the learnt features for anomaly detection (using Auto Encoders) or time series prediction (using LSTM) for understanding how the grid behaves for different systems. The hope is that the features learnt from the auto-encoders/lstms essentially are these non-linear maps that project the system to a low-dimensional representation with minimal error.
* Would you expect noise in the input data to affect the performance of the ML model, if it has been trained with "clean" simulated data?
* At present we are not injecting noisy data to our models, so we cannot really say its perfromance with noisy data. The noise can be simulated as an additive normal variable before pushing it into the model for training, but this increases the complexity and reduces the convergence rate of the ML models. this is a very valid question and is always asked by the power grid utilities to us researchers :).
*