:small_blue_diamond:**April 2021 Data Study Group Presentations**:small_blue_diamond: Welcome to the Data Study Group presentation session. You will be invited to share your comments and questions with each group after their presentation. Please record your questions in the sections below. ---- :clock1::clock9:**Agenda**:clock9::clock1: 10:05am - Opening remarks 10:10am - AMRC 10:40am - CityMaaS 11:10am - Comfort Break 11:20am- DWP 11:50pm - Odin Vision 12:20pm - Comfort Break 12:30pm - Entale 13:00pm - Closing remarks ----- **Questions & Comments** --- :nut_and_bolt: **AMRC Questions & Comments** :nut_and_bolt: *Leave your feedback and comments below:* * What do you think were the main reasons for the difficulties with generating synthetic data? * For the sensor data, any thought about how to ensure the sythetic data generated is 'good' as it is not like the photo data which could be checked via viewing? * Does your solution require a batch of data, or can it classify using a live feed? * Is classification accuracy the only metric you looked at? Are there others that would work well for your problem domain? * * Was the goal to classify current faults based on current performance data? Or was it predicting future failure failure based on truncated timee series data ending before the fault? Current faults. * What architecure did you try for the neural networks? * Since this is time-series data did you consider LSTMs? ---- :city_sunrise: **CityMaaS Questions & Comments** :city_sunrise: *Leave your feedback and comments below:* * Have you accounted for bad crowdsourced data? Trolling, inaccuracies, intentional sabotage, etc? * The initial results point to a very imbalanced dataset - how did you account for that? * Why is the F score for the latitue larger than for the longitude? * Route recommendation seems like a classic genetic algorithm optimization problem, is that something you're looking at? * What is the algorithm used for finding alternate routes that avoid elevation problems or obsticles? * Suggestion -- You can try Fault Tolerant Algorithms for live updates/suggestions of the routes :pound: **DWP Questions & Comments** :pound: *Leave your feedback and comments below:* * Out of the two methods of generating synthetic data which one presents results close to the real data - Both are relatively similar, but we'll have full details in our report once we get the chance to do a test battery run. * Did you get the chance to measure what the largest computational cost was when generating synthetic data? (Did your code spend most of it's time in random number generation functions)? Unfortunately no, but that's a very interesting question. * Could you explain more about the importance of kernels (and the difficulty in choosing kernels)? Yep, here's the original paper: http://www.gatsby.ucl.ac.uk/~gretton/mmd/mmd.htm. The result of the distance metric is entirely based on the kernel transformation chosen, so in order to produce comparable results for multi-variate columns, you need to carefully pick the right one. [Thanks!] * How do you think the bias was introduced for Dataset A Vendor C? That's kind of beyond our scope, since we don't have all the details about how they generated the set - but we can see that their distributions aren't as close as vendor B, so this could be due to training on a restricted sample which happens to include biased results. --- :microscope: **Odin Vision Questions & Comments**:microscope: *Leave your feedback and comments below:* * Did you explore the correlation between the latent space dimensions and the polyp labels? * Hope you answered this question in the Q&A session. Just to reiterate, we haven't explored on this avenue within the DSG, but we would like to add a classifier on the derived features from the VAEs. Examining the classification performance will allow us to understand how useful the derived features are. * I really enjoyed your presentation! It's the first time I've grasped how interpretability can work, so thanks for taking the time to make the presentation clear and visual. So interesting! * Glad you enjoyed it, that's great to hear! * +1 - thank you for the very kind comment! We have to make ML models interpretable, if we want to use them for important things. * +1 - This is music to our ears and makes all of the hard work totally worth it. * What are the challenges in bridging the attribution study and the uncertainty estimation? I can imagine these two methods combined together will be very useful and potentially complement each other. --- **:books: Entale Questions & Comments** :books: *Leave your feedback and comments below:* * What's the connection between Wimbledon and Github (word cloud)? Or is this a spurious correlation? Response: Likely to be a spurious correlation *Response: Thanks for the question! Although it seems wired, remember that the network-based approach takes nothing but the connectivity information as input. The output indicates only topological pattern of the network, which we expect to be a good approximate of semantic topics (could be wrong though!). Rather than thinking of the obervation spurious, I think it can also be potential 'discovery' of new correlation. One way to verify is to look into the data and check their associated podcasts. Another possibility is that the resolution of clustering shown in the presentation could be misleading - if you check look at higher resolution of the clustering, which means nodes in the same communities are further split, then these less interpretable clusters will disappear.* * How are named entities detected in the episodes? *Response: The named entities were given by Entale where they used an in-house tool. But these could be obtained by using entity recognition models, e.g. https://spacy.io/api/entityrecognizer* * Have you considered applying a perturbation across the embedding to push the recommendation away from the obvious results? -- *Response: Currently we haven't, but as mentioned in the future work about Approximate nearest neighbors is one of the motivation towards it.* * How do you balance the tradeoff of recommending podcasts very similar to what people have listened to and new podcasts that are very different from what the user's history? I'm curious about the topic discovery algorithm in the recommendation system. *Response: Thanks for the question. Currently the approach we are folliwing here is the weighted average (simple average, or exponential) of the history, which should help in capture the effect from the history -- we are experimenting with different weighing schemes. The other approach we are following is the network based approach based on entities. We believe that merging the results from both the approaches should help in capturing both the aspects, the history and variability. There are other approaches are we experimenting with -- 1) instead of taking the most similar recommendations, we sample from the candidate recommendations -- that should add some variance to the recommendations and should move the user out of its nearest chamber. 2) The other is to try with "Approximate Nearest Neighbors", which we believe should also help in pushing the user gradually out from the nearby space.* * There was a slide which outlined an example of listening history - European football and combat sports with some weighting to them. The recommendations seemed to vary quite a lot, with Amercian sports (NBA,etc if I remember) - I was expecting similar shows (punditry from ex-sports hosts or fan pods). I'd be worried that users would bounce from using Entale if they felt recommendations didn't align with listening history. *Response: I think this is a good point to bring up. My thinking is that while the user history indicates they like sports (football, UFC in particular), they have only listened to 4 episodes which isn't a lot in general and perhaps the uncertainty is high at the beginning of the user’s journey. We did find that to be the case that users’ recommendations would be more specific as they listened to more. But in addition to that, we also in some ways want to give recommendations to the user that allows them to explore different areas/topics. Controlling that and balancing between giving recommending that are very similar to user history and recommending podcasts that pushes the user to explore is a difficult challenge that we looked and further study is needed. We also want to highlight that this is one particular recommender we have developed. Perhaps the rabbit hole recommender discussed will be able to better identify episodes that have a direct entity relation to the previous podcasts, where it could be in better position to identify these specific football or MMA-related entities that you've mentioned here* * Just some thoughts I had around members of the communities, how some of them are firm center ones and some of them are more corner/border members who are flexible in listening to different areas. Maybe these border members could be a source for a direction to go. They are forming connections between clusters. For example, some might listen to cooking channels and then a subgroup of that community is also interested in fitness. A bit like an opinion leader in network communities. Great talk! -- *Response: This is an interesting point and reminds me of one of the seminal works in the network science or theory domain -- "Strength of Weak Ties" (Mark S. Granovetter -- https://www.jstor.org/stable/2776392?seq=1). I believe that this will be an interesting phenomena to observe or to consider for the "Rabbit Hole" approach. Thanks!!* * One of the data sets provided contains links extracted directly from podcast descriptions (placed there by the publisher). These are likely to be super relevant to the podcast and could form part of a rabbit hole journey. Do you have any thoughts on how to approach integrating these? *Response: This is an excellent point! We havent used that yet, but that could be added as an entity related to the episode and then can be fed into the system ahead -- let it be topic model, or network based approach, or some other approach. Thanks for pointing this out.*