--- tags: dsh --- # DARE national data research infrastructure report response :::spoiler TOC [TOC] ::: ## Demonstrating trustworthiness #### 1. To what extent do you feel the recommendations accurately reflect the current challenges in this area? - [ ] Completely – all major existing issues are addressed - [x] (for us mostly) Partially – some could be better addressed, but most of the major issues are covered - [ ] Inadequately – there are clear gaps in the issues addressed - [ ] Don’t know #### 2. Please explain your answer, addressing any challenges within this area that you feel are missing from or not sufficiently addressed in the recommendations. * Focus on transparency and engaging with public on keeping data safe is really good. * State of use register is a really good thing as a tangible output of this * Most large orgs anonymise datasets, but things we care about is misuse of access to data * Something missing - has anyone done any threat models/analysis for TREs. Somebody should do one. * Everything they have is very original and very powerful. * Something missing - with certain number of steps to depersonalise a dataset, and other controls to stop people accessing it, data can be functionally anonymous. We can be clear how difficult it is to reidentify people (even if we wanted to). There is complexity that can be discussed in a very tangible way. If we're explicit about threat models and how different measures restrict possible effects. mechanisms we use to protect data (can we describe them in a way people to understand), should be easy for a project or TRE to say 'we do a,b,c' and have people easily understand it. No common framework for talking about this stuff at the moment, and to see what others are doing. * X * Final response: #### 3. Are you aware of any initiatives not already mentioned in the report that are currently working on solving some of the issues covered in this area? Please comment. * X * X * X * X * X * X * X * Final response: #### 4. Are there any recommendations you feel should be prioritised in this area? Please explain your answer. * 4 then 1 then 6. If we can leverage this to a central user accreditation model * X * X * X * X * X * X * Final response: ## Access and accreditation of researchers #### 1. To what extent do you feel the recommendations accurately reflect the current challenges in this area? - [ ] Completely – all major existing issues are addressed - [x] (mostly) Partially – some could be better addressed, but most of the major issues are covered - [ ] Inadequately – there are clear gaps in the issues addressed - [ ] Don’t know #### 2. Please explain your answer, addressing any challenges within this area that you feel are missing from or not sufficiently addressed in the recommendations. * Key thing of transferability, so for data with similar risk profile, you should carry forward benefit of going through the process, is key thing. * Data access & standardisation recs seem very sensible. * Some places where they've latched onto specific technologies and given it too much importance. * X * X * X * X * Final response: #### 3. Are you aware of any initiatives not already mentioned in the report that are currently working on solving some of the issues covered in this area? Please comment. * * X * X * X * X * X * X * Final response: #### 4. Are there any recommendations you feel should be prioritised in this area? Please explain your answer. * Technical authentication work is an afterthought - their number 1 is not the number 1 priority. Doing 1 without 2 and 3 is pointless * X * X * X * X * X * X * Final response: ## Accreditation of research environments #### 1. To what extent do you feel the recommendations accurately reflect the current challenges in this area? - [ ] Completely – all major existing issues are addressed - [x] Partially – some could be better addressed, but most of the major issues are covered - [ ] Inadequately – there are clear gaps in the issues addressed - [ ] Don’t know #### 2. Please explain your answer, addressing any challenges within this area that you feel are missing from or not sufficiently addressed in the recommendations. * A lot of these things (especially with 1) represent policies/procedures for protecting data, without IG process as a whole being explicitly addressed. Maybe make more of a link between e.g. people, processes, 5 safes... * Data access is part of the picture. They talk about 5 safes as principles you need to tune depending on data. * Got it quite well in this area. * DEA is quite heavyweight accreditation, do they talk about scale? If they recognise tiering, different controls being appropriate, it would be appropriate to recognise different levels of accreditation. Context for health related data is quite high. There is pathway up through DSPT, Cyber Essentials, ISO 27001... Expanding on what flexibility means would be useful * Mutual recognition is imporatnt - if this data can be held by org with X accreditation, any org with X accreditation should be able to hold data, is a plus * DSPT is de facto baseline. * X * Final response: #### 3. Are you aware of any initiatives not already mentioned in the report that are currently working on solving some of the issues covered in this area? Please comment. * Turing's data sensitivity classification model * X * X * X * X * X * X * Final response: #### 4. Are there any recommendations you feel should be prioritised in this area? Please explain your answer. * X * X * X * X * X * X * X * Final response: ## Data and discovery #### 1. To what extent do you feel the recommendations accurately reflect the current challenges in this area? - [ ] Completely – all major existing issues are addressed - [x] (actually) Partially – some could be better addressed, but most of the major issues are covered - [ ] Inadequately – there are clear gaps in the issues addressed - [ ] Don’t know #### 2. Please explain your answer, addressing any challenges within this area that you feel are missing from or not sufficiently addressed in the recommendations. * A UKRI-wide registry for datasets is a lot of work. Need to recommend these things but lots to do! * Is this a place where DARE should be supporting, not leading * PET bit isn't in more general DARE stuff. * You need someone to describe the differences between different datasets. * They should have good ways to identify pollution attacks on datasets. * Making distributions available, so as an accredited researcher you can go into ant TRE and see the metadata * Who are the people accessing the data and what are they trying to do with it? Always much less sensitive than even a strongly deidentified dataset. * PETs is about safe data, what people aren't doing very well is considering what levers are pulled in production - data needs to be secure enough for safe environments, not to go out in the public. PETs are ways of getting safe inputs or safe outputs, which might require different things. * Some of these are open research questions - tabular data is well handled, time-series, network data not very well. * There are ways to check for whether someone has written code to take code away, but you can track this. * Datasets are frequently updated, getting datasets that are more streaming. * Our PETs => synthetic data work going on at ATI with sensitive data, private federated learning etc., homomorphic encryption * Something missing from entire report - what are the timelines for their recommendations - e.g. MVP for IoTs. Different timeframes exist for different recommendations. * What would DARE back in immediate next few years * Final response: #### 3. Are you aware of any initiatives not already mentioned in the report that are currently working on solving some of the issues covered in this area? Please comment. * Should be lots we can reference here - OpenData folk, Fair Data folk, Reproducibility folk. Research support facility for MLTC consortia is trying to do this * X * X * X * X * X * X * Final response: #### 4. Are there any recommendations you feel should be prioritised in this area? Please explain your answer. * Point 3 - there's a spectrum of privacy and maturity, none of these things should stop us pull these things into the mix and think about access decisions we make. * X * X * X * X * X * X * Final response: ## Core federation services #### 1. To what extent do you feel the recommendations accurately reflect the current challenges in this area? - [ ] Completely – all major existing issues are addressed - [x] Partially – some could be better addressed, but most of the major issues are covered - [ ] Inadequately – there are clear gaps in the issues addressed - [ ] Don’t know #### 2. Please explain your answer, addressing any challenges within this area that you feel are missing from or not sufficiently addressed in the recommendations. * Transferability of these results into different industries. Competing orgs with conflicting goals my not work. You will have to at least mitigate some of these problems. If someone was to take this and naively apply it without clearly agreeing what their goals are it may not work * X * While we want all subjects to know that their data is being usede for public good and not maliciously, other orgs could be doing this better. The problems with finance, regulatory gaps etc. The reason health trusts aren't pooling their data is to protect individuals' rights, not the incentive for finance data, reason not to pool is because we don't want competitors to have it * The ideae of federated services where a third party determines who gets access is really powerful that prevents the sharing failure. * COVID - Repurposing collected data by gov and industry, something there about integrating data from lots of sources and using that to inform gov. Benevolent use case but still has challenge and problems across industry/sectors * A lot of the considerations here aren't unique to TREs, we have siloed clusters of compute and data and aren't very good at connecting them. The UX of onboarding/familiarisation cost is quite high. These aren't unique that DARE needs to solve. They haven't solved it so don't deserve to lead it. The challenges about sensitivity of the data are one aspect that's really important, factors around training for research perspective is important. Different environments/access requirements/can't combine data/reuse data all comes up more broadly. * It's about connecting/augmenting what's there already. These are conveersations that have been had/lessons learned elsewhere, we should connect into this work. Although the needle hasn't moved much on this, so maybe also about recognising that it's a hard problem and stuff can be learned from before * Sandpit environments of wrapping more flexible, independently configurable TRE around things, conversation can be how to operate around this * Final response: #### 3. Are you aware of any initiatives not already mentioned in the report that are currently working on solving some of the issues covered in this area? Please comment. * ICO's regulatory sandpit * DCMS research data cloud, future of compute review * X * X * X * X * X * Final response: #### 4. Are there any recommendations you feel should be prioritised in this area? Please explain your answer. * Recommendation 3 - can there also be a call for projects to connect existing TREs and wider infrastructure. Discovery can then lead to number 2. * X * X * X * X * X * X * Final response: ## Capability and capacity #### 1. To what extent do you feel the recommendations accurately reflect the current challenges in this area? - [ ] Completely – all major existing issues are addressed - [x] Partially – some could be better addressed, but most of the major issues are covered - [ ] Inadequately – there are clear gaps in the issues addressed - [ ] Don’t know #### 2. Please explain your answer, addressing any challenges within this area that you feel are missing from or not sufficiently addressed in the recommendations. * Not often enough that investment is made alongside hardware and technical stuff, stuff like getting people to learn how to use it, people who are around environment to support it, upskilling people who are coming in to use the environments. For TREs, what is the upskilling doing? They are doing it well for data goevrnance (e.g. NHS data security training). Centrally run TREs charge for people that run them as a service. Is there combined messaging to gov around strategic investments, not just about technology. * With automation, need to understand the thing that's being done, switch away from needing people who are pressing buttons and are they correctly applying the model, to systematising it but still ned expertise in teh systematisers. Maknig better use of expertise because less of work is about pressing buttons, and thinking more about what effective controls are. * R.e. salary thing - pipe dream if we're paying Google/finance salaries. Don't want them to say that salaries are not as competitive/appealing as they used to be. * X * X * X * X * Final response: #### 3. Are you aware of any initiatives not already mentioned in the report that are currently working on solving some of the issues covered in this area? Please comment. * RSE community to formalise it as a profession (but quite academic focus) - RSE Society * Software Sustainability Institute, plus orgs that are working in the open so anyone can contribute * X * X * X * X * X * Final response: #### 4. Are there any recommendations you feel should be prioritised in this area? Please explain your answer. * Recommendation 4 - can suggest our own Turing work to support this. * For 'TRE in a box' we can reference our one that we're open sourcing soon. * Describing these roles more effectively as to exactly what they are. * X * X * X * X * Final response: ## Funding and incentives #### 1. To what extent do you feel the recommendations accurately reflect the current challenges in this area? - [ ] Completely – all major existing issues are addressed - [ ] Partially – some could be better addressed, but most of the major issues are covered - [ ] Inadequately – there are clear gaps in the issues addressed - [ ] Don’t know #### 2. Please explain your answer, addressing any challenges within this area that you feel are missing from or not sufficiently addressed in the recommendations. * We should be really strongly support the call for funding for teams like Turing. Where things aren't unique to the health space we should call it out (like with the compute thing). * All for a call to support national sensitive data research infrastructure * TPS keen for there to be support for maintenance costs across other types of projcets/tools/infrastructure, not just this. * More and more code we do our work with is similar work with similar requirements. Even in the enabling level it's software. * Can we connect things up more coherently. * If we have connected way to do this research, there will be tools (synthetic data, pollution testing) will come out of it. Not just code/metadata standards, people processes e.g. accrediting people, processes, etc. + the trustedness, all these pieces pull together as good checklist for any open infrastructure project. * If you don't run long-term investments, so aways new teams, new projects, don't get the accumulated returns. One point is having funds for experimentation - not separate to wider infrastructure. Running it over many decades, need to manage, maintain, work on new problems/goevrnance issues as they arise, funding needs to be for all of this plus experimentation. For DARE projects, wouldn't have done anything if they hadn't built on what was already there. Do you stop funding projects and start funding people? We're funding things as projects and not as long-term infrastructure. More general point to make about how research is funded. * We know of plenty other things that need to be funded long-term and centrally, e.g. email systems at Universities - maintenance should be seem similarly. Another example is Cambridge lab leases. Large Hadron collider/Genome projects, noone wouldn't fund this for long periods at a time. Orgs that curate a lot of data, lots of examples where continuity of funding is accepted. Why is data not done in the same way. * Final response: #### 3. Are you aware of any initiatives not already mentioned in the report that are currently working on solving some of the issues covered in this area? Please comment. * The above projects - couldn't ahve done what they did without knowing there was continuity of funding. * X * X * X * X * X * X * Final response: #### 4. Are there any recommendations you feel should be prioritised in this area? Please explain your answer. * Don't compare what they've proposed in DARE with the Goldacre report - there are small differences that matter to people in privacy. They are data research, not NHS. If there were operational things that go to GPs/paramedics, there are things we should think about. * X * X * X * X * X * X * Final response: ## Final Questions #### Are there any other thoughts or comments you would like to share on the draft report and recommendations? * Mentioned synthetic data in a few places, solves a lot of issues around handling sensitive data, but loads of potential ethical issues (e.g. around synthetic data) that you could surface afterwards. * X * X * X * X * X * X * Final response: