# The Turing Way - Project Design Chapter based on Data Study Group
## Life cycle of a project overview

## Overview
This case study gives an overview of the various elements that project manager must consider when embarking on a new data science project, from the proposal to research phase, including:
* Scoping & defining
* Data preparation
* Recruiting talent
* Data access and security
* Ethics assessment
* Impact planning
### Who can help?
- Daisy Parry
- Jules Manser
- Will be opened for review by anyone who is interested
## Elements of project design
# Scoping & Definition
### Initial Considerations
The below items must also be considered at the stage of conception so that the feasibility of the project can be assessed and designed effectively – A great research questions sadly doesn’t always equate to a great project.
#### What/ How
* What are the research objectives of the project? Do the research objectives make sense? For example have many similar projects taken place before, in which case is this piece of research necessary by examining a new or novel aspect?
* What data will be used for this project? How will this data be collected, how will permission be obtained to use it, how will it be cleaned/ merged and formatted in preparation for the research? Will the data contain sensitive information? If so, how will the data be stored and accessed to ensure it remains secure?
* What methods of analysis is the project proposing, what subsequent research resources will be required to support this?
* How will the results be used when the project has concluded to ensure that all the hard work doesn’t end up in the draw?
#### Who
* Who will be involved in preparing, conducting & presenting findings. Are they in place already, if not how will they be recruited?
* Is your institution/ researchers the people for the job, are there existing groups that should be invited to collaborate or consulted before work begins?
#### Why
* What will the impact be to the immediate stakeholders of the project? What is the impact to the research community as a result of this work being undertaken. What are the possible benefits and negatives to wider society of the work being undertaken? What are the ethical implications of the project and how can these be mitigated against.
### Idea to Proposal
When an idea is first conceived, it is unlikely to arrive as a perfectly formed research question, it will likely enter a scoping phase which will tease out a sensible and feasible version of the question and sub questions.
For example, if the key question is fairly general or wide reaching, it will need to be broken down into sub questions that can then be tackled empirically.
The data proposed might then not be suitable to answer the adjusted question and so re-framing might be necessary so that the data remains expected to answer the question.
### The Initial Proposal
The project proposal (or a similar document) is one of the first items that will be produced. This will likely be used to pitch the project to collaborators and funders, and to generally spread the word of its conception to stakeholders and the research community.
An initial review of the data should take place before the proposal is written, to identify obvious issues, if there are major issues these should be addressed, or at least mitigation plans defined, before the proposal is finalised.
The initial proposal does not need to be perfect and should be considered a document for iteration.
The proposal should map out key elements of the project as much as is possible at the time of writing. The exercise of writing the proposal will likely reveal gaps and that is ok!
The project proposal will likely be written by the scientific researchers involved in the project, with peer review encouraged. It's not expected that the full team will be in place by this stage and the proposal may well be used to attract talent to the project.
Equally the project partner could have authorship and in the context of Data Study Groups, it is the organisation proposing the project that writes the project proposal. In such cases we advise that the proposal is subject to academic review.
Concretely a data science project proposal should include:
* A description the project and the challenge it seeks to solve.
* Back ground information : Why is solving this challenge beneficial. What are the main difficulties, or approaches that have been considered/tried before (if applicable). How is the project necessary to the solution of the problem as well as the scientific basis for this.
* Data information : What data will be used for the project. It might be useful to include information on each dataset including data inventory, size, variable descriptions, description of data collection mechanism, level of data sensitivity/confidentiality, etc. How will this data need to be handled and are there any known issues of working with data of this kind, how will these be mitigated against. How will permission be obtained to use this data?
* Project impact: What is the expected impact of the project? Considering its direct stakeholders, wider society and research community. What would channels of follow up work look like after the project and what is the intended use of any findings.
# Data Preparation
Considerations of the data suitability, readiness and collection should begin at the time of project proposal development. If the data is unsuitable, incomplete or won't be ready in time then the whole project could be compromised. Data preparation should begin as soon as the project question is finalised.
Initial data readiness will be different for every project. In cases the data may not exist and a method of collection will be the first action. In other cases, pre-curated data sets may be used with the key action being obtaining use rights.
Data is research ready when it has been collected, constructed, cleaned, checked for gaps, when potential sensitivities have been considered (and mitigated against) and you have the right to use and work on the datasets.
## Key Considerations
* Data Readiness : If the data doesn't yet exist, method of collection must be devised. If the raw data exists but it has yet to be curated for purpose, methodology for this must be established. If the data is nearly ready, then it must be checked for gaps and cleanliness.
* Data Appropriateness : Data should be highly relevant to the research question. Even if data looks to be suitable, there may be additional sets or resources that already exist that can enrich and improve the scope of research.
* Data Quantity : Data must be large enough to effectively run analysis and experimentation, if not enough data point exists, a method of generating more should be pursued. This could be by obtaining complimentary data sets or generating synthetic or collecting new data. Equally, researchers may be faced with an abundance of data, so much that meaningful analysis becomes hard. In such cases researchers may need to edit and refine what data will be used.
* Data Sensitivity : How sensitive is the data that will be used? The more sensitive the data set it, the more restrictions will be needed to protect it during the research phase which inhibits researchers ease of access and experiment. If the data contains personal information or sensitive commercial information then it is likely to be highly sensitive. In such cases it may be possible to reduce the sensitivity of the data by either removing or anonymising areas of interest. Even if the data is not especially sensitive, the project as a whole may produce sensitive results, so it may be worth taking the same measures to reduce sensitivity as much as possible, reducing the security measures that will be necessary and minimising negative implications of a data breach.
* Data Completeness/ Reliability : The data should be checked for missing observations and unreliable data points. Any incompleteness or unreliability must be assessed as to its impact on the project and what can be done to minimise missing values and maximise overall reliability.
* Data Permissions / Legal Considerations : It is essential that you have the right to use the data. If data has been generated in house and involves no human data, than this may not be necessary. However, in many cases the data will come from a collaborator or from a third party data provider. In these cases, at minimum a data sharing agreement should be enacted so that both parties are protected. Depending on organisations, this may be a straight forward process, or may take many iterations and discussion with legal teams. Data sharing agreements should therefore be discussed from the get go and be one of the first actions when considering data preparation. If the data contains personal information, such as patient data, then you must be able to prove that the subjects have consented for their data to be collected and used.
# Recruiting Talent
The team needed to work on each project will vary from project to project.
In most cases, a project will require a Principal Investigator (or similar role). They will likely be a senior academic with expertise in the project area and experience leading on similar investigations. They will act a champion of the project and the scientific sounding board.
A PI may be in place naturally from the start of the project, or they may need to be recruited as the project comes to fruition. For larger projects a full time research team may need to be recruited, for smaller projects support from PhD candidates may be more appropriate.
# Data Access & Security
It is essential that third party providers have confidence that their data will be handled appropriately with the necessary security measures, and equally that researchers are protected while working on the project.
The data sensitivity, or conditions of use for the data will dictate how it can be shared with researchers and how they can work on it.
When assessing the data’s sensitivity, it is important to consider the project as a whole and wider uses of the data. For example data of publicly available satellite images might not seem sensitive, but if the project then seeks to extract a list of properties with specific features, this list could then become quite sensitive, especially if the occupants have not consented to their addresses being on such a list. Another example is twitter post, this is public and can be scraped by anyone, but if you were to then compile a list of users with certain political beliefs, this then becomes extremely sensitive information.
## Data Transfer
If a project is considered sensitive, security measures will be required when the data is transferred from a third party to the research manager.
Methods such as Azure storage explorer could be used, in which a secure one way upload link is sent to the device uploading the data. This link is only usable by that devices specific public facing IP address.
Alternatively, the data could be transferred by hand.
It is likely each institution will have their preferred method of transferring data of this nature and so a discussion may be required to meet the system and security needs of each party.
## Data Storage & Access
When deciding how to store the data, managers should consider who will need access and what they will need to do with the data in terms of research. This may impact the storage method used.
Non sensitive data may not require full blown security measures like sensitive data. However, it should still be stored securely, so that only those who need access have it. This is good practise for handling data.
Sensitive data, once transferred should be placed in some kind of secure location including security measures. This could be research environment, such as a 'Safe Haven ‘or equivalent trusted secure platform. Depending on the storage method, this may need to be deployed/ set up beforehand so be sure to establish a need for this (or not) early on.
Following the Turing’s 'Safe Haven' Model of secure research environments, there are various security measures that can be implemented depending on the sensitivity of the contents. These can include restrictions on internet access, packages, and copy and paste functions from within the environment. The more restrictions in place from within the environment, the slower it will be for researchers to work on the project. It is therefore essential that only necessary measures are in place. Researchers and the data provider should agree beforehand on how sensitive the project is, as well as what security measures should be present.
# Ethics Assessment
In most academic institutions it is necessary for incoming projects to go through some kind of ethics application and approval process. Even if this is not a requirement, the ethics assessment principals should be applied to the project.
The ethics application and approval process should be integrated into the thinking about the project, rather than administrative box ticking. The ethics process is to be completed by the academics working on the project with scientific insight rather that research project managers.
Academics must collate the required information . If more information is required for a full ethics assessment, it is the academics responsibility to obtain this from collaborators.
The core ethical questions to be considered are below, the function of each is to identify risks and inform how the research plans to minimise or eliminate those risks regarding: :
* What the ethical implications of the project?
* How was the data collected?
* How was consent obtained for collecting the data?
* Will any issues of privacy & security arise from the project or resulting outputs?
* How does the project plan on keeping the data private & secure?
# Impact planning
It is important to consider what will happen post project so that the work doesn't end up in the drawer. This should be done as the project is being designed as it might influence tools used. Start with the end in mind so that when the project finishes, plan are in place to do it justice ensuring findings will be put to meaningful use.
Channels for ensuring longevity might include:
* Planning follow on projects, could the work undertaken lead on to a bigger follow on project? If so, who might this be relevant to, are there relevant research groups or academics that could be approached?
* Publishing, can the work be published after the project is complete? If the project contains sensitive information and data, can a version be published that omits the sensitive information? This will maximise impact to the scientific research community. But be careful that redaction does not alter the overall narrative. Again, a consideration at project proposal stage.
* Exposure, how will the work be published to reach the widest audience, are there relevant newsletters or bulletins that it could be included in? Or social media groups that the work could be shared via?
* Formatting, if the work or code can be shared, could the format be designed so that it is a compatible as possible with a wide audience. For example by using widely used and accessible tools rather than specific paid for versions.