HackMD - Collaborative Markdown Knowledge Base

**Data Study Group Report Checklist** **Must-have sections:** - There should be an executive summary, right at the start. It should be technically correct, but readable by a non-data-scientist. - There should be a section which explains in technical detail how the data scientific approaches were obtained from the domain questions, and what these are. - There should be one section which explains the data tables that were available: for each table, sample size, rows (samples), columns (variables). If there are multiple tables, it should be explained how the tables are linked. This section should also explain how data was cleaned, or extracts were prepared for subsequent experiments (with consistently used reference names if there are many). - Optimally, there is a section on exploratory data analysis (EDA). If this was not recorded, this should be clearly stated, and full EDA suggested in future work. - There should be a longer section on limitations. It is a good idea to split this into “data limitations”, i.e., limitations of the data with regards to the questions – with suggestions how to remedy the limitations, e.g., in acquiring more data. Limitations of the analyses as they were performed. Limitations that arise through omission, e.g., forgetting to record something – which is entirely defensible since DSG week is hectic, but as a note that it should be properly re-done when there is more time and leisure to do so – e.g., in a sub-section “results pending corroboration”, or “results not recorded”. - There should be a section on future work – what would the group suggest to do if follow-up happens? Usually, the analyses will be not entirely complete, so think how these would be completed. It is also important to reflect on the challenge questions that were not at all addressed during the week. Similarly, think how the use case problem could be addressed with different data, if you would have it. - Further sections (usually in the middle) should describe the different approaches. It is a good idea to not split or order these by individuals, but by scientific approach. **Executive summary** - Challenge questions should be concordant with those set out in the 2-page description. This needs not be identical and can be shortened, but it should not substantially differ. - “approach”, “findings” should reference the full set of challenge questions – avoid tacitly ignoring a key question. - If the group ended up not working on certain questions, for whichever reason, e.g., it was found that the data could not answer that, or if it was found a re-formulated question would be better: mention that explicitly, and explain – this is valuable information. **Section 2 (or 3) – “explaining the approach”** - There should be a section which explains how the domain questions map on data scientific approaches – this is very important work, sometimes one of the key informative outputs. - Optimally, this should recount the full process of brainstorming, planning, and selecting of exploration avenues – all this is valuable insights generated in DSG week. - The section should highlight problems separate from solutions – e.g., explain the supervised learning task separate from specific algorithms that solve it. - Citations for approaches should be given. A textbook reference for a common off-shelf approach (e.g., “supervised learning”), or a peer-reviewed academic publication for a more specialized approach. - If an approach is novel and/or experimental, inspired by the challenge, that is fine, and a desired outcome if such an approach is found necessary, as it highlights a need for research. However, if an experimental approach is used, it should be expressly highlighted. “Limitations” and “future work” sections should point out that future work and peer-reviewed publication is necessary for best scientific practice. **Reporting of experiments/approaches** - Some approaches – e.g., visualization, unsupervised learning, GUI design – will not fit the template “experiment” section’s format well. Check whether the structure is appropriate. - All experiment/approach sections should state clearly which data these were run on. Reference data views/extracts from the “data” section. - All experiment/approach sections should clearly and centrally state the purpose, the problem to be addressed, scientific hypotheses to be checked (if experimental), or the type of hypotheses to be generated (if exploratory). - “experiments” should be sections describing the common workflow of (1) method building, (2) evaluation with respect to a task (e.g., supervised). - The following things need to be described in a benchmark/performance estimation experiment: (i) what the data is on which it is conducted. (ii) what precisely the contender methods/algorithms are. This includes tuning. (iii) of the methods/algorithms, which are uninformed baselines, which are state-of-art methods (if applicable). (iv) how exactly test/training sets/splits are selected, if applicable. (v) What the performance criterion is with respect to which the methods are evaluated. (vi) how confidence intervals were computed (if applicable). (vii) how precisely methods were compared. - If any of the above is unknown, either state clearly (e.g., “limitations of analyses” or “results pending corroboration”),