owned this note
owned this note
Published
Linked with GitHub
# Introduction to Active Data Management
## Signin on the day
Please enter your name, preferred pronoun, and department/faculty.
888888
* Brian Ballsun-Stanton, He/Him, Faculty of Arts
* Hussain Syed Gowhor, MRes Student. Faculty of Arts. . He/him
* Cyril Laplane, @CyLap, He/Him, Faculty of Science and Engineering
* Sara King, she/her, AARNet
* Richard Miller, he/him, Faculty of Science and Engineering
* Denise Treacher, she/her, MQ eResearch, DVCR
* Omid Ghasemi, He/Him, Faculty of Medicine, Health and Human Sciences
* Juliet Lum, she/her, HDR Support and Development, DVCR
* Philippa Smith, she/her, GenIMPACT, MQBS
* Heidi Worsley, she/her, GenIMPACT,,,,,
* Andy Gibson, he/him, Postdoc, Department of Linguistics, Faculty of Medicine, Health and Human Sciences
* Sarah West, she/her, GenIMPACT, MQBS
* Haae Bae, she/her, GenIMPACT, MQBS
* Natalie Hart, she/her, GenIMPACT, MQBS
* David Meng, He/Him, Postdoc, Faculty of Medicine, Health and Human Sciences
* Sarah Holmes, she/her, HDR student,Department of Security Studies and Criminology
* Jayamala Parmar, she/her, GenIMPACT, MQBS.
* Mike Williams, he/him, IT Partner for FMHHS
* Odette Subijano, she/her, eResearch (DVCR's Office), Project Manager
* Kristina Kopychynski, FoArts (Education)
* Ioanna Anastasopoulou, she/her, Faculty of Medicine, Health and Human Sciences
* Xiaofei (Clara) Dong, she, PhD student, cognitive science
* Louise Dodd, Project Coordinator, AHH, Faculty of Medicine, Health and Human Sciences
*
# Icebreaker summaries
* Brian Ballsun-Stanton
* Quest: To explore metaresearch and digital humanities
* Airspeed of an unladen swallow: African or European? (and something of a geek)
* Andy Gibson Researcher
* Post Doc in the Linguistics Department looking at how children's speech changes from preschool through to starting school and the effect of community diversity on speech patterns
* Christine Yates
* Reference Librarian - often working on the information desk in the library so somebody we all get a chance to say hello to, once we are back on campus! Has 4 grandchildren from a few months up to 5 years old :+1: :) :) :+1: :smile_cat:
* Cyril Laplane
* works in physics and his favourite colour is blue!
* David Meng
* likes playing soccer and video games
* David likes the default mute on Zoom; he likes playing soccer and video games but doesn't have much time for hobbies
* Denise Treacher
* Has 2 girls, they are always hungry!
* Has a dog and a cat
* Haae Bae
* early walks trying to stay healthy, meeting new friends
* Heidi Worsley likes to eat therefore I have to cook and shop
* having troubles with Zoom atm (keeps freezing) - and is rebooting her machine (sounds like a fearsome quest)
* Hussain Syed Gowhor
* 3 daughters lockdown and studying
* Ioanna Anastasopoulou
* Painting
* Dancing
* Opera
* Jayamala Parmar
* loves long chats and cooking!
* juggling a busy lifestyle and loves reading books but going to discover audiobooks!!
* Juliet Lum working with PhD students, she registered Cloudstor quite a while ago, and now want to know how to use it.
* and she can help Phd students, wonderful!!!
* Kristina Kopychynski
* Enjoys photography and cocktail making
* The zoom backgrounds are growing on her
* Mike Williams
* has a 10 week old alaskan malamute puppy
* playing more piano and doing family tree research during lockdown
*
* Natalie Hart
* is home with 3 LOUD children
* supersmart and great to have a good old banter with :)
* Odette Subijano
* play piano
* but hasn't been able to play in a while
* Omid Ghasemi
* loves video games
* Philippa Smith
* Philippa is very organised and very helpful!!
* Richard Miller
* Richard is in IT and works in Science and Engineering
* Sara King
* works at AARNET and loved pong as a kid
* Sarah Holmes
* Likes decorating cakes
* And running? Sorry my Zoom is a bit crackly lol
* Sarah West
* Louise Dodds
* coordinator in AHH
* likes to run
* Xiaofei (Clara) Dong is a 3rd year PhD candidate in Cog Sci.
* Clara wants to know how to use CloudStor.
*
*
*
Jargon busting (Time: 10 mins)
https://mq-software-carpentry.github.io/intro-active-data-management/01-introduction/index.html
In breakout rooms, look at the list of terms on the shared document.
Are you familiar with these terms in this context? What are the ones that trip you up? Think of a way to remember what that word or term means in this context that might help others understand it better. How could you define a term (or two!) above to make it easier to understand? Write your definitions down in the etherpad - we can add to this list as we go and keep it as a resource for the future.
End: 9:45
Other jargon
* RDM = Research data management
Breakout room 1
* computational notebooks (eg Jupyter Notebooks)
* ?? using a virtual notebook? any different to notepad? combo of command shell with word processor document
* cloud computing
* computing as a service (using someone else's compute - possibly for a fee)
* using a remote super-computer to do your calculations
* 'shared infrastructure' that can be in your own site (on-premise cloud) or somewhere else (off-premise cloud); typically pricing is on a user-pays (on demand) basis which differentiates the delivery and consumption model from having to buy and maintain your own kit to more like a utility
* cloud storage
* storage of files and easy sharing amongst colleagues - although I have an issue with its spelling "Cloudstor"!!!
* storage as a service (using someone else's storage)
* https://www.cbronline.com/wp-content/uploads/2016/11/datacentre2-300x300.jpg
Breakout room 2
* sensitive data
* Sensitive data is information that must be protected against unauthorized access.
* E.g. Personal information, Health information, Education records ([source](https://www.upguard.com/blog/sensitive-data#examples))
* there's a MQU url for this ?
* data movement/transfer
* The act of physically moving data, either through networks or through "sneakernet"
* "Do not underestimate the bandwidth of a volvo filled with tapes"
* https://en.wikiquote.org/wiki/Andrew_S._Tanenbaum
* active data
* Data that is currently being edited and updaimagted (comsumed).
* https://www.webopedia.com/TERM/A/active_data.html
* etherpad
* Etherpad is a highly customizable Open Source online editor providing collaborative editing in really real-time (Etherpad is a particular tool but term is often used to refer to all collaborative text editing enviroments -eg HackMD, Google Docs, OnlyOffice etc).
Breakout room 3
* collaborative editing
* a way to enable "Multi-user, even simutaneous" editing to a common file/document (e.g. HackMD & overleaf https://www.overleaf.com/), 'etherpad' is also a term used for this type of collaborative space
* sync
* When you sync a device you synchronize it with data on your computer. When you sync a device with your computer, it typically updates both device and the computer with the most recent information (https://techterms.com/definition/sync)
*
* upload/download - transmitting data to a destination (another computer) - there or back - putting stuff somewhere or taking it from somewhere
*
* GitHub
* code sharing between individuals, groups, able to edit on-line as well, can learn to code, 'reddit-like' posts and discussion forums https://github.com/
Breakout room 4
* research data management
* A plan for collecting, sharing and preserving data
* processes for collecting and organising data and supporting files so that they may be understood and shared by professional colleagues
* collaborative document
* document authored by multiple people
* a document where multiple people can contribute
* repository
* A place that host/save your project (including documents, data, codes, etc.)
* a system to store active and archive data which can easily be retrieved
* immutable versioned and time-stamped storage of research materials distinct from active working storage
* research data
* Any kind of data or evidence that you show to support your hypotheses
* supporting material documenting procedures and tools used to collect data
# Research Data Management
https://mq-software-carpentry.github.io/intro-active-data-management/02-RDM/index.html
https://ardc.edu.au/wp-content/uploads/2020/01/What-is-research-data.pdf
* Interview recordings and transcripts
* Citations
* Quotes and passages from books
* simulation results
* images
* Patient questionnaires
* Linked data e.g. hospital, Medicare records
* Weather Station data
* Instrumentation data (mass spec, microscopes)
* Reaction time data from experiments
*
*
*
*
*
* Statistical analysis scripts
* Vowel formant tracks / pitch tracks
* Time-aligned annotated transcripts
* Video footage
* Performance in some behavioral tasks (e.g., reaction time)
* Audio
* Social media - posts, images, video, metadata, profile id's, links used to access specific sites, terms of reference for relevant sites.
* info published by unis on their websites
* student survey responses
* Public sign up to participate in research (when attending events at the AHH)
* Data - personal information - about external organisation employees
* All kinds of data from research activities...
# research data lifecycles
Where are you at?
Is something missing? Is something unexpectedly there?
Pre-registration - where does that fit?
Ethics - fit in DMP or within data storage cycle?
Is there an even bigger circle at an institutional level ?
where does sharing of data, analyses, scripts etc go? is that archive or publication?
What do you mean by description? How do we describe our data?
Is there a dedicated team at MQ who can help the researchers at every step of the DMP and the lifecycle?
Horror story from unsw:
12 PhD students lost their data when the office sprinkers went off overnight. None had backups
## Challenge - A video on data sharing (20 minutes in breakout rooms (10:40), 10 minutes discussion)
In breakout rooms, watch this video in small groups: Data Sharing and Management Snafu in 3 Short Acts
https://www.youtube.com/watch?v=N2zK3sAtr-4
Discuss: Have you run into any of these scenarios? What happened? How should this have gone? Write your room’s conclusions in the etherpad.
Share: When we get back together, each room should share some of its worst horror stories (with the names changed to protect the innocent).
* Relevant comics:
* https://xkcd.com/1360/
* https://xkcd.com/927/
* https://xkcd.com/743/
* Breakout room 1
* when I tried to publish my work, the editor asked for the methods I used to correct the artifects in my data, then I realised I did not save the processed data in every single step, so I have to do the whole process again.
* So hard to get access to data when the institutional system for storing and managing data changes. Lots of "no, we do not have access to that data anymore. Goodbye!" responses: grrr!
* Original metadata text was written in Chinese, and the file wouldn't open or read correctly when back in Australia.
* Breakout room 2
* Code - figuring out what old code lines are on older work is a challenge. Historically did not call things sensible titles. Notes to self need to be written as if to somebody who knows nothing about the project.
* Legacy formats... importance of using stable formats. E.g. Physical formats: minidiscs!, CD-roms, floppy disks haha, but also digital file formats that require old operating systems etc...
* https://xkcd.com/743/
* A PhD student only had work in print form and not electronically and they lost it and an award was sent out to try and get it back.
* PhD student only had a copy on a hard drive and the hard drive failed. He did have warranty on computer but unable to retrieve data.
* Example of some column names -- good luck figuring out what this data is about:
ua_6km (ncl6, ncl7, ncl8) float32 ...
va_6km (ncl9, ncl10, ncl11) float32 ...
s06 (ncl12, ncl13, ncl14) float32 ...
ua_3km (ncl15, ncl16, ncl17) float32 ...
va_3km (ncl18, ncl19, ncl20) float32 ...
s03 (ncl21, ncl22, ncl23) float32 ...
* legacy formats are a drama
* Breakout room 3
well I have wondered about all these beautiful in situ images I have sitting on a CD somewhere
- lots of missing documentation that is required , duplications, collecting all the necessary data
- I've experienced the similar situatuin: asking for example scripts but got a reply said loss access to the online server and need a few months to get access back
* Breakout room 4
* my masters research, I really struggled for the right kind of data. There was a time limit and I could not get the ethics approval and had little time to go for collecting primary data. I understood how difficult it is to do a research solely based on secondary data. I would have skipped this problem if I had adopted a data management plan and could have access to data sharing services. - By Hussain.
* Shared codes without any comments (It is impossible to find out what is happening there)
Which of these does NOT count as active research data? Put a +1 in the shared document next to which one you think is right!
A database
A research publication +1 +1 +1+1+1 +1 +1+1++11++11
Field notes +1
Audio and video tapes
## Arguments for choosing your collaboration service
To check if my data are relevant and not obsolete.
To ensure the reliability and validity of my data
https://limesurvey.mq.edu.au/index.php/653544/newtest/Y?WorkshopName=IntroDataManagement
# Active data and files
https://mq-software-carpentry.github.io/intro-active-data-management/03-File-manipulation/index.html
## Challenge
Where is your data now? How do you store, share, sync, protect and back up your files? Could this be done differently?
* Andy Gibson
* Cyril Laplane
* onedriveonedriveonedriveonedriveonedriveonedriveonedriveonedriveonedrive
*
* David Meng
* Cloudstor, Dropbox, Onedrive, GoogleDrive, Desktop(Uni), laptop, exteral HDD
* Sensitive data stored & shared via Cloudstor
* Haae Bae
* online storage e.g. onedrive, memory sticks, laptop, physical copies (questionnaires) are kept in a cabinet in the office - only authorised people have keys to access the questionnaires
* Heidi Worsley share drive, h drive, hard drive, emailllll
* Hussain Syed Gowhor
* Ioanna Anastasopoulou
* Jayamala Parmar
* Juliet Lum
* Kristina Kopychynski
* Mike Williams
* Natalie Hart
* Omid Ghasemi
* Philippa Smith
* personal data is on multiple hardrives (one ideally located elsewhere) I don't like the lack of privacy and control of Google, Apple products
* Richard Miller
* Sarah Holmes
* Cloudstor/Onedrive, personal data is on a combination of Google, Apple, Samsung
* Sarah West
* Xiaofei (Clara) Dong
* currently on my pc and hard drive, tring to upload to cloudstor
## Logging into cloudstor
1. Go to the AARNet website: https://www.aarnet.edu.au/
2. Click on 'Log In and Tools' in the top righthand corner of the page.
3. Select 'CloudStor'.
4. Choose your organisation and click on 'Login at AARNet'.
5. Sign-in with your credentials - user name and password - and click 'Login'.
## Homework
https://mq-software-carpentry.github.io/intro-active-data-management/03-File-manipulation/index.html Challenge - Organising a directory (15 minutes
# Day 2
# Discussing file manipulation homework
Some extras that have come up you might be interested in checking out:
Text editors
https://en.wikipedia.org/wiki/List_of_text_editors
Markdown
https://www.markdownguide.org/
GitHub
https://github.com/
Git
retrieve
https://librarycarpentry.org/lc-git/
Git for writers
https://opensource.com/article/19/4/write-git
**NOTE** I have no affiliation with Udemy but this was a course on special during the new year and it helped me work a few things out (there are probs free ones out there too) - Sara
https://www.udemy.com/course/git-and-github-for-writers/
# Backups
Tape is still king for back up (but not at home)
https://en.wikipedia.org/wiki/Tape_drive
Hot tip: Test your back up before you are in a crisis.
Back Blaze https://www.backblaze.com/
Spideroak https://spideroak.com/
Duplicati https://www.duplicati.com/ - You can use an external HD for this
To back up sensitive and highly sensitive data talk to your local friendly IT peeps
Since Macquarie supports onedrive and Cloudstor, is it possible to use one as a sync system and the other as a backup system? YES! Be careful about how you do it and plan for it well.
Back up on Cloudstor? https://support.aarnet.edu.au/hc/en-us/articles/333757013916-Cloudstor-File-Recovery
So, how do you decipher and 'unencrypt' a certain backed up file you're looking for in your Firedrill?
You need a key (password) management protocol as part of your DMP.
https://en.wikipedia.org/wiki/The_Princess_Bride_(film)
Examples of three copies:
Data on your computer AND in your sync client is for active data.
The second place for your data is a result of an automatic back up plan
The third place is an external hard disk (do **not** use USB thumb drives for persistent data storage) Remember: hard drive not USB drive. Failure rates are something you want to look up when you buy a hard drive.
Making a back up plan articulate what you care about. Define the things you want to protect. Then figure out how you are going to do it, checking you have an understanding of your legal/moral requirements eg Privacy Act, Records Act. Be explicit about who has access to encryption keys. Who needs access? How do they get access? Make a checklist. This way you will know if you have succeeded.
e.g.:
* I have sensitive audio recordings - i want to back them up
* I need to have strong protection against unauthorised access
* Need to be preserved in perpetuity
* Not shared with the team, so don't need access for other, but needs to be encrypted,
* Does ethics need to be able to invisibly audit? E.g. in case of ethical challenge or law suit (ie. university to access without telling you) NB. If you're using an authorised sync client, audits can use your active sync client for this purpose, not your backups
Back up plan might be different for different subparts of project... e.g. audio might be like the above checklist, but excel spreadsheets, publication etc. might have a different plan.
Think of it as 'professional paranoia'!
If you don't articulate the threat your efforts are potentially going to be wasted or left to chance.
Common threats:
* spilling coffee on computer, losing laptop, laptop squeals then passes out, unauthorised access to your data, fire, sprinklers, flood, team member leaves the project, PhD stiudent drops out
Can you recover from these? How?
What are the policies, standards and laws that affect the way I can use and store my data?
Look at each piece of data - does this one thing have a special risk?
Shared folders - one way to protect team data is to set up a team folder, or a 'group allocation'. That way it doesn't belong to an individual and the ownership is for the whole team. That allocation belongs to the institution rather than an individual. Plus you get a team storage allocation separate to your own individual allocation of 1TB
How do you create a team in CloudStor? email Richard!
Think about cost. What's the cost of a breach? (Hint: a lot!) What about your data management costs? What about the costs of where the data is stored? What happens if they share it or the business goes bust? Multiply costs by percentage of failure. Risk analysis principles apply.
Your budget is that the insurance (back up) must not cost more than replacing your data.
Some nighttime reading: https://en.wikipedia.org/wiki/Command_and_Control_(book)
**Excellent tip: If you choose not to sync your computer to Cloudstor you can use Cloudstor as a back up :)
# Jargon busting
What words, terms, phrases and concepts in this workshop can we explain better?
eg back up - a copy of a file or other item of data made in case the original is lost or damaged AND the procedure for making backup copies of files or other items of data. Also see: https://en.wikipedia.org/wiki/Backup - In information technology, a backup, or data backup is a copy of computer data taken and **stored elsewhere** so that it may be used to restore the original after a data loss event.
https://en.wikipedia.org/wiki/Glossary_of_backup_terms,
Restore
- The act of accessing, decrypting, and getting data from a specific point in time out of your backup system.
- "Bring a file's state to a previous point in time. e.g. undeleting it"
Tape
- Magnetic data storage tape, not audio-tape
https://www.youtube.com/watch?v=kiNWOhl00Ao
* # Backup plan challenge
Group: Make, document, and share a backup plan. Use a variety of resources and share those too. Think, pair, share
Room 1:
We planned for the plan! Found some interesting resources:
https://documents.uow.edu.au/content/groups/public/@web/@raid/documents/doc/uow226355.pdf
https://www.ait.com/tech-corner/10836-4-steps-to-create-your-backup-plan
https://www.povertyactionlab.org/sites/default/files/documents/Data_Security_Procedures_December.pdf
https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/framework.html
https://www.ait.com/tech-corner/10836-4-steps-to-create-your-backup-plan
https://libraries.mit.edu/data-management/plan/write/
https://azure.microsoft.com/en-au/services/backup/
https://aws.amazon.com/glacier/
https://iapp.org/resources/article/data-breach-cost-calculators/
https://www.arc.gov.au/policies-strategies/strategy/research-data-management
https://www.nhmrc.gov.au/sites/default/files/documents/attachments/Management-of-Data-and-Information-in-Research.pdf
Room 2:
Backup Plan:
Video files:
Data Survey responses (sensitive - anonyomised) not original
video / recording - high school students (Category: Sensitive)
Risks:
Loss of data means ? Start again ?
possible data breach through via shared camcoder
HDD failure 1% pa
risk of being sued
reputational risk
Impact:
hard to repeat - would need new subjects (Cost: $60K)
Copies:
camcorder (borrowed ) + google drive
3 Copies:
Cloudstor
private SD card
google drive
cloudstor2
## Room 3:
* Data sources:
* Physical device which stores all lab data. (NAS) Only place where the data is saved.
* Partial copies of that data exist elsewhere.
* Analysis is version controlled on github
* Papers on overleaf
* Risks:
* Students have access to the not-backed up NAS. All people in the lab have access.
* Risk: Students may accedentally delete files
* Risk: Students may have a laptop infected with ransomware.
* Risk: People may access other peoples' files.
* Risk: fire or fire response in the lab damages the NAS
* Risk: power outage (for longer than an hour) or internet outage prevents access
* Costs of replacement:
* Replacing the active data, can be retaken relatively quickly. 1 month of work. (9 people, 1 month of work, 600 dollars a day) = $108,000
* 5 drives * (2 drives need to fail at the same time) * .05 per year= .25%
* $250 a year
* Backup strategy
* Setup automatic backup using a command line tool to one of MQ's cloud services (free)
* Buy a second NAS, put it in a different location. Backup automatically monthly
* Test strategy
* Every few months, disconnect the NAS, try to access our data.
* if you want a simple option for pre-registration: aspredicted.org provides a very quick and easy interface and requires less detail than osf for completing a pre-registration
# Collaboration
Stop emailing Word documents! Please for the love of everything good! Stop it! :D
A safe and secure way to share files is Filesender (via Cloudstor): https://support.aarnet.edu.au/hc/en-us/sections/115000260773-CloudStor-FileSender
Overleaf
Macquarie subscribes so if you sign up with your Macquarie email you will get a premier account: https://www.overleaf.com/
More about Overleaf: https://www.digital-science.com/products/overleaf/
What about Github?
Could you talk about differences between overleaf and GitHub for version control of paper ?
Overleaf interacts with GitHub
https://www.overleaf.com/blog/195-new-collaborate-online-and-offline-with-overleaf-and-git-beta
And Google docs?
Since google docs has pretty good history function, the named versions are major versions right?
Google doc is document centric - while git allows commits against entire folders
What's Paperpile?
https://paperpile.com/
https://manubot.org/
EndNote? Maybe not for collaboration? (personal view)
OnlyOffice? No citation manager but useful for other things (it's in Cloudstor)
Mendeley? (Owned by Elsevier - supported by Macquarie)
Do you think that should extend to pdfs? e.g. we have a shared endnote library, but you can’t sync pdfs between users.
Think about it!
Zotero has Word365 integration. https://www.zotero.org/
Is Zotero easy to learn? Ask a librarian! They are awesome and supportive and can help you with these things.
# OSF
Open Science Framework
https://osf.io/
Macquarie has membership, you can keep your data there and there are citations created for that data that they can use and a DOI can be minted for the data too. You can specify Australian storage. You can also include Jupyter notebooks to show your steps in your analysis.
You can also licence your data. If you want people to reuse your data you must have a licence.
You can create a peer review link - this will blind the paper so reviewers cannot see the names of the contributors so it can be used for double blind peer review.
Why use OSF? To create a data supplement for my paper. To give reviewers access to my data, and my analysis, in addition to the paper.
Do you include a URL or DOI?
DOI once the paper is public.
Pros and cons of osf vs github for data sharing?
You can upload a GitHub repo on OSF
TOP Guidelines
https://cos.io/top/
Pre-registrations can be placed here: https://cos.io/prereg/
Andy: if you want a simple option for pre-registration: aspredicted.org provides a very quick and easy interface and requires less detail than osf for completing a pre-registration
# SUMMARY
For active data management do these things:
Make a plan
Follow the plan
Cover every phase of the data lifecycle
Address risks and your back up strategy