owned this note
owned this note
Published
Linked with GitHub
# Response to "Are We at Risk of Losing the Current Generation of Climate Researchers to Data Science?"
[**Link to Jain et al paper**](https://agupubs.onlinelibrary.wiley.com/doi/epdf/10.1029/2022AV000676)
[**Link to our draft paper outline**](https://docs.google.com/document/d/13msVgNJ8SiPbdKNhZcDEZadHqdOEzAi1LNhJbFhY_-o/edit?usp=sharing)
## Meeting Aug 4, 2022
To join the video meeting, click this link: https://meet.google.com/yqt-tdkm-qcu
Otherwise, to join by phone, dial +1 669-241-7529 and enter this PIN: 562 998 071#
### Attendees
- Deepak Cherian / NCAR / dcherian@ucar.edu
- Julius Busecke / LDEO / julius@ldeo.columbia.edu
- Paige Martin / LDEO / pmartin@ldeo.columbia.edu
- Daniel Rothenberg / Waymo / darothen@waymo.com
- Shane Elipot / University of Miami/ selipot@miami.edu
- Max Grover / Argonne National Lab / mgrover@anl.gov
### Agenda
- Do we have a clear goal?
- Let's offer an alternate perspective (rather than rebuttal)
- We will likely be seen as a Pangeo group, so let's embrace that (Shane: it does not have be this way, I'd like to bring in the perspective from inter-agency programs like US CLIVAR in the USA)
- Ideas from discussion
- There is a lot of great "non-traditional" research going on outside of academia
- eg CarbonPlan, important for society, inspired by data intensive activities that have a new or additional type of impact
- "bigger tent of climate science" these days
- ~~Article~~"Our" positive experience seems very US centric
- some haven't had access to the data approach, and may feel overwhelmed
- Computational scientists
- not just in climate science
- has existed for decades [history of comp sci at argonne](https://www.amazon.com/Hybrid-Zone-Computers-Laboratory-1946-1992/dp/0988744902)
- Prescribed vs curious-driven research
- putting classes on types of research
- adds a sense of "elitism" that scientists should just do science and not "lowly" activities like data management
- there may be a correlation vs causation - bad experiences may have been correlated with lots of data management, but not necessarily the cause for the lack of curious-driven research
- emphasize community building as a solution
- paper ignored solutions involving education
- community as a "temporary refuge"
- We agreee a lot with what changs are needd but encourage a broader framing of the motivation for these changes.
- cleaning data is an investment - a learning tool that is part of the training/education
- but we admit that the education system is behind and doesn't teach enough computational tools!
- access to education/community --> positive thing!; if not, it's a drag
- title sets the stage immediately of "us vs them"
- should we really diminish the amount of data we are using? No!
- let's say: we are not going to get around big data
- the solutions they suggest could be improved (could be our approach in our paper)
- reframe our paper, and then offer alternate solutions
- Do we agree on the points outlined in the "summary of views" below?
- take 10 minutes to read through the summary
- What are our next steps?
- who wants to "drive" an outline?
- What outlet are we targeting? (Response to Article, Blogpost, ???)
## Summary of views
### Goal
Can we write a constructive optimistic critique that advances an alternate view of "data-intensive" research?
- Important that this response should not be viewed as the "Pangeo" response. "oh you all like doing this stuff, no wonder you disagree..."
- embrace this; also valid members of the community.
- Pangeo is funded... things like Earthcube, Cyberinfrastructure funding
- Data is one of the foci of US
program, funding agencies recent focus on data and software, working groups on CMIP, on uncertainty quantifications, data science, etc.
- there is an implicit framing of "what is climate research" and value judgement; non traditional climate science; esp translational research; there is a bigger trend that emphasizes technical skills
- Should we avoid the other parts of the paper (incremental "prescribed research" instead of "curiosity driven research") and keep a narrow focus
- playing with data is curiosity driven research
- no we should tackle this.
- do we want to address the inequity aspect
- "urgent need to create equitable globally accessible computational and storage solutions, similar to the Centre for Data Analysis (CEDA) or JASMIN in the UK, which allow researchers to perform data analysis at the site of data, without the need for processing or storage on local systems." (Sec 3.3, sharing of resources)
- or cloud...
-
- It's hard when the paper points out the solutions in some form (Section 3). We can encourage a broader framing of thse points.
- Sharing of Resources
- Dedicated Software Engineers and Data Managers
- Streamlining Post-Production Activities
- Improving the Link Between Model Developers and Data Analysts
- Work Culture Improvements
- We should emphasize community as a solution (somehow ignored; though I think the paper mentions "forums")
- Time devoted to data management and software skills should be considered an investement and hands-on training
- We should also emphasize education aspects
### Concerns with current framing
- perpetuates the notion that data [management] and software development is not “real science”
- It is depressing to read the viewpoint that “an undue focus on data-intensive activities” as something distracting from “foundational research activities.”
- seems to dismiss the role of data-driven investigations, or put them in a lower-class of scientific activity.
- that the authors are somewhat dismissive of the sort of semi-structured scientific research which involves working with large datasets to formulate hypotheses to evaluate, rather than theory-driven or alternative approaches.
- strange assumption that any of the students, postdocs, etc. tasked with data science should want to become academic researchers.
- The authors openly call for universities and research labs to fund data and software engineers. That’s great… but it seems like the intention is to offload the “lesser” work of data management activities to these support staff. It kind of reminds of tensions with lab technicians in the bench-lab and field experiment world; these are critical staff that in many cases were actually scientists in the same research domain but chose to specialize in something other than basic R&D.
### What is our alternate view?
- avoid exclusionary us vs them framing of the scientific enterprise.
- its in the title.. creating a wedge and two different classes;
- many, but not all, like to work at the intersections of various subfields
- a growing movement of people interested in working in climate science/solutions (e.g. [tech workers](https://twitter.com/i/events/1551976905577938949)). We must be welcoming!
- important since the authors seem to call for an "institutional support for large-scale data analysis": Such contributions should not be minimized.
- We need better acknowledgement (?) of other workers in the science enterprise (eg data managers, software developers, lab technicians). But not sure how?
- Lets not forget we can learn from others:
- From the software development perspective, it’s perplexing that scientists are generally averse to learning things that will make their lives easier. The sheer number of man-hours lost to bad software architecture, bad versioning practices, etc., comes into blinding view when one invests whole-heartedly in these skills. It is frustrating to see so many scientists complain of the time it takes to do basic research, while refusing to learn the skills necessary to solve this problem.
- let's not slow down data collection
- so many more ways to respond
- We don't think they're offering the right questions or solutions with potential. We could come up with a new set of solutions as a constructive response.
- We must take a broader view of "data in science". Cheaper sensors, and cheaper compute mean that an ever-increasing size and number of datasets is guaranteed.
- We must take a broader view of "increasing size". Large increase in the availability of a large number of datasets. Surely many scientific insights remain to be discovered at the interesection of these datasets.
- data as a fourth paradigm of scientific discovery (think this is more AI/ML type framing)
- the ability to easily manipulate datasets of large size, or a large number of datasets opens the possibility of computational learning ([Barba](https://lorenabarba.com/blog/computational-thinking-i-do-not-think-it-means-what-you-think-it-means/)).
- **Do we have an example of research findings that were driven by data?**
- We need to take better advantage of online collaboration opportunities / buidl and participate in communities:
- open science or "open-source science"
- reduce "toil" / crowdwourcing fixes & analysis techniques ("peer production")
- ocean glider Standard Operating Procedures 'living document'
- xMIP
- teach each other analysis techniques and skills
- **All of this results from some form of community effort.** Maybe the emphasis needs to be on creating these communities for everyone (because clearly the authors did not benefit from e.g. the pangeo community as much as some of us; see Max's point beneath)
- links to NASA open-source science stuff.
- there is a lot of funding effort in this space. we could consider listing a number of them.
- ToPS, pythia, hackweeks, esgf 2.0 efforts?
- all these are outside the standard training ofr grad students
- Skills, aptitudes and preferences, widely vary in the ECR community;
- we must learn to use this to our advantage
- more flexible career paths and expectations
- more funding...
- end with an optimistic vision:
- technical skills should not limit the questions we can ask of the dataset."anyone should be empowered to ask science questions of huge datasets like CMIP" (paraphrasing ryan A)
- more funding is great, but it should not treat data engineering as something outsourced by "scientists" to "data practitioners"
- The idea is not to just produce more and more data and analyze it with standard methods, but instead we want to be able to 'observe' these amazingly complex datasets like we would the real world. In this sense good tools and technical expertise encourage more 'curiosity-driven' research.
- see "computational learning" idea above
- **computation should be empowering, not discouraging**
- their email contact is "datafatigue@yess.org"
- What can initiatives like pangeo do better to reach more ECR? \
## Initial Thoughts
### James Munroe
(copy pasted from Twitter by Deepak)
- While I agree the downloading of data management task to precariously employed ECRs is problematic, I am concerned this article perpetuates the notion that data and software development is not “real science”
- By that I mean it promotes a hierarchy of science results where the “interpretations” of model results (and thus corresponding papers) are first class results and all the work that those science results are based on are “just” software development.
- While I agree that ECRs need the freedom and support to be able to pursue scientific interpretation, other workers in the science enterprise (eg data managers, software developers) are restricted to term limited, project based roles with no career path to senior roles.
### Deepak Cherian
I've started going down the ["computational thinking/learning" thread](https://lorenabarba.com/blog/computational-thinking-i-do-not-think-it-means-what-you-think-it-means/)
I think the science vs software framing ignores the possibility of computational "research" learning; which is a real loss. You can choose to frame your work as "data crunching" or "data playing". It makes a difference.
### Julius Busecke
- I generally agree that ECS should not spend their time on 'toil-only' tasks, but some data science skills will be required in science going forward. Sure we can stop CMIPing, but should we ignore new high res satellites, ever growing Argo/Glider observations? Big data is not just modelling!
- This might be biased, but I think that the authors ignore the possibility of 'crowdsourcing fixes to reduce toil'. The idea here being that we want to avoid **repeated toil** at all cost. However, some data wrangling is always necessary. And in that sense it "It is not just about creating a few more libraries or learning a few more analysis tools to make research more efficient.", but doing so in a modular and community driven effort (ideas at the core of so many pangeo efforts) can do exactly that - make science more efficient!
- In a sense this will require more 'data science' skills (by which I really mean dev skills) from scientists, not less!
- I am not at all a fan of the 'us vs them' framing. I have seen that in many iterations in my scientific career and always strongly opposed it. It used to be 'observatonalist' vs 'modellers', now its 'real scientists (theoreticians?)' vs 'data scientists'. I think pinning different skillsets against each other is never useful. The real scientific insight IMO will always emerge at the intersection of skills and opinions.
- I fully agree on the idea that science should be as creative as possible! But I disagree that this can only be done from a 'theory-first' point of view. I usually use the argument "these datasets are lying around and shall be analyzed" in a completely opposite way - as a motivation - to develop tools and methods that enable scientists to explore very large data sets interactively and creatively.
- I think I struggle most with this sentence: *"There are growing movements to democratize research data and infrastructure via cloud computing technology to support the development of climate research in under-resourced regions, for example, Group on Earth Observation—Planetary Computer Program, Amazon Sustainability Data Initiative, and analysis-ready CMIP6 data on the cloud with Pangeo. However, these initiatives require new skills that are not yet included in the training of ECRs, thus still struggling to achieve their purpose."*
- What are the skills that are needed for the pangeo efforts (which go far beyond providing data!) exactly? What software are the authors envisioning ECRs should use? Again I find myself agreeing very much with the outlined problem here, but the conclusions are completely orthogonal to what I would think about this issue. Everyone can use the pangeo stack on their computer and the switch to the cloud is comparably easy, because the interface and the tools are exactly the same.
- But I do not want to just note things I disagree with. I believe this is an incredibly important and timely discussion, and the authors raise very good issues. With regard to our community I think this is also an opportunity to practice some intospection: I believe the pangeo community has a lot to offer towards these issues, but evidently we need to improve our outreach and work harder to engage an even broader community and focus more on offering teaching for the "pangeo-tools"?
### Stephan Hoyer
It is depressing to read the viewpoint that "an undue focus on data-intensive activities" as something distracting from "foundational research activities." It reminds me of my days in physics grad school, interacting with some people (including myself at times), who viewed experimental or applied science as a little beneath themselves ("high energy theory or bust").
In contrast, I'm a strong believer in data as a [fourth paradigm](https://www.datanami.com/2019/04/15/is-data-science-the-fourth-pillar-of-the-scientific-method/) of scientific discovery, in addition to experiment, theory and computation. Just like how time spent building experiments or a computer model is an important part of the scientific process, so too is looking at data. It might actually be a useful exercise to look back at the early days of computational science for precedents, when there were probably similar complaints about the risk of losing scientists to computer modeling.
That said, I agree with the message that it is essential to invest in training for ECRs in this new skill set, as well as professional support and tool building. We need institutional support for large-scale data analysis just like support for building a new satellite or super-computers (but [don't fund software that doesn't exist](https://peekaboo-vision.blogspot.com/2020/01/dont-fund-software-that-doesnt-exist.html)). If current trends are any indication, data analysis (including machine learning) is only going to become more prevalent in climate science in the future as model output gets bigger, sensors get cheaper and data-driven approaches are more and more successful in other areas of science.
### Max Grover
* I think this article does a great job of bringing up relevant issues in our discipline - but I would argue that **climate science has always been a "data science" field**, and much of the burden of the technical work has traditionally been placed on ECRs
* The main dissmissal of the efforts of the Pangeo community is that not many people know how to use the tools - this is an argument for **more effort on outreach and education**
* Investment in both **tool develpoment** and **education** (tutorials, blog posts, workshops, etc.) are required here
### Shane Elipot
Overall, yes, the paper brings up a lot of good points, mostly about the difficulties we face when dealing with large data but (probably repeating many of the points above) ...
- Unless I misunderstand it, the title is misleading: an ECR spending an unsatisfying and frustrating amount of time wrangling model data is not going to necessarily become a data scientist. It's not even suggested in the paper I believe.
- The paper seems to dismiss the role of data-driven investigations, or put them in a lower-class of scientific activity. The truth is that "let's analyze that bunch of data sitting there" can leads to great discoveries or research ideas that can eventually be reframed as hypothesis-driven proposals and projects. If they must.
- I have seen people on twitter using this paper to support their (wrong) ideas that data and code are not science. I would like to pinpoint/isolate in the article what could lead to such reckless conclusions.
- Framing data science as a lower class of science (or not even as science) is dismissing the way many of us work in climate science or related fields. We spent much more time collecting, curating, processing, and analyzing data than writing the associated results. This is therefore dismissing many scientists as well as the national and international programs funding such activities (e.g. NSF EarthCube)
- Pangeo is mentioned twice in the paper, and specifically it is written that *However, these efforts are largely community-driven, and therefore professionalizing these initiatives and supporting them financially would be needed to ensure that they continue to develop and fulfill their purpose in the longer term.* Isn't this ignoring that Pangeo and other projects is actually funded by various funding programs?
### Daniel Rothenberg
I had to read this article a few times before really understanding how the arguments the authors offer all line up. I have some specific comments, but I think, first, it's important to higlight the dissonance in the authors' usage of the term "data scien[ce]". The definition kind of meanders a bit, but I think the big throwback to Emanuel (2020) is an indication that the authors are somewhat dismissive of the sort of semi-structured scientific research which involves working with large datasets to formulate hypotheses to evaluate, rather than theory-driven or alternative approaches. They lament that this sort of work can be difficult to plan ahead of time and sell to funding agencies, and that the outcomes may look very different than some reference output for ECRs.
It's tied up in what I find to be a very impolitick perspective towards support staff. The authors openly call for universities and research labs to fund data and software engineers. That's great... but it seems like the intention is to offload the "lesser" work of data management activities to these support staff. It kind of reminds of tensions with lab technicians in the bench-lab and field experiment world; these are critical staff that in many cases _were actually scientists in the same research domain_ but chose to specialize in something other than basic R&D.
If I were to write a rebuttal to this article, I'd probably focus on the following points:
- If dealing with the ever-growing corpus of data involved in climate research is a barrier for ECRs to advance their careers, then we might simply be failing to properly train these scientists for a modern day research career in this field. Supporting resources are one thing; equipping the domain experts to advance these challenges is another option.
- There may be some self-selection bias here. I've anecdotally noted that many of my ECR climate friends with very strong software and data management skills have opted to jump to the private sector, either within the geosciences or beyond. Maybe the challenge here is, by extension of my last point, a pipeline challenge - those most equipped to work effectively in the modern data-heavy climate research paradigm might simply be choosing other career opportunities. We've absolutely noticed this in the weather community, and it's becoming a critical workforce issue.
- If data management activites aren't attractive because they're incentivized in the tenure-track rat race, then once again we need to ensure that we're modernizing our institutions to value and reward the body of activities that comprise contemporary scientific research. I wonder, if publishing and maintaining code for a critical model or tool was highly valued by a funding or tenure committee, would the authors think differently of the balance of time an ECR might choose to invest in this activity?
### Chris Dupuis
I agree that this is an important conversation to have. I can agree with a lot of what they're saying from a science perspective; I definitely run into a lot of the same issues. But, having been on the "data science" side as well (by which I assume they mean analysis, software dev, and IT), this paper is kind of annoying. I think they're not really grappling with the fundamental problems here, which I think are mostly a function of science policy and academia's feudalistic structure.
Snippets below:
-------------
The traditional career path of a research scientist rewards specific personality traits. Among them are the traits we usually associate with researchers: curiosity, persistence, strong work ethic, etc. However, the education environment from as early as the undergraduate level also selects for other traits, like tolerance of complexity, subordination to hierarchy, confrontation avoidance, conformity, and a willingness to accept ridiculous life circumstances, like having to move across the nation (or world) every few years.
These traits can often have negative consequences. A roomful of scientists who happily accept complexity will have no trouble belaboring a simple point until it’s barely intelligible. This is a plausible scenario in science because by the time an ECR reaches a level where they might have a say in research direction, almost everyone who is willing to put the breaks on has been selected out. I believe this happens at every step in the pipeline: many are alienated by the increasing bureaucracy and security measures; many others are pushed out by intransigence. Some are assigned to work with hopelessly messy codebases, and are then told not to fix them in any meaningful way.
This tolerance for complexity leads to absurdities like CMIP’s explosion of experiments and the resulting data. It’s obvious to me that CMIP needs to be narrowly defined, but I think we can agree that the chances of that happening are slim.
In other words, academic science produces System 2 thinkers and usually removes System 1 thinkers. While System 2 thought processes can help answer theoretical questions, they are in direct conflict with the kinds of problems data science tries to solve. In particular, machine learning consists of System 1 problems and solutions exclusively. Further, many of these personality traits are correlated, and any one of them can be a reason why a System 1 thinker would leave traditional academic research.
Software development is also somewhat in the System 1 camp. Research shows [citations] that programming activates parts of the brain associated with language skills. Language skills are generally deprioritized in science education, despite being generally accepted as important. However, this is evidence that language skills directly impact the work that scientists do.
https://medicalxpress.com/news/2020-06-language-brain-scans-reveal-coding.html
https://www.developer-tech.com/news/2021/jan/04/brain-activity-coding-differs-processing-language-maths/
If we accept the framing of Jain et al., it’s no surprise that ECRs are having trouble with data science: we’ve filtered out all the potential researchers with those skills who aren’t also excellent System 2 thinkers!
--------------------
Jain et al. have proposed some generally reasonable ideas to solve this issue. However, I think these can be reimagined to say that we need to professionalize data science in scientific fields. There are a couple of incongruencies though:
* There is a strange assumption that any of the students, postdocs, etc. tasked with data science should want to become academic researchers. In the United States, about 80% of postdocs will never achieve tenure, to say nothing of Ph.D. students. Those in danger of being left behind are reasonable to diversify their skill sets. Jain et al. seem to be somewhat aware of this, but the overproduction of postdoctoral positions as exploitable labor suggests we have an unreasonable hierarchy. There is only one solution: the postdoc should be abolished.
* I agree with jain et al. that undue emphasis on data science is placed on those whose passion is researching novel ideas. However, I think they misread this situation: the blame lies more with funding and publishing mechanisms which overwhelmingly prioritize incrementalism over radical (if risky) endeavors. I believe the growth of data science is an adaptation to this toxic environment, where data science offers an outlet for the creativity and novelty that is generally not rewarded in institutional science. Pouring more money into fundamental research will not solve this problem. Instead, funding specifically demarcated for data science can remove this work from the core research efforts.
One caveat, however, is that in places where something like this arrangement exists, a common pitfall occurs: scientists assign this work to their data scientists, without relinquishing any control over code, data, infrastructure, or their repositories. This creates a situation in which the data scientist assumes all the responsibility and consequences of the decisions made, with few or no decision-making rights. This lack of control (among other reasons) makes such jobs undesirable, and talented data scientists have lucrative alternatives.
The academic lifestyle (data science included) is also unappealing or untenable for many aspiring scientists. Without professional positions, we are consigned to a semi-nomadic life, making relationships difficult to maintain. We are frequently unable to afford the basics of life, like starting a family or decent healthcare in the U.S., and are under constant threat of imminent poverty. These days, it is accepted as normal in academia to not have reliable employment until one is in their 40s. In most industries, it's not unusual to establish a life-long career by 25 or 30. This extreme devaluation of younger people's work can change very quickly for someone with data science skills if they have the heart to leave academia.
All of this is to say, as a data scientist, why should we help you?
-------------
Jain et al. state that data science can displace scientific activities like posing scientific questions and hypotheses. I don’t believe the relationship has to be antagonistic. Compilers and interpreters are proof systems at their core, although the proofs available depend on which language features are selected for each language. Therefore, it is not uncommon to prove or disprove some new idea by trying to implement something and running into a fatal design flaw. This is, notably, far faster than running an idea through a publication process.
From the software development perspective, it’s perplexing that scientists are generally averse to learning things that will make their lives easier. The sheer number of man-hours lost to bad software architecture, bad versioning practices, etc., comes into blinding view when one invests whole-heartedly in these skills. It is frustrating to see so many scientists complain of the time it takes to do basic research, while refusing to learn the skills necessary to solve this problem. It is not difficult to discover which skills are necessary and how to learn them, with a little curiosity.
I imagine this situation exists largely because educational institutions simply do not teach science students (at any level) to do any of this, or that they should care. Many current scientists and staff did not learn how to program until graduate school or later, much less how to properly use object-oriented design, functional programming, or version control, or what they even are.
----------
Section 3.1 is in direct contradiction to everything before it. What?
----------
Re: Julius on the "theory first" approach. I agree completely, I almost never do science this way. Where's the exploration? The part where you just throw different functions at data and see what falls out?
I hate this stupid notion that somehow, numerical modelling is superior. It doesn't even make sense: if we want to decide which branch of science is "real" science, numerical modelling has nothing on observational science.
## Discussion