sk-learn user survey STAGE 2 - DRAFT

### SCIKIT-LEARN USER SURVEY QUESTIONNAIRE STAGE TWO **WELCOME TO THE SCIKIT-LEARN SURVEY** This study is being conducted by the scikit-learn survey team and should take approximately ?? minutes to complete. Your responses are voluntary and will remain completely confidential and used for analysis anonymized. Please check the box below to indicate that you have read this statement in its entirety, that your questions about the research study have been answered to your satisfaction, and that you voluntarily agree to participate in the study. You may print a copy of this consent form if you wish. [x] I have read this statement in its entirety and affirm the stated conditions. ### USING AND CONTRIBUTING TO SCIKIT-LEARN 1. How often do you use scikit-learn? [Multiple choice] * Daily or almost daily * A few times per week * A few times per month * Less than once per month * Other (please specify) 2. How often do you contribute to the scikit-learn community? [Multiple choice] * Daily or almost daily * A few times per week * A few times per month * Less than once per month * Contributed once (or a few isolated times) in my lifetime * Never (Skip to "Are you interested in contributing to scikit-learn?") 3. What initially led you to contribute to scikit-learn? Please select all that apply. [Multiple choice] * I am an expert user of scikit-learn * I was invited by someone from the project core team * Sprint (development or for newcomers) * Search engines / technical review / forum / blog / advertising * Social network / friend / colleague * School / university / conference * Other (please specify) 4. In what way(s) have you contributed to scikit-learn? [Please select all that apply] * Code maintenance and development * Community coordination (e.g., organising local meetups, conferences, development sprints, newcomers sprints) * DevOps * Answering user questions (e.g., on StackOverflow, user forums, GitHub, etc) * Developing educational content & narrative documentation (e.g. tutorials) * Writing technical documentation (e.g., docstrings, user guide, reference guide) * Fundraising * Project management * Translating content * Website design and development * Other (please specify) 5. To what extent do you feel part of the scikit-learn community? [Multiple choice] * Yes, definitely * Yes, somewhat * Neutral * No, not really * No, not at all * Not sure 6. Are you interested in contributing to scikit-learn? * Yes * No 7. In what ways would you be interested in contributing to scikit-learn? Please select all that apply. And, if you are ready to start immediately, please join [scikit-learn contributor community Discord channel](https://discord.com/channels/731163543038197871/918532179888336926). * Code maintenance and development * Community coordination * DevOps * Responding to GitHub issues * Developing educational content & narrative documentation (e.g. tutorials) * Writing technical documentation (e.g. docstrings, user guide, reference guide) * Fundraising * Project management * Translating content * Website design and development * Other (please specify) ### PROJECT FUTURE DIRECTION AND PRIORITIES 8. How strongly do you agree with the following statements? [answers: strongly agree, agree, neither agree nor disagree, disagree, strongly disagree, I don't know] * Core contributors respond promptly to my issues. * Core contributors responses to my issues are helpful. * The scikit-learn documentation is comprehensive and easy to understand. * The scikit-learn community actively contributes to improving and expanding the library. * Scikit-learn models are versatile enough to cover most of my use cases. * Scikit-learn is updated frequently enough to stay current with the latest advancements in the field. 9. How strongly do you agree with the following statements? [answers: strongly agree, agree, neither agree nor disagree, disagree, strongly disagree, I don't know] * The learning curve for beginners in scikit-learn is steep. * Certain machine learning techniques are underrepresented in scikit-learn. [Conidtional question] If agree: Please state what is lacking. [Free text] * The issue resolution process could be more transparent and efficient. * The lack of certain features or functionalities limits the scope of scikit-learn for specific use cases. 10. If scikit-learn had one extra full-time team member, what would you like them to focus on? Please drag and drop the following items in order of priority with 1 being highest priority. * Performance * Reliability * Packaging * New features * Technical documentation * Educational materials * Website redesign * Other (please specify) 11. Please expand on your answer about the priorities for scikit-learn. 12. What single immediate change to scikit-learn would bring the most value to you as a scikit-learn user? [Free text] 13. How strongly do you agree with the following statements? [answers: strongly agree, agree, neither agree nor disagree, disagree, strongly disagree, I don't know] * The scikit-learn core contributors should do more to explore collaborations with other machine learning libraries to enhance interoperability. [Conidtional question] If agree: Which libraries and for what purpose should the project collaborate with?[Free text] * The scikit-learn core contributors should invest in educational initiatives to lower the entry barrier for new users. * The scikit-learn core contributors should enhance community engagement through events, tutorials, or mentorship programs. 14. How strongly do you agree with the following statements? [answers: strongly agree, agree, neither agree nor disagree, disagree, strongly disagree, I don't know] * Insufficient funding or resources may hinder the development and maintenance of scikit-learn. * Evolving the library with changes that break backwards compatibility could prevent from using recent versions because updating my code-base is too expensive / complex. * Inadequate community diversity and inclusivity may limit the perspectives and ideas contributing to scikit-learn. * Scikit-learn has a risk of obsolescence compared to other rising libraries and tools. 15. What opportunities do you see for scikit-learn? [optional] [Free text] 16. What threats do you see for scikit-learn? [optional] [Free text] ### TECHNICAL QUESTIONS **Project** 17. What share of your tasks are unsupervised (state from 10, 20, -100%): * Regression, ….% * Classification. ….% * Forecasting.….% * Outlier/anomaly detection ….% * Dimensionality reduction….% * Clustering ….% **Other** 18. What metrics do you use to evaluate your models?[Drop down menu] [Multiple choice] * Accuracy * Precision & Recall * F1 Score * ROC-AUC * MAE * MSE * Other (please specify) 19. What visualizations do you use to evaluate your models? [Drop down menu] [Multiple choice] * Confusion matrix * Reliability diagram * ROC Curve * Precision-Recall curve * Feature importance * Residual plots * Learning curves * Other (please specify) 20. How often do you update scikit-learn? (as soon as there's a new release, ..., still pinned to 0.1) 21. How badly breaking changes impact your workflow? [Free text] 22. Do you catch breaking changes before they happen (i.e. pay attention to future warnings) or after (i.e. when it breaks your code)? **Data** 23. What type of data do you work with? (Images, tabular, text, relational, time series, etc.) 24. For tabular data, how big are your datasets in bytes (median size)? 25. For tabular data, how big are your datasets in number of records (median size)? 26. For tabular data, how big are your datasets in number of columns (median size)? 27. Which DataFrame libraries do you use? 28. Do you use a data catalogue? * Tabular data (single table) * Tabular Data (multiple joined table to performance feature engineering) * Time series * Text data * Images * Sound / signal * Other **Modeling** 29. Which modules do you regularly use? [Multiple choice] - TO ADD 30.The following list of estimators. Please check on that applies. a) Heard about it. b)have used it at least once. c) using regularly (several times a year) 1. `ARDRegression`. [ a ], [ b ], [ c] 2. `AdaBoostClassifier`[ a ], [ b ], [ c] 3. `AdaBoostRegressor` [ a ], [ b ], [ c] 4. `BaggingClassifier` [ a ], [ b ], [ c] 5. `BayesianRidge`[ a ], [ b ], [ c] 6. `BernoulliNB` [ a ], [ b ], [ c] 7. `Birch` [ a ], [ b ], [ c] 8. `CCA` [ a ], [ b ], [ c] 9. `DecisionTreeClassifier` 10. `DecisionTreeRegressor`[ a ], [ b ], [ c] 11. `ElasticNet` [ a ], [ b ], [ c] 12. `ExtraTreeClassifier` [ a ], [ b ], [ c] 13. `ExtraTreesClassifier`[ a ], [ b ], [ c] 14. `ExtraTreesRegressor` [ a ], [ b ], [ c] 15. `GaussianNB` [ a ], [ b ], [ c] 16. `GaussianProcessClassifier`[ a ], [ b ], [ c] 17. `GaussianProcessRegressor`[ a ], [ b ], [ c] 18. `GradientBoostingClassifier`[ a ], [ b ], [ c] 19. `GradientBoostingRegressor`[ a ], [ b ], [ c] 20. `HistGradientBoostingClassifier`[ a ], [ b ], [ c] These estimators cover a range of machine learning tasks, including classification, regression, and clustering. For the complete API and more details on each estimator, you can refer to the [scikit-learn documentation](https://scikit-learn.org/stable/). 31. Is calibration of probabilistic classifiers a concern for you? * Yes * No 32. Is calibration of regressors a concern for you? * Yes * No 33. Are uncertainty estimates for prediction important to you? * Yes * No 34. Is cost-sensitive learning a concern for you? * Yes * No 35. Do you use scikit-learn pipelines (sklearn.pipeline.Pipeline class or make_pipeline function)? * Yes * No 36. Do you use scikit-learn column transformers (sklearn.compose.ColumnTransformer class or make_column_transformer function)? * Yes * No 37. Do you need feature importances for your particular use cases? * Yes * No 38. Do you use sample_weight? * Yes * No 39. Do you find the Metadata routing feature helpful? * Yes * No * I'm not sure what it is 40. Do you need nested cross-validation (e.g. cross-validation of an hyperparameter tuning procedure)? * Yes * No 41. [Conditional question] If yes, ADD ADDITIONAL QUESTION 42. In what proportion do you use non-euclidean metrics? [Free text] 43. What methods do you use most often? [Please select all that apply.] - TO ADD **Deployment** 44. What is your typical usage of scikit-learn? * draft investigation * proof of concepts * predictions in a production service, e.g., a live web service * batch processing of production data 45. If you deploy in production, what provider do you use? * Azure * AWS * GCP * On premises * Other (please specify) 46. Do you use one of the following tools to assist you when using scikit-learn: * SageMaker * Domino DataLab * Google Colab * Dataiku * Databricks * JupyterHub * Custom tooling * Other 47. Are your satisfied with your current MLOps tools? * Yes * No * Prefer not to answer 48. What is your current budget for MLOps tools? * $0 per user per month * $10 per user per month * $50 per user per month * $100 per user per month * $200 per user per month * $500 per user per month * More than $500 per user per month 49. How often do you use an accelerator (GPU)? 50. what kind of use case have you found scikit-learn computational performance to be an issue? 51. Assuming the notion of "ML project" covers data preparation, feature computing, training an estimator, and bringing it to production, what is the typical size of the ML projects you have been working on? * less than 100 lines of code * ranging 100-1,000 * ranging 1,000-10,000 * more than 10,000 52. Would you say your programming practices when working on ML projects include the following: [Multiple choice] * Unit testing * Continuous integration * Continuous deployment * Code versioning * Data versioning * Model versioning * Logging and monitoring 53. How many ML projects do you have in your organization? 54. For model registry and experiment tracking, do you use any of these tools? * MLFlow * DVC * Weight and biases * Neptune * Custom tool * Other 55. For scheduling * Airflow * Metaflow (outerbounds) * Dagster * Coiled * Custom tool * Other ### VOLUNTEER FOR INTERVIEW 56. Would you like to volunteer for a short conversation with the scikit-learn team to discuss your responses in more detail? * Yes * No 57. [Conditional question] If yes, please provide your email address. [Free text] ### SURVEY FEEDBACK 58. How do you feel about the length of the survey? * Appropriate in length * Too long * Too short 59. How easy or difficult was this survey to complete? * Easy * Difficult * Neither easy nor difficult 60. How important do you believe open source software will be for the future development of AI? * Extremely important * Very important * Moderately important * Slightly important * Not at all important 61. How do you perceive the influence of large technology companies on the direction and development of open source AI projects? * Extremely negative * Very negative * Neutral * Very positive * Extremely positive 62. In your opinion, what are the biggest challenges facing the open source AI community in the next 5-10 years? (Select up to 3) * Sustainability and funding of open source projects * Balancing the interests of individual contributors, academic institutions, and corporate sponsors * Ensuring diversity and inclusivity in the open source AI community * Keeping pace with the rapid advancements in ML/AI research and development * Maintaining high-quality documentation and user support * Addressing ethical concerns and potential misuse of open source ML/AI tools * Other (please specify) ADDITIONAL QUESTIONS What percentage of your scikit-learn usage falls into the following categories? for profit / company work for research / open source for learning other (please specify in the field below)