## Conda for (data) scientists <!-- Put the link to this slide here so people can follow --> slides: https://hackmd.io/@samumantha/conda-slides --- | Time (CET) | Episode | | -------- | -------- | |9-10 | Getting started, Working with Environments | |**10-10.15**|**Break** | |10.15-11.15|Working with Environments , Using packages and channels | |**11.15-11.30**|**Break** | |11.30-12|Using packages and channels , Sharing environments | --- ## Getting started with conda episode: https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/01-getting-started-with-conda/index.html --- ## Packages ```import something``` Note: * Python can do a lot * not alone * external packages * import package * **Module**: a collection of functions and variables, as in a script * **Package**: a collection of modules with an init.py file (can be empty), as in a directory with scripts * **Library**: a collection of packages with realted functionality Library/Package are often used interchangeably. :lightning: Module on HPC different --- ## Dependencies "[Something] relying on [something else] to work."" Note: * not alone * not reinvent the wheel * building on shoulders of giants * updated versions may create issues --- From [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html): ![](https://i.imgur.com/jBKPj1d.png) --- From [scikit-learn documentation](https://scikit-learn.org/0.19/developers/advanced_installation.html) ![](https://i.imgur.com/sJk7NM1.png) --- "I found this package that would solve all my problems, but it needs [some package] 1.3 while all other packages I need rely on [same package] 2.5" :sob: --- "I found this package that would solve all my problems, but it needs [Python] 2.7 while all other packages I need rely on [Python] 3.6" :sob: --- ## Environments Note: * beginning all straight forward, we install on computer * multi version necessary * multi computer * or environments --- ### Environment management system * Multiple versions * Portability * Rights Note: An environment management system solves a number of problems commonly encountered by (data) scientists. * multi version * make old code work * 'it works on my machine' * same packages -> same results * Project specific setup * resolve dependecy issues by being able to use multi versions * projects are self-contained and reproducible by capturing all package dependencies in a single requirements file * Allow you to install packages on a host on which you do not have admin privileges. * Conda is not the only way, check out also other environment management systems! --- ### Package management system * Dependencies * Updates Note: * simplifies the process of installing software by… * identifying and installing compatible versions of software + dependencies. * handling the process of updating software * Conda is not the only way, check out also other package management systems, such as pip --- ### Why use package and environment management systems? * Same package, different version * Dependency hell (updates) * Reproducibility Note: * Installing software is hard. * Installing scientific software is often even more challenging. no software devs drawbacks of system wide installation: * It can be difficult to figure out what software is required for any particular research project. * It is often impossible to install different versions of the same software package at the same time. * Updating software required for one project can often “break” the software installed for another project. -> installing software system-wide creates complex dependencies between your research projects that shouldn’t really exist! wouldnt it be great do install per research project? --- ## Discussion :bulb: What are some of the potential benefits from installing software separately for each project? What are some of the potential costs? Note: hackmd --- ## Conda Note: Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux. * Conda can q install, run, and update packages and their dependencies. * Conda can create, save, load, and switch between project specific software environments on your local computer. * Although created for Python, Conda can package and distribute software for any language such as R, Ruby, Lua, Scala, Java, JavaScript, C, C++, FORTRAN. * Conda as a package manager helps you find and install packages. * With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment, also environment manager --- ![](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/fig/miniconda_vs_anaconda.png) Note: Conda is a tool for managing environments and installing packages. Miniconda combines Conda with Python and a small number of core packages; Anaconda includes Miniconda as well as a large number of the most widely used Python packages. Continuum analytics. --- ## Why conda? * Avoid building from source * takes care of dependencies * OS independent * combined package *and* environment management system Note: * Conda solves both package and environment management problems * Conda provides prebuilt packages, no compilers needed, * TensorFlow is another tool where to install it from source is near impossible, but Conda makes this a single step. * Conda is cross platform * allows sharing environments * pip (other package management system) inside conda * Anaconda : commonly used data science libraries and tools, such as TensorFlow built using optimised, hardware specific libraries (such as NVIDIA’s CUDA), which provides a speedup without having to change any of your code. --- ## Keypoints * Conda: * platform agnostic * open source * package and environment management system * not only for Python Note: * Conda is a platform agnostic, open source package and environment management system. * Using a package and environment management tool facilitates portability and reproducibility of (data) science workflows. * Conda solves both the package and environment management problems and targets multiple programming languages. Other open source tools solve either one or the other, or target only a particular programming language. * Anaconda is not only for Python
{"metaMigratedAt":"2023-06-15T17:57:20.790Z","metaMigratedFrom":"YAML","title":"Conda for (data) scientists - Intro","breaks":true,"slideOptions":"{\"spotlight\":{\"enabled\":false},\"theme\":\"moon\"}","contributors":"[{\"id\":\"06ea2ab1-3ae0-4ad3-8c63-3ae6b5ef2b4d\",\"add\":16035,\"del\":9367}]"}
    516 views