owned this note
owned this note
Published
Linked with GitHub
---
title: Conda for (data) scientists - Intro
tags: Conda, datascience, software carpentry
slideOptions:
spotlight:
enabled: false
theme: "moon"
---
## Conda for (data) scientists
<!-- Put the link to this slide here so people can follow -->
slides: https://hackmd.io/@samumantha/conda-slides
---
| Time (CET) | Episode |
| -------- | -------- |
|9-10 | Getting started, Working with Environments |
|**10-10.15**|**Break** |
|10.15-11.15|Working with Environments , Using packages and channels |
|**11.15-11.30**|**Break** |
|11.30-12|Using packages and channels , Sharing environments |
---
## Getting started with conda
episode: https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/01-getting-started-with-conda/index.html
---
## Packages
```import something```
Note:
* Python can do a lot
* not alone
* external packages
* import package
* **Module**: a collection of functions and variables, as in a script
* **Package**: a collection of modules with an init.py file (can be empty), as in a directory with scripts
* **Library**: a collection of packages with realted functionality
Library/Package are often used interchangeably.
:lightning: Module on HPC different
---
## Dependencies
"[Something] relying on [something else] to work.""
Note:
* not alone
* not reinvent the wheel
* building on shoulders of giants
* updated versions may create issues
---
From [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html):

---
From [scikit-learn documentation](https://scikit-learn.org/0.19/developers/advanced_installation.html)

---
"I found this package that would solve all my problems, but it needs [some package] 1.3 while all other packages I need rely on [same package] 2.5" :sob:
---
"I found this package that would solve all my problems, but it needs [Python] 2.7 while all other packages I need rely on [Python] 3.6" :sob:
---
## Environments
Note:
* beginning all straight forward, we install on computer
* multi version necessary
* multi computer
* or environments
---
### Environment management system
* Multiple versions
* Portability
* Rights
Note:
An environment management system solves a number of problems commonly encountered by (data) scientists.
* multi version
* make old code work
* 'it works on my machine'
* same packages -> same results
* Project specific setup
* resolve dependecy issues by being able to use multi versions
* projects are self-contained and reproducible by capturing all package dependencies in a single requirements file
* Allow you to install packages on a host on which you do not have admin privileges.
* Conda is not the only way, check out also other environment management systems!
---
### Package management system
* Dependencies
* Updates
Note:
* simplifies the process of installing software by…
* identifying and installing compatible versions of software + dependencies.
* handling the process of updating software
* Conda is not the only way, check out also other package management systems, such as pip
---
### Why use package and environment management systems?
* Same package, different version
* Dependency hell (updates)
* Reproducibility
Note:
* Installing software is hard.
* Installing scientific software is often even more challenging. no software devs
drawbacks of system wide installation:
* It can be difficult to figure out what software is required for any particular research project.
* It is often impossible to install different versions of the same software package at the same time.
* Updating software required for one project can often “break” the software installed for another project.
-> installing software system-wide creates complex dependencies between your research projects that shouldn’t really exist!
wouldnt it be great do install per research project?
---
## Discussion :bulb:
What are some of the potential benefits from installing software separately for each project? What are some of the potential costs?
Note:
hackmd
---
## Conda
Note:
Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux.
* Conda can q install, run, and update packages and their dependencies.
* Conda can create, save, load, and switch between project specific software environments on your local computer.
* Although created for Python, Conda can package and distribute software for any language such as R, Ruby, Lua, Scala, Java, JavaScript, C, C++, FORTRAN.
* Conda as a package manager helps you find and install packages.
* With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment, also environment manager
---

Note:
Conda is a tool for managing environments and installing packages. Miniconda combines Conda with Python and a small number of core packages; Anaconda includes Miniconda as well as a large number of the most widely used Python packages. Continuum analytics.
---
## Why conda?
* Avoid building from source
* takes care of dependencies
* OS independent
* combined package *and* environment management system
Note:
* Conda solves both package and environment management problems
* Conda provides prebuilt packages, no compilers needed,
* TensorFlow is another tool where to install it from source is near impossible, but Conda makes this a single step.
* Conda is cross platform
* allows sharing environments
* pip (other package management system) inside conda
* Anaconda : commonly used data science libraries and tools, such as TensorFlow built using optimised, hardware specific libraries (such as NVIDIA’s CUDA), which provides a speedup without having to change any of your code.
---
## Keypoints
* Conda:
* platform agnostic
* open source
* package and environment management system
* not only for Python
Note:
* Conda is a platform agnostic, open source package and environment management system.
* Using a package and environment management tool facilitates portability and reproducibility of (data) science workflows.
* Conda solves both the package and environment management problems and targets multiple programming languages. Other open source tools solve either one or the other, or target only a particular programming language.
* Anaconda is not only for Python