changed 4 years ago
Linked with GitHub

Conda for (data) scientists

slides: https://hackmd.io/@samumantha/conda-slides


Time (CET) Episode
9-10 Getting started, Working with Environments
10-10.15 Break
10.15-11.15 Working with Environments , Using packages and channels
11.15-11.30 Break
11.30-12 Using packages and channels , Sharing environments

Getting started with conda

episode: https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/01-getting-started-with-conda/index.html


Packages

import something

Note:

  • Python can do a lot

  • not alone

  • external packages

  • import package

  • Module: a collection of functions and variables, as in a script

  • Package: a collection of modules with an init.py file (can be empty), as in a directory with scripts

  • Library: a collection of packages with realted functionality

Library/Package are often used interchangeably.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Module on HPC different


Dependencies

"[Something] relying on [something else] to work.""

Note:

  • not alone
  • not reinvent the wheel
  • building on shoulders of giants
  • updated versions may create issues

From pandas documentation:


From scikit-learn documentation


"I found this package that would solve all my problems, but it needs [some package] 1.3 while all other packages I need rely on [same package] 2.5"

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


"I found this package that would solve all my problems, but it needs [Python] 2.7 while all other packages I need rely on [Python] 3.6"

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Environments

Note:

  • beginning all straight forward, we install on computer
  • multi version necessary
  • multi computer
  • or environments

Environment management system

  • Multiple versions
  • Portability
  • Rights

Note:
An environment management system solves a number of problems commonly encountered by (data) scientists.

  • multi version

  • make old code work

  • 'it works on my machine'

  • same packages -> same results

  • Project specific setup

  • resolve dependecy issues by being able to use multi versions

  • projects are self-contained and reproducible by capturing all package dependencies in a single requirements file

  • Allow you to install packages on a host on which you do not have admin privileges.

  • Conda is not the only way, check out also other environment management systems!


Package management system

  • Dependencies
  • Updates

Note:

  • simplifies the process of installing software by…

  • identifying and installing compatible versions of software + dependencies.

  • handling the process of updating software

  • Conda is not the only way, check out also other package management systems, such as pip


Why use package and environment management systems?

  • Same package, different version
  • Dependency hell (updates)
  • Reproducibility

Note:

  • Installing software is hard.
  • Installing scientific software is often even more challenging. no software devs

drawbacks of system wide installation:

  • It can be difficult to figure out what software is required for any particular research project.
  • It is often impossible to install different versions of the same software package at the same time.
  • Updating software required for one project can often “break” the software installed for another project.

-> installing software system-wide creates complex dependencies between your research projects that shouldn’t really exist!

wouldnt it be great do install per research project?


Discussion
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

What are some of the potential benefits from installing software separately for each project? What are some of the potential costs?

Note:
hackmd


Conda

Note:
Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux.

  • Conda can q install, run, and update packages and their dependencies.

  • Conda can create, save, load, and switch between project specific software environments on your local computer.

  • Although created for Python, Conda can package and distribute software for any language such as R, Ruby, Lua, Scala, Java, JavaScript, C, C++, FORTRAN.

  • Conda as a package manager helps you find and install packages.

  • With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment, also environment manager


Note:
Conda is a tool for managing environments and installing packages. Miniconda combines Conda with Python and a small number of core packages; Anaconda includes Miniconda as well as a large number of the most widely used Python packages. Continuum analytics.


Why conda?

  • Avoid building from source
  • takes care of dependencies
  • OS independent
  • combined package and environment management system

Note:

  • Conda solves both package and environment management problems

  • Conda provides prebuilt packages, no compilers needed,

    • TensorFlow is another tool where to install it from source is near impossible, but Conda makes this a single step.
  • Conda is cross platform

  • allows sharing environments

  • pip (other package management system) inside conda

  • Anaconda : commonly used data science libraries and tools, such as TensorFlow built using optimised, hardware specific libraries (such as NVIDIA’s CUDA), which provides a speedup without having to change any of your code.


Keypoints

  • Conda:
    • platform agnostic
    • open source
    • package and environment management system
    • not only for Python

Note:

  • Conda is a platform agnostic, open source package and environment management system.

  • Using a package and environment management tool facilitates portability and reproducibility of (data) science workflows.

  • Conda solves both the package and environment management problems and targets multiple programming languages. Other open source tools solve either one or the other, or target only a particular programming language.

  • Anaconda is not only for Python

Select a repo