Kickstart
simo.tuomisto@aalto.fi
. Looking for contributors/co-authors, add your name if you contributed. All original text and images under CC-BY 4.0. As showed in the short introduction to scientific computing, a typical research project might look like this:
Figure 1 - A common pipeline of scientific computing
A single project can require a huge number of skills and tools:
Any skills learned through a project are often reused in subsequent projects as well.
This brings an interesting problem:
I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.
This means that even bad habits can be reused in subsequent projects.
Figure 2 - xkcd #1739: Fixing Problems
Thus one should always keep a critical eye on what you're using and whether it is a good tool for the job at hand.
In the next sections we look through some tools and skills that might help you on your journey.
However, remember that these are just suggestions and if your tools work for you, maybe you're using the correct tools for the correct task.
Small journey is a situation where you want to get a quick glimpse of a possible project.
Maybe you want to explore a new dataset, to test out a new algorithm or to see if you can use a new programming language.
For these kinds of projects you should invest in the following:
There are many programming languages that can be used for general scientific computing. These kinds of languages can do (among other things):
Figure 3 - Arbitrary definition of a generic scientific programming language
Python is the most popular language for general scientific computing. The main features are:
Matlab is a commercial numerical computing language. It is quite popular in signal processing and in laboratories. The main features are:
When working with Matlab it is good to remember that it is a commercial product and not everyone can access a licence of it.
Julia is a newer language that has been designed for high-performance computing. The main features are:
R is a language for statistics and data analysis. It is especially popular in statistics and bioinformatics. The main features are:
There are plenty of good editors and IDEs. Here's a list of some of the most popular ones.
Generic IDEs:
Python IDEs:
Julia IDEs:
R IDEs:
Even when you're testing stuff it is a good idea to write your code as a script or as a Jupyter notebook. This way your ideas will not disappear when you close your console.
Working like this might seem self-evident, but it is important to keep in mind what commands you have written. Saving your code is the first step of documenting your work.
IDEs commonly have an interactive console and an editor window where you can write your scripts. You can run commands in the console itself, but it is better practice to write your code as a script and then execute the code in the interactive console.
Figure 4 - Writing a script with an IDE
Jupyter notebooks work a bit differently to typical scripts. A notebook is split into cells that can contain code or documentation. Code cells can be executed by a kernel, which can be of any language.
Figure 5 - Writing a notebook
For more on Jupyter, see for example:
A new project is something that takes more than a few days to complete. Even a single weekend or working on a different project can break your train of thought and thus it is important to record your work.
Maybe your initial exploration looked promising and now you want to try it out for real.
For these kinds of projects you should invest in the following:
Version control is an invaluable tool whether you're doing scientific research or software engineering and Git is by far the most popular version control tool available.
Version control tools such as git will track changes on a line by line basis and will record changes into commits. This allows users of version control to revert changes, merge commits made by other collaborators and keep code up to date across multiple systems.
Thus if you do not use version control, start learning on git immediately. There are plenty of good resources such as:
There are many providers for centralized repository storage. For public projects:
Whenever you're starting a new project, it is a good idea to start by creating a new repository for the project.
Commenting your code and your project will help you and your collaborators in keeping track of the different pieces of the project.
You might think that you can keep all of the moving pieces in your memory, but that will put unnecessary pressure on your long-term memory. It's better to remember the big picture and whenever you need to look at the details, you look at the comments to refresh your memory.
There are plenty of tools that help with commenting:
It is good to also remember that no-one likes commenting and documenting.
All projects will utilize programs that are not part of the project. Keeping track of these requirements from the start is very important, as it allows you to:
There are various ways of keeping track of you requirements. Below are few examples:
INSTALL.md
) with instructions on how to install the programrequirements.txt
(PIP), environment.yml
(conda) that works with a package managing system or keep a list of packages you have installed.Most important thing is to keep yourself up to date on what your code utilizes and recording it somewhere.
If I have seen further it is by standing on the sholders [sic] of Giants.
No single person can know everything and no one has time to implement every feature themselves. Thus using existing packages and frameworks is imperative for effictive scientific computations.
If you start writing your own function for e.g. calculating an integral or doing a least-squares fit, ask yourself if someone else has ever needed the same function. If the answer to that question is yes, most likely someone has already implemented the functionality in a scientific computing package or framework.
By using packages made by others you avoid bugs in your own implementations and you make your code easier to read for other people.
Below are examples of few frameworks that might interest you:
When you're starting on a long-term project, you need to plan ahead for possible future needs.
Maybe you'll want to scale up your computations in an HPC environment. Maybe you'll want to share your project to the wider world.
Like building a house, you'll want to have the project on a solid foundation.
Keeping the following ideas at the back of your head will help you in creating such a foundation:
The command line (also known as a shell) can be used to run programs without a graphical interface. This might seem like an old way of running programs, but in many cases it is the most efficient way.
One of the main advantages of the command line is that it can be used in all kinds of different systems as long as they have the same. Especially in ones that do not have a graphical interface.
Other advantage is that when you're running a program through a command line, you do not need the IDE that was used to create the program. This makes it easier to port the program to other systems and to other users.
These features are imperative in high-performance computing (HPC) systems that usually do not have graphical interface and where you want to focus on running the code and nothing else.
Most popular Unix-style shells are:
To learn more on command line usage, see:
Most languages also have libaries or tools for creating easy command line interfaces. Here are few examples:
Typically projects will have parts that are used rarely and some things that are used all the time.
Most scientific programs have similar structure: some parts of the code are called again and again, while some parts are called once or twice.
Thus when you reach a point where you want to do more, it is usually important to know what actually takes the time and focus on that.
Figure 6: xkcd #1205: Is It Worth the Time?
All languages have tools for profiling. They will help you figure out where the bottlenecks might be. Some basic profilers are listed below.
After you have profiled which parts of your program take the most time, you can fix possible bugs in the code itself or use specialized libraries to optimize the code.
However, one should try to avoid premature optimization. If the code does not do what you want it to do, running it fast won't help.
What will happen to the project if I'm not going to update it any more? Will anyone else use this after me? Should I make my project public?
It is good to ask such questions when you start a new project. Knowing what is the end goal of the project will be helps you make choices throughout the project.
In many cases, opening up code in a shared repository and creating code publicly is the best option. This will help you design code not only for yourself, but for others as well.
If fully public design is not an option, using a private repository with your colleagues might be a better one. Getting feedback from your colleagues and minimizing the burden of contributing is always a good idea.
This topic is very wide and there is no good single answer, but for more information, you can check out the following:
A single person cannot know everything about everything. Thus whether you're starting a project, designing a project, working on the project or sharing the project with others, you'll encounter questions you do not know the answer for.
Scientific computing is a field where sharing information is crucial:
At every step of the way, we work with other people, in one way or another. When you encounter a problem, someone else might already have a solution. When you find a solution, someone else might need it to solve their problem.
Sharing information and asking for help are probably the most important skills you might learn when doing scientific computing.
There are many sources of help/information available:
For help:
For information: