Big Data. Bash Refresher

by Boris

Unix/OS/bash

Introduction

Here you will become familiar with tools that "red eyes" community (a.k.a. Linuxoids) have to use every day.

Whenever you read manuals, symbol $ usually signifies the beginning of a bash command. For example:

$ echo hello

That means the author expects that you will type "echo hello" in a bash-compatible shell and, probably, execute it.

$ man

man - bash command - utility to get description of bash commands in bash console.

Usage

$ man man

Outcome

manual for "man" command

Output

​​​​  "man is the system's manual pager.  Each page argument given to  man  is
​​​​   normally  the  name of a program, utility or function.  The manual page
​​​​   associated with each of these arguments is then found and displayed."

(pay ATTENTION to RED WORDS)

Practice

execute commands like:

$ man ssh
$ man 1 strcpy
$ man 3 strcpy # c library function documentation.
$ man rm #  After dash in bash goes not executable comments rm -r /

Google this command by yourself

$ sudo rm -rf /

Do you want to execute this? Hint: you do not!

UNIX shell

Shell is a text user interface (TUI) for accessing operating system’s services. It has many implementations: bash shell, original Unix shell, Bourne shell, ksh, csh, zsh, fish, etc.

Typical shell for ubuntu is bash you know it as Terminal. Hotkey: Ctrl + Alt + T

Questions

  • What happens if type Ctrl + Shift + T in opened terminal?
  • How changes state of terminal if invoke command like $ gedit
  • if invoke $ gedit &
  • behavior if close terminal after $ gedit&
    Attention: there are several methods to close terminal (Alt + F4, Ctrl + D, exit command, etc) and, as was found, at least two behaviours. Describe your actions and their consequences. For curiosity kittens :3 link - SIGKILL, SIGSTOP, SIGTERM

Processes

Explore by yourself about

$ ps 
$ pstree
$ kill

Advanced. Addition

Foreground processes block shell during execution and background do not. Appending & will run process in background.

$ gedit &

Foreground process can be suspend by ctrl+z

$jobs #  display list of jobs.

Advanced practice #1

Try to close process of command:

$ gedit & 

Basic commands

Shell - File system commands

pwd - Print name of current/working directory.
mkdir <dirname> - Make directory.
cd <path> - Change directory.
rm <filenames> - Remove a file.
rm -r <dirname> - Remove (recursive) a directory.
ls - List content of a directory.
mv <old_path> <new_path> - Move file.
cat <filenames> - Concatenate files to stdout.

Shell - File System - Special Characters

~ - home directory
. - represent current directory
.. - represent parent directory of current directory

Examples:

$ cd .. # go previous directory
$ ls . # list all files in current directory
$ cd ~ # go to home directory. What is the path of home directory?

Advanced. Streams and Pipelines.

Standard streams are preconnected communication channels of programs. They are:

  • stdin - standard input that going into program,
  • stdout - standard out where program writes output,
  • stderr - to display error messages.

Usage in python

This code redirects all output of console to the system's trash file (/dev/null) and saves person's PC from one of Memory Leack case (critically if code checks for contest like pcms by S.Protasov)

class HidePrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

# ...

with HidePrints():
    executeCodeWithFloodyOutput('drop')

Advanced practice #2

Redirect output (for example errors) to the file on your PC (like custom log file)

Bash redirect

It is possible to redirect streams to or from files with > and <

$ ls > list.txt # Save list of files in current directory to list.txt
$ head -n 3 < file.txt # Display the first 3 entries.

It is possible to redirect output of one program to input of another by | (pipe symbol)

$ ls | sort -r | tail -n 3

Command above receives list of files, reverse sort and display the 3 last.

Excercises

Ex1

Create directory “week1” in home directory.

$ mkdir ~/week1
$ cd ~/week1

List entries in /usr/bin that contain “gcc” in reverse alphabetical order. Save results in “~/week1/ex1.txt”.

Hint: use $ grep utility (https://docs.oracle.com/cd/E19455-01/806-2902/6jc3b36dn/index.html)

Ex2

Execute command

$ history -c # to clear history of your bash commands

Then execute some commands

## for example
$ echo "hello" 
$ mkdir ~/hello
$ cd hello
$ ping -c 4 8.8.8.8 >> log.txt 
$ cp log.txt ..
$ cd ..
$ cat log.txt
$ rm -r hello
$ rm log.txt
$ echo done

Then save history to “~/week1/ex2.txt”.

$ history | cut -c 8- > ex2.txt

Ex2 (Continue)

Change ex2.txt file to run it from console. Such a file named as Bash script (often with .sh extension). Use guide like this: http://omgenomics.com/writing-bash-script/
(step 1-5)

Execute your script.

~/.bashrc

.bashrc is a shell script that Bash runs whenever it is started interactively. It initializes an interactive shell session. You can put any command in that file that you could type at the command prompt.

You put commands here to set up the shell for use in your particular environment, or to customize things to your preferences. A common thing to put in .bashrc are aliases that you want to always be available.

Alias

In computing, alias is a command in various command line interpreters (shells) which enables a replacement of a word by another string. For example:

$ alias copy='cp'

then you can use

$ copy file file2 #instead of cp file file2

Self study

Can we use this tools to execute Windows scripts on Linux?

Variables

Variables - pairs of key value in bash and used like:

$ echo $PATH

where $PATH substitutes variable

ENVIRONMENT

When a program is invoked it is given an array of strings called the environment. This is a list of name-value pairs, of the form name=value.

Executed commands inherit the environment. The $ export command allow parameters and functions to be added to and deleted from the environment. Removed by the $ unset

$ export KEK=$HOME"/kek"
$ mkdir KEK

Then open or rerun Terminal and paste:

$ echo $KEK

As result it should just print empty string. This is because environment not keeps lust state. To make default state repeatable, we add such a commands in ~/.bashrc file. You can do it in editor (but this file is hidden as all files started with dot in Unix) or add to end of file with command:

$ echo "export KEK=$HOME\"/kek\"" >> ~/.bashrc

Than you will open Terminal next time, variable KEK will be accessible again.

More materials here

The PATH Environment Variable

The PATH environment variable has a special format. Let's see what it looks like:

$ echo $PATH

It's essentially a : separated list of directories. When you execute a command, the shell searches through each of these directories, one by one, until it finds a directory where the executable exists. We can find ls in /bin, right? /bin is the second item in the PATH variable. So let's remove /bin from PATH. We can do this by using the export command:

$ export PATH=/usr/local/bin:/usr/bin:/sbin:/usr/sbin:

Make sure that the variable is set correctly:

$ echo $PATH

Now, if we try to run ls, the shell no longer knows to look in /bin!

$ ls # -bash: ls: command not found

As expected, ls can no longer be found. More details here

Ex 3

Modify your ~/.bashrc file to use such a aliasing tools (at least "cp" to "copy" aliasing).
Provide your modifications from ~/.bashrc
Can "one of my friends" use information from your ~/.bashrc to hack you?

Read by your own

What is responsibility of default linux folders? rus
Why everything in linux is file? For example $ ls /proc will introduce you all processes and its information. Is Windows the same?

P.S. "Holy Grail" of sysadmins or just logs

Each program generates something output (at least return code). It help programmers to understand what happens inside program. The best practice is to use logs. There is no universal solution how to organize or store logs. Some programs store them in system folders (e.g. dpkg tool that installs packages in linux uses /var/log/dpkg.log file)

Pay attention to this link

$ cat  /var/log/dpkg.log # use tail and pipe to short it by yourself :3

Use logs, analyse logs, google error messages from logs, share logs with friends, subscribe on logs, put likes to logs. Don't rush in tg till you read logs ;)

Networks

How TCP/IP stack looks like.


Behind this idea is layered architecture. Each layer has own set of commands (interface), error checking, information delivery addresses and useful data part.

Data frame (restored from raw bits) can be represented like

or like this

That means each website you visit, delivered to your machine by packets like this. Their packet varies from ~60 Bytes to ~64 KBytes. Moodle dashboard weights ~3.4 MB for me. Think about this values.

DS students should be familiar with that because often their (your) work depend on virtualization technologies such that Virtual Machines, remote desktop (ssh), private/virtual networks.

It is too hard to shortly explain 1 semester Tanenbaum based course at single page, so learn such a topics by yourself:

  • IP
  • Port
  • DNS
  • Bridges (device and network virtualization)
  • NAT (Network Address Translation)

And then answer the questions.
Useful links:
THE MOST IMPORTANT FOR U, VirtualBox guidelines
Tanenbaum (use index)
Good slides

Excercises

Now you are ready to complete 13 questions from moodle.

Network tools

wget/curl

wget and curl are command-line programs that let you fetch a URL. Unlike a web browser,
which fetches and executes entire pages, wget and curl give you control over exactly which
URLs you fetch and when you fetch them.

$ wget http://innopolis.ru
$ curl http://innopolis.ru

This will fetch the resource and either write it to a file (wget) or to the screen (curl)

ping

ping is a standard command-line utility for checking that another computer is responsive. It is widely used for network troubleshooting and comes pre-installed on Window, Linux, and Mac. While ping has various options, simply issuing the command

ping www.bing.com

will cause your computer to send a small number of ICMP ping requests to the remote computer (here www.bing.com), each of which should elicit an ICMP ping response.

traceroute

traceroute is a standard command-line utility for discovering the Internet paths that your computer uses. It is widely used for network troubleshooting. It comes pre-installed on Window and Mac, and can be installed using your package manager on Linux. On Windows, it is called “tracert”. It has various options, but simply
issuing the command

traceroute university.innopolis.ru

will cause your computer to
find and print the path to the remote computer (here university.innopolis.ru).

If you are on Linux / Mac and behind a NAT (as most home users or virtual machine users) then
use the –I option (that was a capital i) to traceroute, e.g.,

traceroute –I university.innopolis.ru

This will cause traceroute to send ICMP probes like tracert instead of its usual UDP probes; ICMP probes are better able to pass through NAT boxes.

ifconfig

ifconfig stands for "interface configuration". It is used to view and change the configuration of the network interfaces on your system. Helps to explore addresses and names of networks and virtual networks of your machine. Also in

$ man ifconfig

you can find another utilities how to use this tool to modify network interfaces.

ssh

SSH a.k.a. Secure Shell or Secure Socket Shell, is a network protocol that gives users, particularly system administrators, a secure way to access a computer over an unsecured network. SSH also refers to the suite of utilities that implement the SSH protocol. Secure Shell provides strong authentication and encrypted data communications between two computers connecting over an open network such as the internet. SSH is widely used by network administrators for managing systems and applications remotely, allowing them to log into another computer over a network, execute commands and move files from one computer to another.

The most basic use of SSH is for connecting to a remote host for a terminal session. The form of that command is:

$ ssh UserName@SSHserver.example.com
# or 
$ ssh myPCName@192.168.0.1 #where 192.168.0.1 is myPCName's IP.

Other tools

Look for $ netcat (the most useful for future works), $ netstat and $ scp by your own.

Read by your own

  • Socket connection
  • OSI model
  • Default ports
  • localhost
  • /etc/network

Shared folders

The easiest way to share files between your host and guest machines. Only loosers use TG to share files btw your machines. There is no silver bullet for every hypervisor how to work with shared folder, so there no guides, but keep it in mind when you will work with files passing.