# TF on Ubuntu 20.04 from scratch (AWS)
### Starting an instance
This instance is using a basic Ubuntu 20.04 (no drivers, editors, Xorg).
Note: I use [`aws-vault`](https://github.com/99designs/aws-vault) to store the credentials to AWS accounts on my computer (so, I give `aws-vault exec hal -- bash` at the beginning) and [`Bash-my-AWS`](https://bash-my-aws.org/) as a nicer command-line interface to AWS.
Let's check which AMI to use:
``` bash
$ aws ec2 describe-images --owners aws-marketplace --filters Name=name,Values='*ubuntu-focal-20.04*' | jq '.Images[] | [.CreationDate, .ImageId, .Name, .Description]'[
"2020-04-23T18:35:13.000Z",
"ami-017cb7c5fb425a889",
"SupportedImages ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-serv-a5202e59-fcb2-4185-ac3e-cfb34ca880c0-ami-06f88aeefe25dc6ba.4",
"Ubuntu Server 20.04 LTS (Ubuntu 20.04 LTS ) (Ubuntu 20) Focal Fossa"
]
[
"2020-04-24T09:28:04.000Z",
"ami-0652b0a864db01553",
"ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423-aced0818-eef1-427a-9e04-8ba38bada306-ami-068663a3c619dd892.4",
"Canonical, Ubuntu, 20.04 LTS, amd64 focal image build on 2020-04-23"
]
[
"2020-04-24T09:32:55.000Z",
"ami-099eed573ea1c101f",
"ubuntu-minimal/images/hvm-ssd/ubuntu-focal-20.04-amd64-minimal-2020-d5944ad4-5199-4cf3-ab4c-c2c4598f880b-ami-0f84c9a9348f9f857.4",
"Canonical, Ubuntu Minimal, 20.04 LTS, amd64 focal image build on 2020-04-23"
]
[
"2020-04-24T09:39:36.000Z",
"ami-0aba5b7c9025fd3fd",
"ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20200423-3ba3581b-86f4-4bf1-a9a5-c2e11fe9408d-ami-00579fbb15b954340.4",
"Canonical, Ubuntu, 20.04 LTS, arm64 focal image build on 2020-04-23"
]
[
"2020-04-27T17:11:24.000Z",
"ami-0db29aadadd4a4cca",
"ubuntu-pro/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423-ae7ed378-8838-4fcf-842d-d1d09b34f116-ami-0118f3de163338756.4",
"ubuntu-pro/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423"
]
```
I'll use the minimal one, `ami-099eed573ea1c101f`.
``` bash
$ aws ec2 run-instances --image-id ami-099eed573ea1c101f --block-device-mapping DeviceName=/dev/sda1,Ebs={VolumeSize=16} --instance-type p3.2xlarge --key-name paolieri --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=Marco}]'
```
Let's connect (here I'm using Bash-my-AWS):
``` bash
$ instances
i-0cbce25e73d063cc6 ami-099eed573ea1c101f p3.2xlarge running Marco 2020-06-14T06:49:49.000Z us-west-2b vpc-411bdc39
$ instances | grep Marco | instance-ssh-details
i-0cbce25e73d063cc6 paolieri 34.222.239.184 Marco
$ ssh -i ~/.ssh/aws_hal.pem ubuntu@34.222.239.184
```
### NVIDIA drivers and `coolgpus`
We're connected, so let's install the NVIDIA drivers, `nvidia-dmi`, `gcc-9` and Xorg:
``` bash
$ sudo apt-get update
$ sudo apt-get install nvidia-driver-440
```
Let's reboot and check whether Xorg is running:
``` bash
$ sudo reboot
Connection to 34.222.239.184 closed by remote host.
Connection to 34.222.239.184 closed.
$ ssh -i ~/.ssh/aws_hal.pem ubuntu@34.222.239.184
Welcome to Ubuntu 20.04 LTS (GNU/Linux 5.4.0-1009-aws x86_64)
$ ps aux | grep Xorg
root 881 1.1 0.0 1396048 48960 tty1 Sl+ 07:08 0:00 /usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/119/gdm/Xauthority -background none -noreset -keeptty -verbose 3
```
Xorg is running... and we don't want that; so we do the following to disable the graphical mode, and reboot:
``` bash
$ sudo systemctl set-default multi-user
$ sudo reboot
$ ssh -i ~/.ssh/aws_hal.pem ubuntu@34.222.239.184
$ ps aux | grep Xorg
ubuntu 900 0.0 0.0 5188 732 pts/0 S+ 07:18 0:00 grep --color=auto Xorg
```
Excellent, it's not running anymore on startup.
Can we see the GPU? Yup:
``` bash
$ nvidia-smi
Sun Jun 14 07:20:06 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 38W / 300W | 0MiB / 16160MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
Now, will `coolgpus` work? No... it can't find the fans after starting it.
``` bash
$ sudo apt-get install wget emacs-nox
$ cd /usr/local/bin
$ sudo wget https://raw.githubusercontent.com/andyljones/coolgpus/master/coolgpus
$ sudo chmod +x coolgpus
$ emacs coolgpus # change python to python3, C-x-s C-x-c
$ sudo coolgpus --kill
Killing all running X servers, including 1626
Awaiting X server shutdown
All X servers killed
Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:00:1E.0xkn_4ho5/xorg.conf
X.Org X Server 1.20.8
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.4.0-179-generic x86_64 Ubuntu
Current Operating System: Linux ip-172-31-38-39 5.4.0-1009-aws #9-Ubuntu SMP Sun Apr 12 19:46:01 UTC 2020 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1009-aws root=PARTUUID=89296dce-01 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 panic=-1
Build Date: 21 May 2020 08:22:15AM
xorg-server 2:1.20.8-2ubuntu2.1 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.38.4
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Sun Jun 14 07:53:12 2020
(++) Using config file: "/tmp/cool-gpu-00000000:00:1E.0xkn_4ho5/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
libEGL warning: DRI2: failed to authenticate
ERROR: Error resolving target specification 'fan:0' (No targets match target specification), specified in assignment '[fan:0]/GPUTargetFanSpeed=30'.
Released fan speed control for GPU at :0
Terminating xserver for display :0
Traceback (most recent call last):
File "/usr/local/bin/coolgpus", line 242, in <module>
run()
File "/usr/local/bin/coolgpus", line 239, in run
manage_fans(displays)
File "/usr/local/bin/coolgpus", line 225, in manage_fans
set_speed(display, s)
File "/usr/local/bin/coolgpus", line 212, in set_speed
assign(display, '[fan:0]/GPUTargetFanSpeed='+str(int(target)))
File "/usr/local/bin/coolgpus", line 208, in assign
log_output(['nvidia-settings', '-a', command, '-c', display])
File "/usr/local/bin/coolgpus", line 91, in log_output
raise ValueError('Command ' + ' '.join(command) + ' crashed with return code ' + str(p.returncode) + ':')
ValueError: Command nvidia-settings -a [fan:0]/GPUTargetFanSpeed=30 -c :0 crashed with return code 1:
u
```
### Using Miniconda and TF
First, we need to install conda:
``` bash
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/.miniconda
$ .miniconda/condabin/conda update -y -n base -c defaults conda
$ .miniconda/condabin/conda init bash zsh
$ miniconda/condabin/conda config --set auto_activate_base false
(log out and reconnect with ssh)
```
Now, let's try to create an empty Python 3.8 conda environment:
``` bash
$ conda create -n py38 python=3.8
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/ubuntu/.miniconda/envs/py38
added / updated specs:
- python=3.8
The following packages will be downloaded:
package | build
---------------------------|-----------------
certifi-2020.4.5.1 | py38_0 156 KB
libffi-3.3 | he6710b0_1 50 KB
pip-20.0.2 | py38_3 1.7 MB
python-3.8.3 | hcff3b4d_0 49.1 MB
readline-8.0 | h7b6447c_0 356 KB
setuptools-47.1.1 | py38_0 517 KB
wheel-0.34.2 | py38_0 51 KB
------------------------------------------------------------
Total: 51.9 MB
The following NEW packages will be INSTALLED:
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0
certifi pkgs/main/linux-64::certifi-2020.4.5.1-py38_0
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0
ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1
openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0
pip pkgs/main/linux-64::pip-20.0.2-py38_3
python pkgs/main/linux-64::python-3.8.3-hcff3b4d_0
readline pkgs/main/linux-64::readline-8.0-h7b6447c_0
setuptools pkgs/main/linux-64::setuptools-47.1.1-py38_0
sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1
tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0
wheel pkgs/main/linux-64::wheel-0.34.2-py38_0
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3
Proceed ([y]/n)? y
Downloading and Extracting Packages
libffi-3.3 | 50 KB | ################################################################################################################################################## | 100%
readline-8.0 | 356 KB | ################################################################################################################################################## | 100%
wheel-0.34.2 | 51 KB | ################################################################################################################################################## | 100%
python-3.8.3 | 49.1 MB | ################################################################################################################################################## | 100%
pip-20.0.2 | 1.7 MB | ################################################################################################################################################## | 100%
setuptools-47.1.1 | 517 KB | ################################################################################################################################################## | 100%
certifi-2020.4.5.1 | 156 KB | ################################################################################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate py38
#
# To deactivate an active environment, use
#
# $ conda deactivate
$ python --version
-bash: python: command not found
$ python3 --version
Python 3.8.2
$ which python3
/usr/bin/python3
$ conda activate py38
(py38) $ python --version
Python 3.8.3
(py38) $ python3 --version
Python 3.8.3
(py38) $ which python
/home/ubuntu/.miniconda/envs/py38/bin/python
(py38) $ which python3
/home/ubuntu/.miniconda/envs/py38/bin/python3
(py38) $ which pip
/home/ubuntu/.miniconda/envs/py38/bin/pip
(py38) $ pip install click
Collecting click
Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
|████████████████████████████████| 82 kB 1.4 MB/s
Installing collected packages: click
Successfully installed click-7.1.2
(py38) $ python
Python 3.8.3 (default, May 19 2020, 18:47:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import click
>>> click
<module 'click' from '/home/ubuntu/.miniconda/envs/py38/lib/python3.8/site-packages/click/__init__.py'>
```
Nice, but we could do the same with `poetry`, `pyenv`, `pipenv`, or `venv`.
Now, let's create a conda with TF2 and matching versions of `cudatoolkit` and `cudnn` (binary system libraries in a package!):
```
CTRL-d (to exit the py38 conda)
$ conda create -n tf2 python=3.8 tensorflow-gpu
$ conda create -n tf2 python=3.8 tensorflow-gpu
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/ubuntu/.miniconda/envs/tf2
added / updated specs:
- python=3.8
- tensorflow-gpu
The following packages will be downloaded:
package | build
---------------------------|-----------------
_tflow_select-2.1.0 | gpu 2 KB
absl-py-0.9.0 | py38_0 165 KB
astunparse-1.6.3 | py_0 17 KB
blas-1.0 | mkl 6 KB
blinker-1.4 | py38_0 22 KB
c-ares-1.15.0 | h7b6447c_1001 89 KB
cachetools-4.1.0 | py_1 15 KB
cffi-1.14.0 | py38he30daa8_1 225 KB
chardet-3.0.4 | py38_1003 174 KB
click-7.1.2 | py_0 71 KB
cryptography-2.9.2 | py38h1ba5d50_0 556 KB
cudatoolkit-10.1.243 | h6bb024c_0 347.4 MB
cudnn-7.6.5 | cuda10.1_0 179.9 MB
cupti-10.1.168 | 0 1.4 MB
gast-0.3.3 | py_0 14 KB
google-auth-1.14.1 | py_0 58 KB
google-auth-oauthlib-0.4.1 | py_2 20 KB
google-pasta-0.2.0 | py_0 46 KB
grpcio-1.27.2 | py38hf8bcb03_0 1.3 MB
h5py-2.10.0 | py38h7918eee_0 1.1 MB
hdf5-1.10.4 | hb1b8bf9_0 3.9 MB
intel-openmp-2020.1 | 217 780 KB
keras-preprocessing-1.1.0 | py_1 37 KB
libgfortran-ng-7.3.0 | hdf63c60_0 1006 KB
libprotobuf-3.12.3 | hd408876_0 2.9 MB
markdown-3.1.1 | py38_0 116 KB
mkl-2020.1 | 217 129.0 MB
mkl-service-2.3.0 | py38he904b0f_0 62 KB
mkl_fft-1.0.15 | py38ha843d7b_0 159 KB
mkl_random-1.1.1 | py38h0573a6f_0 341 KB
numpy-1.18.1 | py38h4f9e942_0 5 KB
numpy-base-1.18.1 | py38hde5b4d6_1 4.2 MB
oauthlib-3.1.0 | py_0 91 KB
opt_einsum-3.1.0 | py_0 54 KB
protobuf-3.12.3 | py38he6710b0_0 648 KB
pyasn1-0.4.8 | py_0 57 KB
pyasn1-modules-0.2.7 | py_0 68 KB
pyjwt-1.7.1 | py38_0 33 KB
pyopenssl-19.1.0 | py38_0 88 KB
pysocks-1.7.1 | py38_0 28 KB
requests-2.23.0 | py38_0 93 KB
requests-oauthlib-1.3.0 | py_0 23 KB
rsa-4.0 | py_0 29 KB
scipy-1.4.1 | py38h0b6359f_0 14.8 MB
tensorboard-2.2.1 | pyh532a8cf_0 2.4 MB
tensorboard-plugin-wit-1.6.0| py_0 630 KB
tensorflow-2.2.0 |gpu_py38hb782248_0 4 KB
tensorflow-base-2.2.0 |gpu_py38h83e3d50_0 179.3 MB
tensorflow-estimator-2.2.0 | pyh208ff02_0 254 KB
tensorflow-gpu-2.2.0 | h0d30ee6_0 3 KB
termcolor-1.1.0 | py38_1 8 KB
urllib3-1.25.8 | py38_0 170 KB
werkzeug-1.0.1 | py_0 240 KB
wrapt-1.12.1 | py38h7b6447c_1 50 KB
------------------------------------------------------------
Total: 874.0 MB
The following NEW packages will be INSTALLED:
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
_tflow_select pkgs/main/linux-64::_tflow_select-2.1.0-gpu
absl-py pkgs/main/linux-64::absl-py-0.9.0-py38_0
astunparse pkgs/main/noarch::astunparse-1.6.3-py_0
blas pkgs/main/linux-64::blas-1.0-mkl
blinker pkgs/main/linux-64::blinker-1.4-py38_0
c-ares pkgs/main/linux-64::c-ares-1.15.0-h7b6447c_1001
ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0
cachetools pkgs/main/noarch::cachetools-4.1.0-py_1
certifi pkgs/main/linux-64::certifi-2020.4.5.1-py38_0
cffi pkgs/main/linux-64::cffi-1.14.0-py38he30daa8_1
chardet pkgs/main/linux-64::chardet-3.0.4-py38_1003
click pkgs/main/noarch::click-7.1.2-py_0
cryptography pkgs/main/linux-64::cryptography-2.9.2-py38h1ba5d50_0
cudatoolkit pkgs/main/linux-64::cudatoolkit-10.1.243-h6bb024c_0
cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda10.1_0
cupti pkgs/main/linux-64::cupti-10.1.168-0
gast pkgs/main/noarch::gast-0.3.3-py_0
google-auth pkgs/main/noarch::google-auth-1.14.1-py_0
google-auth-oauth~ pkgs/main/noarch::google-auth-oauthlib-0.4.1-py_2
google-pasta pkgs/main/noarch::google-pasta-0.2.0-py_0
grpcio pkgs/main/linux-64::grpcio-1.27.2-py38hf8bcb03_0
h5py pkgs/main/linux-64::h5py-2.10.0-py38h7918eee_0
hdf5 pkgs/main/linux-64::hdf5-1.10.4-hb1b8bf9_0
idna pkgs/main/noarch::idna-2.9-py_1
intel-openmp pkgs/main/linux-64::intel-openmp-2020.1-217
keras-preprocessi~ pkgs/main/noarch::keras-preprocessing-1.1.0-py_1
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0
libgfortran-ng pkgs/main/linux-64::libgfortran-ng-7.3.0-hdf63c60_0
libprotobuf pkgs/main/linux-64::libprotobuf-3.12.3-hd408876_0
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0
markdown pkgs/main/linux-64::markdown-3.1.1-py38_0
mkl pkgs/main/linux-64::mkl-2020.1-217
mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py38he904b0f_0
mkl_fft pkgs/main/linux-64::mkl_fft-1.0.15-py38ha843d7b_0
mkl_random pkgs/main/linux-64::mkl_random-1.1.1-py38h0573a6f_0
ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1
numpy pkgs/main/linux-64::numpy-1.18.1-py38h4f9e942_0
numpy-base pkgs/main/linux-64::numpy-base-1.18.1-py38hde5b4d6_1
oauthlib pkgs/main/noarch::oauthlib-3.1.0-py_0
openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0
opt_einsum pkgs/main/noarch::opt_einsum-3.1.0-py_0
pip pkgs/main/linux-64::pip-20.0.2-py38_3
protobuf pkgs/main/linux-64::protobuf-3.12.3-py38he6710b0_0
pyasn1 pkgs/main/noarch::pyasn1-0.4.8-py_0
pyasn1-modules pkgs/main/noarch::pyasn1-modules-0.2.7-py_0
pycparser pkgs/main/noarch::pycparser-2.20-py_0
pyjwt pkgs/main/linux-64::pyjwt-1.7.1-py38_0
pyopenssl pkgs/main/linux-64::pyopenssl-19.1.0-py38_0
pysocks pkgs/main/linux-64::pysocks-1.7.1-py38_0
python pkgs/main/linux-64::python-3.8.3-hcff3b4d_0
readline pkgs/main/linux-64::readline-8.0-h7b6447c_0
requests pkgs/main/linux-64::requests-2.23.0-py38_0
requests-oauthlib pkgs/main/noarch::requests-oauthlib-1.3.0-py_0
rsa pkgs/main/noarch::rsa-4.0-py_0
scipy pkgs/main/linux-64::scipy-1.4.1-py38h0b6359f_0
setuptools pkgs/main/linux-64::setuptools-47.1.1-py38_0
six pkgs/main/noarch::six-1.15.0-py_0
sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1
tensorboard pkgs/main/noarch::tensorboard-2.2.1-pyh532a8cf_0
tensorboard-plugi~ pkgs/main/noarch::tensorboard-plugin-wit-1.6.0-py_0
tensorflow pkgs/main/linux-64::tensorflow-2.2.0-gpu_py38hb782248_0
tensorflow-base pkgs/main/linux-64::tensorflow-base-2.2.0-gpu_py38h83e3d50_0
tensorflow-estima~ pkgs/main/noarch::tensorflow-estimator-2.2.0-pyh208ff02_0
tensorflow-gpu pkgs/main/linux-64::tensorflow-gpu-2.2.0-h0d30ee6_0
termcolor pkgs/main/linux-64::termcolor-1.1.0-py38_1
tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0
urllib3 pkgs/main/linux-64::urllib3-1.25.8-py38_0
werkzeug pkgs/main/noarch::werkzeug-1.0.1-py_0
wheel pkgs/main/linux-64::wheel-0.34.2-py38_0
wrapt pkgs/main/linux-64::wrapt-1.12.1-py38h7b6447c_1
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3
Proceed ([y]/n)? y
[...]
#
# To activate this environment, use
#
# $ conda activate tf2
#
# To deactivate an active environment, use
#
# $ conda deactivate
$ wget https://pastebin.com/raw/7FEsnMw3 -O fmnist.py
$ conda activate tf2
(tf2) $ python fmnist.py
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 5s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 1s 0us/step
2020-06-14 08:21:54.239273: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-14 08:21:54.267503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.268518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-06-14 08:21:54.268828: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-14 08:21:54.270920: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-14 08:21:54.272822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-14 08:21:54.273155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-14 08:21:54.275103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-14 08:21:54.276050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-14 08:21:54.280268: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-14 08:21:54.280442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.281498: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.282419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-14 08:21:54.282840: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-14 08:21:54.309605: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300035000 Hz
2020-06-14 08:21:54.310182: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5581bda05040 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-14 08:21:54.310214: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-14 08:21:54.310516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.311507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-06-14 08:21:54.311569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-14 08:21:54.311593: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-14 08:21:54.311610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-14 08:21:54.311631: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-14 08:21:54.311649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-14 08:21:54.311670: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-14 08:21:54.311692: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-14 08:21:54.311759: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.312695: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.313611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-14 08:21:54.313660: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-14 08:21:54.897172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-14 08:21:54.897217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0
2020-06-14 08:21:54.897227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N
2020-06-14 08:21:54.898070: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.899075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.900023: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:21:54.900951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14762 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2020-06-14 08:21:54.903953: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5581c1270160 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-14 08:21:54.903981: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
Epoch 1/10
2020-06-14 08:21:56.333036: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
1875/1875 [==============================] - 3s 2ms/step - loss: 3.2816 - accuracy: 0.6954
Epoch 2/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.6911 - accuracy: 0.7492
Epoch 3/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.5915 - accuracy: 0.7861
Epoch 4/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.5403 - accuracy: 0.8070
Epoch 5/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.5136 - accuracy: 0.8206
Epoch 6/10
1875/1875 [==============================] - 3s 1ms/step - loss: 0.5027 - accuracy: 0.8281
Epoch 7/10
1875/1875 [==============================] - 3s 1ms/step - loss: 0.4890 - accuracy: 0.8325
Epoch 8/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4708 - accuracy: 0.8378
Epoch 9/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4744 - accuracy: 0.8383
Epoch 10/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4665 - accuracy: 0.8400
```
While that is running, from another SSH connection:
``` bash
$ nvidia-smi
Sun Jun 14 08:22:41 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 32C P0 39W / 300W | 15265MiB / 16160MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4254 C python 15265MiB |
+-----------------------------------------------------------------------------+
```
Can we do the same but with TF1.5 and the old Keras?
We need to ask for a different version of TF/Keras (available versions [here](https://anaconda.org/anaconda/tensorflow-gpu/files) and [here](https://anaconda.org/anaconda/keras/files) by clicking on "Versions"), and conda will pull in the right versions of `cuda`, `cudnn`, `cudatools`.
While trying this I discovered that TF1.5 is not compatible with Python 3.8 (so it forced me to use 3.7 or lower):
```
$ conda create -n tf1 python=3.7 tensorflow-gpu=1.15 keras
conda create -n tf1 python=3.7 tensorflow-gpu=1.15 keras
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/ubuntu/.miniconda/envs/tf1
added / updated specs:
- keras
- python=3.7
- tensorflow-gpu=1.15
The following NEW packages will be INSTALLED:
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
_tflow_select pkgs/main/linux-64::_tflow_select-2.1.0-gpu
absl-py pkgs/main/linux-64::absl-py-0.9.0-py37_0
astor pkgs/main/linux-64::astor-0.8.0-py37_0
blas pkgs/main/linux-64::blas-1.0-mkl
c-ares pkgs/main/linux-64::c-ares-1.15.0-h7b6447c_1001
ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0
certifi pkgs/main/linux-64::certifi-2020.4.5.1-py37_0
cudatoolkit pkgs/main/linux-64::cudatoolkit-10.0.130-0
cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda10.0_0
cupti pkgs/main/linux-64::cupti-10.0.130-0
gast pkgs/main/linux-64::gast-0.2.2-py37_0
google-pasta pkgs/main/noarch::google-pasta-0.2.0-py_0
grpcio pkgs/main/linux-64::grpcio-1.27.2-py37hf8bcb03_0
h5py pkgs/main/linux-64::h5py-2.10.0-py37h7918eee_0
hdf5 pkgs/main/linux-64::hdf5-1.10.4-hb1b8bf9_0
intel-openmp pkgs/main/linux-64::intel-openmp-2020.1-217
keras pkgs/main/linux-64::keras-2.3.1-0
keras-applications pkgs/main/noarch::keras-applications-1.0.8-py_0
keras-base pkgs/main/linux-64::keras-base-2.3.1-py37_0
keras-preprocessi~ pkgs/main/noarch::keras-preprocessing-1.1.0-py_1
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0
libgfortran-ng pkgs/main/linux-64::libgfortran-ng-7.3.0-hdf63c60_0
libprotobuf pkgs/main/linux-64::libprotobuf-3.12.3-hd408876_0
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0
markdown pkgs/main/linux-64::markdown-3.1.1-py37_0
mkl pkgs/main/linux-64::mkl-2020.1-217
mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py37he904b0f_0
mkl_fft pkgs/main/linux-64::mkl_fft-1.0.15-py37ha843d7b_0
mkl_random pkgs/main/linux-64::mkl_random-1.1.1-py37h0573a6f_0
ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1
numpy pkgs/main/linux-64::numpy-1.18.1-py37h4f9e942_0
numpy-base pkgs/main/linux-64::numpy-base-1.18.1-py37hde5b4d6_1
openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0
opt_einsum pkgs/main/noarch::opt_einsum-3.1.0-py_0
pip pkgs/main/linux-64::pip-20.0.2-py37_3
protobuf pkgs/main/linux-64::protobuf-3.12.3-py37he6710b0_0
python pkgs/main/linux-64::python-3.7.7-hcff3b4d_5
pyyaml pkgs/main/linux-64::pyyaml-5.3.1-py37h7b6447c_0
readline pkgs/main/linux-64::readline-8.0-h7b6447c_0
scipy pkgs/main/linux-64::scipy-1.4.1-py37h0b6359f_0
setuptools pkgs/main/linux-64::setuptools-47.1.1-py37_0
six pkgs/main/noarch::six-1.15.0-py_0
sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1
tensorboard pkgs/main/noarch::tensorboard-1.15.0-pyhb230dea_0
tensorflow pkgs/main/linux-64::tensorflow-1.15.0-gpu_py37h0f0df58_0
tensorflow-base pkgs/main/linux-64::tensorflow-base-1.15.0-gpu_py37h9dcbed7_0
tensorflow-estima~ pkgs/main/noarch::tensorflow-estimator-1.15.1-pyh2649769_0
tensorflow-gpu pkgs/main/linux-64::tensorflow-gpu-1.15.0-h0d30ee6_0
termcolor pkgs/main/linux-64::termcolor-1.1.0-py37_1
tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0
webencodings pkgs/main/linux-64::webencodings-0.5.1-py37_1
werkzeug pkgs/main/noarch::werkzeug-0.16.1-py_0
wheel pkgs/main/linux-64::wheel-0.34.2-py37_0
wrapt pkgs/main/linux-64::wrapt-1.12.1-py37h7b6447c_1
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
yaml pkgs/main/linux-64::yaml-0.1.7-had09818_2
zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate tf1
#
# To deactivate an active environment, use
#
# $ conda deactivate
$ emacs fmnist.py # to change 'from tensorflow import keras' to 'import keras'
$ conda activate tf1
(tf1) $ python fmnist.py
$ python fmnist.py
Using TensorFlow backend.
WARNING:tensorflow:From /home/ubuntu/.miniconda/envs/tf1/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-06-14 08:50:03.393864: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-14 08:50:03.421339: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:03.422355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
2020-06-14 08:50:03.422643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-14 08:50:03.423990: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-14 08:50:03.425281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-14 08:50:03.425647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-14 08:50:03.427206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-14 08:50:03.428370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-14 08:50:03.432195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-14 08:50:03.432350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:03.433335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:03.434256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-06-14 08:50:03.434772: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-14 08:50:03.457561: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300035000 Hz
2020-06-14 08:50:03.458053: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c057d2bb00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-14 08:50:03.458080: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-14 08:50:03.458284: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:03.459213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
2020-06-14 08:50:03.459254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-14 08:50:03.459275: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-14 08:50:03.459290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-14 08:50:03.459319: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-14 08:50:03.459348: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-14 08:50:03.459368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-14 08:50:03.459402: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-14 08:50:03.459479: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:03.460429: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:03.461323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-06-14 08:50:03.461370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-14 08:50:04.067131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-14 08:50:04.067173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-06-14 08:50:04.067191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-06-14 08:50:04.067430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:04.068411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:04.069352: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 08:50:04.070305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14919 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2020-06-14 08:50:04.072418: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c05a596390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-14 08:50:04.072442: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
WARNING:tensorflow:From /home/ubuntu/.miniconda/envs/tf1/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
Epoch 1/10
2020-06-14 08:50:05.153780: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
60000/60000 [==============================] - 4s 71us/step - loss: 3.7436 - accuracy: 0.6998
Epoch 2/10
60000/60000 [==============================] - 4s 66us/step - loss: 0.6869 - accuracy: 0.7567
Epoch 3/10
60000/60000 [==============================] - 4s 65us/step - loss: 0.5983 - accuracy: 0.7910
Epoch 4/10
60000/60000 [==============================] - 4s 65us/step - loss: 0.5425 - accuracy: 0.8092
```
Nice, it works; but I'm still not satisfied because it's still using CUDA10.
The reason is that TF1.15 is compiled with CUDA10 support. So, let's redo everything with TF1.12 (this time it wants Python 3.6):
```
$ conda create -n tf112 python=3.6 tensorflow-gpu=1.12 keras
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/ubuntu/.miniconda/envs/tf112
added / updated specs:
- keras
- python=3.6
- tensorflow-gpu=1.12
The following packages will be downloaded:
package | build
---------------------------|-----------------
absl-py-0.9.0 | py36_0 167 KB
astor-0.8.0 | py36_0 46 KB
certifi-2020.4.5.1 | py36_0 155 KB
cudatoolkit-9.2 | 0 233.9 MB
cudnn-7.6.5 | cuda9.2_0 142.7 MB
cupti-9.2.148 | 0 1.5 MB
grpcio-1.27.2 | py36hf8bcb03_0 1.3 MB
h5py-2.10.0 | py36h7918eee_0 1.0 MB
keras-base-2.3.1 | py36_0 495 KB
markdown-3.1.1 | py36_0 116 KB
mkl-service-2.3.0 | py36he904b0f_0 219 KB
mkl_fft-1.0.15 | py36ha843d7b_0 155 KB
mkl_random-1.1.1 | py36h0573a6f_0 327 KB
numpy-1.18.1 | py36h4f9e942_0 5 KB
numpy-base-1.18.1 | py36hde5b4d6_1 4.2 MB
pip-20.0.2 | py36_3 1.7 MB
protobuf-3.12.3 | py36he6710b0_0 644 KB
python-3.6.10 | h7579374_2 29.7 MB
pyyaml-5.3.1 | py36h7b6447c_0 180 KB
scipy-1.4.1 | py36h0b6359f_0 14.6 MB
setuptools-47.1.1 | py36_0 514 KB
tensorboard-1.12.2 | py36he6710b0_0 3.0 MB
tensorflow-1.12.0 |gpu_py36he74679b_0 4 KB
tensorflow-base-1.12.0 |gpu_py36had579c0_0 102.7 MB
tensorflow-gpu-1.12.0 | h0d30ee6_0 3 KB
termcolor-1.1.0 | py36_1 8 KB
wheel-0.34.2 | py36_0 51 KB
------------------------------------------------------------
Total: 539.3 MB
The following NEW packages will be INSTALLED:
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
_tflow_select pkgs/main/linux-64::_tflow_select-2.1.0-gpu
absl-py pkgs/main/linux-64::absl-py-0.9.0-py36_0
astor pkgs/main/linux-64::astor-0.8.0-py36_0
blas pkgs/main/linux-64::blas-1.0-mkl
c-ares pkgs/main/linux-64::c-ares-1.15.0-h7b6447c_1001
ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0
certifi pkgs/main/linux-64::certifi-2020.4.5.1-py36_0
cudatoolkit pkgs/main/linux-64::cudatoolkit-9.2-0
cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda9.2_0
cupti pkgs/main/linux-64::cupti-9.2.148-0
gast pkgs/main/noarch::gast-0.3.3-py_0
grpcio pkgs/main/linux-64::grpcio-1.27.2-py36hf8bcb03_0
h5py pkgs/main/linux-64::h5py-2.10.0-py36h7918eee_0
hdf5 pkgs/main/linux-64::hdf5-1.10.4-hb1b8bf9_0
intel-openmp pkgs/main/linux-64::intel-openmp-2020.1-217
keras pkgs/main/linux-64::keras-2.3.1-0
keras-applications pkgs/main/noarch::keras-applications-1.0.8-py_0
keras-base pkgs/main/linux-64::keras-base-2.3.1-py36_0
keras-preprocessi~ pkgs/main/noarch::keras-preprocessing-1.1.0-py_1
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0
libgfortran-ng pkgs/main/linux-64::libgfortran-ng-7.3.0-hdf63c60_0
libprotobuf pkgs/main/linux-64::libprotobuf-3.12.3-hd408876_0
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0
markdown pkgs/main/linux-64::markdown-3.1.1-py36_0
mkl pkgs/main/linux-64::mkl-2020.1-217
mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py36he904b0f_0
mkl_fft pkgs/main/linux-64::mkl_fft-1.0.15-py36ha843d7b_0
mkl_random pkgs/main/linux-64::mkl_random-1.1.1-py36h0573a6f_0
ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1
numpy pkgs/main/linux-64::numpy-1.18.1-py36h4f9e942_0
numpy-base pkgs/main/linux-64::numpy-base-1.18.1-py36hde5b4d6_1
openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0
pip pkgs/main/linux-64::pip-20.0.2-py36_3
protobuf pkgs/main/linux-64::protobuf-3.12.3-py36he6710b0_0
python pkgs/main/linux-64::python-3.6.10-h7579374_2
pyyaml pkgs/main/linux-64::pyyaml-5.3.1-py36h7b6447c_0
readline pkgs/main/linux-64::readline-8.0-h7b6447c_0
scipy pkgs/main/linux-64::scipy-1.4.1-py36h0b6359f_0
setuptools pkgs/main/linux-64::setuptools-47.1.1-py36_0
six pkgs/main/noarch::six-1.15.0-py_0
sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1
tensorboard pkgs/main/linux-64::tensorboard-1.12.2-py36he6710b0_0
tensorflow pkgs/main/linux-64::tensorflow-1.12.0-gpu_py36he74679b_0
tensorflow-base pkgs/main/linux-64::tensorflow-base-1.12.0-gpu_py36had579c0_0
tensorflow-gpu pkgs/main/linux-64::tensorflow-gpu-1.12.0-h0d30ee6_0
termcolor pkgs/main/linux-64::termcolor-1.1.0-py36_1
tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0
werkzeug pkgs/main/noarch::werkzeug-1.0.1-py_0
wheel pkgs/main/linux-64::wheel-0.34.2-py36_0
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
yaml pkgs/main/linux-64::yaml-0.1.7-had09818_2
zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3
Proceed ([y]/n)? y
[..]
#
# To activate this environment, use
#
# $ conda activate tf112
#
# To deactivate an active environment, use
#
# $ conda deactivate
```
Note that it's now using CUDA 9.2; I had to modify the loss to `'sparse_categorical_crossentropy'` because of the different Keras version.
```
$ python fmnist.py
Using TensorFlow backend.
/home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Epoch 1/10
2020-06-14 09:04:12.308259: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-14 09:04:12.920291: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-14 09:04:12.921233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.34GiB
2020-06-14 09:04:12.921263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-06-14 09:04:13.303980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-14 09:04:13.304037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2020-06-14 09:04:13.304047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2020-06-14 09:04:13.304182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14839 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
60000/60000 [==============================] - 6s 100us/step - loss: 2.4673 - acc: 0.1294
Epoch 2/10
[..]
```
### Freezing and restoring a conda
``` bash
(older) $ conda env export > older.yml
(older) $ conda deactivate # or CTRL-d
$ conda env create -n acopy -f older.yml
Collecting package metadata (repodata.json): done
Solving environment: done
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate acopy
#
# To deactivate an active environment, use
#
# $ conda deactivate
$ conda activate acopy
(acopy) $ conda env export
name: acopy
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _tflow_select=2.1.0=gpu
- absl-py=0.9.0=py36_0
- astor=0.8.0=py36_0
- blas=1.0=mkl
- c-ares=1.15.0=h7b6447c_1001
- ca-certificates=2020.1.1=0
- certifi=2020.4.5.1=py36_0
- cudatoolkit=9.2=0
- cudnn=7.6.5=cuda9.2_0
- cupti=9.2.148=0
- gast=0.3.3=py_0
- grpcio=1.27.2=py36hf8bcb03_0
- h5py=2.10.0=py36h7918eee_0
- hdf5=1.10.4=hb1b8bf9_0
- intel-openmp=2020.1=217
- keras=2.2.4=0
- keras-applications=1.0.8=py_0
- keras-base=2.2.4=py36_0
- keras-preprocessing=1.1.0=py_1
- ld_impl_linux-64=2.33.1=h53a641e_7
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.3=he6710b0_1
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libprotobuf=3.12.3=hd408876_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- markdown=3.1.1=py36_0
- mkl=2020.1=217
- mkl-service=2.3.0=py36he904b0f_0
- mkl_fft=1.0.15=py36ha843d7b_0
- mkl_random=1.1.1=py36h0573a6f_0
- ncurses=6.2=he6710b0_1
- numpy=1.18.1=py36h4f9e942_0
- numpy-base=1.18.1=py36hde5b4d6_1
- openssl=1.1.1g=h7b6447c_0
- pip=20.0.2=py36_3
- protobuf=3.12.3=py36he6710b0_0
- python=3.6.10=h7579374_2
- pyyaml=5.3.1=py36h7b6447c_0
- readline=8.0=h7b6447c_0
- scipy=1.4.1=py36h0b6359f_0
- setuptools=47.1.1=py36_0
- six=1.15.0=py_0
- sqlite=3.31.1=h62c20be_1
- tensorboard=1.12.2=py36he6710b0_0
- tensorflow=1.12.0=gpu_py36he74679b_0
- tensorflow-base=1.12.0=gpu_py36had579c0_0
- tensorflow-gpu=1.12.0=h0d30ee6_0
- termcolor=1.1.0=py36_1
- tk=8.6.8=hbc83047_0
- werkzeug=1.0.1=py_0
- wheel=0.34.2=py36_0
- xz=5.2.5=h7b6447c_0
- yaml=0.1.7=had09818_2
- zlib=1.2.11=h7b6447c_3
prefix: /home/ubuntu/.miniconda/envs/acopy
```
### Stopping the instances
``` bash
$ instances
i-088e8c600ed61650f ami-0987fcabe779f2491 p3.2xlarge stopped sourya 2020-03-28T05:29:08.000Z us-west-2b vpc-411bdc39
i-0ee96511fbb4328f9 ami-0987fcabe779f2491 p3.2xlarge stopped SNN 2020-05-10T16:13:26.000Z us-west-2a vpc-411bdc39
i-01a3fac4ddea59c76 ami-0cc039c2244660e0c p3.2xlarge stopped tianyi 2020-06-10T06:12:17.000Z us-west-2b vpc-411bdc39
i-0cbce25e73d063cc6 ami-099eed573ea1c101f p3.2xlarge running Marco 2020-06-14T06:49:49.000Z us-west-2b vpc-411bdc39
$ instances | grep Marco | instance-terminate
You are about to terminate the following instances:
i-0cbce25e73d063cc6 ami-099eed573ea1c101f p3.2xlarge running Marco 2020-06-14T06:49:49.000Z us-west-2b vpc-411bdc39
Are you sure you want to continue? y
i-0cbce25e73d063cc6 PreviousState=running CurrentState=shutting-down
```