# TF on Ubuntu 20.04 from scratch (AWS) ### Starting an instance This instance is using a basic Ubuntu 20.04 (no drivers, editors, Xorg). Note: I use [`aws-vault`](https://github.com/99designs/aws-vault) to store the credentials to AWS accounts on my computer (so, I give `aws-vault exec hal -- bash` at the beginning) and [`Bash-my-AWS`](https://bash-my-aws.org/) as a nicer command-line interface to AWS. Let's check which AMI to use: ``` bash $ aws ec2 describe-images --owners aws-marketplace --filters Name=name,Values='*ubuntu-focal-20.04*' | jq '.Images[] | [.CreationDate, .ImageId, .Name, .Description]'[ "2020-04-23T18:35:13.000Z", "ami-017cb7c5fb425a889", "SupportedImages ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-serv-a5202e59-fcb2-4185-ac3e-cfb34ca880c0-ami-06f88aeefe25dc6ba.4", "Ubuntu Server 20.04 LTS (Ubuntu 20.04 LTS ) (Ubuntu 20) Focal Fossa" ] [ "2020-04-24T09:28:04.000Z", "ami-0652b0a864db01553", "ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423-aced0818-eef1-427a-9e04-8ba38bada306-ami-068663a3c619dd892.4", "Canonical, Ubuntu, 20.04 LTS, amd64 focal image build on 2020-04-23" ] [ "2020-04-24T09:32:55.000Z", "ami-099eed573ea1c101f", "ubuntu-minimal/images/hvm-ssd/ubuntu-focal-20.04-amd64-minimal-2020-d5944ad4-5199-4cf3-ab4c-c2c4598f880b-ami-0f84c9a9348f9f857.4", "Canonical, Ubuntu Minimal, 20.04 LTS, amd64 focal image build on 2020-04-23" ] [ "2020-04-24T09:39:36.000Z", "ami-0aba5b7c9025fd3fd", "ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20200423-3ba3581b-86f4-4bf1-a9a5-c2e11fe9408d-ami-00579fbb15b954340.4", "Canonical, Ubuntu, 20.04 LTS, arm64 focal image build on 2020-04-23" ] [ "2020-04-27T17:11:24.000Z", "ami-0db29aadadd4a4cca", "ubuntu-pro/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423-ae7ed378-8838-4fcf-842d-d1d09b34f116-ami-0118f3de163338756.4", "ubuntu-pro/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423" ] ``` I'll use the minimal one, `ami-099eed573ea1c101f`. ``` bash $ aws ec2 run-instances --image-id ami-099eed573ea1c101f --block-device-mapping DeviceName=/dev/sda1,Ebs={VolumeSize=16} --instance-type p3.2xlarge --key-name paolieri --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=Marco}]' ``` Let's connect (here I'm using Bash-my-AWS): ``` bash $ instances i-0cbce25e73d063cc6 ami-099eed573ea1c101f p3.2xlarge running Marco 2020-06-14T06:49:49.000Z us-west-2b vpc-411bdc39 $ instances | grep Marco | instance-ssh-details i-0cbce25e73d063cc6 paolieri 34.222.239.184 Marco $ ssh -i ~/.ssh/aws_hal.pem ubuntu@34.222.239.184 ``` ### NVIDIA drivers and `coolgpus` We're connected, so let's install the NVIDIA drivers, `nvidia-dmi`, `gcc-9` and Xorg: ``` bash $ sudo apt-get update $ sudo apt-get install nvidia-driver-440 ``` Let's reboot and check whether Xorg is running: ``` bash $ sudo reboot Connection to 34.222.239.184 closed by remote host. Connection to 34.222.239.184 closed. $ ssh -i ~/.ssh/aws_hal.pem ubuntu@34.222.239.184 Welcome to Ubuntu 20.04 LTS (GNU/Linux 5.4.0-1009-aws x86_64) $ ps aux | grep Xorg root 881 1.1 0.0 1396048 48960 tty1 Sl+ 07:08 0:00 /usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/119/gdm/Xauthority -background none -noreset -keeptty -verbose 3 ``` Xorg is running... and we don't want that; so we do the following to disable the graphical mode, and reboot: ``` bash $ sudo systemctl set-default multi-user $ sudo reboot $ ssh -i ~/.ssh/aws_hal.pem ubuntu@34.222.239.184 $ ps aux | grep Xorg ubuntu 900 0.0 0.0 5188 732 pts/0 S+ 07:18 0:00 grep --color=auto Xorg ``` Excellent, it's not running anymore on startup. Can we see the GPU? Yup: ``` bash $ nvidia-smi Sun Jun 14 07:20:06 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 | | N/A 29C P0 38W / 300W | 0MiB / 16160MiB | 5% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` Now, will `coolgpus` work? No... it can't find the fans after starting it. ``` bash $ sudo apt-get install wget emacs-nox $ cd /usr/local/bin $ sudo wget https://raw.githubusercontent.com/andyljones/coolgpus/master/coolgpus $ sudo chmod +x coolgpus $ emacs coolgpus # change python to python3, C-x-s C-x-c $ sudo coolgpus --kill Killing all running X servers, including 1626 Awaiting X server shutdown All X servers killed Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:00:1E.0xkn_4ho5/xorg.conf X.Org X Server 1.20.8 X Protocol Version 11, Revision 0 Build Operating System: Linux 4.4.0-179-generic x86_64 Ubuntu Current Operating System: Linux ip-172-31-38-39 5.4.0-1009-aws #9-Ubuntu SMP Sun Apr 12 19:46:01 UTC 2020 x86_64 Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1009-aws root=PARTUUID=89296dce-01 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 panic=-1 Build Date: 21 May 2020 08:22:15AM xorg-server 2:1.20.8-2ubuntu2.1 (For technical support please see http://www.ubuntu.com/support) Current version of pixman: 0.38.4 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.0.log", Time: Sun Jun 14 07:53:12 2020 (++) Using config file: "/tmp/cool-gpu-00000000:00:1E.0xkn_4ho5/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d" libEGL warning: DRI2: failed to authenticate ERROR: Error resolving target specification 'fan:0' (No targets match target specification), specified in assignment '[fan:0]/GPUTargetFanSpeed=30'. Released fan speed control for GPU at :0 Terminating xserver for display :0 Traceback (most recent call last): File "/usr/local/bin/coolgpus", line 242, in <module> run() File "/usr/local/bin/coolgpus", line 239, in run manage_fans(displays) File "/usr/local/bin/coolgpus", line 225, in manage_fans set_speed(display, s) File "/usr/local/bin/coolgpus", line 212, in set_speed assign(display, '[fan:0]/GPUTargetFanSpeed='+str(int(target))) File "/usr/local/bin/coolgpus", line 208, in assign log_output(['nvidia-settings', '-a', command, '-c', display]) File "/usr/local/bin/coolgpus", line 91, in log_output raise ValueError('Command ' + ' '.join(command) + ' crashed with return code ' + str(p.returncode) + ':') ValueError: Command nvidia-settings -a [fan:0]/GPUTargetFanSpeed=30 -c :0 crashed with return code 1: u ``` ### Using Miniconda and TF First, we need to install conda: ``` bash $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh $ bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/.miniconda $ .miniconda/condabin/conda update -y -n base -c defaults conda $ .miniconda/condabin/conda init bash zsh $ miniconda/condabin/conda config --set auto_activate_base false (log out and reconnect with ssh) ``` Now, let's try to create an empty Python 3.8 conda environment: ``` bash $ conda create -n py38 python=3.8 Collecting package metadata (current_repodata.json): done Solving environment: done ## Package Plan ## environment location: /home/ubuntu/.miniconda/envs/py38 added / updated specs: - python=3.8 The following packages will be downloaded: package | build ---------------------------|----------------- certifi-2020.4.5.1 | py38_0 156 KB libffi-3.3 | he6710b0_1 50 KB pip-20.0.2 | py38_3 1.7 MB python-3.8.3 | hcff3b4d_0 49.1 MB readline-8.0 | h7b6447c_0 356 KB setuptools-47.1.1 | py38_0 517 KB wheel-0.34.2 | py38_0 51 KB ------------------------------------------------------------ Total: 51.9 MB The following NEW packages will be INSTALLED: _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0 certifi pkgs/main/linux-64::certifi-2020.4.5.1-py38_0 ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7 libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0 libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1 libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0 libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0 ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1 openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0 pip pkgs/main/linux-64::pip-20.0.2-py38_3 python pkgs/main/linux-64::python-3.8.3-hcff3b4d_0 readline pkgs/main/linux-64::readline-8.0-h7b6447c_0 setuptools pkgs/main/linux-64::setuptools-47.1.1-py38_0 sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1 tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0 wheel pkgs/main/linux-64::wheel-0.34.2-py38_0 xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0 zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3 Proceed ([y]/n)? y Downloading and Extracting Packages libffi-3.3 | 50 KB | ################################################################################################################################################## | 100% readline-8.0 | 356 KB | ################################################################################################################################################## | 100% wheel-0.34.2 | 51 KB | ################################################################################################################################################## | 100% python-3.8.3 | 49.1 MB | ################################################################################################################################################## | 100% pip-20.0.2 | 1.7 MB | ################################################################################################################################################## | 100% setuptools-47.1.1 | 517 KB | ################################################################################################################################################## | 100% certifi-2020.4.5.1 | 156 KB | ################################################################################################################################################## | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate py38 # # To deactivate an active environment, use # # $ conda deactivate $ python --version -bash: python: command not found $ python3 --version Python 3.8.2 $ which python3 /usr/bin/python3 $ conda activate py38 (py38) $ python --version Python 3.8.3 (py38) $ python3 --version Python 3.8.3 (py38) $ which python /home/ubuntu/.miniconda/envs/py38/bin/python (py38) $ which python3 /home/ubuntu/.miniconda/envs/py38/bin/python3 (py38) $ which pip /home/ubuntu/.miniconda/envs/py38/bin/pip (py38) $ pip install click Collecting click Downloading click-7.1.2-py2.py3-none-any.whl (82 kB) |████████████████████████████████| 82 kB 1.4 MB/s Installing collected packages: click Successfully installed click-7.1.2 (py38) $ python Python 3.8.3 (default, May 19 2020, 18:47:26) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import click >>> click <module 'click' from '/home/ubuntu/.miniconda/envs/py38/lib/python3.8/site-packages/click/__init__.py'> ``` Nice, but we could do the same with `poetry`, `pyenv`, `pipenv`, or `venv`. Now, let's create a conda with TF2 and matching versions of `cudatoolkit` and `cudnn` (binary system libraries in a package!): ``` CTRL-d (to exit the py38 conda) $ conda create -n tf2 python=3.8 tensorflow-gpu $ conda create -n tf2 python=3.8 tensorflow-gpu Collecting package metadata (current_repodata.json): done Solving environment: done ## Package Plan ## environment location: /home/ubuntu/.miniconda/envs/tf2 added / updated specs: - python=3.8 - tensorflow-gpu The following packages will be downloaded: package | build ---------------------------|----------------- _tflow_select-2.1.0 | gpu 2 KB absl-py-0.9.0 | py38_0 165 KB astunparse-1.6.3 | py_0 17 KB blas-1.0 | mkl 6 KB blinker-1.4 | py38_0 22 KB c-ares-1.15.0 | h7b6447c_1001 89 KB cachetools-4.1.0 | py_1 15 KB cffi-1.14.0 | py38he30daa8_1 225 KB chardet-3.0.4 | py38_1003 174 KB click-7.1.2 | py_0 71 KB cryptography-2.9.2 | py38h1ba5d50_0 556 KB cudatoolkit-10.1.243 | h6bb024c_0 347.4 MB cudnn-7.6.5 | cuda10.1_0 179.9 MB cupti-10.1.168 | 0 1.4 MB gast-0.3.3 | py_0 14 KB google-auth-1.14.1 | py_0 58 KB google-auth-oauthlib-0.4.1 | py_2 20 KB google-pasta-0.2.0 | py_0 46 KB grpcio-1.27.2 | py38hf8bcb03_0 1.3 MB h5py-2.10.0 | py38h7918eee_0 1.1 MB hdf5-1.10.4 | hb1b8bf9_0 3.9 MB intel-openmp-2020.1 | 217 780 KB keras-preprocessing-1.1.0 | py_1 37 KB libgfortran-ng-7.3.0 | hdf63c60_0 1006 KB libprotobuf-3.12.3 | hd408876_0 2.9 MB markdown-3.1.1 | py38_0 116 KB mkl-2020.1 | 217 129.0 MB mkl-service-2.3.0 | py38he904b0f_0 62 KB mkl_fft-1.0.15 | py38ha843d7b_0 159 KB mkl_random-1.1.1 | py38h0573a6f_0 341 KB numpy-1.18.1 | py38h4f9e942_0 5 KB numpy-base-1.18.1 | py38hde5b4d6_1 4.2 MB oauthlib-3.1.0 | py_0 91 KB opt_einsum-3.1.0 | py_0 54 KB protobuf-3.12.3 | py38he6710b0_0 648 KB pyasn1-0.4.8 | py_0 57 KB pyasn1-modules-0.2.7 | py_0 68 KB pyjwt-1.7.1 | py38_0 33 KB pyopenssl-19.1.0 | py38_0 88 KB pysocks-1.7.1 | py38_0 28 KB requests-2.23.0 | py38_0 93 KB requests-oauthlib-1.3.0 | py_0 23 KB rsa-4.0 | py_0 29 KB scipy-1.4.1 | py38h0b6359f_0 14.8 MB tensorboard-2.2.1 | pyh532a8cf_0 2.4 MB tensorboard-plugin-wit-1.6.0| py_0 630 KB tensorflow-2.2.0 |gpu_py38hb782248_0 4 KB tensorflow-base-2.2.0 |gpu_py38h83e3d50_0 179.3 MB tensorflow-estimator-2.2.0 | pyh208ff02_0 254 KB tensorflow-gpu-2.2.0 | h0d30ee6_0 3 KB termcolor-1.1.0 | py38_1 8 KB urllib3-1.25.8 | py38_0 170 KB werkzeug-1.0.1 | py_0 240 KB wrapt-1.12.1 | py38h7b6447c_1 50 KB ------------------------------------------------------------ Total: 874.0 MB The following NEW packages will be INSTALLED: _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main _tflow_select pkgs/main/linux-64::_tflow_select-2.1.0-gpu absl-py pkgs/main/linux-64::absl-py-0.9.0-py38_0 astunparse pkgs/main/noarch::astunparse-1.6.3-py_0 blas pkgs/main/linux-64::blas-1.0-mkl blinker pkgs/main/linux-64::blinker-1.4-py38_0 c-ares pkgs/main/linux-64::c-ares-1.15.0-h7b6447c_1001 ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0 cachetools pkgs/main/noarch::cachetools-4.1.0-py_1 certifi pkgs/main/linux-64::certifi-2020.4.5.1-py38_0 cffi pkgs/main/linux-64::cffi-1.14.0-py38he30daa8_1 chardet pkgs/main/linux-64::chardet-3.0.4-py38_1003 click pkgs/main/noarch::click-7.1.2-py_0 cryptography pkgs/main/linux-64::cryptography-2.9.2-py38h1ba5d50_0 cudatoolkit pkgs/main/linux-64::cudatoolkit-10.1.243-h6bb024c_0 cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda10.1_0 cupti pkgs/main/linux-64::cupti-10.1.168-0 gast pkgs/main/noarch::gast-0.3.3-py_0 google-auth pkgs/main/noarch::google-auth-1.14.1-py_0 google-auth-oauth~ pkgs/main/noarch::google-auth-oauthlib-0.4.1-py_2 google-pasta pkgs/main/noarch::google-pasta-0.2.0-py_0 grpcio pkgs/main/linux-64::grpcio-1.27.2-py38hf8bcb03_0 h5py pkgs/main/linux-64::h5py-2.10.0-py38h7918eee_0 hdf5 pkgs/main/linux-64::hdf5-1.10.4-hb1b8bf9_0 idna pkgs/main/noarch::idna-2.9-py_1 intel-openmp pkgs/main/linux-64::intel-openmp-2020.1-217 keras-preprocessi~ pkgs/main/noarch::keras-preprocessing-1.1.0-py_1 ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7 libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0 libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1 libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0 libgfortran-ng pkgs/main/linux-64::libgfortran-ng-7.3.0-hdf63c60_0 libprotobuf pkgs/main/linux-64::libprotobuf-3.12.3-hd408876_0 libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0 markdown pkgs/main/linux-64::markdown-3.1.1-py38_0 mkl pkgs/main/linux-64::mkl-2020.1-217 mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py38he904b0f_0 mkl_fft pkgs/main/linux-64::mkl_fft-1.0.15-py38ha843d7b_0 mkl_random pkgs/main/linux-64::mkl_random-1.1.1-py38h0573a6f_0 ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1 numpy pkgs/main/linux-64::numpy-1.18.1-py38h4f9e942_0 numpy-base pkgs/main/linux-64::numpy-base-1.18.1-py38hde5b4d6_1 oauthlib pkgs/main/noarch::oauthlib-3.1.0-py_0 openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0 opt_einsum pkgs/main/noarch::opt_einsum-3.1.0-py_0 pip pkgs/main/linux-64::pip-20.0.2-py38_3 protobuf pkgs/main/linux-64::protobuf-3.12.3-py38he6710b0_0 pyasn1 pkgs/main/noarch::pyasn1-0.4.8-py_0 pyasn1-modules pkgs/main/noarch::pyasn1-modules-0.2.7-py_0 pycparser pkgs/main/noarch::pycparser-2.20-py_0 pyjwt pkgs/main/linux-64::pyjwt-1.7.1-py38_0 pyopenssl pkgs/main/linux-64::pyopenssl-19.1.0-py38_0 pysocks pkgs/main/linux-64::pysocks-1.7.1-py38_0 python pkgs/main/linux-64::python-3.8.3-hcff3b4d_0 readline pkgs/main/linux-64::readline-8.0-h7b6447c_0 requests pkgs/main/linux-64::requests-2.23.0-py38_0 requests-oauthlib pkgs/main/noarch::requests-oauthlib-1.3.0-py_0 rsa pkgs/main/noarch::rsa-4.0-py_0 scipy pkgs/main/linux-64::scipy-1.4.1-py38h0b6359f_0 setuptools pkgs/main/linux-64::setuptools-47.1.1-py38_0 six pkgs/main/noarch::six-1.15.0-py_0 sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1 tensorboard pkgs/main/noarch::tensorboard-2.2.1-pyh532a8cf_0 tensorboard-plugi~ pkgs/main/noarch::tensorboard-plugin-wit-1.6.0-py_0 tensorflow pkgs/main/linux-64::tensorflow-2.2.0-gpu_py38hb782248_0 tensorflow-base pkgs/main/linux-64::tensorflow-base-2.2.0-gpu_py38h83e3d50_0 tensorflow-estima~ pkgs/main/noarch::tensorflow-estimator-2.2.0-pyh208ff02_0 tensorflow-gpu pkgs/main/linux-64::tensorflow-gpu-2.2.0-h0d30ee6_0 termcolor pkgs/main/linux-64::termcolor-1.1.0-py38_1 tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0 urllib3 pkgs/main/linux-64::urllib3-1.25.8-py38_0 werkzeug pkgs/main/noarch::werkzeug-1.0.1-py_0 wheel pkgs/main/linux-64::wheel-0.34.2-py38_0 wrapt pkgs/main/linux-64::wrapt-1.12.1-py38h7b6447c_1 xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0 zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3 Proceed ([y]/n)? y [...] # # To activate this environment, use # # $ conda activate tf2 # # To deactivate an active environment, use # # $ conda deactivate $ wget https://pastebin.com/raw/7FEsnMw3 -O fmnist.py $ conda activate tf2 (tf2) $ python fmnist.py Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz 32768/29515 [=================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz 26427392/26421880 [==============================] - 5s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz 8192/5148 [===============================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz 4423680/4422102 [==============================] - 1s 0us/step 2020-06-14 08:21:54.239273: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-06-14 08:21:54.267503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.268518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2020-06-14 08:21:54.268828: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-14 08:21:54.270920: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-06-14 08:21:54.272822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-06-14 08:21:54.273155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-06-14 08:21:54.275103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-06-14 08:21:54.276050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-06-14 08:21:54.280268: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-14 08:21:54.280442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.281498: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.282419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2020-06-14 08:21:54.282840: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-06-14 08:21:54.309605: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300035000 Hz 2020-06-14 08:21:54.310182: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5581bda05040 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-06-14 08:21:54.310214: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-06-14 08:21:54.310516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.311507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2020-06-14 08:21:54.311569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-14 08:21:54.311593: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-06-14 08:21:54.311610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-06-14 08:21:54.311631: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-06-14 08:21:54.311649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-06-14 08:21:54.311670: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-06-14 08:21:54.311692: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-14 08:21:54.311759: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.312695: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.313611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2020-06-14 08:21:54.313660: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-14 08:21:54.897172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-14 08:21:54.897217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 2020-06-14 08:21:54.897227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N 2020-06-14 08:21:54.898070: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.899075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.900023: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:21:54.900951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14762 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2020-06-14 08:21:54.903953: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5581c1270160 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-06-14 08:21:54.903981: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0 Epoch 1/10 2020-06-14 08:21:56.333036: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 1875/1875 [==============================] - 3s 2ms/step - loss: 3.2816 - accuracy: 0.6954 Epoch 2/10 1875/1875 [==============================] - 3s 2ms/step - loss: 0.6911 - accuracy: 0.7492 Epoch 3/10 1875/1875 [==============================] - 3s 2ms/step - loss: 0.5915 - accuracy: 0.7861 Epoch 4/10 1875/1875 [==============================] - 3s 2ms/step - loss: 0.5403 - accuracy: 0.8070 Epoch 5/10 1875/1875 [==============================] - 3s 2ms/step - loss: 0.5136 - accuracy: 0.8206 Epoch 6/10 1875/1875 [==============================] - 3s 1ms/step - loss: 0.5027 - accuracy: 0.8281 Epoch 7/10 1875/1875 [==============================] - 3s 1ms/step - loss: 0.4890 - accuracy: 0.8325 Epoch 8/10 1875/1875 [==============================] - 3s 2ms/step - loss: 0.4708 - accuracy: 0.8378 Epoch 9/10 1875/1875 [==============================] - 3s 2ms/step - loss: 0.4744 - accuracy: 0.8383 Epoch 10/10 1875/1875 [==============================] - 3s 2ms/step - loss: 0.4665 - accuracy: 0.8400 ``` While that is running, from another SSH connection: ``` bash $ nvidia-smi Sun Jun 14 08:22:41 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 | | N/A 32C P0 39W / 300W | 15265MiB / 16160MiB | 12% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 4254 C python 15265MiB | +-----------------------------------------------------------------------------+ ``` Can we do the same but with TF1.5 and the old Keras? We need to ask for a different version of TF/Keras (available versions [here](https://anaconda.org/anaconda/tensorflow-gpu/files) and [here](https://anaconda.org/anaconda/keras/files) by clicking on "Versions"), and conda will pull in the right versions of `cuda`, `cudnn`, `cudatools`. While trying this I discovered that TF1.5 is not compatible with Python 3.8 (so it forced me to use 3.7 or lower): ``` $ conda create -n tf1 python=3.7 tensorflow-gpu=1.15 keras conda create -n tf1 python=3.7 tensorflow-gpu=1.15 keras Collecting package metadata (current_repodata.json): done Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source. Collecting package metadata (repodata.json): done Solving environment: done ## Package Plan ## environment location: /home/ubuntu/.miniconda/envs/tf1 added / updated specs: - keras - python=3.7 - tensorflow-gpu=1.15 The following NEW packages will be INSTALLED: _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main _tflow_select pkgs/main/linux-64::_tflow_select-2.1.0-gpu absl-py pkgs/main/linux-64::absl-py-0.9.0-py37_0 astor pkgs/main/linux-64::astor-0.8.0-py37_0 blas pkgs/main/linux-64::blas-1.0-mkl c-ares pkgs/main/linux-64::c-ares-1.15.0-h7b6447c_1001 ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0 certifi pkgs/main/linux-64::certifi-2020.4.5.1-py37_0 cudatoolkit pkgs/main/linux-64::cudatoolkit-10.0.130-0 cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda10.0_0 cupti pkgs/main/linux-64::cupti-10.0.130-0 gast pkgs/main/linux-64::gast-0.2.2-py37_0 google-pasta pkgs/main/noarch::google-pasta-0.2.0-py_0 grpcio pkgs/main/linux-64::grpcio-1.27.2-py37hf8bcb03_0 h5py pkgs/main/linux-64::h5py-2.10.0-py37h7918eee_0 hdf5 pkgs/main/linux-64::hdf5-1.10.4-hb1b8bf9_0 intel-openmp pkgs/main/linux-64::intel-openmp-2020.1-217 keras pkgs/main/linux-64::keras-2.3.1-0 keras-applications pkgs/main/noarch::keras-applications-1.0.8-py_0 keras-base pkgs/main/linux-64::keras-base-2.3.1-py37_0 keras-preprocessi~ pkgs/main/noarch::keras-preprocessing-1.1.0-py_1 ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7 libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0 libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1 libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0 libgfortran-ng pkgs/main/linux-64::libgfortran-ng-7.3.0-hdf63c60_0 libprotobuf pkgs/main/linux-64::libprotobuf-3.12.3-hd408876_0 libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0 markdown pkgs/main/linux-64::markdown-3.1.1-py37_0 mkl pkgs/main/linux-64::mkl-2020.1-217 mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py37he904b0f_0 mkl_fft pkgs/main/linux-64::mkl_fft-1.0.15-py37ha843d7b_0 mkl_random pkgs/main/linux-64::mkl_random-1.1.1-py37h0573a6f_0 ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1 numpy pkgs/main/linux-64::numpy-1.18.1-py37h4f9e942_0 numpy-base pkgs/main/linux-64::numpy-base-1.18.1-py37hde5b4d6_1 openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0 opt_einsum pkgs/main/noarch::opt_einsum-3.1.0-py_0 pip pkgs/main/linux-64::pip-20.0.2-py37_3 protobuf pkgs/main/linux-64::protobuf-3.12.3-py37he6710b0_0 python pkgs/main/linux-64::python-3.7.7-hcff3b4d_5 pyyaml pkgs/main/linux-64::pyyaml-5.3.1-py37h7b6447c_0 readline pkgs/main/linux-64::readline-8.0-h7b6447c_0 scipy pkgs/main/linux-64::scipy-1.4.1-py37h0b6359f_0 setuptools pkgs/main/linux-64::setuptools-47.1.1-py37_0 six pkgs/main/noarch::six-1.15.0-py_0 sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1 tensorboard pkgs/main/noarch::tensorboard-1.15.0-pyhb230dea_0 tensorflow pkgs/main/linux-64::tensorflow-1.15.0-gpu_py37h0f0df58_0 tensorflow-base pkgs/main/linux-64::tensorflow-base-1.15.0-gpu_py37h9dcbed7_0 tensorflow-estima~ pkgs/main/noarch::tensorflow-estimator-1.15.1-pyh2649769_0 tensorflow-gpu pkgs/main/linux-64::tensorflow-gpu-1.15.0-h0d30ee6_0 termcolor pkgs/main/linux-64::termcolor-1.1.0-py37_1 tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0 webencodings pkgs/main/linux-64::webencodings-0.5.1-py37_1 werkzeug pkgs/main/noarch::werkzeug-0.16.1-py_0 wheel pkgs/main/linux-64::wheel-0.34.2-py37_0 wrapt pkgs/main/linux-64::wrapt-1.12.1-py37h7b6447c_1 xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0 yaml pkgs/main/linux-64::yaml-0.1.7-had09818_2 zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3 Proceed ([y]/n)? y Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate tf1 # # To deactivate an active environment, use # # $ conda deactivate $ emacs fmnist.py # to change 'from tensorflow import keras' to 'import keras' $ conda activate tf1 (tf1) $ python fmnist.py $ python fmnist.py Using TensorFlow backend. WARNING:tensorflow:From /home/ubuntu/.miniconda/envs/tf1/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. 2020-06-14 08:50:03.393864: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-06-14 08:50:03.421339: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:03.422355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:1e.0 2020-06-14 08:50:03.422643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-14 08:50:03.423990: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-14 08:50:03.425281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-14 08:50:03.425647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-14 08:50:03.427206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-14 08:50:03.428370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-14 08:50:03.432195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-14 08:50:03.432350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:03.433335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:03.434256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2020-06-14 08:50:03.434772: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-06-14 08:50:03.457561: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300035000 Hz 2020-06-14 08:50:03.458053: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c057d2bb00 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-06-14 08:50:03.458080: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-06-14 08:50:03.458284: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:03.459213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:1e.0 2020-06-14 08:50:03.459254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-14 08:50:03.459275: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-14 08:50:03.459290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-14 08:50:03.459319: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-14 08:50:03.459348: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-14 08:50:03.459368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-14 08:50:03.459402: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-14 08:50:03.459479: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:03.460429: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:03.461323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2020-06-14 08:50:03.461370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-14 08:50:04.067131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-14 08:50:04.067173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2020-06-14 08:50:04.067191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2020-06-14 08:50:04.067430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:04.068411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:04.069352: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 08:50:04.070305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14919 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2020-06-14 08:50:04.072418: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c05a596390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-06-14 08:50:04.072442: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0 WARNING:tensorflow:From /home/ubuntu/.miniconda/envs/tf1/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. Epoch 1/10 2020-06-14 08:50:05.153780: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 60000/60000 [==============================] - 4s 71us/step - loss: 3.7436 - accuracy: 0.6998 Epoch 2/10 60000/60000 [==============================] - 4s 66us/step - loss: 0.6869 - accuracy: 0.7567 Epoch 3/10 60000/60000 [==============================] - 4s 65us/step - loss: 0.5983 - accuracy: 0.7910 Epoch 4/10 60000/60000 [==============================] - 4s 65us/step - loss: 0.5425 - accuracy: 0.8092 ``` Nice, it works; but I'm still not satisfied because it's still using CUDA10. The reason is that TF1.15 is compiled with CUDA10 support. So, let's redo everything with TF1.12 (this time it wants Python 3.6): ``` $ conda create -n tf112 python=3.6 tensorflow-gpu=1.12 keras Collecting package metadata (current_repodata.json): done Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source. Collecting package metadata (repodata.json): done Solving environment: done ## Package Plan ## environment location: /home/ubuntu/.miniconda/envs/tf112 added / updated specs: - keras - python=3.6 - tensorflow-gpu=1.12 The following packages will be downloaded: package | build ---------------------------|----------------- absl-py-0.9.0 | py36_0 167 KB astor-0.8.0 | py36_0 46 KB certifi-2020.4.5.1 | py36_0 155 KB cudatoolkit-9.2 | 0 233.9 MB cudnn-7.6.5 | cuda9.2_0 142.7 MB cupti-9.2.148 | 0 1.5 MB grpcio-1.27.2 | py36hf8bcb03_0 1.3 MB h5py-2.10.0 | py36h7918eee_0 1.0 MB keras-base-2.3.1 | py36_0 495 KB markdown-3.1.1 | py36_0 116 KB mkl-service-2.3.0 | py36he904b0f_0 219 KB mkl_fft-1.0.15 | py36ha843d7b_0 155 KB mkl_random-1.1.1 | py36h0573a6f_0 327 KB numpy-1.18.1 | py36h4f9e942_0 5 KB numpy-base-1.18.1 | py36hde5b4d6_1 4.2 MB pip-20.0.2 | py36_3 1.7 MB protobuf-3.12.3 | py36he6710b0_0 644 KB python-3.6.10 | h7579374_2 29.7 MB pyyaml-5.3.1 | py36h7b6447c_0 180 KB scipy-1.4.1 | py36h0b6359f_0 14.6 MB setuptools-47.1.1 | py36_0 514 KB tensorboard-1.12.2 | py36he6710b0_0 3.0 MB tensorflow-1.12.0 |gpu_py36he74679b_0 4 KB tensorflow-base-1.12.0 |gpu_py36had579c0_0 102.7 MB tensorflow-gpu-1.12.0 | h0d30ee6_0 3 KB termcolor-1.1.0 | py36_1 8 KB wheel-0.34.2 | py36_0 51 KB ------------------------------------------------------------ Total: 539.3 MB The following NEW packages will be INSTALLED: _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main _tflow_select pkgs/main/linux-64::_tflow_select-2.1.0-gpu absl-py pkgs/main/linux-64::absl-py-0.9.0-py36_0 astor pkgs/main/linux-64::astor-0.8.0-py36_0 blas pkgs/main/linux-64::blas-1.0-mkl c-ares pkgs/main/linux-64::c-ares-1.15.0-h7b6447c_1001 ca-certificates pkgs/main/linux-64::ca-certificates-2020.1.1-0 certifi pkgs/main/linux-64::certifi-2020.4.5.1-py36_0 cudatoolkit pkgs/main/linux-64::cudatoolkit-9.2-0 cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda9.2_0 cupti pkgs/main/linux-64::cupti-9.2.148-0 gast pkgs/main/noarch::gast-0.3.3-py_0 grpcio pkgs/main/linux-64::grpcio-1.27.2-py36hf8bcb03_0 h5py pkgs/main/linux-64::h5py-2.10.0-py36h7918eee_0 hdf5 pkgs/main/linux-64::hdf5-1.10.4-hb1b8bf9_0 intel-openmp pkgs/main/linux-64::intel-openmp-2020.1-217 keras pkgs/main/linux-64::keras-2.3.1-0 keras-applications pkgs/main/noarch::keras-applications-1.0.8-py_0 keras-base pkgs/main/linux-64::keras-base-2.3.1-py36_0 keras-preprocessi~ pkgs/main/noarch::keras-preprocessing-1.1.0-py_1 ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7 libedit pkgs/main/linux-64::libedit-3.1.20181209-hc058e9b_0 libffi pkgs/main/linux-64::libffi-3.3-he6710b0_1 libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0 libgfortran-ng pkgs/main/linux-64::libgfortran-ng-7.3.0-hdf63c60_0 libprotobuf pkgs/main/linux-64::libprotobuf-3.12.3-hd408876_0 libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0 markdown pkgs/main/linux-64::markdown-3.1.1-py36_0 mkl pkgs/main/linux-64::mkl-2020.1-217 mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py36he904b0f_0 mkl_fft pkgs/main/linux-64::mkl_fft-1.0.15-py36ha843d7b_0 mkl_random pkgs/main/linux-64::mkl_random-1.1.1-py36h0573a6f_0 ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1 numpy pkgs/main/linux-64::numpy-1.18.1-py36h4f9e942_0 numpy-base pkgs/main/linux-64::numpy-base-1.18.1-py36hde5b4d6_1 openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0 pip pkgs/main/linux-64::pip-20.0.2-py36_3 protobuf pkgs/main/linux-64::protobuf-3.12.3-py36he6710b0_0 python pkgs/main/linux-64::python-3.6.10-h7579374_2 pyyaml pkgs/main/linux-64::pyyaml-5.3.1-py36h7b6447c_0 readline pkgs/main/linux-64::readline-8.0-h7b6447c_0 scipy pkgs/main/linux-64::scipy-1.4.1-py36h0b6359f_0 setuptools pkgs/main/linux-64::setuptools-47.1.1-py36_0 six pkgs/main/noarch::six-1.15.0-py_0 sqlite pkgs/main/linux-64::sqlite-3.31.1-h62c20be_1 tensorboard pkgs/main/linux-64::tensorboard-1.12.2-py36he6710b0_0 tensorflow pkgs/main/linux-64::tensorflow-1.12.0-gpu_py36he74679b_0 tensorflow-base pkgs/main/linux-64::tensorflow-base-1.12.0-gpu_py36had579c0_0 tensorflow-gpu pkgs/main/linux-64::tensorflow-gpu-1.12.0-h0d30ee6_0 termcolor pkgs/main/linux-64::termcolor-1.1.0-py36_1 tk pkgs/main/linux-64::tk-8.6.8-hbc83047_0 werkzeug pkgs/main/noarch::werkzeug-1.0.1-py_0 wheel pkgs/main/linux-64::wheel-0.34.2-py36_0 xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0 yaml pkgs/main/linux-64::yaml-0.1.7-had09818_2 zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3 Proceed ([y]/n)? y [..] # # To activate this environment, use # # $ conda activate tf112 # # To deactivate an active environment, use # # $ conda deactivate ``` Note that it's now using CUDA 9.2; I had to modify the loss to `'sparse_categorical_crossentropy'` because of the different Keras version. ``` $ python fmnist.py Using TensorFlow backend. /home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/ubuntu/.miniconda/envs/older/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Epoch 1/10 2020-06-14 09:04:12.308259: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-06-14 09:04:12.920291: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-14 09:04:12.921233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:1e.0 totalMemory: 15.78GiB freeMemory: 15.34GiB 2020-06-14 09:04:12.921263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2020-06-14 09:04:13.303980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-14 09:04:13.304037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2020-06-14 09:04:13.304047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2020-06-14 09:04:13.304182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14839 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 60000/60000 [==============================] - 6s 100us/step - loss: 2.4673 - acc: 0.1294 Epoch 2/10 [..] ``` ### Freezing and restoring a conda ``` bash (older) $ conda env export > older.yml (older) $ conda deactivate # or CTRL-d $ conda env create -n acopy -f older.yml Collecting package metadata (repodata.json): done Solving environment: done Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate acopy # # To deactivate an active environment, use # # $ conda deactivate $ conda activate acopy (acopy) $ conda env export name: acopy channels: - defaults dependencies: - _libgcc_mutex=0.1=main - _tflow_select=2.1.0=gpu - absl-py=0.9.0=py36_0 - astor=0.8.0=py36_0 - blas=1.0=mkl - c-ares=1.15.0=h7b6447c_1001 - ca-certificates=2020.1.1=0 - certifi=2020.4.5.1=py36_0 - cudatoolkit=9.2=0 - cudnn=7.6.5=cuda9.2_0 - cupti=9.2.148=0 - gast=0.3.3=py_0 - grpcio=1.27.2=py36hf8bcb03_0 - h5py=2.10.0=py36h7918eee_0 - hdf5=1.10.4=hb1b8bf9_0 - intel-openmp=2020.1=217 - keras=2.2.4=0 - keras-applications=1.0.8=py_0 - keras-base=2.2.4=py36_0 - keras-preprocessing=1.1.0=py_1 - ld_impl_linux-64=2.33.1=h53a641e_7 - libedit=3.1.20181209=hc058e9b_0 - libffi=3.3=he6710b0_1 - libgcc-ng=9.1.0=hdf63c60_0 - libgfortran-ng=7.3.0=hdf63c60_0 - libprotobuf=3.12.3=hd408876_0 - libstdcxx-ng=9.1.0=hdf63c60_0 - markdown=3.1.1=py36_0 - mkl=2020.1=217 - mkl-service=2.3.0=py36he904b0f_0 - mkl_fft=1.0.15=py36ha843d7b_0 - mkl_random=1.1.1=py36h0573a6f_0 - ncurses=6.2=he6710b0_1 - numpy=1.18.1=py36h4f9e942_0 - numpy-base=1.18.1=py36hde5b4d6_1 - openssl=1.1.1g=h7b6447c_0 - pip=20.0.2=py36_3 - protobuf=3.12.3=py36he6710b0_0 - python=3.6.10=h7579374_2 - pyyaml=5.3.1=py36h7b6447c_0 - readline=8.0=h7b6447c_0 - scipy=1.4.1=py36h0b6359f_0 - setuptools=47.1.1=py36_0 - six=1.15.0=py_0 - sqlite=3.31.1=h62c20be_1 - tensorboard=1.12.2=py36he6710b0_0 - tensorflow=1.12.0=gpu_py36he74679b_0 - tensorflow-base=1.12.0=gpu_py36had579c0_0 - tensorflow-gpu=1.12.0=h0d30ee6_0 - termcolor=1.1.0=py36_1 - tk=8.6.8=hbc83047_0 - werkzeug=1.0.1=py_0 - wheel=0.34.2=py36_0 - xz=5.2.5=h7b6447c_0 - yaml=0.1.7=had09818_2 - zlib=1.2.11=h7b6447c_3 prefix: /home/ubuntu/.miniconda/envs/acopy ``` ### Stopping the instances ``` bash $ instances i-088e8c600ed61650f ami-0987fcabe779f2491 p3.2xlarge stopped sourya 2020-03-28T05:29:08.000Z us-west-2b vpc-411bdc39 i-0ee96511fbb4328f9 ami-0987fcabe779f2491 p3.2xlarge stopped SNN 2020-05-10T16:13:26.000Z us-west-2a vpc-411bdc39 i-01a3fac4ddea59c76 ami-0cc039c2244660e0c p3.2xlarge stopped tianyi 2020-06-10T06:12:17.000Z us-west-2b vpc-411bdc39 i-0cbce25e73d063cc6 ami-099eed573ea1c101f p3.2xlarge running Marco 2020-06-14T06:49:49.000Z us-west-2b vpc-411bdc39 $ instances | grep Marco | instance-terminate You are about to terminate the following instances: i-0cbce25e73d063cc6 ami-099eed573ea1c101f p3.2xlarge running Marco 2020-06-14T06:49:49.000Z us-west-2b vpc-411bdc39 Are you sure you want to continue? y i-0cbce25e73d063cc6 PreviousState=running CurrentState=shutting-down ```