Trying out llama.cpp on the SCC

# Trying out llama.cpp on the SCC So apparently you don't need a GPU to run a reasonable LLM. Recently, Facebook released Llama, a set of freely available LLM weights. A few months ago I came across https://github.com/ggerganov/llama.cpp which seems to promise that it can be run on a laptop without a GPU, so let's try it out! ## TL;DR The Llama model weights are now stored in `/projectnb/aclab/llama`. Please don't delete/modify them, but feel free to use them in any way you like! 1. `git clone https://github.com/ggerganov/llama.cpp` 2. `module load gcc openblas` 3. Change `Makefile` in the following way: (line with - in front of it means delete the line, and line with + means add that line) ``` diff --git a/Makefile b/Makefile index 08e2503..2a6bd1a 100644 --- a/Makefile +++ b/Makefile @@ -119,7 +119,7 @@ ifdef LLAMA_OPENBLAS ifneq ($(shell grep -e "Arch Linux" -e "ID_LIKE=arch" /etc/os-release 2>/dev/null),) LDFLAGS += -lopenblas -lcblas else - LDFLAGS += -lopenblas + LDFLAGS += -lopenblas -lgfortran -L/share/pkg.7/openblas/0.3.17/install/lib endif endif ifdef LLAMA_BLIS ``` 3. Run `make LLAMA_OPENBLAS=1` (add `-B` if you are remaking, or do a `make clean` first). 4. Spin up an interactive session: `qrsh -P aclab -pe omp 16 -l mem_per_core=8G` 5. `./main -s 1685050909 -c 2048 -t 16 -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was"` 6. Watch the magic (hopefully)! ## Step 1: building First step, I will get a CPU interactive job on the SCC (yes, this thing says it can run on a laptop, but let's walk before we run): `qrsh -P aclab -pe omp 8 -l mem_per_core=8G`. Next, we clone the repo: `git clone https://github.com/ggerganov/llama.cpp`. Easy enough. Now we need to compile. So we do: `module load gcc/12.2.0` (edit from future: originally I just did `gcc`, but later on when encountering a problem I switched to loading the most recent `gcc`. I'm pretty sure it actually made no difference - you can use the default `gcc` version, but the version needs to be consistent so pick something and stick with it) and then: `make`. This ran with no problems! But then I realized that probably we should compile with BLAS to make things a bit faster. So, `module load openblas` and then `make LLAMA_OPENBLAS=1 -B` (the `-B` forces make to recreate files). This had a problem: ``` I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -I/usr/include/openblas I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native I LDFLAGS: -lopenblas I CC: gcc (GCC) 8.3.0 I CXX: g++ (GCC) 8.3.0 /share/pkg.7/gcc/8.3.0/install/bin/gcc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -I/usr/include/openblas -c ggml.c -o ggml.o /share/pkg.7/gcc/8.3.0/install/bin/g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o /share/pkg.7/gcc/8.3.0/install/bin/g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o /share/pkg.7/gcc/8.3.0/install/bin/g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main -lopenblas /bin/ld: cannot find -lopenblas collect2: error: ld returned 1 exit status make: *** [main] Error 1 ``` Uh oh, seems like the linker cannot find the blas library. From the messages above, it seems like it's looking for openblas in `/usr/local/include/openblas`. Indeed this directory does not exist. However, my `$PATH` variable contains: `/share/pkg.7/openblas/0.3.17/install/bin`. Doing an `ls` on `/share/pkg.7/openblas/0.3.17/install`, I do find an `include` directory there. So maybe we should be looking in `/share/pkg.7/openblas/0.3.17/install/include`. Let's edit the Makefile to do this: there is a line ``` ifdef LLAMA_OPENBLAS CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -I/usr/include/openblas ifneq ($(shell grep -e "Arch Linux" -e "ID_LIKE=arch" /etc/os-release 2>/dev/null),) LDFLAGS += -lopenblas -lcblas else LDFLAGS += -lopenblas endif endif ``` So, let's change that to: ``` ifdef LLAMA_OPENBLAS CFLAGS += -DGGML_USE_OPENBLAS -I/share/pkg.7/openblas/0.3.17/install/include -I/usr/include/openblas ``` I'm not sure about the `/usr/include`, but let's give it a go... Unfortunately, it didn't change anything After a bit of investigation, I realized I was being dumb: if the problem were really an inability to find the header files in `include`, then there would likely be a compiler error rather than a linker error. So, there is also a `/share/pkg.7/openblas/0.3.17/install/lib`. Let's tell the linker about that: ``` ifdef LLAMA_OPENBLAS CFLAGS += -DGGML_USE_OPENBLAS -I/share/pkg.7/openblas/0.3.17/install/include -I/usr/include/openblas ifneq ($(shell grep -e "Arch Linux" -e "ID_LIKE=arch" /etc/os-release 2>/dev/null),) LDFLAGS += -lopenblas -lcblas else LDFLAGS += -lopenblas -L/share/pkg.7/openblas/0.3.17/install/lib endif endif ``` Ok, now I get: ``` share/pkg.7/gcc/12.2.0/install/bin/g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main -lopenblas -L/share/pkg.7/openblas/0.3.17/install/lib /share/pkg.7/openblas/0.3.17/install/lib/libopenblas.so: undefined reference to `_gfortran_concat_string' /share/pkg.7/openblas/0.3.17/install/lib/libopenblas.so: undefined reference to `_gfortran_etime' collect2: error: ld returned 1 exit status ``` So, this seems like progress. Some googling suggests I need to tell the linker about `-lgfortran`, so let's add in a line for that: ``` ifdef LLAMA_OPENBLAS CFLAGS += -DGGML_USE_OPENBLAS -I/share/pkg.7/openblas/0.3.17/install/include -I/usr/include/openblas ifneq ($(shell grep -e "Arch Linux" -e "ID_LIKE=arch" /etc/os-release 2>/dev/null),) LDFLAGS += -lopenblas -lcblas else LDFLAGS += -lopenblas -lgfortran -L/share/pkg.7/openblas/0.3.17/install/lib endif endif ``` ...and that worked! ## Step 2: downloading a model. It looks like the git repo does not come with the models (makes sense, they are pretty large). So, let's see how we can go about downloading one. Looks like step 1 is to download facebook's llama model, as described here: https://huggingface.co/docs/transformers/main/model_doc/llama. So, I fill out a form. Now, I guess we wait. Hopefully they mostly auto-approve .edu mail addresses without too much lag time. It is now 2:20pm on May 24th 2023. After 5 minutes I still haven't gotten any mail, so maybe it will be a while. I'll go for a run and see. Otherwise, we may have to come back tomorrow. Ok, it's tomorrow and still nothing. I don't really feel like waiting around, and a bit of googling suggested that there is a non-trivial chance that the mail, if it comes, will be very late. So, I'm going to follow the instructions here: https://github.com/shawwn/llama-dl to download weights that someone else got from facebook and then released. The legality of this is a little dubious to me, but I'm not doing anything that I think facebook would find objectionable, like try to make money off this (plenty of people online assert that only the person who initially released the weights has any legal responsibility, but I strongly suspect none of them are lawyers). Now we just `git clone https://github.com/shawwn/llama-dl.git`, and then modify the top of `llama.sh` to specify arguments for the SCC's scheduler: ``` #!/bin/bash -l #$ -P aclab #$ -pe omp 8 #$ -l mem_per_core=8G ``` and change the line with `TARGET_FOLDER` to `TARGET_FOLDER="/projectnb/aclab/llama"`. Probably we don't need 8 cores, but whatever. Then `qsub llama.sh`. Started at 10:59am, finished downloading and verifying checksums for all four Llama model sizes (7B, 13B, 30B, 65B) sometime between 12:30 and 12:37pm, so a bit under 2 hours, just as advertised! ## Preparing Next step is to run some conversion scripts. So I setup a python environmnent: ``` module load python3 gcc/12.2.0 openblas [ ! -d "env" ] && python -m venv env source env/bin/activate pip install --upgrade pip pip install -r requirements.txt ``` and then run `python convert.py /projectnb/aclab/llama/7B/`. This took about a minute or so, and eventually ended with the output: `Wrote /projectnb/aclab/llama/7B/ggml-model-f16.bin`, so I guess this converts the pytorch checkpoint into whatever format the `ggml` C++ code uses. Next, we need to quantize the model with 4-bit quantization. This is supposedly the magic that makes it run fast on the CPU: `./quantize /projectnb/aclab/llama/7B/ggml-model-f16.bin /projectnb/aclab/llama/7B/ggml-model-q4_0.bin q4_0` As an aside, looking at the code it seems like possibly most of the actual arithmetic happens in ordinary fp16 precision: the weights appear to be unpacked from 4bits to 16 bits, multiplied, and then packed back down. So I guess the real gain is maybe from being able to fit more the model in the cache at a time? Or maybe I missed the part in the code that directly multiplies 4 bit weights. This took about a minute to run. Afterwards I see: ``` (env) [cutkosky@scc-wb4 llama.cpp]$ ls -l /projectnb/aclab/llama/7B total 30025881 -rw-r--r-- 1 cutkosky aclab 100 Mar 5 04:36 checklist.chk -rw-r--r-- 1 cutkosky aclab 13476939516 Mar 5 04:42 consolidated.00.pth -rw-r--r-- 1 cutkosky aclab 13477814912 May 25 12:32 ggml-model-f16.bin -rw-r--r-- 1 cutkosky aclab 3791725184 May 25 12:38 ggml-model-q4_0.bin -rw-r--r-- 1 cutkosky aclab 101 Mar 5 04:36 params.json ``` So, looks like the quantized model is present in `ggml-model-q4_0.bin`. Alright, moment of truth! Let's run: `./main -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was"`: ``` (env) [cutkosky@scc-wb4 llama.cpp]$ ./main -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685032894 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) . llama_init_from_file: kv self size = 256.00 MB system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was ``` Well, it seems to be *really* slow. After about 5 minutes we've moved to: ``` Once upon a time there was a family that lived in an old house with many rooms. One night the parents decided to move their kids from their old home in Los Angeles ``` So, I guess it's working, but I was expecting something at least plausibly interactive. Maybe I need to give the system more memory? Currently this is running on an interactive job invoked via ` -P aclab -pe omp 8 -l mem_per_core=8G`, so it should be 64GB of RAM and 8 cores, which I figured would be enough. Maybe these cores are really slow? Maybe I just need more cpus? Had a few meetings, and came back again. I increased the CPU count: `qrsh -P aclab -pe omp 16 -l mem_per_core=8G` and tried again (this is also increasing the memory by a factor 2 since I specify memory per core rather than total memory). Success! It started generating text very quickly and within a minute I had the following: ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685049400 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) . llama_init_from_file: kv self size = 256.00 MB system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was a lovely little cottage in the middle of nowhere, surrounded by green fields. It belonged to a very old lady called Granny Smith, who lived there all alone with her four cats: Tiddles, Purrs, Squeeze and Mews. The day before she had been out foraging with her dogs in the nearby woods. She came home and sat down at her kitchen table to a delicious meal of carrot soup, blackcurrant sponge cake and freshly-made apple juice. While she was enjoying her lunch, a knock sounded on the door. Granny Smith got up from her seat and went out onto the veranda, where there were six baskets full of delicious apples that she had collected. She opened the front door to discover a woman standing in the doorway. She was carrying an empty basket, which she placed on the floor. The old lady greeted her with a warm smile and invited her inside for coffee. This made the woman very happy indeed, as Granny Smith’s apple juice is known to be one of the best around! "Thank you, thank you!" said the woman, accepting the tray. "I know how much work goes into making this drink." It would have been rude for them not to say it was delicious. So Granny Smith and her visitor chatted happily while they drank their coffee and ate their biscuits with caramel apple icing and melted dark-chocolate pearls. They talked about all sorts of things - Granny Smith’s dogs, the weather, the fact that they both liked cats, even though they were very different from each other - until finally it was time for them to say goodbye. The woman looked at her watch and said that she had a long walk ahead of her in order to reach the road leading back home again. She promised Granny Smith that she would be back soon to collect more apples, and left. Granny Smith sat down in her chair once more, sipped some coffee, nibbled on an apple, and looked out through the window at the sunset. This was her favourite time of day, because she could watch the shadows gradually changing colour as the light faded. And it wasn’t long before she heard another knock at the door, only ``` Then it stopped and waited for a while (a minute or so I think), before continuing: ``` this time there was no tray. "You’re very welcome!" Granny Smith smiled and watched as the young woman began to unpack her basket and set out various things on the kitchen table: a pile of brown paper bags; a plastic container full of different kinds of apples; various knives, scissors, measuring cups, and other utensils; two pairs of white rubber gloves, which Granny Smith found particularly amusing. "Don’t you have any more?" she asked. "The ones you gave me yesterday seem to be doing their work." "Oh yes," said the woman, and began to carefully wrap up the empty brown paper bags. Once they were all in order, she put them into Granny Smith’s basket, along with a big bag of sugar - that, according to her, was very important. "Now we’ll have apple cider, shall we?" They both agreed, and before long they had two glasses full of delicious red liquid sitting on the table in front of them. The woman took one sip from hers and said, "Mmmm!" Granny Smith took a sip too, and thought ``` Then it seems to have stopped again. I have a guess that it is swapping out memory to disk or something during these periods. So, I am going to set the `--mlock` flag to force it to store everything in RAM: ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main --mlock -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685049744 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 256.00 MB system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was a little girl who lived in a cottage by the sea. Her name was Ava, and she had the most beautiful voice. She sang all day long, even while she slept. The birds loved to hear her sing, but they got tired of hearing the same tune over and over again. They decided it would be funny to change the music. They began by adding a few extra notes here and there until the song became so crazy that Ava had no idea what was going on. The birds stopped singing long enough for her to hear them talking. “What are you doing?” she asked. The birds looked at each other, as if trying to find an answer to this question. “We’re singing!” they finally said. [end of text] llama_print_timings: load time = 1249.47 ms llama_print_timings: sample time = 141.36 ms / 157 runs ( 0.90 ms per token) llama_print_timings: prompt eval time = 757.81 ms / 7 tokens ( 108.26 ms per token) llama_print_timings: eval time = 22358.98 ms / 156 runs ( 143.33 ms per token) llama_print_timings: total time = 23801.44 ms ``` hmmm ok well that happened without any pause, but it wasn't very long. Let's try again. ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main --mlock -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685049806 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 256.00 MB system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was an extremely busy professional who had no time to go and get her hair done but desperately needed a new look. When she finally found the time, it took three hours to do her hair because of all the products used in order to get the perfect shine and color. Afterwards, she felt quite dismayed that all this effort was gone into and just left her looking like everyone else. But then she discovered the latest trend – ombre highlights! These are a range of highlighting techniques that can be applied to your whole head or in sections. They are ideal for those who want to achieve an eye-catching look but don’t have the time to go through all the preliminary steps. What can you expect from a professional ombre job? Professional hairdressers will use products that make the color and the shine long lasting. These specialist techniques will give your hair an intense amount of volume, texture and definition. Once you have had them applied, you should be able to leave the salon knowing that you have a great looking head of hair. These days, ombre highlights are very popular as it is a quick way to get a new look without needing to go through all the preliminary steps such as dying your hair. There are a huge number of different shades and styles available to choose from – you can even have one color in the top half and another at the bottom! When choosing the right colors for your ombre highlights, it is important that you think carefully about which colors you want. The more color choices you pick, the longer it will take for your hair to dry between sections. You should also consider whether or not you are going to get a cut with your dye job – this takes time and can mean that your ombre highlights last longer than you expect! As soon as you have decided on your shades, it is important that you book an appointment to get your colors done professionally. You should also allow yourself to be in the salon for at least three hours so that all of the highlighting can be carried out properly. If you need to go home during this time, make sure that you have arranged a lift! If you are having some sections highlighted (such as your fringe), then it is important that you get these done first – ombre techniques use very volatile dyes and they will stain anything they come ``` Ok, we've paused again... It's now 5:25 at 5:26 we started generating text again: ``` into contact with. This means that you should avoid drinking coffee, eating chocolate or getting any hair product through the dyed areas until all of your colors have been applied. If you want to get your ombre highlights done professionally, then it is important that you book an appointment ahead of time so that you can make sure you can get it on a convenient day and time. If you do not have any appointments already made, then you might find yourself waiting for a long time before you can finally sit down in the salon chair! It is also worthwhile giving your colorist as much information about what look you are trying to achieve with your ombre highlights so that they can advise you on which shades will be best for your locks. It is also important that you give them accurate measurements and details of your current hair color – this way, they can plan out how long it will take them to do the job properly! It is important that you avoid washing or swimming in pools while you have ombre highlights as these can cause the dye on your locks to fade. If you are planning a big night out with your pals and want to make sure that your tresses look ``` Now we pause again (it's still 5:26) at 5:28 (almost 5:29) we generated some final text: ``` fabulous for it, then booking a blow dry appointment at your local salon is a great way of ensuring that they will be in tip-top condition! [end of text] llama_print_timings: load time = 1258.48 ms llama_print_timings: sample time = 716.08 ms / 797 runs ( 0.90 ms per token) llama_print_timings: prompt eval time = 186618.80 ms / 521 tokens ( 358.19 ms per token) llama_print_timings: eval time = 118292.87 ms / 794 runs ( 148.98 ms per token) llama_print_timings: total time = 306395.70 ms ``` OK, it says that the avearge generation time was roughly 2 tokens/sec. Apparently a token is roughly 3/4 of a word (at least with the GPT tokenizer: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them. presumably the Llama one is similar). So this should translate to roughly 3 words every two seconds. It might actually be a better UI to cache the words and only print them at this slower speed to give the model some buffer time to hide the long pauses. However, I still don't know what's causing the pause. It's not swapping from disk now. Maybe let's try increasing the batch size with the `-b` flag. I don't totally understand what this means in this context: isn't the batch size 1 since there is only one prompt? Maybe it's going to predict ahead some tokens somehow? Anyway, apparently the default is 512, so let's try 1024. ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main -b 1024 --mlock -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685050513 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 256.00 MB system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was a beautiful, young widow who fell in love with the best man. It seems a little far-fetched; but if the truth be told, it’s exactly what happened to me (albeit sans the best man). Let me explain… When we were both just twenty, my future husband and I met on holiday. He was a handsome young Frenchman, the son of two teachers who taught at our school in Paris. My parents, like many immigrants from the Middle East at that time, had moved to France for work and then started a family there. I was his first girlfriend outside our school; he wasn’t mine. At that age, we were all very naive. We were both students when we met; by the time I knew him better (and became the object of his affections), I had changed schools and was no longer a student. We spent three perfect weeks together in Paris that summer holiday—just the two of us, in love, with our whole lives ahead of us. By the end, we’d promised to write to each other, be friends forever. And that, of course, is exactly what happened… for ten years. Until one day he wrote me a letter and told me his life had changed—he was getting married. He had met this beautiful, young widow at university in Paris, fallen head over heels in love with her, proposed to her and married her without telling anyone (including me). He’d never dreamed I would find out. It was a huge shock when I read his letter: I felt like my heart was being ripped from my chest… literally! I went crazy. I threw everything away that reminded me of him, destroyed our friendship forever and swore I’d never get over him, no matter how hard he tried to contact me again. (He couldn’t, because she wouldn’t let him.) We hadn’t been in touch for ten years; now we were barely friends at all—except that my husband knew every detail of our love story. And it seemed to make a difference. The first time he tried to call me again he got through to me, and the second time he did too… but only when I was out—my mother answered the phone each time. He told her how much he loved me; she never once told him that I had dumped ``` Still pausing here at 5:36. At 5:38 we started generating again ``` him for another man (her partner). I was so hurt by his betrayal of our friendship. My husband knew all about it, but my friend still lived in denial—he’d married the woman who had been his boss at work and had left Paris to settle down with her family in a tiny village outside Paris. His wife had never really wanted children; he was thirty-five, she was forty-eight (old enough to be my mother) when they were finally allowed to have their first baby—a beautiful, little girl who is now a teenager. I can’t forgive him… so I don’t visit them, and I never mention his name to anyone. I still can’t get over what he did to me, even though my husband tries to convince me that he was young and in love—I can still see us together all those years ago at college. My husband is right, we should have grown up by now. But he doesn’t know that my marriage isn’t perfect either… my husband had never loved me the way I deserved to be loved until he found out about our old friend. I had always been in love with him and his wife ``` pause again at 5:39 (btw this story seems deliciously dramatic, but it doesn't *quite* seem to make sense to me...) At 5:40 we start again ``` —we were all very young when we first met, but it was more than just a schoolboy crush, as though it were meant to be! Even now, when I see him at a party or on social media, my heart skips a beat. It’s a deep, unconditional love that has endured for so long… we know each other’s secrets and still have such close bonds that no one could ever break. The problem is, our friend’s wife knows about it—she even warned me off! She knows everything I’m writing in this letter, she was the first to tell him what he had done to me, but I can never forgive her for it… He has a wonderful family now and lots of friends, who are all so happy together. But we were once very close—he would have married me one day if not for the fact that his wife was pregnant! [end of text] llama_print_timings: load time = 1266.36 ms llama_print_timings: sample time = 858.06 ms / 956 runs ( 0.90 ms per token) llama_print_timings: prompt eval time = 187510.29 ms / 521 tokens ( 359.90 ms per token) llama_print_timings: eval time = 143867.22 ms / 953 runs ( 150.96 ms per token) llama_print_timings: total time = 333056.57 ms ``` Ok, I notice it's only using 8 threads. Let's try 16 with the `-t 16` option: ``` main: build = 588 (ac7876a) main: seed = 1685050909 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 256.00 MB system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was this cute little girl. She was in the eighth grade, studying hard to get into high school, when her father passed away. The family fell apart and she had to fend for herself. She started working at the age of 14, and then went back to the same school as before, where everyone treated her as a ‘daughter of the poor’. “I am not interested in being charity case,” she said with tears in her eyes. Her parents were illiterate. But their daughter did well at school till eighth grade and wanted to go ahead studying further. She was also working hard, but her family could not afford it. She got married at the age of 16 years to an elderly man who already had a wife. Her husband’s family discriminated against her and she was abused by them. She bore this in silence as they were paying dowry from her own money, while her father was alive. After her marriage she could not go out with men or boys for any reason. She was forced to get a divorce three years later – the in-laws blamed her for her husband’s death which happened just after their separation. The ex-husband and his family started abusing her physically, verbally and mentally. “They used to threaten me that they would kill my father if I did not give them more money. They also said that they would make my mother a widow within two days of my marriage,” she says. Even after the divorce, she was abused and her family was threatened. She could see no way out except to run away from home with her three children. That is when we got to know about her, in January 2016. Her husband has still not remarried and now he lives in his wife’s house. Her mother is mentally ill but she still takes care of her. We have been helping the family ever since – with rations and money. A few months ago, we asked them to apply for our scholarship programme for children. The husband said that this could not be done till his divorce was finalised from his wife’s house. He felt like he was doing a favour by allowing his former wife and in-laws to continue living in the family house which belonged to him. “If I had my own house, I would have given it ``` Stop at 5:43. Man, this story seem very depressing. So I just Ctrl-C. Ok, new theory: perhaps it has to do with context length. The model has a context of size `N=512` and it might generate until it fills up the context, then reset (maybe with 1/2 the context from before idk). So, let's try increasing context size with `-c 1024`: ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main -c 1024 -t 16 -b 1024 --mlock -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685051394 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1024 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 512.00 MB system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 1024, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was a little boy who loved his father very much. The little boy’s name was Avery. His dad, Andrew, worked as a mechanic at the local car lot. Avery and his family were living in their own home. They had food on the table every day and new clothes to wear in the fall. But, there was something they wanted very much; but, it was too expensive for them to buy. There was not enough money left over after paying all of their bills each month. Avery and his family looked forward to spending time together as a family, but sometimes Dad had to work late at the car lot or mom had to go grocery shopping on days that were not her regular day off from work. They didn’t have enough money for all of them to do things together like they wanted. One night Andrew and Avery’s mom were talking about how they could make more money. They started looking into different ways to earn more, but found it would be too hard for Andrew to work overtime each week because he had his regular job at the car lot every day of the week. He also didn’t want to ask for a raise at work because that might cause him to lose his job. So, instead they talked about how Avery could earn money to help out with all of their bills and allow them to have more fun as a family together. Andrew told Avery that he should start a lemonade stand outside in front of their house during the nice summer months. He would have to buy his own supplies, but it looked like this could be something Avery would be able to do with little help from mom and dad. When Avery’s friends found out about his lemonade stand they wanted to come over for fresh cold lemonade, and cookies that he made himself. They each paid him ten cents per cup of lemonade and one dollar for a cookie. That was two dollars! Avery had never seen so much money in his life. He ran into the house and told Andrew and mom all about it. When Andrew got home from work, he asked Avery how business was going at the stand. Avery told him that he made lots of money, but now he wanted to know what they could do with all of this extra money. “What would you like?” Andrew asked. Avery thought for a minute and then said, “I’d really like to get my own bike.” Andrew smiled at his son and told him it was possible if Avery kept working hard outside in front of their house during the nice weather. He had also seen a boy that lived on the other side of the street who worked very hard at his lemonade stand. “Wow, I’d love to get my own bike too!” said Avery’s best friend Samantha and he went off to tell his mom about this new plan for getting their bikes. Andrew and Avery’s mom were happy that they had a plan to help out with the expenses. “We can do it, but you need to work as hard at your stand each week as Samuel does,” said Andrew encouragingly. The two boys both agreed on this and started making plans for how their stands would look so different from one another. The next day Avery went outside and put up his lemonade stand signs in front of the house, but he wasn’t sure if it was really going to work out. He could have just asked Samuel directly about selling lemonade at his stand, but then they might not have had as many customers. Soon enough Avery had made two bags full of lemonade with all of the money from each week’s sales. When he went inside and counted it up, he realized that each bag of money was equal to $50. The boys were amazed! They ran into their parents’ room excitedly and told them all about how they had made more than enough to get their bikes in just one summer season. Both families were very happy for Avery, Samuel, Andrew, and his mom when they saw how much money the two lemonade stands had made. They asked if there was anything else either boy could do to earn even more so that they would each have a bike of their own. Andrew knew that the boys would love riding together on their bikes so he started thinking about what other things they could possibly do together to make money. He told Avery and Samuel about some extra work around town, but both boys already had summer jobs at different local businesses. They liked those jobs, but were also interested in finding out more about Andrew’s new plan. Andrew asked the boys if either of them would like to have a yard sale together where they could sell things that they no longer needed or ``` Pause at 5:52. You know, this feels like it took longer to pause this time. Maybe I'll try setting the seed to be the same as a previous one (say 1685049400 from the first attempt): ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main -s 1685049400 -c 1024 -t 16 -b 1024 --mlock -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685049400 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1024 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 512.00 MB system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 1024, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was a lovely little cottage in the middle of nowhere, surrounded by green fields. It belonged to a very old lady called Granny Smith, who lived there all alone with her four cats: Tiddles, Purrs, Squeeze and Mews. The day before she had been out foraging with her dogs in the nearby woods. She came home and sat down at her kitchen table to a delicious meal of carrot soup, blackcurrant sponge cake and freshly-made apple juice. While she was enjoying her lunch, a knock sounded on the door. Granny Smith got up from her seat and went out onto the veranda, where there were six baskets full of delicious apples that she had collected. She opened the front door to discover a woman standing in the doorway. She was carrying an empty basket, which she placed on the floor. The old lady greeted her with a warm smile and invited her inside for coffee. This made the woman very happy indeed, as Granny Smith’s apple juice is known to be one of the best around! "Thank you, thank you!" said the woman, accepting the tray. "I know how much work goes into making this drink." It would have been rude for them not to say it was delicious. So Granny Smith and her visitor chatted happily while they drank their coffee and ate their biscuits with caramel apple icing and melted dark-chocolate pearls. They talked about all sorts of things - Granny Smith’s dogs, the weather, the fact that they both liked cats, even though they were very different from each other - until finally it was time for them to say goodbye. The woman looked at her watch and said that she had a long walk ahead of her in order to reach the road leading back home again. She promised Granny Smith that she would be back soon to collect more apples, and left. Granny Smith sat down in her chair once more, sipped some coffee, nibbled on an apple, and looked out through the window at the sunset. This was her favourite time of day, because she could watch the shadows gradually changing colour as the light faded. And it wasn’t long before she heard another knock at the door, only this time there were a lot more baskets than last time! Granny Smith opened the door to reveal nine people standing in front of her, each carrying one of those same empty baskets! She invited them into her house for coffee, but they said that they just wanted to drink their juice and eat some biscuits. They sipped their apple juice from the same glasses Granny Smith used - a special occasion always deserves something more than plastic cups! And then they all tucked in together. "Thank you," they said, when it was time to leave again. "Granny Smith, your apple juice is just as good as I remembered!" Before Granny Smith could respond, a huge thunderclap and flash of lightning lit up the sky outside her front door. The wind started howling around the house - in fact there were so many trees swaying that they almost knocked over her garden table! It looked like a gale was on its way. She ran to close all the windows, but before she’d even had time to sit down again, another knock sounded at the door. "Who could be coming here in such terrible weather?" thought Granny Smith, as she opened it again. This time there were twelve people out there - four men, four women and four children. All of them carried empty baskets over their shoulders! Granny Smith invited them to come inside for a hot drink or biscuits with apple juice icing and pearls. But they all refused - they just wanted to sit down on her front steps with an orange-juice smoothie instead, and share the story of how the trees had once been their home. "It was a very long time ago," said one, "When I was young and my hair was still black." Granny Smith could tell that these people were all rather old, but they seemed to be quite healthy and sprightly! They told Granny Smith all about how they had lived in the trees for many years, until a man called Logan came along, chopped down most of their homes and built houses on top. "I remember that day," said one of them, "It was very windy like this." Their stories sounded quite sad to Granny Smith - but she could only imagine how much they had missed when the trees were all cut down! She was ``` Paused at 5:56, but this definitely went went way further than the first time (which didn't complete the line starting with "Granny Smith sat down in her chair once more"). This time we started generating text again at 5:59. It makes sense that things are slower to start up with a larger context I guess: ``` She was glad that these people still believed in the magic of her apple juice. "But you know," said one of them, "Things are getting better!" "Oh really?" asked Granny Smith, surprised. "How’s that? And how do you know anyway?" The first person spoke again - "Well, a few years ago we moved out into the trees again - into houses Logan made for us out of tree branches." This second family had all grown up in the trees, and remembered them well! Granny Smith was very pleased that they’d come back to her - but they seemed to be in quite a hurry. "We need to go now," said one of them, "There’s another man coming who wants Logan’s house!" Granny Smith went with them all the way to the end of her garden path and waved them off as they hurried away into the dark. Then she walked back inside again - but she wasn’t happy that a new group of people had come along so quickly! She tried to get their baskets out from underneath the kitchen table, but one of them knocked over a glass cup full of apple juice! Granny Smith could smell it now - her magic apple juice was going all over the floor! [end of text] llama_print_timings: load time = 940.97 ms llama_print_timings: sample time = 1190.72 ms / 1293 runs ( 0.92 ms per token) llama_print_timings: prompt eval time = 178922.15 ms / 519 tokens ( 344.74 ms per token) llama_print_timings: eval time = 157951.48 ms / 1292 runs ( 122.25 ms per token) llama_print_timings: total time = 339012.90 ms ``` Looks like average generation time is roughly the same, so the same UI trick of artificially slowing down the printing is still probably a reasonable idea. But now I recall that we've changed a few things. Let's make sure that it was in fact the context increase by regenerating that depressing story with seed 1685050909: ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main -s 1685050909 -c 1024 -t 16 -b 1024 --mlock -m /projectnb/aclab/llama/7B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685050909 llama.cpp: loading model from /projectnb/aclab/llama/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1024 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 512.00 MB system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 1024, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was this cute little girl. She was in the eighth grade, studying hard to get into high school, when her father passed away. The family fell apart and she had to fend for herself. She started working at the age of 14, and then went back to the same school as before, where everyone treated her as a ‘daughter of the poor’. “I am not interested in being charity case,” she said with tears in her eyes. Her parents were illiterate. But their daughter did well at school till eighth grade and wanted to go ahead studying further. She was also working hard, but her family could not afford it. She got married at the age of 16 years to an elderly man who already had a wife. Her husband’s family discriminated against her and she was abused by them. She bore this in silence as they were paying dowry from her own money, while her father was alive. After her marriage she could not go out with men or boys for any reason. She was forced to get a divorce three years later – the in-laws blamed her for her husband’s death which happened just after their separation. The ex-husband and his family started abusing her physically, verbally and mentally. “They used to threaten me that they would kill my father if I did not give them more money. They also said that they would make my mother a widow within two days of my marriage,” she says. Even after the divorce, she was abused and her family was threatened. She could see no way out except to run away from home with her three children. That is when we got to know about her, in January 2016. Her husband has still not remarried and now he lives in his wife’s house. Her mother is mentally ill but she still takes care of her. We have been helping the family ever since – with rations and money. A few months ago, we asked them to apply for our scholarship programme for children. The husband said that this could not be done till his divorce was finalised from his wife’s house. He felt like he was doing a favour by allowing his former wife and in-laws to continue living in the family house which belonged to him. “If I had my own house, I would have given it to you,” he said. The girl’s father had given up and stopped attending school. Her elder brother had already dropped out from school due to financial reasons. The only sibling who was still studying hard was the younger sister. She wanted to study further but her family could not afford this – and she could not work as she was a full-time mother of three children. The girl has no one in the world except for us! All our help to her was from our own funds, which we raise by helping destitute women and their families who come to us for help. We have never asked anyone for sponsorship or charity before – we wanted to give this young woman a chance at life herself. “I am so grateful that you came into my life. I would like to pursue a Masters in English Literature from the University of Delhi,” says this young woman, who will be 21 years old next month. Her younger sister is also studying with us and we are helping her complete her schooling. You can make a contribution to help girls like her study further. [end of text] llama_print_timings: load time = 940.91 ms llama_print_timings: sample time = 671.57 ms / 735 runs ( 0.91 ms per token) llama_print_timings: prompt eval time = 445.14 ms / 7 tokens ( 63.59 ms per token) llama_print_timings: eval time = 87155.05 ms / 734 runs ( 118.74 ms per token) llama_print_timings: total time = 89013.92 ms ``` It actually decided to finish, but more important it clearly got further than before so definitely the cotext size was the important part. ## Trying a bigger model Ok, things seem to work. Let's try the 13B parameter model. I compress everything and start generation: ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main -c 2048 -t 16 -m /projectnb/aclab/llama/13B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685056036 llama.cpp: loading model from /projectnb/aclab/llama/13B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 9031.70 MB (+ 1608.00 MB per state) . llama_init_from_file: kv self size = 1600.00 MB system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was a very bad man named Jabba the Hutt. He ran a planet called Tatooine. The other planets of the galaxy were scared that Jabba would take over because he owned the only place where ships could get fuel. Jabba had an enormous body, a small head, and giant eyes that were always watching everything that happened around him. He dressed in flowing robes that matched his skin color, and he wore jewels on his fingers, neck, and ears. Jabba’s favorite food was fried eel. Jabba had an enormous body, a small head, and giant eyes that were always watching everything that happened around him. He dressed in flowing robes that matched his skin color, and he wore jewels on his fingers, neck, and ears. Jabba’s favorite food was fried eel. Jabba had four bounty hunters looking for Han Solo. The bounty hunters were from different planets in the galaxy, but they all came to Tatooine because everyone knew that’s where Han would go next. One of them, a man with a green face and bright yellow eyes named Greedo, caught up with him at a cantina. Greedo didn’t say anything. He just pointed his blaster at Han. Jabba was so angry he threw the plate across the room. Then he told Greedo to kill Han because Han had shot first. Greedo raised his blaster and pulled the trigger, but nothing happened! The gun was broken! Han Solo took out a knife and cut off Greedo’s head. Then he ran outside and found Chewbacca. The two friends got into the Millennium Falcon and started to flee from Tatooine, even though Jabba’s men were chasing them in their own ship! Han Solo jumped over to a different ship, but he didn’t get inside before he was shot at by one of Jabba’s men. A blaster bolt hit the Falcon and blew Han into space. But then Chewbacca got on top of the Falcon and caught him! The two friends got back inside and started to flee from Tatooine, even though Jabba’s men were chasing them in their own ship! Han Solo jumped over to a different ship, but he didn’t get inside before he was shot at by one of Jabba’s men. A blaster bolt hit the Falcon and blew Han into space. But then Chewbacca got on top of the Falcon and caught him! The two friends jumped back into the Falcon and started to flee from Tatooine, even though Jabba’s men were chasing them in their own ship! Han Solo jumped over to a different ship, but he didn’t get inside before he was shot at by one of Jabba’s men. A blaster bolt hit the Falcon and blew Han into space. But then Chewbacca got on top of the Falcon and caught him! The two friends jumped back into the Falcon, started it up, and took off from Tatooine—even though J ``` Hmm, this seems kind of bad? I hit Ctrl-C to stop it at this point. It looks like we can try some options: ``--repeat_penalty 1.15 --repeat_last_n 128`: ``` (env) [cutkosky@scc-pg3 llama.cpp]$ ./main -s 1685056036 --repeat_penalty 1.15 --repeat_last_n 128 -c 2048 -t 16 -m /projectnb/aclab/llama/13B/ggml-model-q4_0.bin --prompt "Once upon a time there was" main: build = 588 (ac7876a) main: seed = 1685056036 llama.cpp: loading model from /projectnb/aclab/llama/13B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 9031.70 MB (+ 1608.00 MB per state) . llama_init_from_file: kv self size = 1600.00 MB system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 128, repeat_penalty = 1.150000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0 Once upon a time there was a very bad man named Jabba the Hutt. He ran a terrible place where people like you and me were forced to work as his slaves, doing whatever evil he ordered us to do… But it doesn’t matter how bad or evil someone is when they are standing on your tail…it only matters that you escape their grasp! That is what Anakin Skywalker did by freeing himself from slavery in “Star Wars: Episode I – The Phantom Menace” (1999). But it wasn’t just him who escaped Jabba’s clutches; his mother Shmi also got away from her son’s former captor. She was able to save herself because of a little help from her youngest child, and the boy she loved more than anything. Soon, though, this little one would become a man capable of doing incredible things as he freed everyone from slavery under Emperor Palpatine’s evil reign… A young Anakin Skywalker and Qui-Gon Jinn on Tatooine in “Star Wars: Episode I – The Phantom Menace” (1999). ©20th Century Fox Film Corp./Lucasfilm Ltd. All Rights Reserved. As you are probably aware by now, we have been celebrating Star Wars with a new story each day during the month of May! We had so much fun that we decided to continue our celebration into June with some posts about characters who appear across multiple films within this epic franchise. Today’s subject is young Anakin Skywalker from “Star Wars: The Phantom Menace”, and how his life before he became Darth Vader was shaped by two women – one who made him what he would become and another who saved him when no one else could do it… Anakin Skywalker as an older boy in “Star Wars: Episode II – Attack of the Clones” (2002). ©2018 Lucasfilm Ltd./Lucasfilm/Disney CoMPany, All Rights Reserved. We first meet our future Dark Lord of the Sith as a young child living on Tatooine with Shmi Skywalker at their home which is situated near Mos Espa’s spaceport terminal. We learn that this Jedi Padawan named Qui-Gon Jinn has been sent by his master to find and train a certain person who he feels may be “The Chosen One.” This prophecy that everyone keeps talking about says that there will one day come someone so powerful that they can bring balance back into the Force between good and evil… But why would anyone believe in such a myth? Because it seems that it has already happened once, when an even greater Jedi Knight named Darth Bane brought peace during the Sith Wars! (If you are not familiar with these stories or have forgotten them, please check out our posts for this week as well as the ones posted last month!) Shmi Skywalker explains to Qui-Gon how her son Anakin was born. In “Star Wars: Episode I – The Phantom Menace” (1999). ©20th Century Fox Film Corp./Lucasfilm Ltd. All Rights Reserved. Anyways, Shmi tells us that she believes in both sides of life and therefore doesn’t really care about such a prophecy; she just wants her children to be safe no matter who they are. She says that one day some space travelers came by their home searching for someone with knowledge of ships but found only her instead…they asked if she knew anyone on Tatooine because they were looking for someone named “Skywalker.” And so it was during this conversation at the spaceport terminal that Qui-Gon Jinn first learned of Anakin Skywalker as well! The boy is apparently being kept from his mother, which worries Qui-Gon greatly – but he will find out why soon enough… Young Anakin Skywalker fights to free himself from slavery in “Star Wars: Episode I – The Phantom Menace” (1999). ©20th Century Fox Film Corp./Lucasfilm Ltd. All Rights Reserved. Qui-Gon and Obi-Wan Kenobi take young Anakin with them when they leave Tatooine for Naboo so that the Jedi Council can decide whether or not this boy is truly “The Chosen One.” While staying on a ship en route to the planet, little Anakin has an accident which causes him to fall into one of the vessel’s reactors…although he survives, his mother Shmi does not. Now in mourning she will never see her beloved son again… But perhaps if she had lived longer then things might have turned out differently? What do you think about all of these new details we are learning about both young and adult versions of Darth Vader/Anakin Skywalker? Let us know your thoughts in the comments below! And may The Force be with you always… [end of text] llama_print_timings: load time = 1369.46 ms llama_print_timings: sample time = 2038.93 ms / 1106 runs ( 1.84 ms per token) llama_print_timings: prompt eval time = 828.39 ms / 7 tokens ( 118.34 ms per token) llama_print_timings: eval time = 239800.51 ms / 1105 runs ( 217.01 ms per token) llama_print_timings: total time = 243606.85 ms ``` Ok, the repeats are gone. I'm not sure how great this is though: I'd almost expect it to be verbatim from some fan website, except some of the details are dramatically incorrect to an extent that I refuse to believe any fan site would make these mistakes (Qui-Gon is not a padawan, he was not sent to Tatooine to find the chosen one, and Shmi doesn't fall into a reactor and die on a spaceship - I'm pretty sure she stays on Tatooine). Anyway, it printed fast, so it's looking good!