章劉軒瑋
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
As machine learning algorithms process numbers rather than text, the text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an embedding is associated to the integer index.
Tokenization also compresses the datasets. Because LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded" until they match the length of the longest one. How many tokens are, on average, needed per word depends on the language of the dataset.
Teach the model the general structure, grammar, and meaning of language by training it on large-scale text data.
The model is trained on massive datasets such as web pages, books, and Wikipedia.
The training tasks often involve self-supervised learning, for example:
Through this training, the model learns patterns, word relationships, and contextual dependencies in the language.
Focus the model on a specific task (e.g., translation, question answering, or sentiment analysis) to improve its performance in that domain.
The model is further trained using labeled datasets tailored to the target task, such as:
The model’s parameters are adjusted based on these task-specific datasets, refining its understanding.
This stage is more efficient as it builds on the foundational knowledge acquired during pre-training.
Further optimize the model to produce outputs aligned with specific requirements, such as user preferences, content quality, or ethical standards.
A Reward Model is introduced to evaluate the quality of the model’s outputs. Reinforcement Learning algorithms are used to adjust the model’s behavior:
The model learns to maximize rewards, improving the quality of its generated content.
Article Writing and Creative Content: LLMs can automatically generate articles, news reports, technical documents, or creative writing (e.g., stories, poetry). This is highly useful in content creation or media industries.
Ad Copy and Marketing: LLMs can automatically create catchy ad copy or social media content based on specific needs, saving time in content writing.
LLMs can perform efficient language translation, supporting multilingual tasks such as cross-border business communication, international collaboration, etc.
Compared to traditional translation tools, LLMs can better understand complex context and generate more natural translations.
As we’ve explored, LLMs have diverse and impactful applications in areas such as content generation, Question answering systems. However, in training large language models, one key operation that consumes a significant amount of computational resources is matrix-vector multiplication, also known as the fully-connected or linear layer in deep learning. This operation plays a crucial role in applying the learned parameters across the model and often accounts for over 70% of the total computation during training.
In the next section, we’ll use a model based on the open-source project llama2.c by Andrej Karpathy, an open-source variation of GPT, and LLaMA2 released by Meta, to further explore how such operations are optimized in the context of modern language models.
llama2.c is a minimalistic implementation of the Llama 2 architecture, focusing on simplicity and educational value. It provides a full-stack solution for training and inference using a small Llama 2 model in pure C. The repository allows loading models trained on the TinyStories dataset and supports running them interactively with a C-based inference engine. It emphasizes ease of use, with the ability to run models with parameter sizes up to 42M efficiently on personal hardware.
The matmul
function in Llama performs matrix-vector multiplication, a fundamental operation in neural network computations.
For each row of the weight matrix w, it computes the dot product with the vector x and stores the result in xout.
OpenMP parallelization is used to divide row-wise computations across multiple cores, enhancing performance.
This function is a computational bottleneck in model inference.
Let’s take an example:
Here is a simple diagram to illustrate the matrix multiplication:
The result is:
matmul
in assembly code:
101fc-10294
Outer Loop Control:
10208-10250 Inner Loop Calculation:
Although each row of the matrix multiplication can be computed in parallel (thanks to OpenMP), the total computation still involves d
rows, and each row involves 𝑛
operations. The parallelization only reduces the time for individual row computations, not the overall complexity of processing d
rows. Hence, even with parallelism, the overall time complexity remains O(d×n)
. The reduction in time only happens in the constant factor, not in the big-O complexity.
In matrix multiplication, the inner loop performs multiplication and accumulation one pair of data at a time, requiring multiple CPU cycles per operation. This approach is inefficient because each multiplication and addition operation is done sequentially, leading to high overhead from repeated memory access, data loading, and computation. As a result, the CPU spends excessive time on individual operations, reducing the overall performance.
To address the inefficiencies of sequential operations in matrix multiplication, I propose defining custom Matrix-Vector Multiplication (MVM) instructions inspired by the RISC-V Vector Extension (RVV). These instructions would focus on parallelizing computation, enabling operations such as simultaneous data loading, vectorized multiplication, and accumulation. Specifically, the design could include:
w[i * n + j] * x[j]
operations with accumulation, reducing loop overhead.After Adding instructions, if each vector operation processes l elements in parallel:
⌈n/l⌉
.d×⌈n/l⌉
.𝑂(d×⌈n/l⌉)
.The vectorized instructions theoretically reduce complexity by a factor of
l, enhancing performance by decreasing loop iterations and memory access.
vflw
:
Its mnemonic representation would resemble:
Use the vflw
instruction to load data directly from memory into a vector register. I would like to store four words in the vector register at once for computation.
vfsw
:
Its mnemonic representation would resemble:
vmul
:
Each corresponding element from v2
and v3
is multiplied, and the result is stored in the corresponding position in v1
.
vadd
:
The vadd instruction performs element-wise addition between two vector registers and stores the result in a destination vector register.
In this first step, the default RISC-V toolchain is compiled, without modifications in the instructions set.
Cloning the Linux kernel and its submodules:
Around 7GB are needed to download all repositories.
The toolchain is built in /opt/riscv_custom
:
GCC cross-compiler version can be checked:
To test the implementation, we first add the non-default modulo
instruction to RV32I. Its mnemonic representation would resemble:
The opcode syntax would be:
The rv_i
file is modified as follows:
The rv_i
file is located in the riscv-opcodes/extensions
directory, which contains opcode definitions for RISC-V instruction extensions.
Then, opcode file is processed to get MATCH and MASK values:
This command will generate the representation of opcodes in several formats such as SystemVerilog, Chisel and C (in the encoding.out.h file).
Now, binutils need to be aware of the new instruction. riscv-gnu-toolchain/binutils/include/opcode/riscv-opc.h
is updated as follows:
The related C file (riscv-gnu-toolchain/binutils/opcodes/riscv-opc.c) has to be modified as well:
name
: name of the instruction.
xlen
: width of an integer register in bits.
isa
: ISA extension.
operands
: based on the parsing available in riscv-gnu-toolchain/riscv-binutils/gas/config/tc-riscv.c:
match_func
pointer to the function recovering funct7, funct3 and opcode fields of the instruction
The final step is to recompile the custom instruction that has been implemented.
Here is a sample C code using the freshmly implemented mod
instruction:
Compile the C file and verify the presence of the mod instruction in the objdump output.
We can observe the mod instruction at line 89 in the objdump output.
Two tools needs to be installed:
RISCV tools path
Spike install
PK install
Describe the behavior of the new instruction by adding a file in riscv-isa-sim/riscv/insns/mod.h
.
The mod.h
file will be:
In riscv-isa-sim/riscv/encoding.h
, add MATCH_MOD
and MATCH_MOD
as for the compiler:
Then, Makefile needs to compile the mod
instruction. In riscv-isa-sim/riscv/riscv.mk.in
:
The last file to be modified is riscv-isa-sim/disasm/disasm.cc
where instruction types are defined:
The last step is to rebuild the simulator and test the program.