OpenAI’s chatGPT has woke up a collective consciousness of what Giant
Language Fashions (LLMs) are able to. With that awakening comes a every day
march of LLM information: new merchandise, new options, new fashions, new
capabilities, (and new worries). It appears we’re within the early levels of a
Cambrian explosion of LLMs and LLM powered instruments; it’s not but clear how
LLMs will affect and affect our skilled and private lives, however
it appears clear that they’ll, indirectly.
Since LLMs are right here to remain, it’s worthwhile to take a while to
perceive how these fashions work from a firstprinciples perspective.
Beginning with the mechanics will help foster sturdy intuitions that may
inform our utilization of those fashions now and sooner or later. (Particularly if
the longer term is one the place LLMs are a staple of the information scientist’s
toolbox, as frequent as an lm()
perform name).
And what higher manner is there to be taught than by doing. So with that
preamble, on this submit we’ll stroll by an implementation of an LLM,
LLaMA (Touvron et al. 2023)
particularly, in TensorFlow and Keras, with the purpose being to develop
understanding first, functionality second.
Why LLaMA? With the sheer quantity of LLM associated content material and information out
there, it might probably appear formidable to know the place to get began. Nearly weekly
it appears there’s a new mannequin introduced. Searching some hubs of LLM
exercise (HuggingFace,
TFHub,
reddit,
HackerNews) muddies the waters even
extra. The best way to decide a particular mannequin?
Of the various LLMrelated information objects prior to now months, one which stands
headandshoulders above the gang is the release of
LLaMA,
a contemporary, foundational LLM made out there to the general public by Meta AI in
February 2023. On frequent benchmarks, LLaMA outperforms OpenAI’s GPT3,
whereas being considerably smaller (although nonetheless massive).
LLaMA is a superb beginning place as a result of it’s a easy and trendy
structure, has wonderful efficiency on benchmarks, and is open. The
mannequin structure has had just some new concepts included into it since
the unique Transformer structure first described in,
“Attention Is All You Need”
printed from Google (Vaswani et al. 2017). 4 completely different sizes of
LLaMA have been launched: 7 billion and 13 billion parameter fashions
skilled on 1 Trillion tokens, and 33 billion and 65 billion parameter
fashions skilled on 1.4 trillion tokens. This is a gigantic quantity of
coaching information these fashions have seen–the most important 65B mannequin has been
skilled on roughly the “Chinchilla
computeoptimum” (Hoffmann et al. 2022)
variety of tokens, whereas the smaller LLaMAs are considerably
past that optimum. On this weblog submit we’ll concentrate on the smallest, 7B
parameter LLaMA mannequin, which you’ll be able to comfortably load domestically and run on
CPU with solely 64Gb of RAM.
Whereas not strictly vital, to comply with alongside domestically, you’ll most likely
wish to purchase the pretrained LLaMA weights one
way or
another. Notice, the
weights do include their very own license, which you’ll be able to preview
here.
So, with out additional ado, let’s get began.
Setup
First, we’ll wish to set up the required R and Python packages, and
configure a digital atmosphere:
With that out of the way, let’s load some packages and prepare our R
session:
If you’ve acquired the pretrained weights, it’ll be convenient to
convert them from the torch checkpoint format to something that’s more
framework agnostic (you only need to do this once, of course):
We’ll also define a helper function so we can avoid having to retype the
full path to our weights:
And load the model configuration parameters specific to the 7B LLaMA,
which we’ll use to build the model.
List of 6
$ dim : int 4096
$ multiple_of: int 256
$ n_heads : int 32
$ n_layers : int 32
$ norm_eps : num 1e06
$ vocab_size : int 1
Tokenizer
The first component to LLaMA is the tokenizer, which converts text to a
sequence of integers. The LLaMA model uses the
SentencePiece tokenizer from
Google. SentencePiece is on the market as a TensorFlow graph operation
by
tf_text.SentencepieceTokenizer
,
and in addition as a Keras layer in
keras_nlp.tokenizers.SentencepieceTokenizer
.
By alternative of a coin flip, we’ll use the lowerlevel tf_text
interface.
Let’s test it out with a prompt:
tf.Tensor([ 1 450 1900 982 304 13978 367 267], shape=(8), dtype=int32)
tf.Tensor(b'The best way to attract bees', shape=(), dtype=string)
Let’s define a show_tokens()
helper function and play with the
tokenizer a little.
1 450 1900 982 304 13978 367 267
"" "The" "best" "way" "to" "attract" "be" "es"
Note that “bees” is two tokens. Not every token corresponds to a word.
For example, one nonword token we can reliably expect to show up in a
tokenizer trained on a corpus of English text is “ing.” However, when the
“ing” token shows up will not always follow your intuitions, because
common words get their own token id, even if they can be decomposed into
multiple tokens.
1 2348
"" "ing"
1 1985
"" "working"
1 8525 292
"" "flex" "ing"
1 2113 9292
"" "won" "king"
Another thing to note about the tokenizer is that each token sequence
starts with token id 1
. This is a special beginningofsequence
token that we requested be added when we loaded the tokenizer with
add_bos = TRUE
. There are two other such special tokens that we will
encounter later: an endofsequence special tokens with id 2
, and an
unknowntoken with id 0
.
[1] "<unk>"
[1] "<s>"
[1] "</s>"
1 0 2
"" " ⁇ " ""
Overall, there are 32,000 tokens.
[1] 32000
One last observation is that the more frequently encountered tokens are
assigned lower ids.
50 51 52 53 54 55 56 57 58 59
"/" "0" "1" "2" "3" "4" "5" "6" "7" "8"
100 101 102 103 104 105 106 107 108 109
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
1000 1001 1002 1003 1004 1005 1006 1007 1008 1009
"ied" "ER" "stat" "fig" "me" "von" "inter" "roid" "ater" "their"
10000 10001 10002 10003 10004 10005 10006 10007
"ång" "citep" "Ill" "rank" "sender" "beim" "рак" "compat"
10008 10009
"occurs" "diese"
20000 20001 20002 20003 20004 20005 20006 20007
"admit" "Comment" "стя" "Vien" "ці" "permut" "cgi" "crít"
20008 20009
"Console" "ctic"
31990 31991 31992 31993 31994 31995 31996 31997 31998 31999
"ὀ" "げ" "べ" "边" "还" "黃" "왕" "收" "弘" "给"
Moving on, the next step after tokenization is embedding. An embedding
layer is effectively a dictionary lookup that converts an integer (token
id) to a 1d float array. For this we can use the standard keras
Embedding
layer.
<tf.Tensor: shape=(4096), dtype=float32, numpy=…>
<tf.Tensor: shape=(8, 4096), dtype=float32, numpy=…>
Once it’s tokenized and embedded, the input then passes through the bulk
of the model, a sequence of repeating TransformerBlock
layers. The 7B
model has 32 of these TransformerBlock
layers, while the 65B model has
80 of them.
[1] 32
[1] 80
Here is what the transformer block looks like:
TransformerBlock(keras$layers$Layer) %py_class% {
initialize < function(attn_head_size, attn_n_heads,
norm_eps = k_epsilon(), ...,
block_id = NULL) {
super$initialize(...)
self$attention < Attention(attn_head_size, attn_n_heads,
block_id = block_id)
self$feed_forward < FeedForward(
hidden_dim = 4 * attn_head_size * attn_n_heads,
block_id = block_id)
self$attention_norm < RMSNorm(eps = norm_eps,
block_id = block_id,
feeds_into = "attention")
self$feed_forward_norm < RMSNorm(eps = norm_eps,
block_id = block_id,
feeds_into = "ffn")
}
call < function(x) >
self$attention()
x < x + x2 # add residual
# norm and swiglu
x2 < x %>%
self$feed_forward_norm() %>%
self$feed_forward()
x < x + x2 # residual again
x
}
While there is not a lot of code, there are a lot of ideas packed in
there. This block forms the main trunk of the model, so it’s worth
taking the time to go through it slowly.
We implement the TransformerBlock
as a subclassed
keras.layers.Layer
. This is gives us some niceties like the ability to
compose with other Keras layers, but these are mostly irrelevant to the
purpose of this blog post; we could just as easily implement this as,
for example, a vanilla R6 class. Our TransformerBlock
class has two
methods: initialize
, called when we first create the block, and
call
, called when we run the forward pass of the block.
In initialize
, we create 4 layers: an Attention
layer, a
FeedForward
layer, and 2 RMSNorm
layers. We’ll take a close look at
each of these soon, but even before we do so, we can see how they fit
together by looking at the TransformerBlock$call()
method.
The call
method has a few simple ideas. In no particular order, the
first one to observe is the composition pattern of adding residuals.
This is a common pattern that helps with model training, and especially
to help with the vanishing gradient
problem. It’s
a skipconnection within the otherwise linear sequence of matrix
transformations. It reinjects info (throughout the ahead move), and
gradients (throughout again propagation), again into the trunk. You possibly can assume
of those residual connections as liberating the learnable layers inbetween
(the ...
within the pseudo code) from the burden of getting to
“passthrough” or “protect” info in x
, permitting the weights to
as an alternative concentrate on studying transformations which might be, (in corporatese
vernacular), valueadding.
The following composition sample to notice is the repeating utilization of a
normalization layer:
There are many kinds of normalization layers, but to slightly
overgeneralize, they can all be thought of as a stabilizer that helps
with training. Like their deeplearning cousins the regularizers, their
main function is to keep values passing through in a sensible range–in
the ball park of (1, 1), typically. We’ll take a closer look at
RMSNorm
soon.
Stripped of two tricks that are mostly there to help the model train,
residuals and normalization, the core of the TransformerBlock
is just
this:
In a moment we’ll see that that feed_foward
is a slightly fancier
variation of a conventional sequence of Dense
layer. Before we get
there we can we safely skip ahead to distill the following intuition: a
TransformerBlock
is basically an Attention
layer followed by a few
(fancy) dense layers, with some simple composition patterns (tricks)
that help with training. Attention
is the heart of the model: it’s the
most interesting, and also the most involved.
With the framing in place, let’s go through and take a closer look at
RMSNorm
, FeedForward
, and then with the foundation in place, we’ll
turn our attention to Attention
.
RMSNorm
RMSNorm(keras$layers$Layer) %py_class% {
initialize <
function(eps = 1e6, ..., block_id = NULL, feeds_into = NULL) {
super$initialize(...)
self$eps < eps
self$block_id < block_id
self$feeds_into < feeds_into
}
build < function(input_shape) {
# input_shape == (batch_size, seqlen, params$dim)
# self$w will broadcast over batch_size and seqlen dims.
# w_shape == (1, 1, params$dim)
w_shape < rep(1L, length(input_shape))
w_shape[length(input_shape)] < as.integer(input_shape) > tail(1L)
# define a local function that will load
# the pretrainedweights if we supplied `block_id` and `feeds_into`
import_from({self}, block_id, feeds_into)
initializer <if (is.null(block_id))
"ones"
else if (block_id >=0) {
(...) weights_path("7B/layers.{block_id}.{feeds_into}_norm.weight.npy") >
np$load() > np$expand_dims(0:1)
} else if(block_id == 1)
# load weights for the final output normalization layer, which is not
# part of a TransformerBlock
(...) weights_path("7B/norm.weight.npy") >
np$load() > np$expand_dims(0:1)
self$w < self$add_weight(shape = w_shape,
initializer = initializer,
trainable = TRUE)
}
rrms < function(x) {
# reciprocal root mean square along the last axis
x %>% # (batch_size, seqlen, n_features)
tf$math$square() %>%
tf$reduce_mean(axis = 1L, keepdims = TRUE) %>% # (batch_size, seqlen, 1)
tf$math$add(self$eps) %>% # for numerical stability
tf$math$rsqrt()
}
call < function(x) {
x * self$rrms(x) * self$w
}
}
RMSnorm()
has a single trainable tensor w
. In the forward pass, each
value in the input is multiplied by the reciprocalrootmeansquare of
all the values in the feature axis and by w
. Certainly a mouthful, but
just a simple sequence of arithmetic transformations in the end,
designed for the express purpose of adjusting the range of values
passing through.
Let’s kick the tires on it:
tf.Tensor(
[[0. 1.4142132 ]
[0.44721353 1.3416406 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0. 1.4142137 ]
[0.44721362 1.3416408 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0. 1.4142137]
[0.4472136 1.3416408]], shape=(2, 2), dtype=float32)
FeedForward
Next up is FeedForward()
FeedForward(keras$layers$Layer) %py_class% {
initialize < function(hidden_dim, multiple_of = 256L,
..., block_id = NULL) {
super$initialize()
if(!is.null(multiple_of)) {
hidden_dim < hidden_dim %>%
{ as.integer( . * (2/3)) } %>%
{ (. + multiple_of  1) %/% multiple_of } %>%
{ . * multiple_of }
}
self$hidden_dim < hidden_dim
self$block_id < block_id
}
build < function(input_shape) {
output_dim < input_shape > as.integer() > tail(1)
if(is.null(self$block_id))
load_weight < (...) NULL
else
load_weight < (name) (...) np$load(weights_path(
"7B/layers.{self$block_id}.feed_forward.{name}.weight.npy"))$`T`
self$w1 < Dense(self$hidden_dim, use_bias = FALSE,
kernel_initializer = load_weight("w1"))
self$w2 < Dense(output_dim, use_bias = FALSE,
kernel_initializer = load_weight("w2"))
self$w3 < Dense(self$hidden_dim, use_bias = FALSE,
kernel_initializer = load_weight("w3"))
super$build(input_shape)
}
call < function(x) {
import_from({self}, w1, w2, w3)
import_from(tf$nn, silu)
x %>%
{ silu(w1(.)) * w3(.) } %>% # SwiGLU
w2()
}
}
FeedForward
consists of three Dense
layers. initialize
does some
simple arithmetic, munging on the input value hidden_dim
to ensure the
size is a performant multiple of 256, and build
is mostly boiler plate
for creating the layers and loading the weights.
The novelty of FeedForward()
is in the call()
method, where rather
than composing the Dense
layers in a conventional sequential model
with, say, ReLU activations in between and maybe some dropout, the
layers are composed to form a “SwiGLU” unit. The publication by Shazeer (2020)
of SwiGLU and different variations on GLU is an exemplar of the kinds
of explorations and enhancements across the Transformer structure
since its preliminary publication in
2017; a gradual accretion of
enhancements that has introduced us to at present. The Feedforward$name()
is
only a single SwiGLU adopted by a linear projection. In its essence,
it’s a intelligent composition of three (realized) linear projections, an
elementwise multiplication, and a silu()
activation
perform.
Maybe essentially the most shocking commentary to make right here is the relative
dearth of activation capabilities, and even nonlinearities, not simply in
FeedForward
, however general. The silu()
on this feedforward, the
reciprocalrootmeansquare in RMSnorm()
, and a softmax()
in
Consideration()
are the one nonlinear transformations in the entire
sequence of TransformerBlock
s. Every thing else is a linear
transformation!
Consideration
Lastly, let’s flip our consideration to Consideration()
.
Attention(keras$layers$Layer) %py_class% {
initialize < function(head_size, n_heads,
..., block_id = NULL) {
super$initialize(...)
self$head_size < head_size
self$n_heads < n_heads
if (is.null(block_id))
load_weight < function(name) NULL
else
load_weight < (name) (...) np$load(weights_path(
"7B/layers.{block_id}.attention.{name}.weight.npy"))$`T`
Dense < function(name) keras$layers$Dense(
units = n_heads * head_size,
use_bias = FALSE,
kernel_initializer = load_weight(name)
)
self$wq < Dense("wq")
self$wk < Dense("wk")
self$wv < Dense("wv")
self$wo < Dense("wo")
}
call < function(x) {
c(batch_size, seqlen, n_features) %<% tf$unstack(tf$shape(x))
# 1. project (linear transform) x into
# query, key, and value tensors
# 2. reshape q k v, splitting out the last dim (n_features)
# into n_heads independent subspaces,
# each with size head_size.
# (n_features == head_size * n_heads)
split_heads_shape < c(batch_size, seqlen,
self$n_heads, self$head_size)
q < x > self$wq() > tf$reshape(split_heads_shape)
k < x > self$wk() > tf$reshape(split_heads_shape)
v < x > self$wv() > tf$reshape(split_heads_shape)
# embed positional information in query and key
# (bsz, seqlen, n_heads, head_size)
q %<>% apply_rotary_embedding()
k %<>% apply_rotary_embedding()
# reshape:
# move heads out of the last 2 axes,
# so later matmuls are performed across the subspaces (heads)
# between (seqlen, head_size) axes
v < tf$transpose(v, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
q < tf$transpose(q, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
k < tf$transpose(k, c(0L, 2L, 3L, 1L)) # (bsz, n_heads, head_size, seqlen)
# calculate and normalize attention scores
scores < q %*% k # (bsz, n_heads, seqlen, seqlen)
scores < scores / sqrt(self$head_size) # scale
# apply causal mask, so the model can't "look ahead" during training
mask < make_mask(seqlen, dtype = scores$dtype)
scores %<>% { . + mask }
scores < tf$nn$softmax(scores, axis = 1L)
# adjust values tensor with attention scores
# scores (bsz, n_heads, seqlen, seqlen)
# v (bsz, n_heads, seqlen, head_size)
output < scores %*% v # (bsz, n_heads, seqlen, head_size)
# combine heads back into a single features dim,
# so Attention output_shape==input_shape
output < output >
tf$transpose(c(0L, 2L, 1L, 3L)) > # (bsz, seqlen, n_heads, head_size)
tf$reshape(tf$shape(x)) # (bsz, seqlen, n_heads * head_size)
# one more trainable linear projection for good luck
output < self$wo(output) # (bsz, seqlen, n_heads * head_size)
output
}
}
Attention
in LLaMA is similar but not identical to the Attention
described in the original Transformers
paper (and out there as a keras
builtin below keras$layers$MultiHeadAttention()
). The core novelty is
the addition of the apply_rotary_embedding()
perform, which we’ll
describe shortly. The extra novelty is balanced by the simplicity
from the truth that the layer is performing selfattention—we don’t want
to move in numerous question, key, and worth tensors (or purpose about what
meaning), because the identical enter serves all three roles. Notice that the
typical MultiHeadAttention()
layer is roofed fairly completely in
the 2nd Version of Deep Learning with R,
together with a full implementation of consideration in base R.
To develop an understanding of the mechanics in a layer like this, it’s
useful to quickly unsee a few of the minutia that may act as a fog
obscuring the essence of the operation. On this occasion, if we
quickly strip out the transpose()
s and reshape()
s (as intelligent and
important as they’re), that is what’s left:
Returning to the transpose()
s and reshapes()
, you can observe that
their purpose is to make it so that the attention calculations are
performed across n_heads
independent subspaces, rather than in a
single larger space. The same reasoning drives this decision as that
driving usage of depthwiseseparable convolutions in image models.
Empirically, for the fixed compute budget, factoring features into
independent subspaces performs better than doing the same core
operations in single larger feature space. As with all things, there is
a balance to strike between n_heads
(the number of subspaces) and
head_dim
(the size of each subspace). The LLaMA authors have struck
the balance like this at the various model sizes:
# A tibble: 4 × 3
llama_size n_heads head_dim
<chr> <int> <int>
1 7B 32 128
2 13B 40 128
3 30B 52 128
4 65B 64 128
Next lets turn our attention to the causal attention mask.
make_mask < function(seqlen, dtype = k_floatx()) {
x < tf$range(seqlen)
mask < tf$where(x[, tf$newaxis] < x[tf$newaxis, ],
tf$constant(Inf, dtype = dtype),
tf$constant(0, dtype = dtype))
# broadcast over batch and heads dim
mask[tf$newaxis, tf$newaxis, , ] # (1, 1, seqlen, seqlen)
}
The mask is a strictly upper triangular matrix filled with Inf
values. Adding the mask to the attention scores prevents the model from
being able to “look ahead” and see the attention score for a token
pairing it hasn’t seen yet at a particular position in the sequence.
This need for a mask is best thought of as a vestige from training,
an apparatus that the model needed to learn with and now it can’t function without.
During training, gradients are calculated for predictions from all
token positions in a sequence, including predictions tokens where the correct
answer is right there, as the very next token in same sequence. The mask
prevents the model from being able to cheat and look ahead into the future,
something it won’t be able to do once it’s we’re running it for inference.
tf.Tensor(
[[[[ 0. inf inf inf inf]
[ 0. 0. inf inf inf]
[ 0. 0. 0. inf inf]
[ 0. 0. 0. 0. inf]
[ 0. 0. 0. 0. 0.]]]], shape=(1, 1, 5, 5), dtype=float32)
Rotary Position Embedding
Next lets turn our attention to apply_rotary_embedding()
. This core
innovation was published by Su et al. (2022) within the paper titled
“RoFormer: Enhanced Transformer with Rotary Position Embedding”.
Some context:

The naked Consideration()
mechanism doesn’t go away any chance for a
token’s place in a sequence to have an effect on the eye scores, since
solely tokenpairs are scored. Consideration treats its enter like a
bagoftokens.

The place of a token in a sequence is clearly essential, and the
consideration layer ought to have entry to that info.

Absolutely the place of a token in a sequence is much less essential
than the relative place between tokens. (Particularly so for lengthy
sequences).
Which leads us into the complicated airplane. If we think about the options as
complicated numbers, we are able to rotate them, and we are able to calculate angles between
them. From the Roformers paper:
Particularly, incorporating the relative place embedding is
easy: merely rotate the affinetransformed phrase embedding
vector by quantity of angle multiples of its place index and thus
interprets the instinct behind Rotary Place Embedding
Increasing barely: the rotation matrix is designed in order that
subsequently, after rotating our q
and ok
token sequence embedding
the identical manner, the angle between token options is a perform of the
relative distance between these tokens within the token sequence. The
relative angle between two tokens is invariant to absolutely the
place of these tokens within the full sequence.
Briefly, the rotation injects positional info. The which means or
interpretability of that positional info, or how it’s meant to
be used, and even extracted from the results of q %*% ok
, is left to the
mannequin to be taught.
Right here is the code:
apply_rotary_embedding < function(x) {
c(., seqlen, ., head_size) %<%
tf$unstack(tf$shape(x))
rotation_matrix < compute_rotation_matrix(seqlen, head_size)
x %>%
view_as_complex() %>%
{ . * rotation_matrix } %>%
view_as_real()
}
compute_rotation_matrix <
function(seqlen, feature_dim, theta = 10000) {
# `feature_dim` here is going to be attention$head_size
# `seqlen` is going to match the token sequence length.
t < tf$range(seqlen, dtype = tf$float32)
freqs < tf$range(start = 0, limit = 1, delta = 1 / (feature_dim %/% 2),
dtype = tf$float32)
tf_assert(tf$size(freqs) == feature_dim %/% 2)
freqs < 1.0 / (theta ^ freqs)
# outer product; (seqlen, head_size/2)
freqs < tf$einsum('a,b>ab', t, freqs)
rot_mat < tf$complex(tf$cos(freqs), tf$sin(freqs))
# the positional embedding will be broadcast across batch and heads dim
rot_mat[tf$newaxis, , tf$newaxis, ] #(1, seqlen, 1, headdim/2)
}
view_as_complex < function(x) {
tf$complex(x[all_dims(), `::2`],
x[all_dims(), `2::2`])
}
view_as_real < function(x) {
# xs = (..., f); xs2 = (..., f*2)
xs < tf$shape(x)
xs2 < tf$concat(list(xs[1:(length(xs)1)],
xs[length(xs), drop = FALSE] * 2L),
axis = 0L)
x2 < tf$stack(list(Re(x), Im(x)), axis = 1L)
# (..., f, 2) > (..., f*2)
tf$reshape(x2, xs2)
}
As you can see, to imagine the embedding features as existing in the
complex plane, we merely treat adjacent pairs of floats in the
underlying array as the real and imaginary part of a complex number. We
rotate the embeddings in the complex plane, then go back to imagining
the features as existing in the real plane. Again, the job of
interpreting the meaning of the features after rotation is left to the
model to learn.
We can quickly confirm that the rotary embeddings only rotate features
and don’t scale them:
tf.Tensor(True, shape=(), dtype=bool)
There is one more trick to observe before moving on: because of some of
the mathematical properties of the rotation matrix, it’s possible to
avoid doing a full complex multiply operation and still arrive at the
same result. Also, since the rotation matrix never changes, it makes
sense to only compute it once and cache it, like so:
precomputed_rotation_matrix < compute_rotation_matrix(
seqlen = 2048L, # LLaMA max seqlen
feature_dim = with(params, dim %/% n_heads) # head_size
)
apply_rotary_embedding_faster < function(x) {
rotate_every_two < function(x) {
x1 < x[all_dims(), `::2`]
x2 < x[all_dims(), `2::2`]
x_ < tf$stack(list(x2, x1), axis = 1L)
tf$reshape(x_, tf$shape(x))
}
repeat_each_twice < function(x) {
tf$`repeat`(x, 2L, axis = 1L)
}
seqlen < tf$shape(x)[2]
rot < precomputed_rotation_matrix[, NA:seqlen, , ]
cos < Re(rot) > repeat_each_twice()
sin < Im(rot) > repeat_each_twice()
(x * cos) + (rotate_every_two(x) * sin)
}
tf.Tensor(True, shape=(), dtype=bool)
Finally, note that the rotary positional embeddings are applied within
each Attention
layer. This is different from the original Transformer
implementation, where a positional embedding was only added once at the
head of the model. Similar to residual connections, you can think of the
presence of these repeated injections of positional information as
relieving the remaining trainable layers from the burden of allocating
some of their weights to the task of “passing through” or “preserving”
the positional information for later layers.
Positional embeddings are a rich subject that also comes up in other
deep learning architectures, like denoising diffusion (Falbel and Keydana 2023),
so time spent understanding them higher is time nicely
spent. For the needs of this weblog submit we’ve lined the factors
wanted and we’ll transfer on to tying all items collectively. To go deeper and
develop a extra mathematically knowledgeable perceive of RoPE, two wonderful
beginning factors are:

The original paper by Su et al. (2022)

This blog post by
Biderman et al. (2021)
Tying all of it collectively
With Tokenizer
, Embedding
, TransformerBlock
(RMSNorm
,
Consideration
FeedForward
and apply_rotary_embedding
) all lined,
it’s time to tie all of the items collectively right into a Transformer
mannequin. We
may do that utilizing %py_class%
like with the opposite layers above, however
it’s simply as simple to maneuver over to utilizing the Keras practical API at this
level.
layer_transformer_block < create_layer_wrapper(TransformerBlock)
layer_rms_norm < create_layer_wrapper(RMSNorm)
# input to the model will be output from the tokenizer
input < layer_input(shape(NA)) #, dtype = "int32")
x < input >
tok_embeddings() # instantiated earlier in the blogpost
for(block_id in seq_len0(params$n_layers)) >
layer_transformer_block(attn_head_size = params$dim %/% params$n_heads,
attn_n_heads = params$n_heads,
norm_eps = params$norm_eps,
block_id = block_id)
# final output projection into logits of output tokens
x < x >
layer_rms_norm(block_id = 1, eps = params$norm_eps) >
layer_dense(
tokenizer$vocab_size(), use_bias = FALSE,
kernel_initializer = (...) np$load(weights_path("7B/output.weight.npy"))$`T`
)
# slice out the logits for the last token
with_options(c(tensorflow.extract.warn_negatives_pythonic = FALSE), {
output < x[, 1, ]
})
llama < keras_model(input, output) %>%
compile(jit_compile = TRUE)
The input to the model is tokenized text and the output is the
(unnormalized) probabilities for each token in tokenizer$vocab_size()
being the next token in the sequence.
tf.Tensor(
[[2.4503722e+00 3.4463339e+00 1.3200411e+01 ... 4.8804146e01
1.3277926e+00 9.9985600e03]], shape=(1, 32000), dtype=float32)
Sampling strategies for selecting a token from the token logits is a
rich topic, (also covered thoroughly in the Deep Learning with
R ebook), however this weblog submit is lengthy sufficient
already. So for now, let’s simply take the argmax()
.
tf.Tensor([304], shape=(1), dtype=int32)
[1] "to"
Let’s run it for a few tokens and let LLaMa finish the sentence:
The best way to attract bees to your garden is to plant a
variety of flowers that bloom at different times.
Wrapping up
In this blog post we’ve walked through the LLaMA architecture
implemented in R TensorFlow, including how to load pretrained weights,
and then run the model to generate a sentence. Note, much of the code in
this blog post is tailored for didactic purposes. While the
implementation of the LLaMA architecture covered in this blog post is
appropriate for training, there are a few modifications you’ll want to
make before doing a lot of text generation. Those include things like:

In the Attention
layer, caching the k
and v
tensors. Then,
after the first forward pass with the initial prompt, only feeding
the model the one new token from the sampler()
, rather than
feeding the model all the tokens of the full prompt on each forward
pass.

Only generating the causal mask make_mask()
and rotary_matrix
slices once per forward pass, instead of within each Attention
call.

Updating the TransformerBlock
to be cacheaware and to pass
through the appropriate arguments to Attention()

Wrapping all the additional bookkeeping logic in a custom
TransformerDecoder()
class.
The changes required to implement these optimizations for inference
balloon the code size and are mostly about bookkeeping, so we won’t go
through them in this blog post. However, you can find a fuller
implementation of LLaMA in R Tensorflow, including a cacheaware
generate()
method that only feeds the model one token at a time during
the main inference loop, (and compiles to XLA!),
here.
That’s all for now. Thanks for studying and completely satisfied travels to all
exploring this thrilling LLM terrain!
Picture by Sébastien Goldberg on Unsplash
Biderman, Stella, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. 2021.
“Rotary Embeddings: A Relative Revolution.” blog.eleuther.ai/rotaryembeddings/.
Falbel, Daniel, and Sigrid Keydana. 2023.
“Posit AI Weblog: DeNoising Diffusion with Torch.” https://blogs.rstudio.com/tensorflow/posts/20230413denoisingdiffusion/.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022.
“Coaching ComputeOptimum Giant Language Fashions.” https://arxiv.org/abs/2203.15556.
Shazeer, Noam. 2020.
“GLU Variants Enhance Transformer.” https://arxiv.org/abs/2002.05202.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2022.
“RoFormer: Enhanced Transformer with Rotary Place Embedding.” https://arxiv.org/abs/2104.09864.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, MarieAnne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023.
“LLaMA: Open and Environment friendly Basis Language Fashions.” https://doi.org/10.48550/ARXIV.2302.13971.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
“Consideration Is All You Want.” https://arxiv.org/abs/1706.03762.
Take pleasure in this weblog? Get notified of recent posts by email:
Posts additionally out there at rbloggers