NumPy-style broadcasting for R TensorFlow customers

We develop, practice, and deploy TensorFlow fashions from R. However that doesn’t imply we don’t make use of documentation, weblog posts, and examples written in Python. We glance up particular performance within the official TensorFlow API docs; we get inspiration from different folks’s code.

Relying on how snug you might be with Python, there’s an issue. For instance: You’re alleged to know the way broadcasting works. And maybe, you’d say you’re vaguely accustomed to it: So when arrays have completely different shapes, some components get duplicated till their shapes match and … and isn’t R vectorized anyway?

Whereas such a world notion may fit on the whole, like when skimming a weblog submit, it’s not sufficient to know, say, examples within the TensorFlow API docs. On this submit, we’ll attempt to arrive at a extra actual understanding, and examine it on concrete examples.

Talking of examples, listed here are two motivating ones.

Broadcasting in motion

The primary makes use of TensorFlow’s matmul to multiply two tensors. Would you wish to guess the end result – not the numbers, however the way it comes about on the whole? Does this even run with out error – shouldn’t matrices be two-dimensional (rank-2 tensors, in TensorFlow communicate)?

a <- tf$fixed(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a 
# tf.Tensor(
# [[[ 1.  2.  3.]
#   [ 4.  5.  6.]]
# 
#  [[ 7.  8.  9.]
#   [10. 11. 12.]]], form=(2, 2, 3), dtype=float64)

b <- tf$fixed(keras::array_reshape(101:106, dim = c(1, 3, 2)))
b  
# tf.Tensor(
# [[[101. 102.]
#   [103. 104.]
#   [105. 106.]]], form=(1, 3, 2), dtype=float64)

c <- tf$matmul(a, b)

Second, here’s a “actual instance” from a TensorFlow Likelihood (TFP) github issue. (Translated to R, however protecting the semantics).
In TFP, we will have batches of distributions. That, per se, is no surprise. However take a look at this:

library(tfprobability)
d <- tfd_normal(loc = c(0, 1), scale = matrix(1.5:4.5, ncol = 2, byrow = TRUE))
d
# tfp.distributions.Regular("Regular", batch_shape=[2, 2], event_shape=[], dtype=float64)

We create a batch of 4 regular distributions: every with a distinct scale (1.5, 2.5, 3.5, 4.5). However wait: there are solely two location parameters given. So what are their scales, respectively?
Fortunately, TFP builders Brian Patton and Chris Suter defined the way it works: TFP truly does broadcasting – with distributions – identical to with tensors!

We get again to each examples on the finish of this submit. Our essential focus can be to elucidate broadcasting as finished in NumPy, as NumPy-style broadcasting is what quite a few different frameworks have adopted (e.g., TensorFlow).

Earlier than although, let’s rapidly overview just a few fundamentals about NumPy arrays: index or slice them (indexing usually referring to single-element extraction, whereas slicing would yield – nicely – slices containing a number of components); learn how to parse their shapes; some terminology and associated background.
Although not sophisticated per se, these are the sorts of issues that may be complicated to rare Python customers; but they’re usually a prerequisite to efficiently making use of Python documentation.

Said upfront, we’ll actually prohibit ourselves to the fundamentals right here; for instance, we gained’t contact advanced indexing which – identical to tons extra –, will be seemed up intimately within the NumPy documentation.

Few information about NumPy

Primary slicing

For simplicity, we’ll use the phrases indexing and slicing kind of synonymously any further. The essential machine here’s a slice, particularly, a begin:cease construction indicating, for a single dimension, which vary of components to incorporate within the choice.

In distinction to R, Python indexing is zero-based, and the tip index is unique:

import numpy as np
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

x[1:7] 
# array([1, 2, 3, 4, 5, 6])

x[5:] 
# array([5, 6, 7, 8, 9])

x[:7]
# array([0, 1, 2, 3, 4, 5, 6])

x[:] 
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

x = np.array([[1, 2], [3, 4], [5, 6]])
x
# array([[1, 2],
#        [3, 4],
#        [5, 6]])

x[1, :] 
# array([3, 4])

x[1] 
# array([3, 4])

x[1, ] 
# array([3, 4])

While the second one sure looks a bit like R, the mechanism is different. Technically, these start:stop things are parts of a Python tuple – that list-like, but immutable data structure that can be written with or without parentheses, e.g., 1,2 or (1,2) –, and whenever we have more dimensions in the array than elements in the tuple NumPy will assume we meant : for that dimension: Just select everything.

We can see that moving on to three dimensions. Here is a 2 x 3 x 1-dimensional array:

x = np.array([[[1],[2],[3]], [[4],[5],[6]]])
x
# array([[[1],
#         [2],
#         [3]],
# 
#        [[4],
#         [5],
#         [6]]])

x.shape
# (2, 3, 1)

x[0,]
#array([[1],
#       [2],
#       [3]])

x[0, ...]
#array([[1],
#       [2],
#       [3]])

We stop here with our selection of essential (yet confusing, possibly, to infrequent Python users) Numpy indexing features; re. “possibly confusing” though, here are a few remarks about array creation.

Syntax for array creation

Creating a more-dimensional NumPy array is not that hard – depending on how you do it. The trick is to use reshape to tell NumPy exactly what shape you want. For example, to create an array of all zeros, of dimensions 3 x 4 x 2:

np.zeros(24).reshape(4, 3, 2)

c1 = np.array([[[0, 0, 0]]])
c2 = np.array([[[0], [0], [0]]]) 
c3 = np.array([[[0]], [[0]], [[0]]])

c1.shape # (1, 1, 3)
c2.shape # (1, 3, 1)
c3.shape # (3, 1, 1)

but we’d like to be able to “parse” internally without executing the code. One way to think about it would be processing the brackets like a state machine, every opening bracket moving one axis to the right and every closing bracket moving back left by one axis. Let us know if you can think of other – possibly more helpful – mnemonics!

In the very last sentence, we on purpose used “left” and “right” referring to the array axes; “out there” though, you’ll also hear “outmost” and “innermost”. Which, then, is which?

A bit of terminology

In common Python (TensorFlow, for example) usage, when talking of an array shape like (2, 6, 7), outmost is left and innermost is right. Why?
Let’s take a simpler, two-dimensional example of shape (2, 3).

a = np.array([[1, 2, 3], [4, 5, 6]])
a
# array([[1, 2, 3],
#        [4, 5, 6]])

Computer memory is conceptually one-dimensional, a sequence of locations; so when we create arrays in a high-level programming language, their contents are effectively “flattened” into a vector. That flattening could occur “by row” (row-major, C-style, the default in NumPy), resulting in the above array ending up like this

1 2 3 4 5 6

or “by column” (column-major, Fortran-style, the ordering used in R), yielding

1 4 2 5 3 6

for the above example.

Now if we see “outmost” as the axis whose index varies the least often, and “innermost” as the one that changes most quickly, in row-major ordering the left axis is “outer”, and the right one is “inner”.

Just as a (cool!) aside, NumPy arrays have an attribute called strides that stores how many bytes have to be traversed, for each axis, to arrive at its next element. For our above example:

c1 = np.array([[[0, 0, 0]]])
c1.shape   # (1, 1, 3)
c1.strides # (24, 24, 8)

c2 = np.array([[[0], [0], [0]]]) 
c2.shape   # (1, 3, 1)
c2.strides # (24, 8, 8)

c3 = np.array([[[0]], [[0]], [[0]]])
c3.shape   # (3, 1, 1) 
c3.strides # (8, 8, 8)

For array c3, every element is on its own on the outmost level; so for axis 0, to jump from one element to the next, it’s just 8 bytes. For c2 and c1 though, everything is “squished” in the first element of axis 0 (there is just a single element there). So if we wanted to jump to another, nonexisting-as-yet, outmost item, it’d take us 3 * 8 = 24 bytes.

At this point, we’re ready to talk about broadcasting. We first stay with NumPy and then, examine some TensorFlow examples.

NumPy Broadcasting

What happens if we add a scalar to an array? This won’t be surprising for R users:

a = np.array([1,2,3])
b = 1
a + b

array([2, 3, 4])

Technically, this is already broadcasting in action; b is virtually (not physically!) expanded to shape (3,) in order to match the shape of a.

How about two arrays, one of shape (2, 3) – two rows, three columns –, the other one-dimensional, of shape (3,)?

a = np.array([1,2,3])
b = np.array([[1,2,3], [4,5,6]])
a + b

array([[2, 4, 6],
       [5, 7, 9]])

The one-dimensional array gets added to both rows. If a were length-two instead, would it get added to every column?

a = np.array([1,2,3])
b = np.array([[1,2,3], [4,5,6]])
a + b

ValueError: operands could not be broadcast together with shapes (2,) (2,3)

So now it is time for the broadcasting rule. For broadcasting (virtual expansion) to happen, the following is required.

We align array shapes, starting from the right.

   # array 1, shape:     8  1  6  1
   # array 2, shape:        7  1  5

Starting to look from the right, the sizes along aligned axes either have to match exactly, or one of them has to be 1: In which case the latter is broadcast to the one not equal to 1.
If on the left, one of the arrays has an additional axis (or more than one), the other is virtually expanded to have a 1 in that place, in which case broadcasting will happen as stated in (2).

Stated like this, it probably sounds incredibly simple. Maybe it is, and it only seems complicated because it presupposes correct parsing of array shapes (which as shown above, can be confusing)?

Here again is a quick example to test our understanding:

a = np.zeros([2, 3]) # shape (2, 3)
b = np.zeros([2])    # shape (2,)
c = np.zeros([3])    # shape (3,)

a + b # error

a + c
# array([[0., 0., 0.],
#        [0., 0., 0.]])

All in accord with the rules. Maybe there’s something else that makes it confusing?
From linear algebra, we are used to thinking in terms of column vectors (often seen as the default) and row vectors (accordingly, seen as their transposes). What now is

, of shape – as we’ve seen a few times by now – (2,)? Really it’s neither, it’s just some one-dimensional array structure. We can create row vectors and column vectors though, in the sense of 1 x n and n x 1 matrices, by explicitly adding a second axis. Any of these would create a column vector:

# start with the above "non-vector"
c = np.array([0, 0])
c.shape
# (2,)

# way 1: reshape
c.reshape(2, 1).shape
# (2, 1)

# np.newaxis inserts new axis
c[ :, np.newaxis].shape
# (2, 1)

# None does the same
c[ :, None].shape
# (2, 1)

# or construct directly as (2, 1), paying attention to the parentheses...
c = np.array([[0], [0]])
c.shape
# (2, 1)

c = np.array([[0], [0]])
c.shape
# (2, 1)

a = np.zeros([2, 3])
a.shape
# (2, 3)
a + c
# array([[0., 0., 0.],
#       [0., 0., 0.]])

a = np.zeros([3, 2])
a.shape
# (3, 2)
a + c
# ValueError: operands could not be broadcast together with shapes (3,2) (2,1)

a = np.array([0.0, 10.0, 20.0, 30.0])
a.shape
# (4,)

b = np.array([1.0, 2.0, 3.0])
b.shape
# (3,)

a[:, np.newaxis] * b
# array([[ 0.,  0.,  0.],
#        [10., 20., 30.],
#        [20., 40., 60.],
#        [30., 60., 90.]])

TensorFlow

If by now, you’re feeling less than enthusiastic about hearing a detailed exposition of how TensorFlow broadcasting differs from NumPy’s, there is good news: Basically, the rules are the same. However, when matrix operations work on batches – as in the case of matmul and friends – , things may still get complicated; the best advice here probably is to carefully read the documentation (and as always, try things out).

Before revisiting our introductory matmul example, we quickly check that really, things work just like in NumPy. Thanks to the tensorflow R package, there is no reason to do this in Python; so at this point, we switch to R – attention, it’s 1-based indexing from here.

First check – (4, 1) added to (4,) should yield (4, 4):

a <- tf$ones(shape = c(4L, 1L))
a
# tf.Tensor(
# [[1.]
#  [1.]
#  [1.]
#  [1.]], form=(4, 1), dtype=float32)

b <- tf$fixed(c(1, 2, 3, 4))
b
# tf.Tensor([1. 2. 3. 4.], form=(4,), dtype=float32)

a + b
# tf.Tensor(
# [[2. 3. 4. 5.]
# [2. 3. 4. 5.]
# [2. 3. 4. 5.]
# [2. 3. 4. 5.]], form=(4, 4), dtype=float32)

And second, after we add tensors with shapes (3, 3) and (3,), the 1-d tensor ought to get added to each row (not each column):

a <- tf$fixed(matrix(1:9, ncol = 3, byrow = TRUE), dtype = tf$float32)
a
# tf.Tensor(
# [[1. 2. 3.]
#  [4. 5. 6.]
#  [7. 8. 9.]], form=(3, 3), dtype=float32)

b <- tf$fixed(c(100, 200, 300))
b
# tf.Tensor([100. 200. 300.], form=(3,), dtype=float32)

a + b
# tf.Tensor(
# [[101. 202. 303.]
#  [104. 205. 306.]
#  [107. 208. 309.]], form=(3, 3), dtype=float32)

Now again to the preliminary matmul instance.

Again to the puzzles

The documentation for matmul says,

The inputs should, following any transpositions, be tensors of rank >= 2 the place the inside 2 dimensions specify legitimate matrix multiplication dimensions, and any additional outer dimensions specify matching batch dimension.

So right here (see code slightly below), the inside two dimensions look good – (2, 3) and (3, 2) – whereas the one (one and solely, on this case) batch dimension reveals mismatching values 2 and 1, respectively.
A case for broadcasting thus: Each “batches” of a get matrix-multiplied with b.

a <- tf$fixed(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a 
# tf.Tensor(
# [[[ 1.  2.  3.]
#   [ 4.  5.  6.]]
# 
#  [[ 7.  8.  9.]
#   [10. 11. 12.]]], form=(2, 2, 3), dtype=float64)

b <- tf$fixed(keras::array_reshape(101:106, dim = c(1, 3, 2)))
b  
# tf.Tensor(
# [[[101. 102.]
#   [103. 104.]
#   [105. 106.]]], form=(1, 3, 2), dtype=float64)

c <- tf$matmul(a, b)
c
# tf.Tensor(
# [[[ 622.  628.]
#   [1549. 1564.]]
# 
#  [[2476. 2500.]
#   [3403. 3436.]]], form=(2, 2, 2), dtype=float64)

Let’s rapidly examine this actually is what occurs, by multiplying each batches individually:

tf$matmul(a[1, , ], b)
# tf.Tensor(
# [[[ 622.  628.]
#   [1549. 1564.]]], form=(1, 2, 2), dtype=float64)

tf$matmul(a[2, , ], b)
# tf.Tensor(
# [[[2476. 2500.]
#   [3403. 3436.]]], form=(1, 2, 2), dtype=float64)

Is it too bizarre to be questioning if broadcasting would additionally occur for matrix dimensions? E.g., might we strive matmuling tensors of shapes (2, 4, 1) and (2, 3, 1), the place the 4 x 1 matrix could be broadcast to 4 x 3? – A fast take a look at reveals that no.

To see how actually, when coping with TensorFlow operations, it pays off overcoming one’s preliminary reluctance and really seek the advice of the documentation, let’s strive one other one.

Within the documentation for matvec, we’re instructed:

Multiplies matrix a by vector b, producing a * b.
The matrix a should, following any transpositions, be a tensor of rank >= 2, with form(a)[-1] == form(b)[-1], and form(a)[:-2] capable of broadcast with form(b)[:-1].

In our understanding, given enter tensors of shapes (2, 2, 3) and (2, 3), matvec ought to carry out two matrix-vector multiplications: as soon as for every batch, as listed by every enter’s leftmost dimension. Let’s examine this – up to now, there isn’t a broadcasting concerned:

# two matrices
a <- tf$fixed(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a
# tf.Tensor(
# [[[ 1.  2.  3.]
#   [ 4.  5.  6.]]
# 
#  [[ 7.  8.  9.]
#   [10. 11. 12.]]], form=(2, 2, 3), dtype=float64)

b = tf$fixed(keras::array_reshape(101:106, dim = c(2, 3)))
b
# tf.Tensor(
# [[101. 102. 103.]
#  [104. 105. 106.]], form=(2, 3), dtype=float64)

c <- tf$linalg$matvec(a, b)
c
# tf.Tensor(
# [[ 614. 1532.]
#  [2522. 3467.]], form=(2, 2), dtype=float64)

Doublechecking, we manually multiply the corresponding matrices and vectors, and get:

tf$linalg$matvec(a[1,  , ], b[1, ])
# tf.Tensor([ 614. 1532.], form=(2,), dtype=float64)

tf$linalg$matvec(a[2,  , ], b[2, ])
# tf.Tensor([2522. 3467.], form=(2,), dtype=float64)

The identical. Now, will we see broadcasting if b has only a single batch?

b = tf$fixed(keras::array_reshape(101:103, dim = c(1, 3)))
b
# tf.Tensor([[101. 102. 103.]], form=(1, 3), dtype=float64)

c <- tf$linalg$matvec(a, b)
c
# tf.Tensor(
# [[ 614. 1532.]
#  [2450. 3368.]], form=(2, 2), dtype=float64)

Multiplying each batch of a with b, for comparability:

tf$linalg$matvec(a[1,  , ], b)
# tf.Tensor([ 614. 1532.], form=(2,), dtype=float64)

tf$linalg$matvec(a[2,  , ], b)
# tf.Tensor([[2450. 3368.]], form=(1, 2), dtype=float64)

It labored!

Now, on to the opposite motivating instance, utilizing tfprobability.

Broadcasting in all places

Right here once more is the setup:

library(tfprobability)
d <- tfd_normal(loc = c(0, 1), scale = matrix(1.5:4.5, ncol = 2, byrow = TRUE))
d
# tfp.distributions.Regular("Regular", batch_shape=[2, 2], event_shape=[], dtype=float64)

What’s going on? Let’s examine location and scale individually:

d$loc
# tf.Tensor([0. 1.], form=(2,), dtype=float64)

d$scale
# tf.Tensor(
# [[1.5 2.5]
#  [3.5 4.5]], form=(2, 2), dtype=float64)

Simply specializing in these tensors and their shapes, and having been instructed that there’s broadcasting occurring, we will cause like this: Aligning each shapes on the suitable and increasing loc’s form by 1 (on the left), we now have (1, 2) which can be broadcast with (2,2) – in matrix-speak, loc is handled as a row and duplicated.

That means: Now we have two distributions with imply (0) (considered one of scale (1.5), the opposite of scale (3.5)), and in addition two with imply (1) (corresponding scales being (2.5) and (4.5)).

Right here’s a extra direct technique to see this:

d$imply()
# tf.Tensor(
# [[0. 1.]
#  [0. 1.]], form=(2, 2), dtype=float64)

d$stddev()
# tf.Tensor(
# [[1.5 2.5]
#  [3.5 4.5]], form=(2, 2), dtype=float64)

Puzzle solved!

Summing up, broadcasting is straightforward “in principle” (its guidelines are), however might have some training to get it proper. Particularly together with the truth that features / operators do have their very own views on which elements of its inputs ought to broadcast, and which shouldn’t. Actually, there isn’t a manner round trying up the precise behaviors within the documentation.

Hopefully although, you’ve discovered this submit to be begin into the subject. Possibly, just like the writer, you are feeling such as you may see broadcasting occurring wherever on the planet now. Thanks for studying!