ScPyT

Linear Algebra in Python

A thorough Linear Algebra Bootcamp as a Machine learning Practitioner

Guangyuan(Frank) Li

Published in

The Startup

11 min readJan 24, 2021

As a data scientist or machine learning practitioner, how good is your linear algebra?

No matter you have a positive or negative answer to this question, hopefully, after reading this post and practicing a bit, you will be able to grasp most of the Linear Algebra you need to know for your day-to-day work.

Well… I know it may sound a bit exaggerating, how is that possible? The reason is, we do not need to know or actually calculate everything by hand. Off-the-shelf packages like Python Numpy or Matlab have already done a lot of hard work for us underneath the hood. But, in order to correctly utilize these tools at hand, we still need to have a comprehensive understanding of Linear Algebra itself, and this is exactly what I am going to tell you in this article.

The code will be posted on my Github repository: https://github.com/frankligy/ScPyT

Concepts you need to know in Linear Algebra

It’s key to speak Linear Algebra language, which allows you to better communicate with Linear Algebra experts and understand more advanced concepts. I always found out that the most difficult part for every beginner to delve into linear algebra world is its somewhat counterintuitive terminologies. So I decided to clearly explain everything from here. The concepts that I will cover based on my own learning notes from Prof. Gilbert Strang MIT Open CourseWare and Deep learning book written by Dr. Ian Goodfellow et al, but I add my own understanding and distill them into a coherent way.

I will provide a brief explanation for all the concepts and optionally I will introduce related Numpy function so you can play with them while reading, but I would also like to refer you to Wikipedia whenever you want to know more about them.

Let’s get started!

Vector space, subspace, span, column space, row space, null space, left-null space, rank, basis, orthogonal matrix, symmetric matrix

Imagine we have a matrix A, in python, A.T means transpose of A, @ means matrix multiplication.

a = np.random.randn(2,3)
a
# array([[ 0.39,  0.54, -0.71],
       [-1.84, -0.6 ,  0.53]])
a.T
# array([[ 0.39, -1.84],
       [ 0.54, -0.6 ],
       [-0.71,  0.53]])b = np.random.randn(3,2)
b
# array([[ 1., -0.],
       [ 1., -0.],
       [-3.,  1.]])a @ b
# array([[-1.09762048,  0.96911979],
       [-1.97859822, -3.15604207]])

What is a span? The span of a set of vectors is all linear combinations of these vectors. Think about vector (0,1) and (1,0), a span of these two vectors would be the whole x-y plane.

Then thinking about a x-y-z 3D space (Vector space R3), it is composed of basis — (1,0,0), (0,1,0), and (0,0,1), since every element in Vector space can be written as a linear combination of the elements in the basis.

Another concept is subspace, Span of Vector (1,0,0) and (0,1,0) constitutes a subspace of x-y-z 3D vector space R3. (A subset of a larger vector space)

Column space of A is the span of all column vectors of A, row space of A is the span of all row vectors of A. If A @ x=0, the span of all solutions x constitutes the null space of A. If A.T @ x= 0, span of all solutions x constitutes the left-null space of A. These four spaces will keep occuring in lots of linear algebra tutorials.

How many linearly independent column vectors in matrix A? This is rank of A. To better understand linearly independent, for instance (1,2,3) and (10,20,30) are linear dependent, because (10,20,30) is a multiple of (1,2,3)! Then how to compute the rank of a matrix?

x = np.random.rand(2,3)
np.linalg.matrix_rank(x)
# 2
# So, x has two linearly independent column vectors

Two vectors’ inner products are 0 and they are all of the unit lengths, then they are orthonormal vectors. If matrix A’s rows and columns are orthonormal vectors, this matrix A is an orthogonal matrix. In math, it means A.T @ A = A @ A.T = I.

Symmetric matrix means A = A.T, also, symmetric matrix and orthogonal matrix only apply to real value matrix, in a complex matrix, we have the counterparts, namely hermitian matrix and unitary matrix, We will cover them in later part.

Inverse, determinant

You can compute the determinant of any square matrix.

x = np.random.rand(4,4)
np.linalg.det(x)
# 0.08437697073565216

Only invertible (determinant != 0) square matrix will have inverse. If A @ B = I, B is A’s inverse matrix.

Gaussian Elimination, row echelon form, reduced echelon form, leading coefficient, pivot, elementary row operation

Gaussian Elimination can be used in (a) solving a linear system Ax=b, (b) compute inverse matrix (c) solve rank (d) solve determinant (details please refer to Wikipedia), via a series of elementary row operations including (1) swap rows, (2) scale rows, (3) Add one row to another.

There are two other concepts that we should be familiar with,

Row echelon form: the first non-zero element from the left in each row (leading coefficient, pivot) is always to the right of the first non-zero element in the row above.

Reduced row echelon form: row echelon form whose pivots are1 and column containing pivots are zero elsewhere.

The process of Gaussian Elimination aims to obtain a row echelon form, it is a simplified form of any linear system. Also, the ultimate goal would be getting a reduced row echelon form, which allows you to solve Ax=b just by looking at the form itself.

Inner product, outer product, hadamard product, Projection, Gram-Schmidt process

First, let’s understand the inner product, outer product, Hadamard product of two 1D arrays:

a = np.array([4,5,6])
b = np.array([7,8,9])
np.inner(a,b).  # 122
np.outer(a,b).  # array([[28, 32, 36],
                         [35, 40, 45],
                         [42, 48, 54]])
a * b.   # element wise or hadamard product
# array([28, 40, 54])

How to project one vector (a) to another vector (b)?

a = np.array([4,5,6])
b = np.array([7,8,9])
proj_b_a = np.inner(a,b) / np.inner(b,b) * b
# array([4.40206186, 5.03092784, 5.65979381])

The Gram-Schmidt process is to orthonormalize a set of vectors (v1,v2,v3…vn) to (u1,u2,u3…un) in the same Rn vector space but within which each element is of unit length and are mutually orthogonal. In a sense, it transforms a set of vectors into a set of orthonormal vectors in the same vector space. It is very helpful since, in quite a lot of cases, it is required for your matrix to be orthogonal. Hence, the Gram-Schmidt process can be your go-to approach when it comes to orthonormalize anything (vector or matrix).

LU decomposition, QR decomposition, Cholesky decomposition.

LU decomposition aims to decompose a matrix (no need to be square) to a lower triangular matrix (L, entries above diagonal are 0) and an upper triangular matrix (U, entries above diagonal are 0). It is related to Gaussian decomposition but has better computational efficiency. In order to make LU decomposition materialize, sometimes we reorder the matrix using a P matrix.

a = np.random.randn(3,4)
p,l,u = scipy.linalg.lu(a)#p
array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])#l
array([[ 1.        ,  0.        ,  0.        ],
       [-0.71426919,  1.        ,  0.        ],
       [-0.85039495,  0.12237106,  1.        ]])#u
array([[-2.09219711, -0.48959089,  0.81707073,  0.77602155],
       [ 0.        , -1.53448255, -1.72785249,  0.04775144],
       [ 0.        ,  0.        ,  1.5052947 , -0.27769281]])

QR decompostion aims to decompose a matrix into an orthogonal matrix (Q) and an upper triangular matrix (R). It is used in QR algorithms to solve the linear least square problem. Also, the resultant matrix Q sometimes is exactly what we want to have after the Gram-Schmidt process.

a = np.random.randn(3,4)
q,r = np.linalg.qr(a)#q
array([[-0.47569189,  0.71339517,  0.51457221],
       [-0.82969021, -0.55818202,  0.00685511],
       [ 0.29211536, -0.42367461,  0.85741964]])
#r
array([[-1.34089268, -1.73408897, -0.07436536,  0.78464807],
       [ 0.        , -1.66272812,  0.63477604, -1.60036506],
       [ 0.        ,  0.        ,  0.43098896, -0.31316029]])

Cholesky decomposition, instead tries to decompose a hermitian (or symmetric) positive-definite matrix into a lower-triangular matrix L and its conjugate transpose (or transpose) L.H. This again has been used in a lot of algorithms for numerical convenience.

x = np.diagflat([[1,2], [3,4]])
L = np.linalg.cholesky(x)

#L
array([[1.        , 0.        , 0.        , 0.        ],
       [0.        , 1.41421356, 0.        , 0.        ],
       [0.        , 0.        , 1.73205081, 0.        ],
       [0.        , 0.        , 0.        , 2.        ]])

Eigen-decomposition, diagonalization, Characteristic polynomial.

I will illustrate Eigen-decomposition using a 3*3 square matrix, it will have 3 eigenvectors and cognate 3 eigenvalues.

ei = np.random.randn(4,4)
w,v = np.linalg.eig(ei)# w. eigenvalues
array([-2.21516912+1.65582705j, -2.21516912-1.65582705j,
        1.45929568+0.99548974j,  1.45929568-0.99548974j])
# v. eigenvector, 4*4
array([[-0.1070701 -0.40447805j, -0.1070701 +0.40447805j,
        -0.03773179+0.56113399j, -0.03773179-0.56113399j],
       [ 0.15294619-0.0172865j ,  0.15294619+0.0172865j ,
         0.65763758+0.j        ,  0.65763758-0.j        ],
       [ 0.12433426+0.47215025j,  0.12433426-0.47215025j,
         0.3236955 +0.37453337j,  0.3236955 -0.37453337j],
       [-0.75023815+0.j        , -0.75023815-0.j        ,
         0.00770123+0.07813095j,  0.00770123-0.07813095j]])

Diagonalization has a lot of things to do with Eigen-Decomposition. Let us denote the original matrix as A, then by doing Eigen-Decomposition (see figure above), A can be decomposed into three parts: V (eigenvectors matrix) times D times V.I (inverse of V, think about multiply V.I on both sides of the equation). Since D is a diagonal matrix. We call this process the diagonalization of original matrix A. Generally, if a matrix D is symmetric (as below), then D is diagonalizable. Here A is a diagonal matrix.

A square matrix A (n*n) can have a characteristic polynomial, we are trying to compute:

p(t) will be a polynomial with respect to t.

Moore-Penrose pseudo-inverse, hermitian, conjugate transpose, full-rank matrix

We know only an invertible square matrix has an inverse, how about a non-square matrix? we have pseudo-inverse.

Moore-Penrose defines the pseudo-inverse (A+,n*m) of a rectangular matrix (A, m*n) if A+ meets four particular conditions:
(1) AA+A = A
(2) A+AA+ = A+
(3) (AA+)* = AA+
(4) (A+A)* = A+A

Symbol * means conjugate transpose, for a complex square matrix (meaning the entry may be the complex number), the conjugate transpose is to first transpose the matrix and then conjugate (flip the sign of imaginary part but leave real part untouched) every entry of the matrix.

a = np.random.randn(3,4)
np.linalg.pinv(a)
# 
array([[-0.09988152,  0.23637842,  0.29312192],
       [-1.07695892,  0.49075645, -0.36409413],
       [ 0.03471007,  0.57286174, -0.13192266],
       [ 0.83630477, -0.20498149,  0.30882525]])

A hermitian matrix is a complex square matrix whose conjugate transpose is equal to itself. Its real value counterpart is a symmetric matrix.

If A is full-rank (rank=min(m,n)) matrix, A+ can be expressed as :
(1) A+ = (A*A)-1A*. A has independent columns
(2) A+ = A*(AA*)-1. A has independent rows

Singular Value Decomposition (SVD), singular matrix, complex unitary matrix

Only the square matrix has eigenvalues, the rectangular matrix can have singular values, a square matrix whose determinant = 0 will be a singular matrix, or in another word, it is non-invertible.

A complex matrix will be unitary if its conjugate transpose happens to be its inverse. U*U = UU* = I. The real value counterpart is the orthogonal matrix.

So a rectangular matrix A can be decomposed into three components:
U, S, and Vh, diagonal of S matrix contain singular values for matrix A. The column vector of U is called the left singular vector and the column vector of V is called the right singular vector. This decomposition is called Singular Value Decomposition (SVD). SVD can be really convenient to compute pseudo-inverse, matrix rank, and cross decomposition. Furthermore, SVD is also the default solution for PCA in quite a lot of machine learning packages (i.e. scikit-learn). Understanding SVD can be thought of as a milestone during your journey in Linear Algebra.

svd = np.random.rand(4,5)
u,s,vh = np.linalg.svd(svd)#u. shape (4,4)
array([[ 0.66581103,  0.61992005, -0.21849383, -0.3530655 ],
       [ 0.47769768,  0.06749606,  0.56339472,  0.67069784],
       [ 0.44673911, -0.66240419,  0.31234136, -0.51389467],
       [ 0.35906096, -0.41516755, -0.73300048,  0.40177286]])#s.  s is the np.diag(S), S will be of shape (4,5)
array([2.35262119, 0.87561858, 0.31537598, 0.043907  ])# vh of shape (5,5)
array([[ 0.29743919,  0.569923  ,  0.3576598 ,  0.52216343,  0.43144237],
       [-0.61102456,  0.15085062,  0.50163189, -0.46496796,  0.36886763],
       [ 0.49893385,  0.39660661, -0.29349238, -0.68317228,  0.2022525 ],
       [ 0.52538762, -0.60314147,  0.5557502 , -0.15677528,  0.16355866],
       [-0.11494248, -0.3624299 , -0.47481454,  0.14088043,  0.78111244]])

Lp Norm, Frobenius norm, einsum

What are L1 and L2 norms? Basically, it represents different metrics to define a vector’s distance from the origin. Let’s define a more general Lp norm:

Lp norm for vector and Frobenius Norm for a matrix

It is worth noting that, usually you use L2 norm to signify the magnitute of a vector, it can be seen from the projection part where we use inner product to compute the denominator.

For a matrix, we usually encounter Frobenius norm like above.

a = np.array([4,5,6])
np.linalg.norm(a,ord=3). # L3 norm for vector a
# 7.398636222991409x = np.random.rand(2,3)
np.linalg.norm(x,ord='fro'). # frobenius norm for matrix x
# 1.309085506183174

Finally, I’d like to introduce the Einstein Sum function in the Numpy package. This is a somewhat magic function that provides a generic solution for nearly all the basic matrix operations you can encounter, including, inner product, outer product, matrix multiplication, trace, diagonal, Hadamard product, summation, row sum, column sum, transpose.

Basically, this function requires two arguments, first is called “subscript”, second is called “operands”, let us see a matrix multiplication example to understand that:

x = np.random.rand(2,3)
y = np.random.rand(5,3)
np.einsum('ij,kj -> ik',x,y)#
array([[0.3676856 , 0.33156855, 0.39793874, 0.70939856, 0.8353566 ],
       [0.19219345, 0.19312881, 1.36739081, 1.0606612 ,1.15039307]])

As we can see, we use i and j represent the dimensions of the operand x , k and j to represent dimensions of the operand y . And we specify the output matrix to be in the dimension i * k . The einsum function will automatically figure out it is a matrix multiplication problem. With that understood, let’s delve into all the interesting examples einsum function can perform. A basic rule is that the dimension that only happens in input subscripts but not in output subscript means this dimension needs to be summed.

# einsum
x = np.random.rand(2,3)
np.einsum('ij -> ji',x)   # transpose
np.einsum('ij ->',x)    # sum
np.einsum('ij -> i',x)   # column sum
np.einsum('ij -> j',x)  # row sum

x = np.random.rand(2,3)
y = np.random.rand(5,3)
np.einsum('ij,kj -> ik',x,y)  # matrix multiplication

a = np.array([4,5,6])
b = np.array([7,8,9])
np.einsum('i,i ->',a,b)   # inner product
np.einsum('i,j ->ij',a,b)  # outer product
np.einsum('i,i ->i',a,b)   # hadamard product

y = np.random.rand(5,3)
np.einsum('ij -> j',y)   # diagonal
np.einsum('ij ->',y)   # trace

Singular, invertible, positive definite, positive semidefinite

These sets of concepts are always confusing to beginners, so let’s tease them apart. When we describe whether a square matrix is invertible or not, meaning whether we can perform np.linalg.inv() on it, we need to know if the matrix is singular.

A singular matrix is characterized by the fact that there is an eigenvalue as 0, and its determinant = 0. A positive-definite matrix means all eigenvalues are greater than 0. Positive-semidefinite matrix means all eigenvalues are non-negative.

How important is Linear Algebra?

Not everyone needs to know Linear Algebra, and to what extent you should understand Linear Algebra heavily depends on your type of work. But it is reasonable to forecast that machine learning and other statistical models will become more and more accessible and by then, simply knowing how to run a machine learning model might not be sufficient to make you stand out. So I would suggest to equip yourself with the necessary knowledge from now and make sure you always be competitive in the foreseeable future.

Thanks for reading! Next time I will cover the statistical model in Python. If you like this article, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please let me know if you have any questions or what kind of NumPy tutorials you would like to see In the future!

Github Repository:

https://github.com/frankligy/ScPyT