ScPyT

Revisiting Numpy and ndarray

Less-known but useful knowledge and tricks about ndarray object and Numpy package

Guangyuan(Frank) Li

Published in

Towards Data Science

8 min readJan 16, 2021

I recently started a hard task — — reading the whole Numpy documentation, in particular, the API reference of the latest release of the Numpy module (1.19). The reason for doing that is simple, I am a bioinformatics PhD student and my research focus on developing computational tools in a wide spectrum of biological problems. As my project progress, I found out that my lack of knowledge of Numpy greatly hinders my ability to quickly and accurately find optimal solutions and I wasted a huge amount of time searching for certain commands on, Stackoverflow, as an example. To make my life easier, I asked myself, why don’t I just directly delve into the Numpy official documentation, which can largely fill my knowledge gaps and save me a lot of time.

So I did read the documentation, and to be completely honest to you, I learned A LOT from reading that. Now, I just want to share the tips and tricks I learned in the past few weeks, and hopefully, it will be something you found useful in the end.

I decided to name this series: Scientific computing in Python because I will cover Linear algebra, statistical modeling as well. This is a huge topic so I have to divide it into several parts:

Revisiting Numpy package and ndarray object (this article)
Linear Algebra in Python
Statistical modeling in Python (stay tuned)

All the content I summarize here comes from Numpy's official API reference, but I think the critical part that makes this article somehow valuable is because I distilled them according to my own working experience and organized them in a way that I think would be easy to follow and understand. You can find all the code base on my Github page:

https://github.com/frankligy/ScPyT

Now, let’s start the journey!

How ndarray is stored in memory?

Let’s create one ndarray:

import numpy as np
a = np.array([[1,2,3],[4,5,6]])

One attribute of ndarray that you might not know of:

a.strides
# (24, 8)

when you access the strides attribute, it returns a tuple as (24,8), what does it mean? In order to understand it, we have to know how ndarray is stored in our memory.

At the low level, data have to be stored in a flat 1D array, the 2D,3D arrays that we have taken for granted and use every day, actually necessitates a sophisticated memory allocation process. Let’ use this a array as an example.

This 2D array has to be flattened and stored in memory in the above manner, and as you noticed, each item in the array is an integer, which by default consumes 8 bytes, hence, 64 bits, now you know what np.int64 means, right? It means this integer will consume 64 bits of space in memory.

Now we go back and look at our a.strides output, which is (24,8) , it denotes that along axis=0 (row-wise), jumping one unit needs to stride over 24 strides, it is derived from 8 bytes per integer times there are three elements in each row. Concretely, jumping from 1 to 4 needs 24 strides in memory. The second 8 means column-wise, jumping over 1 unit needs 8 strides since they are basically adjacent to each other.

Do you understand dtype object in Numpy?

When you create a ndarray, a dtype object will be automatically inferred and associated with the array. For instance,

d = np.array([1,2,3])
d.dtype
# dtype('int64')

Its dtype is an ‘int64’ object, meaning to say, the elements in the array are homogeneously integers and each integer occupies 8 bytes or 64 bits in memory. There is a more formal way to represent this:

d.dtype == np.dtype('<i8')
# True

‘<i8’ is a more formal way to define an ‘int64’ dtype object, but how to interpret it? It ought to be broken down into three parts:

‘<’ or ‘>’ means the bytes of the order, to keep it simple, for most the case it doesn’t affect how we understand dtype object. It signifies how we store the bytes at the low-level.
‘i’ means integer, there are some other frequently-used notations worth memorizing: ‘?’ means boolean, ‘u’ means unsigned integer, ‘f’ means float, ‘O’ means python object, and ‘U’ means Unicode string.
‘8’ means how many bytes it takes in memory. To generalize, ‘<f8’ should be ‘float64’, right? The only place where special cares should be taken is that, for Unicode string, the third-place doesn’t represent the number of bytes but the length of the string. For instance, ‘<U8’ means a Unicode string with length 8.

To further understand dtype , let’s use the ‘<U8’ example again,

d_type = np.dtype('<U8')
d_type.byteorder
# '='
d_type.itemsize
# 32
d_type.name
# 'str256'

So a dtype object will have three easy-to-access attributes associated with its byte order, item size, and the shortcut name for convenience. Let’s have a closer look, U8 means string with 8 characters, each character is encoded by UTF8, which is a variable-length encoding (each character will occupy 1–4 bytes). Let’s take the maximum, hence one character occupies 4 bytes, 8 characters will occupy 32 bytes. str256 means this string takes up 256 bits. As an extension, if using ASCII encoding, each character will be encoded by 1 byte, for instance, letter A would be number 65 in ASCII table, which is able to be represented using 1 byte (8 binary bits).

Structural array and Record array

For what it’s worth, as a side note, I picked up these examples because they are all frequently-used stuff on my daily basis. So please don’t just view it as some interesting/fancy usages that I am trying to blow you away, they are actually super useful commands in lots of real-world examples.

Let’s create a structural array:

sa = np.array([('John',[88,95,100]),('Mary',[77,88,68])],
              dtype=[('student','<U8'),('grades','<i4',(3,))])

Think about we want to use an array to image the following situation, there are two students “John” and “Mary”, John’s math, English, and biology grades are 88, 95, 100, respectively and Mary’s are 77, 88, 68. We can illustrate this kind of structure in a table:

Here we have two meta-field, “student” and “grades”, for field “grades”, it is actually a (3,) 1D array which corresponds to “Math”, “English” and “Biology”, respectively. In the structural array sa , we define the field names (column names) and the dtype for each column, plus an optional shape parameter to specify ndarray shape. Then we take each row/observation as a tuple to form a structural array.

The benefit of the structural array lies not only in organizing data in a cost-saving way but also allow you to access each field by simply typing:

sa['student']
# array(['John', 'Mary'], dtype='<U8')

There is another object similar to the structural array called record array, we first convert the structural array sa to record array ra .

ra = sa.view(np.recarray)
type(ra)
# numpy.recarray

The added benefit of using a record array is that you are allowed to access each field using:

ra.student
# array(['John', 'Mary'], dtype='<U8')

Other than that, the structural array and record array looks pretty similar.

Slicing and Indexing

There are three ways to do index and slice on a ndarray, according to the official documentation, they are:

basic indexing
advanced slicing (numeric, boolean)
field accessing

I’ve already covered field accessing in my structural array and record array section. For the remaining two, basic indexing refers to we either use an integer or a slicing object to slice ndarray:

b = np.array([[1,2,3,4,5,6,7,8,9,10],
             [4,5,6,7,8,9,20,11,12,13],
             [1,2,3,4,5,6,7,8,9,9]])
b0 = b[1:3,4:7].  # basic indexing

Here (1:3) is a python slice object, it is equivalent to slice(1,3,1) . It is also basic indexing if you use a single integer like 1 or 3. The caveat is: basic indexing will only create a view on the original b array, it doesn’t create a copy for that. To illustrate that, we change a value on the sliced b0 array:

b0[0,1] = 99

Then let’s see what would happen on the original b array:

modify “view” would change the original array

Do you see an obvious “99”? We accidentally changed the value of the original array b , it might have detrimental effects so please make sure you are aware of that. Whenever you use basic indexing, you are creating a “view” instead of a complete “copy”.

In contrast, what is advanced slicing then? Advanced slicing allows you to pass the indices on each dimension where you want to pull the element out.

b1 = b[:,(4,7)].  # (4,7) is advanced slicing

To make differences stand out, (4,7) belongs to advanced slicing, but 4:7 is basic indexing because it is a slice object.

Advanced slicing will always create a “copy”, so you don’t need to worry about the detrimental issues “basic indexing” may raise.

Advanced slicing can also accept the boolean value, which is a very convenient method in the Numpy package. Python primitive list object doesn’t support the boolean indexing.

Numpy useful functions you need to know of

Here’s a list of functions I labeled as “need to know” after I read through the documentation because I saw them being used in the real-world tasks before or I think knowing that can bring you a lot of conveniences.

create arrays: (empty, ones, zeros, full and associated full_like)
identity arrays: (eye, identity)
numeric range: (arrange, linspace, logspace, meshgrid)
building matrix: (diag, tri, tril, triu, vander)
reshape:(reshape, swapaxes, transpose, flatten, ravel, newaxis, squeeze, expand_dims, tile, repeat)
join and splitting: (concatenate, stack, column_stack, split, array_split)
functional programming: (apply_along_axis)
indexing: (where )
logical: (logical_xor)
math: (deg2rad, hypot, sinh, trunc, round, lcm, gcd, etc)
random: (rand,randn, randint, choice, shuffle, beta, etc)
statistics: (corrcoef, cov, histogram, quantile, var, etc)

I would recommend checking their examples in Numpy documentation and internalize their usages.

Conclusion:

Next time, I will walk you through the linear algebra module in python, followed by statistical modeling. Thanks for reading! If you like this article, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please let me know if you have any questions or what kind of NumPy tutorials you would like to see In the future!

Github Repository: