An Introduction to NumPy
Numerical Python or NumPy is a Python programming language library that supports large, multi-dimensional arrays and matrices, and comes with a vast collection of highlevel mathematical functions to operate on these arrays.
Matrix-sig, a special interest group, was founded in 1995 with the aim of developing an array computing package compatible with the Python programming language. In the same year, Jim Hugunin developed a generalised matrix implementation package, Numeric. Later, in 2005, Travis Oliphant incorporated features of NumArray and C-API into Numeric code to create NumPy (Numerical Python), which was developed as a part of the SciPy project. These were separated from each other to avoid installing the large SciPy package just to get an array object.
In the rest of this article, ‘>>>’ represents the Python interpreter prompt; statements without this prompt show the output of the code.
NumPy is a BSD-new licensed library for the Python programming language. It comes with Python distributions like Anaconda, Enthought Canopy and Pyzo. It can also be installed using package managers like dnf or pip. In Ubuntu and Debian systems, use the following code for installing NumPy:
sudo apt-get install python-numpy
Once NumPy is installed in the system, we need to import it to use the functionalities provided by it:
import numpy as n
The above command will create an alias ‘n’ while importing the package. Once you import the package, it will be active till you exit from the interpreter. If the programs are saved in files, you must import NumPy in each file.
NumPy, SciPy, Pandas and Scikit-learn
In Python, many packages provide support for scientific applications. NumPy, like Matlab, is used for efficient array computation. It also provides vectorised mathematical functions like sin() and cos(). SciPy assists us in scientific computing by providing methods for integration, interpolation, signal processing, statistics and linear algebra. Pandas helps in data analysis, statistics and visualisation. We use Scikit-learn when we want to train a machine learning algorithm. It seems like NumPy is inferior to all these packages. But the beauty is that all these packages use NumPy for their working.
The soul of NumPy is ‘ndarray’, an n-dimensional array.
You can see from the code segment given below that whenever you apply the function ‘type’ on any NumPy array, it will return the type numpy.ndarray irrespective of the type of data stored in it.
>>>import numpy as n
Unlike the list data structure in Python, ndarray holds elements of the same data type only. A few attributes of the ndarray object are listed below. Examples shown with the description of each attribute refer to the array ‘a’, whose definition is as follows:
ndarray.ndim: This displays the number of dimensions of the array. The array ‘a’ in the example is two-dimensional, and hence the output is as follows:
ndarray.shape: This displays the dimensions of the array, as shown below:
ndarray.size: This displays the total number of elements in the array, as shown below:
ndarray.dtype: This displays the data type of elements, depending on the type of data stored in it. Built-in types include int, bool, float, complex, bytes, str, unicode, buffer; all others are referred to as objects. Default dtype of ndarray is float64.
In the case of strings, ‘type’ will be displayed as dtype(‘S#’), where # represents the length of the string:
ndarray.itemsize: This displays the size of each element in bytes. In the example, ‘a’ contains integers and the size of the integers is 8 bytes. Hence, the output is as follows:
We have already seen in the above examples how ndarrays are created from a Python list using array(). A tuple can also be used in place of a list. It is possible to specify explicitly the data type of elements in the array, as shown below:
>>>e array([[‘1’, ‘2’, ‘3’],
[‘3’, ‘4’, ‘5’]], dtype=’|S2’)
There are other ways too for generating arrays. A few of them are listed below (italicised words in the description of each function denote the arguments to the functions).
ones(shape[,dtype, order]): This returns an array of given dimensions and type filled with 1s. ‘Order’ in the option set specifies whether to store the data in rows or columns. An example is given below.
>>>a array([‘1’, ‘1’, ‘1’],
If the specified dtype is Sn, whatever be the value of ‘n’, the array generated by ones() will contain a string of length 1. But, at some later point of time, we will be able to replace ‘1’ with a string of length up to ‘n’.
empty(shape[,dtype, order]): This returns an array of given dimensions and type without initialising the entries. In the code segment given below, the specified dtype is ‘S1’. Hence the array ‘a’ may be modified later to store strings of length 1.
>>>a array([‘’, ‘’, ‘’],
full(shape,fill_value[,dtype,order]): This returns an array of given dimensions filled with ‘fill_value’. If dtype is not explicitly specified, a float array will be generated with a warning message. A sample statement is given below:
>>>a array([[2, 2, 2],
[2, 2, 2],
[2, 2, 2]])
fromstring(string[,dtype,count,sep]): This returns a 1-D array initialised with ‘string’. NumPy takes ‘count’
elements of type ‘dtype’ from ‘string’ and generates an array. ‘String’ will be interpreted as a binary if ‘sep’, a string, is not provided, and as ASCII otherwise. I would like to add a bit more about fromstring(). This function needs the input string size to be a multiple of the element size. Unless specified, the array created using this function will be of dtype ‘float64’, which requires 8 bytes for representation. Consider the example given below:
The above statement intends to generate an array from the string ‘123’. But it will generate an error message since its length is not a multiple of 8.
The above statement will successfully generate an array as given below:
>>>a array([ 6.82132005e-38])
Consider the example given below: >>>a=n.fromstring(‘123456’,dtype=’S2’,count=2)
Here, dtype is specified as ‘S2’. So the array ‘a’ will contain elements of length 2.
>>>a array([‘12’, ‘34’],
We can see that ‘a’ contains only two elements since the count given in fromstring() is 2.
loadtxt(fname[,dtype][,comments][,skiprows][,delimiter] [,converters][,usecol] ..... ): This returns an array containing elements formed from the data in the file. The contents of an input file, say loadtxt.txt, are given below:
#this is comment line abc def ghi jkl mno pqr
Use the function shown below:
>>>n.loadtxt(‘/home/abc/Desktop/loadtxt.txt’,dtype=’S3’) array([[‘abc’, ‘def’, ‘ghi’],
[‘jkl’, ‘mno’, ‘pqr’]], dtype=’|S3’)
We can see in the output that the comment statement has automatically been eliminated. Before applying this function, we must make sure that all rows in the file contain an equal number of strings. We can specify in the comments option of loadtxt() which character will mark the beginning of the comments. By default, it is the ‘#’ symbol. The skiprows option will help to skip the first ‘skiprows’ lines in the input file.
arange([start], stop[, step,][,dtype]): This returns an array containing elements within a range.
There are many more functions that help in generating arrays. They are detailed in the official site scipy.org.
Functions associated with arrays
We have written programs to find trace, to sort elements, to find the index of non-zero elements, to multiply two matrices, etc. We know how lengthy these programs are, if written in C. Each of these tasks can be finished with a single statement using NumPy. The description of a few functions that are used is given below. A majority of the functions associated with an array return an array.
nonzero(a): This returns a tuple containing the indices of non-zero elements in the array.
(array([0, 1]), array([2, 0]))
We can see in the definition of ‘a’ that indices of nonzero elements in it are [0,2] and [1,0]. Each element in the result of nonzero() is an array containing the index position of the non-zero element in that dimension. In this case, the first array contains row numbers and the second array contains column numbers of non-zero elements. If we had a third dimension in the input array, the tuple would have contained one more element showing the positions of nonzero elements in that dimension.
>>>a[n.nonzero(a)] array([2, 3])
The above code shows us how to retrieve the non-zero elements from the array.
transpose(a[, axes]): This returns a new ndarray after performing a permutation on dimensions. The code segment given below shows a 3D array and its transpose.
>>>a array([[[1, 2, 3],
[4, 5, 6]],
[[7, 8, 9], [0, 1, 2]]]) >>>n.transpose(a) array([[[1, 7],
[[2, 8], [5, 1]],
[[3, 9], [6, 2]]])
sum(a [, axis][,dtype][,out][,keepdims]): This returns the sum of elements along the given axis. In the option list is an ndarray into which the result should be written. keepdims is a Boolean value, which if set to ‘True’, will keep the axis which is reduced a dimension with size one in the result. An example of a 3D array is given below. In this case, the axis takes values from 0 to 2.
>>>a array([[[1, 2, 3],
[4, 5, 6]],
[[7, 8, 9], [0, 1, 2]]])
>>>n.sum(a,axis=0) array([[ 8, 10, 12],
[ 4, 6, 8]])
>>>n.sum(a,axis=1) array([[ 5, 7, 9],
[ 7, 9, 11]])
>>> n.sum(a,axis=2,keepdims=True) array([[[ 6],
[, [ 3]]])
prod(a [, axis][,dtype][,out][,keepdims]): This returns the product of elements along the given axis.
There are functions like argmax, min, argmin, ptp, clip, conj, round, trace, cumsum, mean, var, std, cumprod, all and any, which make scientific computations easier. There are functions for array conversion, shape manipulation, item selection and manipulation too. If one wishes to dig deep, please visit the official SciPy site.
Operations on arrays
Arithmetic operations: Arithmetic operators like ‘+’, ‘-’, ‘*’, ‘/’ and ‘%’ can be applied directly on NumPy arrays. It is to be noted that all operations are element-wise operations. The result of 2D array multiplication is shown below:
>>>c*d array([[1, 6], [6, 4]])
If you wish to perform matrix multiplication, use the function dot() as shown below or generate matrices using the matrix function and use the ‘*’ operator on them.
>>> c array([[1, 2],
>>> d array([[1, 3],
>>> n.dot(c,d) array([[ 5, 5],
Relational operations: NumPy allows one to compare two arrays using relational operators. The result will be a Boolean array, i.e., an element in a resultant array is set to ‘True’ only if the condition is satisfied. An example is shown below:
>>>c array([[1, 2],
>>>d array([[1, 3],
>>>c==d array([[ True, False],
[False, False]], dtype=bool)
Logical operations: Logical operations can be performed on arrays using built-in functions supported by NumPy. Functions like logical_or(), logical_not(), logical_and(), etc can be used for this purpose. The code segment given below shows the results of the XOR operation.
>>>n.logical_xor(c,d) array([[ True, False],
[False, True]], dtype=bool)
Indexing and slicing arrays
The NumPy array index starts at 0. Let ‘a’ be a 2D array. ‘a[i][j]’ represents the (j+1)th element in the (i+1)th row. Equivalently, you can write it as ‘a[i,j]’. ‘a[3,:]’ represents all elements in the 4th row. ‘a[i:i+2, :]’ represents all the elements in the (i+1)th row to the (i+3)rd row.
I am now going to explain an attractive feature of NumPy
arrays, which is nothing but support for Boolean indexing. The example given below explains the same.
>>>d array([ True, True, False, False, True], dtype=bool)
>>>c[d] array([1, 4, 2])
Here, ‘d’ is a Boolean array whose element is set to ‘True’ if the corresponding element in ‘c’ has a value less than 5. Accessing array ‘c’ using ‘d’, i.e., c[d], will fetch an element in ‘c’ only if the element in the corresponding index position in ‘d’ is ‘True’. I will give one more example. An array ‘a’ is defined as follows:
We can see that the statement given below will retrieve all elements in array ‘a’ which are even numbers.
>>>a[a%2==0] array([2, 4, 6])
Integer overflow in Python
Python 2 supports two types of integers: int and long. Int is C type, which allows a range of values to be taken, while long is arbitrary precision whose maximum value is limited by the available memory. If int is not enough, it will be automatically promoted to long. When it comes to Python 3, there is support for arbitrary precision integers. So there is no question of overflow in integer operations in pure Python. But we cannot restrict our use to pure Python, since scientific computation needs packages in the PyData stack (e.g., NumPy, Pandas, SciPy, etc). The PyData stack uses C type integers which have fixed precision. It uses 64 bits for representation. So the maximum value an integer can take is 263-1. The overflow condition is shown below: >>>a=n.array([2**63-1,4],dtype=int)
>>>a array([9223372036854775807, 4])
>>>a+1 array([-9223372036854775808, 5])
To conclude, NumPy not only makes computation easier, but also makes the program run faster. It provides multidimensional arrays and tools to play with arrays.