datafresh: (T) NumPy in Detail

In my post previous there was some examples contained matrices or other data structures of higher dimensionality—just one-dimensional vectors. To understand how NumPy treats objects with dimensions greater than one, we need to develop at least a superficial understanding for the way NumPy is implemented. It is misleading to think of NumPy as a “matrix package for Python” (although it’s commonly used as such). I find it more helpful to think of NumPy as a wrapper and access layer for underlying C buffers. These buffers are contiguous blocks of C memory, which—by their nature—are one-dimensional data structures. All elements in those data structures must be of the same size, and we can specify almost any native C type (including C structs) as the type of the individual elements. The default type corresponds to a C double and that is what we use in the examples that follow, but keep in mind that other choices are possible. All operations that apply to the data overall are performed in C and are therefore very fast.

To interpret the data as a matrix or other multidimensional data structure, the shape or layout is imposed during element access. The same 12-element data structure can therefore be interpreted as a 12-element vector or a 3×4 matrix or a 2×2×3 tensor—the shape comes into play only through the way we access the individual elements. (Keep in mind that although reshaping a data structure is very easy, resizing is not.) The encapsulation of the underlying C data structures is not perfect: when choosing the

types of the atomic elements, we specify C data types not Python types. Similarly, some features provided by NumPy allow us to manage memory manually, rather than have the memory be managed transparently by the Python runtime. This is an intentional design decision, because NumPy has been designed to accommodate large data structures—large enough that you might want (or need) to exercise a greater degree of control over the way memory is managed. For this reason, you have the ability to choose types that take up less space as elements in a collection (e.g., C float elements rather than the default double). For the same reason, all ufuncs accept an optional argument pointing to an (already allocated) location where the results will be placed, thereby avoiding the need to claim additional memory themselves. Finally, several access and structuring routines return a view (not a copy!) of the same underlying data. This does pose an aliasing problem that you need to watch out for.

The next listing quickly demonstrates the concepts of shape and views. Here, I assume that the commands are entered at an interactive Python prompt (shown as >>> in the listing). Output generated by Python is shown without a prompt:

Let’s step through this. We create two vectors of 12 elements each. Then we reshape the first one into a 3 × 4 matrix. Note that the shape property is a data member—not an accessory function! For the second vector, we create a view in the form of a 3 × 4 matrix. Now d1 and the newly created view of d2 have the same shape, so we can combine them (by forming their sum, in this case). Note that even though reshape() is a member function, it does not change the shape of the instance itself but instead returns a new view object: d2 is still a one-dimensional vector. (There is also a standalone version of this function, so we could also have written view = np.reshape( d2, (3,4) ). The presence of such redundant functionality is due to the desire to maintain backward compatibility with both of NumPy’s ancestors.)

We can now access individual elements of the data structures, depending on their shape. Since both d1 and view are matrices, they are indexed by a pair of indices (in the order [row,col]). However, d2 is still a one-dimensional vector and thus takes only a single index. (We will have more to say about indexing in a moment.) Finally, we examine some diagnostics regarding the shape of the data structures, emphasizing their precise semantics. The shape is a tuple, giving the number of elements in each dimension. The size is the total number of elements and corresponds to the value returned by len() for the entire data structure. Finally, ndim gives the number of dimensions (i.e., d.ndim == len(d.shape)) and is equivalent to the “rank” of the entire data structure. (Again, the redundant functionality exists to maintain backward compatibility.)

Finally, let’s take a closer look at the ways in which we can access elements or larger subsets of an ndarray. In the previous listing we saw how to access an individual element by fully specifying an index for each dimension. We can also specify larger sub-arrays of a data structure using two additional techniques, known as slicing and advanced indexing. The following listing shows some representative examples. (Again, consider this an interactive Python session.)

We first create a 12-element vector and reshape it into a 3 × 4 matrix as before. Slicing uses the standard Python slicing syntax start:stop:step, where the start position is inclusive but the stopping position is exclusive. (In the listing, I use only the simplest form of slicing, selecting all available elements.) There are two potential “gotchas” with slicing. First of all, specifying an explicit subscripting index (not a slice!) reduces the corresponding dimension to a scalar. Slicing, though, does not reduce the dimensionality of the data structure. Consider the two extreme cases: in the expression d[0,1], indices for both dimensions are fully specified, and so we are left with a scalar. In contrast, d[0:1,1:2] is sliced in both dimensions. Neither dimension is removed, and the resulting object is still a (two-dimensional) matrix but of smaller size: it has shape 1 × 1. The second issue to watch out for is that slices return views, not copies. Besides slicing, we can also index an ndarray with a vector of indices, by an operation called “advanced indexing.” The previous listing showed two simple examples. In the first we use a Python list object, which contains the integer indices (i.e., the positions) of the desired columns and in the desired order, to select a subset of columns. In the second example, we form an ndarray of Boolean entries to select only those rows for which the Boolean evaluates to True. In contrast to slicing, advanced indexing returns copies, not views. This completes our overview of the basic capabilities of the NumPy module. NumPy is easy and convenient to use for simple use cases but can get very confusing otherwise.

datafresh

Tuesday, June 9, 2015

(T) NumPy in Detail

No comments:

Lastest Posts

(T) Using shared variables and key-value pairs

Posts More Readen