In my post previous there
was some examples contained matrices or other data structures of
higher dimensionality—just one-dimensional vectors. To understand
how NumPy treats objects with dimensions greater than one, we need to
develop at least a superficial understanding for the way NumPy is
implemented. It is misleading to think of NumPy as a “matrix
package for Python” (although it’s commonly used as such). I find
it more helpful to think of NumPy as a wrapper and access layer for
underlying C buffers. These buffers are contiguous blocks of C
memory, which—by their nature—are one-dimensional data
structures. All elements in those data structures must be of the same
size, and we can specify almost any native C type (including C
structs) as the type of the individual elements. The default type
corresponds to a C double and that is what we use in the examples
that follow, but keep in mind that other choices are possible. All
operations that apply to the data overall are performed in C and are
therefore very fast.
To interpret the data as
a matrix or other multidimensional data structure, the shape or
layout is imposed during element access. The same 12-element data
structure can therefore be interpreted as a 12-element vector or a
3×4 matrix or a 2×2×3 tensor—the shape comes into play only
through the way we access the individual elements. (Keep in mind that
although reshaping a data structure is very easy, resizing is not.)
The encapsulation of the underlying C data structures is not perfect:
when choosing the
types of the atomic
elements, we specify C data types not Python types. Similarly, some
features provided by NumPy allow us to manage memory manually, rather
than have the memory be managed transparently by the Python runtime.
This is an intentional design decision, because NumPy has been
designed to accommodate large data structures—large enough that you
might want (or need) to exercise a greater degree of control over the
way memory is managed. For this reason, you have the ability to
choose types that take up less space as elements in a collection
(e.g., C float elements rather than the default double). For the same
reason, all ufuncs accept an optional argument pointing to an
(already allocated) location where the results will be placed,
thereby avoiding the need to claim additional memory themselves.
Finally, several access and structuring routines return a view (not a
copy!) of the same underlying data. This does pose an aliasing
problem that you need to watch out for.
The next listing quickly
demonstrates the concepts of shape and views. Here, I assume that the
commands are entered at an interactive Python prompt (shown as >>>
in the listing). Output generated by Python is shown without a
prompt:
Let’s step through
this. We create two vectors of 12 elements each. Then we reshape the
first one into a 3 × 4 matrix. Note that the shape property is a
data member—not an accessory function! For the second vector, we
create a view in the form of a 3 × 4 matrix. Now d1
and the newly created view of d2 have
the same shape, so we can combine them (by forming their sum, in this
case). Note that even though reshape()
is a member function, it does not change the shape of the instance
itself but instead returns a new view object: d2
is still a one-dimensional vector. (There is also a standalone
version of this function, so we could also have written view
= np.reshape( d2, (3,4) ). The presence of such redundant
functionality is due to the desire to maintain backward compatibility
with both of NumPy’s ancestors.)
We can now access
individual elements of the data structures, depending on their shape.
Since both d1 and view are matrices,
they are indexed by a pair of indices (in the order [row,col]).
However, d2 is still a one-dimensional
vector and thus takes only a single index. (We will have more to say
about indexing in a moment.) Finally, we examine some diagnostics
regarding the shape of the data structures, emphasizing their precise
semantics. The shape is a tuple, giving the number of elements in
each dimension. The size is the total number of elements and
corresponds to the value returned by len()
for the entire data structure. Finally, ndim
gives the number of dimensions (i.e., d.ndim ==
len(d.shape)) and is equivalent to the “rank”
of the entire data structure. (Again, the redundant functionality
exists to maintain backward compatibility.)
Finally, let’s take a
closer look at the ways in which we can access elements or larger
subsets of an ndarray. In the previous listing we saw how to access
an individual element by fully specifying an index for each
dimension. We can also specify larger sub-arrays of a data structure
using two additional techniques, known as slicing and advanced
indexing. The following listing shows some representative examples.
(Again, consider this an interactive Python session.)
We first create a
12-element vector and reshape it into a 3 × 4 matrix as before.
Slicing uses the standard Python slicing syntax start:stop:step,
where the start position is inclusive but the stopping position is
exclusive. (In the listing, I use only the simplest form of slicing,
selecting all available elements.) There are two potential “gotchas”
with slicing. First of all, specifying an explicit subscripting index
(not a slice!) reduces the corresponding dimension to a scalar.
Slicing, though, does not reduce the dimensionality of the data
structure. Consider the two extreme cases: in the expression d[0,1],
indices for both dimensions are fully specified, and so we are left
with a scalar. In contrast, d[0:1,1:2]
is sliced in both dimensions. Neither dimension is removed, and the
resulting object is still a (two-dimensional) matrix but of smaller
size: it has shape 1 × 1. The second issue to watch out for is that
slices return views, not copies. Besides slicing, we can also index
an ndarray with a vector of indices, by an operation called “advanced
indexing.” The previous listing showed two simple examples. In the
first we use a Python list object, which contains the integer indices
(i.e., the positions) of the desired columns and in the desired
order, to select a subset of columns. In the second example, we form
an ndarray of Boolean entries to select only those rows for which the
Boolean evaluates to True. In contrast to slicing, advanced indexing
returns copies, not views. This completes our overview of the basic
capabilities of the NumPy module. NumPy is easy and convenient to use
for simple use cases but can get very confusing otherwise.
No comments:
Post a Comment