Pular para o conteúdo principal

(T) NumPy in Detail

In my post previous there was some examples contained matrices or other data structures of higher dimensionality—just one-dimensional vectors. To understand how NumPy treats objects with dimensions greater than one, we need to develop at least a superficial understanding for the way NumPy is implemented. It is misleading to think of NumPy as a “matrix package for Python” (although it’s commonly used as such). I find it more helpful to think of NumPy as a wrapper and access layer for underlying C buffers. These buffers are contiguous blocks of C memory, which—by their nature—are one-dimensional data structures. All elements in those data structures must be of the same size, and we can specify almost any native C type (including C structs) as the type of the individual elements. The default type corresponds to a C double and that is what we use in the examples that follow, but keep in mind that other choices are possible. All operations that apply to the data overall are performed in C and are therefore very fast.

To interpret the data as a matrix or other multidimensional data structure, the shape or layout is imposed during element access. The same 12-element data structure can therefore be interpreted as a 12-element vector or a 3×4 matrix or a 2×2×3 tensor—the shape comes into play only through the way we access the individual elements. (Keep in mind that although reshaping a data structure is very easy, resizing is not.) The encapsulation of the underlying C data structures is not perfect: when choosing the
types of the atomic elements, we specify C data types not Python types. Similarly, some features provided by NumPy allow us to manage memory manually, rather than have the memory be managed transparently by the Python runtime. This is an intentional design decision, because NumPy has been designed to accommodate large data structures—large enough that you might want (or need) to exercise a greater degree of control over the way memory is managed. For this reason, you have the ability to choose types that take up less space as elements in a collection (e.g., C float elements rather than the default double). For the same reason, all ufuncs accept an optional argument pointing to an (already allocated) location where the results will be placed, thereby avoiding the need to claim additional memory themselves. Finally, several access and structuring routines return a view (not a copy!) of the same underlying data. This does pose an aliasing problem that you need to watch out for.

The next listing quickly demonstrates the concepts of shape and views. Here, I assume that the commands are entered at an interactive Python prompt (shown as >>> in the listing). Output generated by Python is shown without a prompt:

Let’s step through this. We create two vectors of 12 elements each. Then we reshape the first one into a 3 × 4 matrix. Note that the shape property is a data member—not an accessory function! For the second vector, we create a view in the form of a 3 × 4 matrix. Now d1 and the newly created view of d2 have the same shape, so we can combine them (by forming their sum, in this case). Note that even though reshape() is a member function, it does not change the shape of the instance itself but instead returns a new view object: d2 is still a one-dimensional vector. (There is also a standalone version of this function, so we could also have written view = np.reshape( d2, (3,4) ). The presence of such redundant functionality is due to the desire to maintain backward compatibility with both of NumPy’s ancestors.)

We can now access individual elements of the data structures, depending on their shape. Since both d1 and view are matrices, they are indexed by a pair of indices (in the order [row,col]). However, d2 is still a one-dimensional vector and thus takes only a single index. (We will have more to say about indexing in a moment.) Finally, we examine some diagnostics regarding the shape of the data structures, emphasizing their precise semantics. The shape is a tuple, giving the number of elements in each dimension. The size is the total number of elements and corresponds to the value returned by len() for the entire data structure. Finally, ndim gives the number of dimensions (i.e., d.ndim == len(d.shape)) and is equivalent to the “rank” of the entire data structure. (Again, the redundant functionality exists to maintain backward compatibility.)

Finally, let’s take a closer look at the ways in which we can access elements or larger subsets of an ndarray. In the previous listing we saw how to access an individual element by fully specifying an index for each dimension. We can also specify larger sub-arrays of a data structure using two additional techniques, known as slicing and advanced indexing. The following listing shows some representative examples. (Again, consider this an interactive Python session.)

We first create a 12-element vector and reshape it into a 3 × 4 matrix as before. Slicing uses the standard Python slicing syntax start:stop:step, where the start position is inclusive but the stopping position is exclusive. (In the listing, I use only the simplest form of slicing, selecting all available elements.) There are two potential “gotchas” with slicing. First of all, specifying an explicit subscripting index (not a slice!) reduces the corresponding dimension to a scalar. Slicing, though, does not reduce the dimensionality of the data structure. Consider the two extreme cases: in the expression d[0,1], indices for both dimensions are fully specified, and so we are left with a scalar. In contrast, d[0:1,1:2] is sliced in both dimensions. Neither dimension is removed, and the resulting object is still a (two-dimensional) matrix but of smaller size: it has shape 1 × 1. The second issue to watch out for is that slices return views, not copies. Besides slicing, we can also index an ndarray with a vector of indices, by an operation called “advanced indexing.” The previous listing showed two simple examples. In the first we use a Python list object, which contains the integer indices (i.e., the positions) of the desired columns and in the desired order, to select a subset of columns. In the second example, we form an ndarray of Boolean entries to select only those rows for which the Boolean evaluates to True. In contrast to slicing, advanced indexing returns copies, not views. This completes our overview of the basic capabilities of the NumPy module. NumPy is easy and convenient to use for simple use cases but can get very confusing otherwise.


Postagens mais visitadas deste blog

(A) Data Science in Practice with Python

The top trending in Twitter or other social network is the term “data science”. But ...
What’s the data science? How do real companies use data science to make products, services and operations better? How does it work? What does the data science lifecycle look like?  This is the buzzword at the moment. A lot of people ask me about it. Are many questions. I’ll try answer all of these questions through of some samples.

Sample 1 - Regression

WHAT IS A REGRESSION? This is the better definition what I found [Source: Wikipedia] - Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning.
HOW DOES IT WORK? Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variable…

(A) Tucson Best Buy Analysis

“Data! Data! Data!” he cried impatiently.  “I can’t make bricks without clay.” —Arthur Conan Doyle
The Ascendance of Data

We live in a world that’s drowning in data. Websites track every user’s every click. Your smartphone is building up a record of your location and speed every second of every day. “Quantified selfers” wear pedometers-on-steroids that are ever recording their heart rates, movement habits, diet, and sleep patterns. Smart cars collect driving habits, smart homes collect living habits, and smart marketers collect purchasing habits. The Internet itself represents a huge graph of knowledge that contains (among other things) an enormous cross-referenced encyclopedia; domain-specific databases about movies, music, sports results, pinball machines, memes, and cocktails; and too many government statistics (some of them nearly true!) from too many governments to wrap your head around.
Buried in these data are answers to countless questions that no one’s ever thought to ask. In…

(T) Como indexar uma tabela Fato - (Best Practice)

A base de qualquer projeto de bi é ter um bom dw/data mart. Podemos falar em modelagem star-schema durante dias, sem falar nas variações do snowflake, mas o objetivo principal deste artigo é apontar algumas negligências que tenho percebido no tratamento da tabela fato. Tabela esta que é o principal pilar da casa que reside um modelo star-schema.

Ouço muitas vezes os clientes reclamando do desempenho das consultas enviadas contra o seu dw/data mart, ou do tempo de resposta das análises solicitadas ao bi. Isto é realmente inaceitável, não só numa perspectiva de implantação do projeto, mas também de desempenho da entrega das informações.

Como eu mencionei anteriormente o meu objetivo neste artigo, é alertar sobre a importância da indexação da tabela fato: o que deveria ser, porque é necessário, porque chaves compostas são boas e más, e porque você deveria se preocupar com isso.

Então, vejamos:

|a| Indexação padrão (default):
De forma rápida, todas as colunas de chave estrangeira (FK) devem …