datafresh: 2017

Friday, October 6, 2017

(A) Data Science in Practice with Python - Sample 2

In this post, I'll explain what is a recommender system, how to work it and show you some code examples. In my previous post I did a quick introduction:

Sample 2 - Recommender System

WHAT IS A RECOMMENDER SYSTEM? A model that filters information to present users with a curated subset of options they’re likely to find appealing.
HOW DOES IT WORK? Generally via a collaborative approach (considering the user’s previous behavior) or content-based approach (based on discrete assigned characteristics).

Now I'll get into in some concepts very important about recommender systems.

Recommender System in Details:

We can say that the goal of a recommender system is to make product or service recommendations to people. Of course, these recommendations should be for products or services they’re more likely to want to buy or consume.

Recommender systems are active information filtering systems which personalize the information coming to a user based on his interests, relevance of the information etc. Recommender systems are used widely for recommending movies, articles, restaurants, places to visit, items to buy etc.

1. Type of Recommendation Engines

A simple way to think about the different types of recommender are:

Content Filtering: “If you liked this item, you might also like …”
Item-Item Collaborative Filtering: “Customers who liked this item also liked …”
User-Item Collaborative Filtering: “Customers who are similar to you also liked …”

Confused...So, let's see in practice some cases:

Case 1: Content-based algorithm
Idea: “If you liked this item, you might also like …”
Based on similarity of the items being recommended. It generally works well when its easy to determine the context/properties of each item. For instance when we are recommending the same kind of item like a movie recommendation or song recommendation.

Case 2: Collaborative filtering algorithm
If a person A likes item 1, 2, 3 and B like 2, 3, 4 then they have similar interests and A should like item 4 and B should like item 1.

This algorithm is entirely based on past behavior and not on the context. This makes it one of the most commonly used algorithms as it is not dependent on any additional information. For instance: product recommendations by e-commerce player like Amazon and merchant recommendations by banks like American Express.

An important point here is that in this case, there are several types of collaborative filtering algorithms. Let's see the two most important:

Item-Item Collaborative filtering:
Idea: “Customers who liked this item also liked …”
It is quite similar to the previous algorithm, but instead of finding customer look alike, we try finding item look alike. Once we have item look alike matrix, we can easily recommend alike items to a customer who has purchased an item from the store. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new customer, the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between customers. And with a fixed number of products, product-product look alike matrix is fixed over time.

User-Item Collaborative filtering:
Idea: “Customers who are similar to you also liked …”
Here we find look-alike customers (based on similarity) and offer products which first customer’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every customer pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelization system.

But before we proceed, let me define a couple of terms:

Item would refer to content whose attributes are used in the recommended models. These could be movies, documents, book etc.
Attribute refers to the characteristic of an item. A movie tag, words in a document are examples.

Recommender System Hands-On:

As I said above, "Collaborative filtering" algorithms search large groupings of preference expressions to find similarities to some input preference or preferences. The output from these algorithms is a ranked list of suggestions that is a subset of all possible preferences, and hence, it's called "filtering". The "collaborative" comes from the use of many other peoples' preferences in order to find suggestions for themselves. This can be seen either as a search of the space of preferences (for brute-force techniques), a clustering problem (grouping similarly preferred items), or even some other predictive model.

Many algorithmic attempts have been created in order to optimize or solve this problem across sparse or large datasets, and we will discuss a few of them in this post.

The goals of this post are:

Understanding how to model preferences from a variety of sources
Learning how to compute similarities using distance metrics

To demonstrate the techniques in this post, I will use the "MovieLens" database from the University of Minnesota that contains star ratings of moviegoers for their preferred movies. You can find the data here.

I will use the smaller MoveLens 100k dataset (4.7 MB in size / ml-100k.zip) in order to load the entire model into the memory with ease. Unzip the downloaded data into the directory of your choice.

These data consists of:

100,000 ratings (1-5) from 943 users on 1682 movies.
Each user has rated at least 20 movies.
Simple demographic info for the users (age, gender, occupation, zip)
Genre information of movies

In a zipped file, there are two main files that we will be using are as follows:

u.data: This contains the user moving ratings, it is the main file and it is tab delimited
u.item: This contains the movie information and other details, it is pipe delimited

The u.data file, the first column is the "user ID", the second column is the "movie ID", the third is the "star rating", and the last is the "timestamp".

The u.item file contains much more information, including the "ID", "title", "release date", and so on. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western.

Well ... let's see all this in the practice, step-by-step or better line-by-line. For this, I prepared a Jupyter notebook. You can find the notebook here.

I hope you enjoy it!

See you again on next post ... Bye, Bye!

Saturday, July 8, 2017

(A) Data Science in Practice with Python - Sample 1

The top trending in Twitter or other social network is the term “data science”. But ...

What’s the data science?
How do real companies use data science to make products, services and operations better?
How does it work?
What does the data science lifecycle look like?

This is the buzzword at the moment. A lot of people ask me about it. Are many questions. I’ll try answer all of these questions through of some samples.

Sample 1 - Regression

WHAT IS A REGRESSION? This is the better definition what I found [Source: Wikipedia] - Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning.
HOW DOES IT WORK? Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.

Sample 2 - Recommender System

WHAT IS A RECOMMENDER SYSTEM? A model that filters information to present users with a curated subset of options they’re likely to find appealing.
HOW DOES IT WORK? Generally via a collaborative approach (considering user’s previous behavior) or content based approach (based on discrete assigned characteristics).

Sample 3 - Credit Scoring

WHAT IS CREDIT SCORING? A model that determines an applicant’s creditworthiness for a mortgage, loan or credit card.
HOW DOES IT WORK? A set of decision management rules evaluates how likely an applicant is to repay debts.

Sample 4 - Dynamic Pricing

WHAT IS DYNAMIC PRICING? Modeling price as a function of supply, demand, competitor pricing and exogenous factors.
HOW DOES IT WORK? Generalized linear models and classification trees are popular techniques for estimating the “right” price to maximize expected revenue.

Sample 5 - Customer Churn

WHAT IS CUSTOMER CHURN? Predicting which customers are going to abandon a product or service.
HOW DOES IT WORK? Data scientists may consider using support vector machines, random forest or k-nearest-neighbors algorithms.

Sample 6 - Fraud Detection

WHAT IS FRAUD DETECTION? Detecting and preventing fraudulent financial transactions from being processed.
HOW DOES IT WORK? Fraud detection is a binary classification problem: “is this transaction legitimate or not?”

This post will be divided in 5 parts. In each one I’ll explain the machine learning techniques mentioned above. This is the first post and I’ll show you how work the sample 1: regression. But, first let’s start with the question (below):

- What’s the data science?

In my previous post “Tucson Best Buy Analysis”, you can know more about it. There I explain “what is” and show you some examples of the day-by-day of a Data Scientist.

Being so, let’s talk about sample 1: regression. To explain this let’s start with a simple problem, I could say “classic problem”, predicting house prices in Russia. A Kaggle's challenge what was closed on 06/29/2017. The goal is to predict the median value of a house in a particular area. As usual, we have some training data, where the answer is known to us. To our study and a better comprehension about this topic I created a notebook on Jupyter tool. To access the notebook, please click here.

Have fun! See you on next post ...

Sunday, June 11, 2017

(A) Tucson Best Buy Analysis

“Data! Data! Data!” he cried impatiently.

“I can’t make bricks without clay.”

—Arthur Conan Doyle

The Ascendance of Data

We live in a world that’s drowning in data. Websites track every user’s every click. Your smartphone is building up a record of your location and speed every second of every day. “Quantified selfers” wear pedometers-on-steroids that are ever recording their heart rates, movement habits, diet, and sleep patterns. Smart cars collect driving habits, smart homes collect living habits, and smart marketers collect purchasing habits. The Internet itself represents a huge graph of knowledge that contains (among other things) an enormous cross-referenced encyclopedia; domain-specific databases

about movies, music, sports results, pinball machines, memes, and cocktails; and too many government statistics (some of them nearly true!) from too many governments to wrap your head around.

Buried in these data are answers to countless questions that no one’s ever thought to ask. In this post, we’ll talk about how to find them.

What is Data Science?

There’s a joke that says a data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. (I didn’t say it was a good joke.) In fact, some data scientists are—for all practical purposes—statisticians, while others are pretty much indistinguishable from software engineers. Some are machine-learning experts, while others couldn’t machine-learn their way out of kin‐dergarten. Some are PhDs with impressive publication records, while others have never read an academic paper (shame on them, though). In short, pretty much no matter how you define data science, you’ll find practitioners for whom the definition is totally, absolutely wrong.

Nonetheless, we won’t let that stop us from trying. We’ll say that a data scientist is someone who extracts insights from messy data. Today’s world is full of people trying to turn data into insight.

For instance, the dating site in USA called OkCupid asks its members to answer thousands of questions in order to find the most appropriate matches for them. But it also analyzes these results to figure out innocuous-sounding questions you can ask someone to find out how likely someone is to sleep with you on the first date.

Facebook asks you to list your hometown and your current location, ostensibly to make it easier for your friends to find and connect with you. But it also analyzes these locations to identify global migration patterns and where the fanbases of different football teams live.

As a large retailer in Brazil, B2W (owner of the brands Americanas.com, Shoptime.com, Submarino.com and Soubarato.com) tracks your purchases and interactions online. And it uses the data to predictively model which of its customers are pregnant, to better market baby-related purchases to them.

In 2012, the Obama campaign employed dozens of data scientists who data-mined and experimented their way to identifying voters who needed extra attention, choosing optimal donor-specific fundraising appeals and programs, and focusing get-out-the-vote efforts where they were most likely to be useful. It is generally agreed that these efforts played an important role in the president’s re-election, which means it is a safe bet that political campaigns of the future will become more and more data-driven, resulting in a never-ending arms race of data science and data collection.

Now, before you start feeling too jaded: some data scientists also occasionally use their skills for good—using data to make government more effective, to help the homeless, and to improve public health. But it certainly won’t hurt your career if you like figuring out the best way to get people to click on advertisements.

Data Science in pratice

So, let's play with some data and to understand some activities performed by a Data Scientist.

In this exercise, the idea is to find some options to buy a Hyundai Tucson car. There is a specialized site to buy and sell cars and motorbikes (http://www.webmotors.com.br). The cars/motorbikes can be new or used. There are many advertisements that are published every day in this place. So many options: color, year, model, price, etc. And I visit this site every day to find out a good opportunity.

How work: I created an app in python to extract (crawler) these data from this site. Then I saved these data in a sqlite database, called 'crawler-motors.db'. The next step was to analyze these data. So, I created this notebook in Jupyter tool. This notebook has seven parts. In general I use pandas for data manipulation, some statistical analysis and matplotlib to create some graphics. Also, I am using approaches of machine learning to do a prediction of the sale's price throught by a linear regression model.

Tuesday, April 25, 2017

(T) Statistical Computing with Python - Part Two

This is a quick tutorial and here I'll show you “how-to do” some statistical programming tasks using python. For that, is necessary to have some basic knowledge with python and be familiar with statistical programming in a language like R, Stata, SAS, SPSS or Matlab.

Please click here to see this post in pdf.

datafresh