“Data! Data! Data!” he cried impatiently.
“I can’t make bricks without clay.”
—Arthur Conan Doyle
The Ascendance of Data
We live in a world that’s drowning in data. Websites track every user’s every click. Your smartphone is building up a record of your location and speed every second of every day. “Quantified selfers” wear pedometers-on-steroids that are ever recording their heart rates, movement habits, diet, and sleep patterns. Smart cars collect driving habits, smart homes collect living habits, and smart marketers collect purchasing habits. The Internet itself represents a huge graph of knowledge that contains (among other things) an enormous cross-referenced encyclopedia; domain-specific databases
about movies, music, sports results, pinball machines, memes, and cocktails; and too many government statistics (some of them nearly true!) from too many governments to wrap your head around.
Buried in these data are answers to countless questions that no one’s ever thought to ask. In this post, we’ll talk about how to find them.
What is Data Science?
There’s a joke that says a data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. (I didn’t say it was a good joke.) In fact, some data scientists are—for all practical purposes—statisticians, while others are pretty much indistinguishable from software engineers. Some are machine-learning experts, while others couldn’t machine-learn their way out of kin‐dergarten. Some are PhDs with impressive publication records, while others have never read an academic paper (shame on them, though). In short, pretty much no matter how you define data science, you’ll find practitioners for whom the definition is totally, absolutely wrong.
Nonetheless, we won’t let that stop us from trying. We’ll say that a data scientist is someone who extracts insights from messy data. Today’s world is full of people trying to turn data into insight.
For instance, the dating site in USA called OkCupid asks its members to answer thousands of questions in order to find the most appropriate matches for them. But it also analyzes these results to figure out innocuous-sounding questions you can ask someone to find out how likely someone is to sleep with you on the first date.
Facebook asks you to list your hometown and your current location, ostensibly to make it easier for your friends to find and connect with you. But it also analyzes these locations to identify global migration patterns and where the fanbases of different football teams live.
As a large retailer in Brazil, B2W (owner of the brands Americanas.com, Shoptime.com, Submarino.com and Soubarato.com) tracks your purchases and interactions online. And it uses the data to predictively model which of its customers are pregnant, to better market baby-related purchases to them.
In 2012, the Obama campaign employed dozens of data scientists who data-mined and experimented their way to identifying voters who needed extra attention, choosing optimal donor-specific fundraising appeals and programs, and focusing get-out-the-vote efforts where they were most likely to be useful. It is generally agreed that these efforts played an important role in the president’s re-election, which means it is a safe bet that political campaigns of the future will become more and more data-driven, resulting in a never-ending arms race of data science and data collection.
Now, before you start feeling too jaded: some data scientists also occasionally use their skills for good—using data to make government more effective, to help the homeless, and to improve public health. But it certainly won’t hurt your career if you like figuring out the best way to get people to click on advertisements.
Data Science in pratice
So, let's play with some data and to understand some activities performed by a Data Scientist.
In this exercise, the idea is to find some options to buy a Hyundai Tucson car. There is a specialized site to buy and sell cars and motorbikes (http://www.webmotors.com.br). The cars/motorbikes can be new or used. There are many advertisements that are published every day in this place. So many options: color, year, model, price, etc. And I visit this site every day to find out a good opportunity.
How work: I created an app in python to extract (crawler) these data from this site. Then I saved these data in a sqlite database, called 'crawler-motors.db'. The next step was to analyze these data. So, I created this notebook in Jupyter tool. This notebook has seven parts. In general I use pandas for data manipulation, some statistical analysis and matplotlib to create some graphics. Also, I am using approaches of machine learning to do a prediction of the sale's price throught by a linear regression model.