Liftoff

TATA v0.1

2018-02-07T00:00:00+00:00

This post will describe the development and deployment of TATA a Twitter Account Toxicity Analyzer. TATA v0.1 is now online at http://www.toxic-tweets.xyz. Check it out!

GitHub Repo

The code for the analysis is available on my GitHub repo if you are interested. Enjoy!

A good movie to watch

2018-01-28T00:00:00+00:00

This blog post deals with the incredibly common task of interpreting a language and extracting its sentiment. Humans are obviously very good at this: if I tell you a story or read some article, you understand its meaning; if I tell you a joke or I’m sarcastic, you get it. Computers… not so much…

But for every difficult task there is obviously a big prize at the end. Think about how much unexploited data there is in textual resources. Virtually every industry can benefit from Natural Language Processing (NLP): from healthcare to finance, from ads to traveling.

But let us start with small steps. The main goal here is rather simple to state: can Deep Learning classify bad or good movies based on their reviews? In order to tackle this problem there are various dataset that one can use. For example, IMDB and Rotten Tomatoes have a lot of reviews that could be used, however their reviews are written by many different people and therefore both the text and the grading might be too heterogeneous. For this reason I decided to approach the problem from a different angle and use the reviews written by a single person: the famous critic Roger Ebert. Clearly does not eliminate the problem but it might reveal some new interesting side of Roger Ebert.

Data

Ebert has been an incredibly prolific reviewer: nearly 50 years as a critic writer with thousands of reviews under the belt. All his reviews (and many others) are available on https://www.rogerebert.com for free. Unfortunately there is no API to access them so I had to gently scrap the website and I made the code available on my repo.

The fields useful for this projects are: the year of publication, the review, and the rating. Overall these fit in a easily manageable dataset of 50 MB.

EDA

Let us start our EDA by considering the two numerical attributes we have: ratings and year. Before looking at the data one can expect various distribution for these quantities. For example, one can imagine that the rating is distributed uniformly (unlikely), that it peaks at high or low rating, or most likely that it peaks somewhere in the middle. Regarding the year, we already know that there are no reviews after 2013, so the distribution will be truncated.

What does the data say? First of all the distribution of ratings is not what I was expecting. In particular the distribution looks somewhat Poissonian but there is a high number of movies with 3 stars. It seems like Ebert had a sort of psychological anchor there: not sure about the rating? give it a 3!

A more interesting quantity that can be derived from the text is the length of the review. The distribution of length of Ebert’s reviews follows very nicely a log-normal distribution. This happens both if you consider the length defined either as the number of words or the number of sentences in the review. Take a look.

This quite surprising fact is not unique of Ebert’s style but is instead a generic human behavior connected somehow with natural language (information theory?). For those interested you can read more about the occurrence of the log-normal distribution on this Wikipedia entry.

NLP in Python

To be finished…

GitHub Repo

The code for the analysis is available on my GitHub repo if you are interested. Enjoy!

Cabs vs Uber

2017-11-13T00:00:00+00:00

Since the introduction of the iPhone, our lives and habits have changed dramatically, and one of these big changes involves the way we travel around cities. For example, 10 years ago we used to raise our arm to call a cab, nowadays we most likely go for our phones, and probably in some years we’ll summon a driverless car using our watch. Moreover, all these changes are also affecting companies that haven’t been able to evolve (think about Nokia) and entire compartments of the economy (sorry drivers…).

In this post I will take a look at the traditional cab companies operating in NYC and see how they are doing in the era of Uber. Are they doing well or just dying slowly? To answer these kinds of questions I used the data of the The New York City Taxi & Limousine Commission (TLC) which has made publicly available a gigantic database of all the rides of the Yellow Cabs, Green Cabs and FHV (For Hire Vehicle) companies in NYC from 2009 to present.

This database can obviously answer a bunch of interesting questions and should be exploited much more than what I will be showing here. To mention some possible direction:

When and Where should vehicles stay to maximize total income?
How bad is rush hour?
Is it better to take a cab or a city bike or the metro?
Are you tipping too few?
…

Data

The TLC provides a monthly dataset for each of the three categories of vehicles operating in NYC (Yellow, Green and FHV). The average monthly dataset contains around 10^8 rides and weights around 1 GB. In total (from 2009 to present), there have been roughly a billion rides (yes, a billion!) for a total of 250 GB. In fact it is so big that it does not fit on my laptop.

Fortunately a large part of it can be already found on BigQuery, and can be assessed via simple SQL queries. The 2017 data is not there, so I downloaded them directly from TLC. Take a look at my GitHub repository for an explanation on how to retrieve the data and play with the the notebook.

The FHV dataset contains the rides of all for hire vehicles, among which Uber’s. The latter can be identified by the license ID (also provided by TLC). However, at the moment, this data is not as exhaustive as the one of the Yellow and Green cabs, as it often contains only the license ID and the pickup time. It is however sufficient to individuate the amount of trips made by Uber cars. Notice also there is an error in how BigQuery retrieves the FHV data in June 2015, which I’ve fixed by looking at the original TLC data.

Bye Bye cabs…

The main point of this post is summarized in the next figure, where I plot the amount of trips made by each company per month. I think there are three main take-home messages:

Uber is overtaking the Yellow Cab company (which lost 30% of its customers in the last 3 years).
Uber has changed our habits as people are more likely to take a ride.
At this pace, the traditional companies will be closing business in 3 years.

One may think that with the falling number of rides and additional competition the prices of the Yellow and Green cabs would fall. However this does not seem to be the case. In fact looking at the average trip cost in the next plot, it seems that prices have been pretty constant in the past years. (Unfortunately no data from Uber.)

Notice however that this plot shows just the average cost of a trip, and one should not think that the Green cabs are cheaper, as the kind of trips the take might by different (therefore biasing the cost).

GitHub Repo

The data for these plots are available on the GitHub repository if you are interested in doing your own analysis.

New Neighbors

2017-10-11T00:00:00+00:00

A couple of weeks ago I noticed that some new tenants had moved in the apartment above mine and I was of course fine with that… until they woke me up at 3 am. Loud music, dancing, jumping and all of that…

But hey, I live in NYC, the city that never sleeps. Or at least this is what I thought. (Okay, maybe not the first thing that I thought…)

But is this even true? Are really people in NYC used to parties? Or in other words:

when do typically people start to complain about noise/music?
which are the most noisy areas in NYC?

To find an answer to these questions, I have taken a look at the enormous database of calls to the 311. The database contains information about the service requests made in NYC since 2010 (roughly 16 millions). I needed a very limited amount of information for the following exploration, but if you want to explore more, you can find the my notebook here.

Is it too early to call 311?

I pulled from the database 10^5 calls that where categorized as “residential noise”. Interestingly enough, most of the time people complain for their neighbors having loud music and dancing (70%), and only sometimes for pounding/banging (25%). Take a look

But the most important thing for my good rest is the distribution of calls over time. When is it ok to call 311?

As expected most of the calls happen at night and (to my surprise) as early as 9 pm, with a peak around 11 pm. Just for comparison, my beloved neighbors where still making a mess at 4 am! I guess I won’t feel sorrow for shutting down the party the next time!

Where is the party?

As a fun side-question, one could ask which neighborhood receive most attention. In order to find out, I used a k-means clustering with 100 centers and listed the clusters with the highest number of calls. These are the top 3

If you live in NYC and you want to sneak into a party, I would suggest to try the lower east-side!