Try   HackMD

Home | About | Researches | Side projects | Life gallery

Data science > [Kaggle] New York taxi trip duration


[Kaggle] New York taxi-trip duration

EDA regression clustering scikit-learn numpy pandas

Introduction

This is the competition of the machine learning. The data is the travel information for the New York taxi. The prediction is using the regression method to predict the trip duration depending on the given variables. The variables contains the locations of pickup and drop-off presenting with latitude and longitude, pickup date/time, number of passenger etc The design of the learning algorithm includes the preprocess of feature explanation and data selection, modeling and validation. To improve the prediction, we have done several test for modeling and feature extraction. Here we give the summery of the results.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Techniques

The procedure in this study includes the feature exploration, clustering and modeling. The feature selection is dealing with the noise which is possible to contaminate the prediction. Since less of the domain knowledges, the data was selected by visualizing the data and observed the distribution of each features. The following figures show the pickup location from the training data.

1. Feature exploration

The features in dataset are actually less, and some of them are noisy during the data-taking. Before training the model, it is better to explore the much useful feature and veto the noise. However, we believe those relevant information is hiding behind the data, e.g. the distance, direction or density, we made several statistical studies to extract those additional features. Except for searching from the original data, the extra knowledge is also added for the model training which is from the other studies by the expert of this domain. These studies, called explanatory data analysis (EDA) , improve the final prediction of the trained model. The summarized selection is as following list, and the details are described in the subsections.

  • 300 < Trip duration < 7200 (sec.)
  • Veto date : 2016.01.20 ~ 2016.01.25
  • 0.6 < Distance < 11 (Km)
  • 5 < speed < 120 (Km/hr)
  • 1 < Number of passenger < 7

1.1. Time and duration

The pickup data-time is recored in both training and test dataset, while the drop-off and duration are only shown in training dataset. We analyzed the date-time by observing the distribution of month, week, day and the clock. The distribution of date shows the strange drop between January 20th to 25th, since the big snow was in New York which course the traffic problem.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

On the other hand, the training data with the large duration is discarded by the observation, since we believe those data are the outlier from the prediction for New York City. The selection is set larger than 2 minute and lower than 2 hours.

Since the value of time does not mean the quantity, the additional method is included to transfer the time to angular value. The angular value can present the continual cyclic number.

1.2. Distance, speed and direction

The locations, speed and direction are expected to be strongly correlated to duration, especially the distance. We can use the coordinates of each pickup and drop-off to overlook the scattering map of New York City. The important feature quickly shown up is the densities of pickup and drop-off location.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The most of dense regions are in the city center and two airports. The direction from pickup to drop-off may point out the popular locations, especially between city center to airport, which can effect the duration prediction. Thus, we estimated the direction as

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Moreover, the distance between drop-off and pickup are calculated with haversine distance, i.e.

d=(x2x1)2+(y2y1)2 ,

where it presents the distance between point 2 to 1. From the distribution, we found that a few of training data are out of the New York city, and the distance are extremely large. Thus, we only consider the data containing the distance in the coverage of 95% of the training dataset, assuming the distribution of distance is the gaussian distribution. In the end, there are duration information in training dataset, we simply calculated the average speed. The speed information provide the information of the noisy data. The data containing the speed between 5 and 120 (km/hr) are only used for training.

1.3. Passenger

According the distribution of passengers, it appears the over-counting issue which is out of the normal taxi's capacity. Thus, the data has to be applied the selection of passengers between above 0 and less than 7.

1.4. Extra feature - right/left turns

By adding the extra data, Open Source Routing Machine (OSRM), it gives the additional information about the traffic for each route of data. Here we use the steps of the turns during the taxi driving for model training, i.e. the number of right and left turns. The turn is expected to affect duration, since the time cost for right and left turn is different. For the right-hand traffic (RHT) city, e.g. New York City and Taipei, the right turn is much efficient than the left turn, hence the duration with less left turns is much smaller with respect to the same distance.

2. Clustering

We have studied the different clustering algorithms to quantify or labelize the pickup and drop-off locations, e.g. the k-means algorithm and density pixels which made by statistical studies. Since the limited competition dateline and manpower, the density pixels wasn't completely finished, the performance is similar with k-mean algorithm for the chosen training model. Thus, the final prediction is using k-means to do feature extraction.

2.1. K-means

The k-means algorithm one of choice to cluster the data and label the locations. The important feature of the algorithm is it is clustering the data by looking for the k clusters (sets), which the data in the set has the minimum error with respect to set's mean. In the case for clustering the locations by the coordinates, it can represent the k regions having different density. However, we didn't optimize the number of cluster, the value of regions is set 100. The following left-hand figure shows the coverage regions in prediction. The left-hand figure shows the clustering results applied to data.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

2.2. Density pixels

The method is trying to give the meaningful value for training models instead of labeling. By using the physics sense, we calculate the density from the pickup and drop-off coordinates with respect to the fixed bin sizes of maps, we call they are "pixels". Each pixel has different value which present how often the taxi pickup and drop-off in the region. The results can be shown as heat map as following figure; the right-hand figures present in the linear scale, while the left-hands present log scale; the top figures are for the pickup cases, while the bottoms are for the drop-off cases.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Of course, the bin size can optimize as a supper parameter of clustering. We can expected either the large or the small bin sizes may give the misleading of the model training. The interesting feature we have found, the density difference between pickup and drop-off present with normal distribution as following figure.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

It shows there is the certain frequency of trips between two regions. Thus, we take the absolute value of the difference and comparing the duration as the following figure.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The correlation tells the story, which can be easily understood, the long duration is much possible from the city center to the unpopular region, e.g. the countryside, while the short duration can happen between two pixels of city center due to the common densities in the high popular regions. On the other hand, we also calculate the average speed in each region which the average speed can give the expectation of the duration. However, this method is pushing to human learning by statistical analysis instead of the machine learning, we didn't go deep in the end.

3. Modeling

Since the dimension of provided features in data is not many, and the data is noisy, we decided use the tree learning algorithm as the training model. Two tree algorithms, random forest and boosted decision tree (BDT), have been compared, BDT is used for our final fit. As mentioned in begin of clustering, the k-means algorithm is used for location information, since clustering algorithm does not need to be applied one-hot-code for tree algorithm. It make the k-means is much easier to be implemented than density pixel method.

The public RMSLE score of the model without applying the OSRM is 0.39, and it is improved to 0.38 after the application. The RMSLE score, i.e. Root Mean Squared Logarithmic Error, is defined as

ϵRMSLE=1ni=1n(ln(pi+1)ln(ai1))2 ,

where

n is the total number of observations in the dataset;
pi
is the prediction, while
ai
is the actual value.

The model is validated by checking the learning curve between test and training dataset with n-fold cross validation, it shows no overfitting issue exist. The training epoch is shown in following figure.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Results

The final

ϵRMSLE on Kaggle private dataset is 0.37. The score is in the top %8 of 1257 teams. Obviously, the results still can be improved by modeling or feature exploration.

Reference



Github | Linkedin