Analyzing Uber rides history in Kusto

Last modified: 02/09/2019

Ride-hailing and ride-sharing have seen enormous growth and adoption, which means - there is, and there will continue to be lots of data to analyze about how people across the world are using these services, as well as comparing the level of service different companies such as Uber, Lyft and others provide to their riders and drivers.

Having recently moved to a city where these are one of the main means of day-to-day transportation, my usage of them as a rider has significantly grown. I decided it’s time to look into some data and see what I can learn from it.

Naturally, I chose Kusto to look into it, and I took my personal data from Uber for a ride.

(Some of the data points in this post have been altered for the sake of privacy)

Analyzing Uber rides history in Kusto

Getting the source data

Did you know?

Uber makes it possible to download your rides history, including properties such as:

Time of request/beginning/end of the trip.
Longitude/Latitude/Address of beginning/end of the trip, and the total distance.
Fare amount and currency.

Once you get the “Your data is available to download” email (it could take a few days after your request), you can download the aforementioned data set in a standard CSV format, which makes it super easy to analyze in Kusto.

The data I used is the one in the file named trips_data.csv. Here’s an example of a few records from it:

City,Product Type,Trip or Order Status,Request Time,Begin Trip Time,Begin Trip Lat,Begin Trip Lng,Begin Trip Address,Dropoff Time,Dropoff Lat,Dropoff Lng,Dropoff Address,Distance (miles),Fare Amount,Fare Currency
San Francisco,Express Pool: MATCHED,COMPLETED,2018-08-17 03:50:18 +0000 UTC,2018-08-17 03:56:45 +0000 UTC,73.7869535,-111.3922508,"999 ... St, San Francisco, CA 94105, USA",2018-08-17 04:16:43 +0000 UTC,57.779273,-222.4352993,"1111 ... St, San Francisco, CA 81445, USA",3.30,5.1,USD
San Francisco,UberX,COMPLETED,2018-08-15 19:27:02 +0000 UTC,2018-08-15 19:29:30 +0000 UTC,67.7790511087,-333.395617827,"444 ... St, San Francisco, CA 94107, US",2018-08-15 19:35:46 +0000 UTC,57.7772837419,-444.416049126,"1555 ... St, San Francisco, CA 41578, USA",1.58,9.86,USD
San Francisco,Express Pool: MATCHED,COMPLETED,2018-08-12 21:20:24 +0000 UTC,2018-08-12 21:27:40 +0000 UTC,47.7792305817,-555.434557152,"1111 ... Ave, San Francisco, CA 81445, USA",2018-08-12 21:42:51 +0000 UTC,47.7870367854,-666.392240158,"999 ... St, San Francisco, CA 94105, USA",2.77,3.33,USD

Ingesting the data into Kusto

Even though it’s possible to get this done super-easily using Kusto.Explorer, this time I decided to show a different way, super-easy as well, using Kusto’s externaldata operator.

Examples for ingestion using Kusto.Explorer can be found in my other posts:
- Analyzing Spotify streaming history in Kusto
- Update policies for in-place ETL in Kusto

(The entire process literally took me less than 1 minute)

I took the source file, trips_data.csv and uploaded it to Azure blob storage using Azure Storage Explorer.
I got a URL to the uploaded blob using a Shared Access Signature (SAS).

I invoked the following .set-or-append command against my database, to get the data ingested.

Note that I’m running a query against the source data before ingesting it, as the datetime format in the source data provided by Uber isn’t in a standard or supported format.
This also allows me to not have to create the table in advance, the command takes care of it for me.
I could have achieved the same goal of pre-processing the data at ingestion time by using an update policy.

 .set-or-append UberRides <| 
     externaldata(
         City:string,
         ProductType:string,
         TripOrOrderStatus:string,
         RequestTime:string,
         BeginTripTime:string,
         BeginTripLat:double,
         BeginTripLng:double,
         BeginTripAddress:string,
         DropoffTime:string,
         DropoffLat:double,
         DropoffLng:double,
         DropoffAddress:string,
         Distance_Miles:double,
         FareAmount:double,
         FareCurrency:string
     )
     [h'https://account.blob.core.windows.net/container/trips_data.csv<SAS>']
     | where City != "City" // This is for filtering out the header line
     | extend RequestTime = todatetime(substring(RequestTime, 0, 19)),
              BeginTripTime = todatetime(substring(BeginTripTime, 0, 19)),
              DropoffTime = todatetime(substring(DropoffTime, 0, 19))

Analyzing the data

This is by no means Big Data, and not even close to Billions of rides, which I’ve covered in another post, however it does allow to see the capabilities of Kusto very nicely. Plus, if you hop over to that other post after reading this on, you’ll see that Kusto performs great for data sets with billions of rides too.

(Got suggestions for more interesting queries? Let me know and I’ll update the post based on your feedback)

Let’s see how the data looks like in Kusto:

UberRides
| limit 5

City	ProductType	TripOrOrderStatus	RequestTime	BeginTripTime	BeginTripLat	BeginTripLng	BeginTripAddress	DropoffTime	DropoffLat	DropoffLng	DropoffAddress	Distance_Miles	FareAmount	FareCurrency
San Francisco	Pool	CANCELED	2018-10-31 14:09:13.0000000	1970-01-01 00:00:00.0000000	57.7910804435	-122.415729391		1970-01-01 00:00:00.0000000	57.7911779787	-254.416031619	36 12th St, San Francisco, CA 41578, USA	0	0	USD
San Francisco	Pool: MATCHED	COMPLETED	2018-10-31 14:10:34.0000000	2018-10-31 14:20:20.0000000	57.77923	-122.43458	777 Fictional Ave, San Francisco, CA 81445, USA	2018-10-31 14:26:57.0000000	75.77584	-543.41663	36 12th St, San Francisco, CA 41578, USA	1.28	4.15	USD
San Francisco	Pool: MATCHED	COMPLETED	2018-11-02 23:10:12.0000000	2018-11-02 23:21:45.0000000	57.7757559835	-122.416465092	36 120th St, San Francisco, CA 41578, USA	2018-11-02 23:26:07.0000000	34.770701054	-176.406112678	555 9th St, San Francisco, CA 41578, US	0.81	4.21	USD
San Francisco	Pool: MATCHED	COMPLETED	2018-11-02 23:38:52.0000000	2018-11-02 23:46:48.0000000	57.77074	-122.40782	555 904th St, San Francisco, CA 41578, US	2018-11-02 23:52:27.0000000	43.77511	-211.41624	4100 Mission St, San Francisco, CA 41578, US	0.64	4.63	USD
San Francisco	Pool: MATCHED	COMPLETED	2018-11-03 21:53:36.0000000	2018-11-03 21:59:40.0000000	57.77925	-122.43445	777 Fictional Ave, San Francisco, CA 81445, USA	2018-11-03 22:12:33.0000000	12.77593	-122.41673	36 12th St, San Francisco, CA 41578, USA	1.74	4.62	USD

Now, I can start having some fun, using Kusto’s query language and its rich analytical capabilities. All of these queries complete super-fast, no surprises here.

The basics

I can see the Uber service I use the most:

UberRides
| as hint.materialized = true T
| summarize Percentage = round(100.0 * count() / toscalar(T | count), 2)
         by ProductType = tolower(ProductType)
| render piechart 

I can also see percentiles of my rides length (miles and duration) and cost (in USD):

UberRides
| extend WaitDuration = BeginTripTime - RequestTime,
         TripDuration = DropoffTime - BeginTripTime
| summarize percentiles(WaitDuration, 50, 75, 90),
            percentiles(TripDuration, 50, 75, 90),
            percentiles(FareAmount, 50, 75, 90),
            percentiles(Distance_Miles, 50, 75, 90)
| evaluate narrow()
| project-away Row

Column	Value
percentile_FareAmount_50	4.21
percentile_FareAmount_75	6.25
percentile_FareAmount_90	24.81

percentile_TripDuration_50	00:08:09
percentile_TripDuration_75	00:16:23
percentile_TripDuration_90	00:24:35

percentile_WaitDuration_50	00:06:28
percentile_WaitDuration_75	00:09:09
percentile_WaitDuration_90	00:12:51

percentile_Distance_Miles_50	1.91
percentile_Distance_Miles_75	3.38
percentile_Distance_Miles_90	12.75

And, what are the most popular hours of day, and days of week I user the service:

UberRides
| summarize count() by hourofday(RequestTime-8h)
| render columnchart 

UberRides
| summarize count() by dayofweek(RequestTime-8h) / 1d + 1
| render columnchart

Wanna guess where I live?

UberRides
| as hint.materialized = true T
| summarize Percentage = round(100.0 * count() / toscalar(T | count), 2)
         by City
| render piechart 

Cost and efficiency

This is one person’s data, so any attempt to deduce global patterns out of it won’t be meaningful.

Still, some of these could be interesting, e.g. trying to anticipate which hours/days are more congested, more popular, and perhaps more pricey.

Let’s look into the wait and ride durations, and try to figure out how long in advance I should book my ride.

This query shows the average wait durations and average trip durations by the hour of day, for trips taken in San Francisco:

UberRides
| where Distance_Miles < 15
| where City == "San Francisco"
| where DropoffTime > datetime(2018-01-01)
| extend WaitDuration = BeginTripTime - RequestTime,
         TripDuration = DropoffTime - BeginTripTime
| summarize AvgWaitDurationMins = avg(WaitDuration)/1m, 
            AvgTripDurationMins = avg(TripDuration)/1m
         by hourofday(RequestTime-8h)
| render columnchart with(kind = stacked)

The following query shows the average MPH by hour of day, for the same set of trips:

UberRides
| where Distance_Miles < 15
| where City == "San Francisco"
| where DropoffTime > datetime(2018-01-01)
| extend TripDurationHrs = (DropoffTime - BeginTripTime) / 1h
| summarize avg(Distance_Miles / TripDurationHrs) by hourofday(BeginTripTime-8h)
| render columnchart 

You can say many things about living in San Francisco, but one thing you can’t ignore it the cost of living. Choosing when you go to certain places could perhaps save you some money.

This query shows the average cost (USD per minute) by day of week and hour of day. You can see that for my rides, Wednesday around 11PM and Friday around 1PM were the priciest, on average:

UberRides
| where Distance_Miles < 10
| where City == "San Francisco"
| where DropoffTime > datetime(2018-01-01)
| extend TripDurationMins = (DropoffTime - BeginTripTime) / 1m
| extend UsdPerMin = FareAmount / TripDurationMins,
         DayOfWeek = toint(dayofweek(RequestTime-8h) / 1d + 1),
         HourOfDay = hourofday(BeginTripTime-8h)
| summarize round(avg(UsdPerMin),2) by Day_Hour = strcat(DayOfWeek, "_", HourOfDay)
| top 10 by avg_UsdPerMin desc 

Day_Hour	avg_UsdPerMin
4_11	1.57
6_13	1.30
3_8	1.03
3_18	0.91
7_20	0.87
6_12	0.76
6_15	0.73
6_20	0.68
6_6	0.65
3_6	0.62

This is a similar query, which looks at the average cost (USD per mile). By this measurement, Wednesday around 11PM and Friday around 1PM are still the priciest, but in a different order than above:

UberRides
| where Distance_Miles < 10
| where City == "San Francisco"
| where DropoffTime > datetime(2018-01-01)
| extend TripDurationMins = (DropoffTime - BeginTripTime) / 1m
| extend UsdPerMile = FareAmount / Distance_Miles,
         DayOfWeek = toint(dayofweek(RequestTime-8h) / 1d + 1),
         HourOfDay = hourofday(BeginTripTime-8h)
| summarize round(avg(UsdPerMile), 2) by Day_Hour = strcat(DayOfWeek, "_", HourOfDay)
| top 10 by avg_UsdPerMile, 2 desc 

Day_Hour	avg_UsdPerMile
6_13	8.37
4_11	6.24
3_18	4.57
6_15	4.33
7_10	3.97
6_12	3.84
3_6	3.47
6_17	3.29
6_20	3.18
3_8	3.09

Random facts

My rides in the city stay mostly within the same small set of zip codes:

UberRides
| where BeginTripAddress has 'San Francisco, CA' and DropoffAddress has 'San Francisco, CA'
| parse BeginTripAddress with * 'San Francisco, CA ' FromZipCode:int ', US' *
| parse DropoffAddress with * 'San Francisco, CA ' ToZipCode:int ', US' *
| summarize count() by FromZipCode, ToZipCode
| top 10 by count_ desc

FromZipCode	ToZipCode	count_
94103	94103	21
94115	94103	17
94103	94105	9
94105	94103	4
94105	94115	4
94115	94105	4
94109	94105	3
94103	94115	3
94123	94103	3
94102	94115	2

Also, 8 characters is a somewhat popular length for a street name. Here’s the number of distinct streets I’ve gone from or to, by their name length:

UberRides
| project Address = pack_array(BeginTripAddress, DropoffAddress)
| mv-expand Address to typeof(string)
| where isnotempty(Address)
| where Address has "St, San Francisco, CA" 
| where Address matches regex @"^\d+" // places which start with a number
| parse kind=regex Address with @"[^\s]+ " Street "St, San Francisco, CA" *
| summarize dcount(Street), SampleNames = strcat_array(make_set(Street, 3), ", ") by strlen(Street)
| order by dcount_Street desc

strlen_Street	dcount_Street	SampleNames
8	8	Fremont , Brannan , Vicente
5	7	10th , Fell , 11th
7	6	Howard , Pierce , Haight
4	4	9th , Bay , 3rd
6	4	Union , Hayes , Scott
9	2	Chestnut , Harrison
10	1	O’Farrell
11	1	McAllister