Big Yellow Taxi

Analyzing ridership data on taxis in 2019

Posted by Aditya Ranganathan on April 23, 2022 · 18 mins read

The dashboard can be accessed from here.

INTRO

This dashboard is designed to display taxi ridership data in a simple and intuitive manner. Users can search for ridership data by filtering the data based on the various options given in the dashboard such as filtering by taxi company and/or community area, viewing the data in kilometres or miles, in 12 hour or 24 hour format, and the rides going to/from a location. Users can observe the changes happening in the map, table and bar graph according to the filters applied. Users can view the different bar graphs from the drop down menu in the interface. Users can observe data on hourly, daily, weekly and monthly taxi rides along with the binned mileage and binned trip time graphs as well as the percentage of rides going to/from a community area. Users can observe these multiple graphs, analyze patterns that they can find in the data and draw their own inferences based on the data displayed by the interface.

The project was developed in R using RStudio as the development enviroment and ShinyApps to deploy the dashboard. The dashboard was designed with the intent of running on a touch-screen wall with a resolution of 11,520 x 3,240.

INSTALLATION and PREREQUISITES

  • Installation of latest version of R and RStudio is necessary to run the application. R can be installed from here

RStudio Desktop can be installed from here

  • For this project we’ll be using a couple of libraries that can be installed using ‘install.packages()’ command. So after installation of R and RStudio, open the RStudio application and run the following commands in the console.
    install.packages("lubridate")
    install.packages("DT")
    install.packages("leaflet")
    install.packages("leaflet.providers")
    install.packages("maptools")
    install.packages("viridis")
    install.packages("ggplot2")
    install.packages("rgdal")
    install.packages("dplyr")
    install.packages("tidyr")
    install.packages("stringr")
    install.packages("scales")
    install.packages("shiny")
    install.packages("shinydashboard")
    
  • Once the libraries, you can open the project by clicking File -> Open Project -> Navigate to project directory to open the project you’ve downloaded from Github.

  • Double click the ‘app.R’ file to open it and click on ‘Run’ in the top bar to run the application.

  • Optionally, if you wish to process the data from scratch, you will also need the data files required to run the application. The main data source file can be downloaded from here

The shape files for the map component can be downloaded from here. Download and save the shape files

PRE-PROCESSING

The dataset is one Tab Separated Value(.tsv) file consisting of 16.5 million rows and 23 columns, but for the purposes of our project, we will only focus on 6 columns that contain the following attributes:

Ridership Dataset

Column   Data Type
Trip Start Timestamp   string
Trip Seconds   int
Trip Miles   float
Dropoff Community Area   integer
Pickup Community Area   integer
Company   string

We first read the ridership data file into the R file as a data frame(taxi_data) for manipulating the attributes. We rename the columns in the dataset for easier management of data

taxi_data <- taxi_data %>%
  rename(timestamp = `Trip Start Timestamp`,
         sec = `Trip Seconds`,
         miles = `Trip Miles`,
         pickup = `Pickup Community Area`,
         dropoff = `Dropoff Community Area`,
         company = `Company`)

We then check for NULL values in the dataset columns.

lapply(taxi_data,
       function(x) { 
         length(which(is.na(x)))
         }
       )

We can observe that NULL values appear in the columns of Pickup and Dropoff Community Area. As per our understanding, the NULL values here refer to community areas outside of Chicago. We will impute 0 (zero) in place of NULL.

taxi_data$pickup[is.na(taxi_data$pickup)] <- 0
taxi_data$dropoff[is.na(taxi_data$dropoff)] <- 0 

As per our project requirements, we can cut down on some of the data points by filtering the data based on the given conditions. We can remove all taxi trips less than 0.5 miles and more than 100 miles

taxi_data <- filter(taxi_data, miles > 0.5 & miles < 100)

Similarly, we can remove all taxi trips less than 60 seconds and more than 5 hours

taxi_data <- filter(taxi_data, miles > 0.5 & miles < 100)

From the company column, we can clean up the string data in order to make the names more presentable Remove all non-alphabetical characters from the data

taxi_data$company <- str_replace_all(taxi_data$company, "[^[:alpha:]]", " ")

Remove leading and trailing spaces in each new word

taxi_data$company <- trimws(taxi_data$company, which = c("both"))

We’ll have to manually change some of the company names to ensure no data is lost

taxi_data <- taxi_data %>%
  mutate(company = recode(company, `Medallion Management Corp` = '312 Medallion Management Corp', 
                          `Star Taxi` = 'Five Star Taxi', 
                          `Seven Taxi` =  'Twentyfour Seven Taxi', 
                          `Star Taxi` = 'STAR Taxi',
                          `Sun Taxi` = 'SUN Taxi',
                          `Checker Taxi Affiliation` = 'CHECKER Taxi Affiliation',
                          `Chicago Taxicab` = 'CHICAGO Taxicab'
                          ))

We’ll create a unique code for each individual taxi company. The idea being that numbers use less memory resource and process faster.

taxi_data$company_code <- as.integer(factor(taxi_data$company))

We’ll create a new dataframe that holds just the company name and the corresponding company code and output the dataframe to a .csv file

companies <- distinct(taxi_data, company_code, .keep_all = TRUE)
companies <- companies[, c("company", "company_code")]
write.csv(companies,"C:/Users/aranga22/Downloads/Academics/Sem 2/424 Visual Data/Projects/424_Project3/ext\\companies.csv", row.names = FALSE)

Similarly, manually create a csv file that holds the community area name and the community code. This can be done by copying values from the table given in Wikipedia

Let’s deal with the timestamp data. We’ll first convert the timestamp to POSIX format and then extract the individual day, month, year, hour components from it.

taxi_data <- taxi_data %>%
  mutate(timestamp = mdy_hms(timestamp))


taxi_data <- taxi_data %>%
  mutate(year = year(timestamp),
         month = month(timestamp),
         day = day(timestamp),
         hour = hour(timestamp)
  )

We’ll create a date column by concatenating the day, month, year values

taxi_data$date <- with(taxi_data, ymd(paste(year, month, day, sep= ' ')))

Get the weekday name and month name data from the corresponding columns

# Create week day column
taxi_data$week_day <- lubridate::wday(taxi_data$timestamp, label=TRUE)

name_of_month <- function(taxi_data){
  taxi_data$month_name[taxi_data$month==1] <- 'Jan'
  taxi_data$month_name[taxi_data$month==2] <- 'Feb'
  taxi_data$month_name[taxi_data$month==3] <- 'Mar'
  taxi_data$month_name[taxi_data$month==4] <- 'Apr'
  taxi_data$month_name[taxi_data$month==5] <- 'May'
  taxi_data$month_name[taxi_data$month==6] <- 'Jun'
  taxi_data$month_name[taxi_data$month==7] <- 'Jul'
  taxi_data$month_name[taxi_data$month==8] <- 'Aug'
  taxi_data$month_name[taxi_data$month==9] <- 'Sep'
  taxi_data$month_name[taxi_data$month==10] <- 'Oct'
  taxi_data$month_name[taxi_data$month==11] <- 'Nov'
  taxi_data$month_name[taxi_data$month==12] <- 'Dec'

  return(taxi_data)

}
taxi_data <- name_of_month(taxi_data)

We remove all unnecessary columns from our dataframe and output the dataset into chunks to handle the restrictions on file size and for easier reading and manipulation of data.

taxi_data <- subset(taxi_data, select = -c(timestamp, year, month, day, company))

no_of_chunks <- 50
f <- ceiling(1:nrow(taxi_data) / nrow(taxi_data) * 35)
res <- split(taxi_data, f)
map2(res, paste0("part_", names(res), ".csv"), write.csv)

The output ‘part_.csv’ files will be the data source used for our application.

Users can optionally group data by different taxi companies and/or community areas and individually output each grouping into a .csv or .tsv file as well. using the same code above

These are the data files we’ll be using for our dashboard.

THE DASHBOARD

Dashboard Interface

The dashboard is divided into 4 components:

sidebar

Users can filter the data being observed by adjusting the filters on the sidebar. Users can choose to see the data in different metrics such as in kilometres/miles or in 12 hour/24 hour format. Users can choose to drill down the dataset by selecting a specific community and/or taxi company or they can choose to view the data data dor all taxi companies and/or communities. Users can choose whether to see the rides coming from/to a community.

Finally users can choose whether to exclude or include community areas outside of Chicago in the dataset or not. By choosing to include or exclude the dataset, all of the data is re-processed according to the selection made and the changes are reflected in the graphs and tables.

MAP

Map

Users can pan around the map of Chicago, zoom in or out and view data of the various communities and the percentage of taxi rides going to/from that community. Selecting a community from the side panel calibrates the map to that communty location with a pop-up showing information on the community such as the name of the community and percentage of rides that are going to/from that community. Similarly, we can further drill-down on this data by filtering taxis by the taxi company to see percentage of taxis from a taxi company that go from/to a particular community

BAR GRAPH

There are 7 different bar graphs for users to observe patterns in the data

DAILY GRAPH

Daily Plot

Users can see the daily ridership in the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the number of rides that occur daily for that community and taxi company. The bar graph and table change to reflect the values on that date.

Users can easily mix-match different communities and taxi companies to find different patterns in the data and draw resonable inferences from them.

HOURLY GRAPH

Hourly Plot

Users can see the hourly ridership in the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the number of rides that occur hourly for that community and taxi company. The bar graph and table change to reflect the values on that date.

Users can easily mix-match different communities and taxi companies to find different patterns in the data and draw resonable inferences from them.

WEEKLY GRAPH

Monthly Plot

Users can see the weekly ridership in the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the number of rides that occur weekly for that community and taxi company. The bar graph and table change to reflect the values on that date.

Users can easily mix-match different communities and taxi companies to find different patterns in the data and draw resonable inferences from them.

MONTHLY GRAPH

Monthly Plot

Users can see the monthly ridership in the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the number of rides that occur monthly for that community and that taxi company. The bar graph and table change to reflect the values on that date.

Users can easily mix-match different communities and taxi companies to find different patterns in the data and draw resonable inferences from them.

BINNED MILEAGE GRAPH

Mileage Plot

Users can see the number of rides as per the range of mileage in the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the number of rides that occur in each range of mileage for that community and taxi company. The bar graph and table change to reflect the values on that date.

Users can easily mix-match different communities and taxi companies to find different patterns in the data and draw resonable inferences from them.

PERCENTAGE GRAPH

Mileage Plot

Users can see the percentage of rides that go to/from the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the percentage of rides that occur in each community for all taxis or for a specific taxi company. The bar graph and table change to reflect the values on that date.

Users can easily mix-match different communities and taxi companies to find different patterns in the data and draw resonable inferences from them.

BINNED TRIP TIME GRAPH

Trip Plot

Users can see the number of rides as per the range of trip time in the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the number of rides that occur in each range of trip time for that community and taxi company. The bar graph and table change to reflect the values on that date.

Users can easily mix-match different communities and taxi companies to find different patterns in the data and draw resonable inferences from them.

TABLES

Table

Users can observe the same data seen on the graph in tabular form. Each bar graph has a table corresponding to it where users can sort the table columns in ascending/ descending as well as search for a particular entry in the table.

INTERESTING INSIGHTS

By observing the data, we can observe some interesting insights.

norwood Looking into 24/7 Taxi Company, we see that the company has had a huge drop in taxi rides in Norwood and many other communities as well. By looking into these dates, we notice that there were reports of severe snowstorms which might have prompted cab drivers to stay indoors while also factoring in the decrease in number of outgoing passengers.

nova We observe the same dips in the Nova Taxi Affiliation LLC where there are huge drops in ridership in April which can again be attributed the reports of snowstorms and rainfall at the time.

time Observe from a less granualr perspective, by analyzing data on all taxi companies operating in all communities, there’s a huge dip in number of rides from 12 am - 5am in the morning and starts rising from 5 am as that’s when the day starts for people with the max number of rides peaking at around 6pm

oharemonthly As can be observed from the monthly graph, there’s a huge decline in number of taxis rides in O’Hare from November to February. This could be attributed to the start of winter and less number of passengers who wish to venture out.

NEXT STEPS

Users can access the source code if they wish to dive deep into the program and come up with their plots that they wish to visualize. Users can modify the selection of communities and taxi companies and choose their own filters to visualize and infer data from.

REFERENCES