Big Yellow Taxi

The dashboard can be accessed from here.

INTRO

This dashboard is designed to display taxi ridership data in a simple and intuitive manner. Users can search for ridership data by filtering the data based on the various options given in the dashboard such as filtering by taxi company and/or community area, viewing the data in kilometres or miles, in 12 hour or 24 hour format, and the rides going to/from a location. Users can observe the changes happening in the map, table and bar graph according to the filters applied. Users can view the different bar graphs from the drop down menu in the interface. Users can observe data on hourly, daily, weekly and monthly taxi rides along with the binned mileage and binned trip time graphs as well as the percentage of rides going to/from a community area. Users can observe these multiple graphs, analyze patterns that they can find in the data and draw their own inferences based on the data displayed by the interface.

The project was developed in R using RStudio as the development enviroment and ShinyApps to deploy the dashboard. The dashboard was designed with the intent of running on a touch-screen wall with a resolution of 11,520 x 3,240.

INSTALLATION and PREREQUISITES

Installation of latest version of R and RStudio is necessary to run the application. R can be installed from here

RStudio Desktop can be installed from here

For this project we’ll be using a couple of libraries that can be installed using ‘install.packages()’ command. So after installation of R and RStudio, open the RStudio application and run the following commands in the console.

install.packages("lubridate")
install.packages("DT")
install.packages("leaflet")
install.packages("leaflet.providers")
install.packages("maptools")
install.packages("viridis")
install.packages("ggplot2")
install.packages("rgdal")
install.packages("dplyr")
install.packages("tidyr")
install.packages("stringr")
install.packages("scales")
install.packages("shiny")
install.packages("shinydashboard")

Once the libraries, you can open the project by clicking File -> Open Project -> Navigate to project directory to open the project you’ve downloaded from Github.
Double click the ‘app.R’ file to open it and click on ‘Run’ in the top bar to run the application.
Optionally, if you wish to process the data from scratch, you will also need the data files required to run the application. The main data source file can be downloaded from here

The shape files for the map component can be downloaded from here. Download and save the shape files

PRE-PROCESSING

The dataset is one Tab Separated Value(.tsv) file consisting of 16.5 million rows and 23 columns, but for the purposes of our project, we will only focus on 6 columns that contain the following attributes:

Ridership Dataset

Column		Data Type
Trip Start Timestamp		string
Trip Seconds		int
Trip Miles		float
Dropoff Community Area		integer
Pickup Community Area		integer
Company		string

We first read the ridership data file into the R file as a data frame(taxi_data) for manipulating the attributes. We rename the columns in the dataset for easier management of data

taxi_data <- taxi_data %>%
  rename(timestamp = `Trip Start Timestamp`,
         sec = `Trip Seconds`,
         miles = `Trip Miles`,
         pickup = `Pickup Community Area`,
         dropoff = `Dropoff Community Area`,
         company = `Company`)

We then check for NULL values in the dataset columns.

lapply(taxi_data,
       function(x) { 
         length(which(is.na(x)))
         }
       )

We can observe that NULL values appear in the columns of Pickup and Dropoff Community Area. As per our understanding, the NULL values here refer to community areas outside of Chicago. We will impute 0 (zero) in place of NULL.

taxi_data$pickup[is.na(taxi_data$pickup)] <- 0
taxi_data$dropoff[is.na(taxi_data$dropoff)] <- 0 

As per our project requirements, we can cut down on some of the data points by filtering the data based on the given conditions. We can remove all taxi trips less than 0.5 miles and more than 100 miles

taxi_data <- filter(taxi_data, miles > 0.5 & miles < 100)

Similarly, we can remove all taxi trips less than 60 seconds and more than 5 hours

taxi_data <- filter(taxi_data, miles > 0.5 & miles < 100)

From the company column, we can clean up the string data in order to make the names more presentable Remove all non-alphabetical characters from the data

taxi_data$company <- str_replace_all(taxi_data$company, "[^[:alpha:]]", " ")

Remove leading and trailing spaces in each new word

taxi_data$company <- trimws(taxi_data$company, which = c("both"))

We’ll have to manually change some of the company names to ensure no data is lost

taxi_data <- taxi_data %>%
  mutate(company = recode(company, `Medallion Management Corp` = '312 Medallion Management Corp', 
                          `Star Taxi` = 'Five Star Taxi', 
                          `Seven Taxi` =  'Twentyfour Seven Taxi', 
                          `Star Taxi` = 'STAR Taxi',
                          `Sun Taxi` = 'SUN Taxi',
                          `Checker Taxi Affiliation` = 'CHECKER Taxi Affiliation',
                          `Chicago Taxicab` = 'CHICAGO Taxicab'
                          ))

We’ll create a unique code for each individual taxi company. The idea being that numbers use less memory resource and process faster.

taxi_data$company_code <- as.integer(factor(taxi_data$company))

We’ll create a new dataframe that holds just the company name and the corresponding company code and output the dataframe to a .csv file

companies <- distinct(taxi_data, company_code, .keep_all = TRUE)
companies <- companies[, c("company", "company_code")]
write.csv(companies,"C:/Users/aranga22/Downloads/Academics/Sem 2/424 Visual Data/Projects/424_Project3/ext\\companies.csv", row.names = FALSE)

Similarly, manually create a csv file that holds the community area name and the community code. This can be done by copying values from the table given in Wikipedia

Let’s deal with the timestamp data. We’ll first convert the timestamp to POSIX format and then extract the individual day, month, year, hour components from it.

taxi_data <- taxi_data %>%
  mutate(timestamp = mdy_hms(timestamp))


taxi_data <- taxi_data %>%
  mutate(year = year(timestamp),
         month = month(timestamp),
         day = day(timestamp),
         hour = hour(timestamp)
  )

We’ll create a date column by concatenating the day, month, year values

taxi_data$date <- with(taxi_data, ymd(paste(year, month, day, sep= ' ')))

Get the weekday name and month name data from the corresponding columns

# Create week day column
taxi_data$week_day <- lubridate::wday(taxi_data$timestamp, label=TRUE)

name_of_month <- function(taxi_data){
  taxi_data$month_name[taxi_data$month==1] <- 'Jan'
  taxi_data$month_name[taxi_data$month==2] <- 'Feb'
  taxi_data$month_name[taxi_data$month==3] <- 'Mar'
  taxi_data$month_name[taxi_data$month==4] <- 'Apr'
  taxi_data$month_name[taxi_data$month==5] <- 'May'
  taxi_data$month_name[taxi_data$month==6] <- 'Jun'
  taxi_data$month_name[taxi_data$month==7] <- 'Jul'
  taxi_data$month_name[taxi_data$month==8] <- 'Aug'
  taxi_data$month_name[taxi_data$month==9] <- 'Sep'
  taxi_data$month_name[taxi_data$month==10] <- 'Oct'
  taxi_data$month_name[taxi_data$month==11] <- 'Nov'
  taxi_data$month_name[taxi_data$month==12] <- 'Dec'

  return(taxi_data)

}
taxi_data <- name_of_month(taxi_data)

We remove all unnecessary columns from our dataframe and output the dataset into chunks to handle the restrictions on file size and for easier reading and manipulation of data.

taxi_data <- subset(taxi_data, select = -c(timestamp, year, month, day, company))

no_of_chunks <- 50
f <- ceiling(1:nrow(taxi_data) / nrow(taxi_data) * 35)
res <- split(taxi_data, f)
map2(res, paste0("part_", names(res), ".csv"), write.csv)

The output ‘part_.csv’ files will be the data source used for our application.

Users can optionally group data by different taxi companies and/or community areas and individually output each grouping into a .csv or .tsv file as well. using the same code above

These are the data files we’ll be using for our dashboard.

THE DASHBOARD

Dashboard Interface

The dashboard is divided into 4 components:

sidebar

Users can filter the data being observed by adjusting the filters on the sidebar. Users can choose to see the data in different metrics such as in kilometres/miles or in 12 hour/24 hour format. Users can choose to drill down the dataset by selecting a specific community and/or taxi company or they can choose to view the data data dor all taxi companies and/or communities. Users can choose whether to see the rides coming from/to a community.

Finally users can choose whether to exclude or include community areas outside of Chicago in the dataset or not. By choosing to include or exclude the dataset, all of the data is re-processed according to the selection made and the changes are reflected in the graphs and tables.

MAP

Map

Users can pan around the map of Chicago, zoom in or out and view data of the various communities and the percentage of taxi rides going to/from that community. Selecting a community from the side panel calibrates the map to that communty location with a pop-up showing information on the community such as the name of the community and percentage of rides that are going to/from that community. Similarly, we can further drill-down on this data by filtering taxis by the taxi company to see percentage of taxis from a taxi company that go from/to a particular community

BAR GRAPH

There are 7 different bar graphs for users to observe patterns in the data

DAILY GRAPH

Daily Plot

Users can see the daily ridership in the various communities in Chicago. Users can filter the data from the side panel by selecting a particular community and taxi company and observe the number of rides that occur daily for that community and taxi company. The bar graph and table change to reflect the values on that date.