CTA Data Visualization

Analyzing ridership data on CTA metros from 2001 to 2021

Posted by Aditya Ranganathan on February 13, 2022 · 10 mins read

The dashboard can be accessed from here.

INTRO

This dashboard is designed to display ridership data in a simple and intuitive manner. Users can view the ridership data in three metro stations: UIC-Halsted, O’Hare Airport and Racine.

The project was developed in R using RStudio as the development enviroment and ShinyApps to deploy the dashboard. The dashboard was designed with the intent of running on a touch-screen wall with a resolution of 5,760 x 3,240.

PREREQUISITES

  • Installation of R and RStudio is necessary to run the application.

  • For this project we’ll be using a couple of libraries that can be installed using ‘install.packages()’
    install.packages("lubridate")
    install.packages("DT")
    install.packages("ggplot2")
    install.packages("dplyr")
    install.packages("tidyr")
    install.packages("scales")
    install.packages("shiny")
    install.packages("shinydashboard")
    
  • Due to the size limitations of ShinyApps, our data file must be less than 5 MB. We can choose to either breakup the file into chunks or extract the relevant data and pass it to the server.

PRE-PROCESSING

The data is sourced from the Chicago Data Portal. The dataset is Tab Separated Value(.tsv) file, consisting of 1.08 million rows of data and contains the following attributes:

Column   Data Type
station_id   integer
stationname   string
date   Date
daytype   character
rides   integer

We first read the data file into the R file as a data frame(cta_station) for manipulating the attributes. We then check for NULL values in the dataset.

ok <- complete.cases(cta_station)
dim(cta_station[ok, ]) == dim(cta_station)

Taking a closer look at the attributes, the date column is a character data type which we need to convert to the date format

cta_station$date <- as.Date(cta_station$date, '%m/%d/%Y')

From the date column, we can engineer new features by individually extracting the year, month and week day and storing them in separate attributes.

cta_station$year <- as.numeric(format(cta_station$date, format='%Y'))
cta_station$month <- as.numeric(format(cta_station$date, format='%m'))
cta_station$day <- as.numeric(format(cta_station$date, format='%d'))
cta_station$week_day <- wday(cta_station$date, label=TRUE)

We can also create a month_name column that holds values of the months as string

name_of_month <- function(df){
  df$month_name[df$month==1] <- 'Jan'
  df$month_name[df$month==2] <- 'Feb'
  df$month_name[df$month==3] <- 'Mar'
  df$month_name[df$month==4] <- 'Apr'
  df$month_name[df$month==5] <- 'May'
  df$month_name[df$month==6] <- 'Jun'
  df$month_name[df$month==7] <- 'Jul'
  df$month_name[df$month==8] <- 'Aug'
  df$month_name[df$month==9] <- 'Sep'
  df$month_name[df$month==10] <- 'Oct'
  df$month_name[df$month==11] <- 'Nov'
  df$month_name[df$month==12] <- 'Dec'
  
  return(df)
  }

By analysing the stationname column, we can observe that there are 148 uniques stations in the dataset.

unique(cta_station$stationname)
length(unique(cta_station$stationname))

For the purposes of this project, we’ll only be focusing on three stations: UIC-Halsted, O’Hare Airport and Racine. So we create subsets of the dataframe for each individual station and merge them.

uic_station <- subset(cta_station, stationname == 'UIC-Halsted')
ohare_station <- subset(cta_station, stationname == 'O'Hare Airport')
racine_station <- subset(cta_station, stationname == 'Racine')

station_data <- do.call("rbind", list(uic_station, ohare_station, racine_station))

This is the data file we’ll be using for our dashboard.

THE DASHBOARD

Dashboard Interface

The dashboard is divided into 4 sections:

THE ABOUT PAGE

About Page

Users can get a brief idea of the interface and how it’s meant to be used. There’s an introduction to the dataset, the link from it was sourced and a brief elaboration of the attributes.

STATION COMPARISON

Table Data

The section is divided into two regions, each with their own set of controls. Users can at a glance compare information between any two stations. Graphs on ridership are plotted on a yearly, daily, monthly and weekly basis. Users can change the station and the year from which the data is being visualized from the dropdowns in the sidepanel. Users can observe data of individual stations or the summarized data of all three stations. This allows users to mix and match graphs from different years allowing them to quickly spot any possible evnts of interest that may have happened at a station

TABLE DATA

Table Data

Users can view the visualization and tabular data of stations in the two regions of the sections. Users can choose the station, year and time frame ie. daily, weekly and monthly view of the station data from the dropdowns in the sidepanel and compare it against the station data in the opposite region. Users can pinpoint certain dates where unexpected spikes and/or declines in ridership may have occurred and infer the possible reasons behind it.

DATES OF INTEREST

These are a set of pre-selected dates which have been observed to have an unusual pattern in ridership. Users can select on the tabs present in the interface to view the 10 dates in question. The description in each tab gives more information about the event in question and the potential impact its had on ridership numbers in the station for that period.

October 2019, O’Hare Airport

DOI1 There’s a significant decline in ridership between September 28th to October 5th, 2019. Upon research, it was discovered that there was heavy rainfall and reports of a flash floods happening around that time which could have led to the possible decline in numbers. Source

11th September 2001, O’Hare Aiport

DOI2 The attack on the Twin Towers had a significant impact on the general populace. Due to the nature of the attack, it could have possibly instilled a fear of travelling in the population and hence the decline in the number riders starting from September.

COVID-19 April 2020, UIC-Halsted

DOI3 Due to the pandemic and strict CDC guidelines, travelling was severely restricted and prompted classes to be held online. In-person classes then resumed on August 2021. These factors have led to a significant decline in riders which can be observed from April 2020 all the way to August 2021.

COVID-19 April 2020, O’Hare Airport

DOI4 The same trend can be observed for the O’Hare Airport station as well. Due to strict CDC guidelines, a number of flights were cancelled and/or grounded. While online classes would have helped keep ridership entries low, domestic and international travel is crucial for the economy which might be the reason why there’s faster growth in numbers for this particular situation.

July 2008, O’Hare Airport

DOI5 There’s a significant gap in ridership numbers at the O’Hare Airport station. Researching these dates found that Hurricane Ike occurred during this time with heavy rainfall and floods primarily the reasons why ridership fell so low. Source

March 2014, O’Hare Airport

DOI6 A train derailed at the O’Hare airport station on March 24th, 2014. Due to maintenance and repair, the station was shutdown for 6 days and reopened on the 30th of March. This explains the gap present in the data. Source

October 2017, O’Hare Airport

DOI7 The Chicago Cubs were playing against the Washington Nationals during this time which could have contributed to the high spikes in ridership that we can observe for the month of October.

November 2008, O’Hare Airport

DOI8 There’s a high spike in ridership number in the month of Novemebr. Research shows that Barack Obama addressed the people at Grant Park on November 8th 2008, which could be a possible reason in the spike ridership. The general ridership numbers start declining from October due to the onset of winter, which makes this spike in numbers all the more noticable. Source

Summer Break, UIC-Halsted

DOI9 On observation of ridership numbers at UIC-Halsted, a pattern emerges where there’s a decline in ridership during the months of May to July. This could be attributed due to lack of students, since most of them would be on break at the time.

November 2016, O’Hare

DOI10 It was a historic moment for the people of Chicago when cubs won World Series on November 2nd, 2016. The high spikes in ridership numbers could be attributed to the popularity of the match considering it’s the first time since 1908 that the Cubs had reached the finals. Source

NEXT STEPS

Users can access the source code if they wish to dive deep into the program and come up with their plots that they wish to visualize. Users can modify the selection of stations and choose their own to visualize and infer data from.