The dashboard can be accessed from here.
This dashboard is designed to display ridership data in a simple and intuitive manner. Users can view the ridership data in three metro stations: UIC-Halsted, O’Hare Airport and Racine.
The project was developed in R using RStudio as the development enviroment and ShinyApps to deploy the dashboard. The dashboard was designed with the intent of running on a touch-screen wall with a resolution of 5,760 x 3,240.
Installation of R and RStudio is necessary to run the application.
install.packages("lubridate")
install.packages("DT")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("scales")
install.packages("shiny")
install.packages("shinydashboard")
The data is sourced from the Chicago Data Portal. The dataset is Tab Separated Value(.tsv) file, consisting of 1.08 million rows of data and contains the following attributes:
Column | Data Type | |
---|---|---|
station_id | integer | |
stationname | string | |
date | Date | |
daytype | character | |
rides | integer |
We first read the data file into the R file as a data frame(cta_station) for manipulating the attributes. We then check for NULL values in the dataset.
ok <- complete.cases(cta_station)
dim(cta_station[ok, ]) == dim(cta_station)
Taking a closer look at the attributes, the date column is a character data type which we need to convert to the date format
cta_station$date <- as.Date(cta_station$date, '%m/%d/%Y')
From the date column, we can engineer new features by individually extracting the year, month and week day and storing them in separate attributes.
cta_station$year <- as.numeric(format(cta_station$date, format='%Y'))
cta_station$month <- as.numeric(format(cta_station$date, format='%m'))
cta_station$day <- as.numeric(format(cta_station$date, format='%d'))
cta_station$week_day <- wday(cta_station$date, label=TRUE)
We can also create a month_name column that holds values of the months as string
name_of_month <- function(df){
df$month_name[df$month==1] <- 'Jan'
df$month_name[df$month==2] <- 'Feb'
df$month_name[df$month==3] <- 'Mar'
df$month_name[df$month==4] <- 'Apr'
df$month_name[df$month==5] <- 'May'
df$month_name[df$month==6] <- 'Jun'
df$month_name[df$month==7] <- 'Jul'
df$month_name[df$month==8] <- 'Aug'
df$month_name[df$month==9] <- 'Sep'
df$month_name[df$month==10] <- 'Oct'
df$month_name[df$month==11] <- 'Nov'
df$month_name[df$month==12] <- 'Dec'
return(df)
}
By analysing the stationname column, we can observe that there are 148 uniques stations in the dataset.
unique(cta_station$stationname)
length(unique(cta_station$stationname))
For the purposes of this project, we’ll only be focusing on three stations: UIC-Halsted, O’Hare Airport and Racine. So we create subsets of the dataframe for each individual station and merge them.
uic_station <- subset(cta_station, stationname == 'UIC-Halsted')
ohare_station <- subset(cta_station, stationname == 'O'Hare Airport')
racine_station <- subset(cta_station, stationname == 'Racine')
station_data <- do.call("rbind", list(uic_station, ohare_station, racine_station))
This is the data file we’ll be using for our dashboard.
The dashboard is divided into 4 sections:
Users can get a brief idea of the interface and how it’s meant to be used. There’s an introduction to the dataset, the link from it was sourced and a brief elaboration of the attributes.
The section is divided into two regions, each with their own set of controls. Users can at a glance compare information between any two stations. Graphs on ridership are plotted on a yearly, daily, monthly and weekly basis. Users can change the station and the year from which the data is being visualized from the dropdowns in the sidepanel. Users can observe data of individual stations or the summarized data of all three stations. This allows users to mix and match graphs from different years allowing them to quickly spot any possible evnts of interest that may have happened at a station
Users can view the visualization and tabular data of stations in the two regions of the sections. Users can choose the station, year and time frame ie. daily, weekly and monthly view of the station data from the dropdowns in the sidepanel and compare it against the station data in the opposite region. Users can pinpoint certain dates where unexpected spikes and/or declines in ridership may have occurred and infer the possible reasons behind it.
These are a set of pre-selected dates which have been observed to have an unusual pattern in ridership. Users can select on the tabs present in the interface to view the 10 dates in question. The description in each tab gives more information about the event in question and the potential impact its had on ridership numbers in the station for that period.
There’s a significant decline in ridership between September 28th to October 5th, 2019. Upon research, it was discovered that there was heavy rainfall and reports of a flash floods happening around that time which could have led to the possible decline in numbers. Source
The attack on the Twin Towers had a significant impact on the general populace. Due to the nature of the attack, it could have possibly instilled a fear of travelling in the population and hence the decline in the number riders starting from September.
Due to the pandemic and strict CDC guidelines, travelling was severely restricted and prompted classes to be held online. In-person classes then resumed on August 2021. These factors have led to a significant decline in riders which can be observed from April 2020 all the way to August 2021.
The same trend can be observed for the O’Hare Airport station as well. Due to strict CDC guidelines, a number of flights were cancelled and/or grounded. While online classes would have helped keep ridership entries low, domestic and international travel is crucial for the economy which might be the reason why there’s faster growth in numbers for this particular situation.
There’s a significant gap in ridership numbers at the O’Hare Airport station. Researching these dates found that Hurricane Ike occurred during this time with heavy rainfall and floods primarily the reasons why ridership fell so low. Source
A train derailed at the O’Hare airport station on March 24th, 2014. Due to maintenance and repair, the station was shutdown for 6 days and reopened on the 30th of March. This explains the gap present in the data. Source
The Chicago Cubs were playing against the Washington Nationals during this time which could have contributed to the high spikes in ridership that we can observe for the month of October.
There’s a high spike in ridership number in the month of Novemebr. Research shows that Barack Obama addressed the people at Grant Park on November 8th 2008, which could be a possible reason in the spike ridership. The general ridership numbers start declining from October due to the onset of winter, which makes this spike in numbers all the more noticable. Source
On observation of ridership numbers at UIC-Halsted, a pattern emerges where there’s a decline in ridership during the months of May to July. This could be attributed due to lack of students, since most of them would be on break at the time.
It was a historic moment for the people of Chicago when cubs won World Series on November 2nd, 2016. The high spikes in ridership numbers could be attributed to the popularity of the match considering it’s the first time since 1908 that the Cubs had reached the finals. Source
Users can access the source code if they wish to dive deep into the program and come up with their plots that they wish to visualize. Users can modify the selection of stations and choose their own to visualize and infer data from.