The dashboard can be accessed from here.
This dashboard is designed to display ridership data in a simple and intuitive manner. Users can search for ridership data from any of the stations in the station input bar and the map, bar graph and tables change according to the inputs given. Users can select a particular date to see the activity in a particular station and can switch to the previous/next day using the buttons on the sidepanel. Users can also compare ridership data between two dates by clicking on the comparison option and setting the required dates. The bar graph displays a divergent graph that clearly highlights the difference in ridership values between the two dates. Users can view the plot of ridership over time period. The time period can be slected from the sidepanel and period can be set to Daily, Weekly, Monthly or Yearly.
The project was developed in R using RStudio as the development enviroment and ShinyApps to deploy the dashboard. The dashboard was designed with the intent of running on a touch-screen wall with a resolution of 11,520 x 3,240.
Installation of R and RStudio is necessary to run the application.
install.packages("lubridate")
install.packages("DT")
install.packages("plotly")
install.packages("leaflet")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("scales")
install.packages("shiny")
install.packages("shinydashboard")
The data is sourced from the Chicago Data Portal and the corresponding location data is taken from here
The datasets are two Tab Separated Value(.tsv) files, consisting of 1.08 million and 300 rows of data respectively and contains the following attributes:
Ridership Dataset
Column | Data Type | |
---|---|---|
station_id | integer | |
stationname | string | |
date | Date | |
daytype | character | |
rides | integer |
Location Dataset
Column | Data Type | |
---|---|---|
STOP_ID | integer | |
DIRECTION_ID | string | |
STOP_NAME | string | |
STATION_NAME | string | |
STATION_DESCRIPTIVE_NAME | integer | |
MAP_ID | integer | |
ADA | boolean | |
RED | boolean | |
BLUE | boolean | |
G | boolean | |
BRN | boolean | |
P | boolean | |
Pexp | boolean | |
Y | boolean | |
Pnk | boolean | |
O | boolean | |
Location | string |
We first read the ridership data file into the R file as a data frame(cta_station) for manipulating the attributes and location data to lat_long. We then check for NULL values in the datasets.
ok <- complete.cases(cta_station)
dim(cta_station[ok, ]) == dim(cta_station)
ok <- complete.cases(lat_long)
dim(cta_station[ok, ]) == dim(lat_long)
We rename the station_id columns in both datasets for maerging later on
names(cta_station)[1] <- 'station_id'
names(lat_long)[6] <- 'station_id'
Taking a closer look at the attributes, the date column is a character data type which we need to convert to the date format
cta_station$date <- as.Date(cta_station$date, '%m/%d/%Y')
From the date column, we can engineer new features by individually extracting the year, month and week day and storing them in separate attributes.
cta_station$year <- as.numeric(format(cta_station$date, format='%Y'))
cta_station$month <- as.numeric(format(cta_station$date, format='%m'))
cta_station$day <- as.numeric(format(cta_station$date, format='%d'))
cta_station$week_day <- wday(cta_station$date, label=TRUE)
We can also create a month_name column that holds values of the months as string
name_of_month <- function(df){
df$month_name[df$month==1] <- 'Jan'
df$month_name[df$month==2] <- 'Feb'
df$month_name[df$month==3] <- 'Mar'
df$month_name[df$month==4] <- 'Apr'
df$month_name[df$month==5] <- 'May'
df$month_name[df$month==6] <- 'Jun'
df$month_name[df$month==7] <- 'Jul'
df$month_name[df$month==8] <- 'Aug'
df$month_name[df$month==9] <- 'Sep'
df$month_name[df$month==10] <- 'Oct'
df$month_name[df$month==11] <- 'Nov'
df$month_name[df$month==12] <- 'Dec'
return(df)
}
From the STATION_DESCRIPTIVE_NAME, we can engineer a new column ‘line’ that contains information regarding which lines the stations serve.
lat_long$line <- str_extract(lat_long$STATION_DESCRIPTIVE_NAME, "\\(.*\\)")
lat_long$line <- str_remove_all(lat_long$line, "[\\(\\)]")
We extract the relevant columns from lat_long necessary for our work
loc_data = subset(lat_long, select = c('station_id', 'Location', 'line'))
loc_data <- loc_data %>% distinct(station_id, line, .keep_all = TRUE)
By analysing the stationname column, we can observe that there are 148 uniques stations in the dataset.
unique(cta_station$stationname)
length(unique(cta_station$stationname))
We then merge both cta_station and lat_long to give the final dataframe
final_data <- left_join(cta_station, loc_data, on="station_id")
final_data
Handling the missing values and inconsistencies in final_data
# Fill Missing Data
final_data$Location[final_data$stationname == 'Randolph/Wabash'] = "(41.884431, -87.626149)"
final_data$line[final_data$stationname == 'Randolph/Wabash'] = "Green, Orange, Pink, Purple & Brown Lines"
final_data$Location[final_data$stationname == 'Madison/Wabash'] = "(41.882023, -87.626098)"
final_data$line[final_data$stationname == 'Madison/Wabash'] = "Brown, Green, Orange, Pink & Purple Lines"
final_data$Location[final_data$stationname == 'Washington/State'] = "(41.8837, -87.6278)"
final_data$line[final_data$stationname == 'Washington/State'] = "Red Line"
final_data$Location[final_data$stationname == 'Homan'] = "(41.884914, -87.711327)"
final_data$line[final_data$stationname == 'Homan'] = "Green Line"
# Change data regarding Pink Line
final_data$line[final_data$stationname == "54th/Cermak" & final_data$date < "2006-06-25"] <- "Blue Line"
This is the data file we’ll be using for our dashboard.
The dashboard is divided into 4 componenets:
Users can get a brief idea of the interface and how it’s meant to be used. There’s an introduction to the dataset, the link from it was sourced and a brief elaboration of the attributes.
Users can pan around the map of Chicago, zoom in or out and view data on the various stations. Selecting a station from the side panel calibrates the map to that stations locations with a pop-up showing information on the station such as the number of rides and lines that stations serves.
Users can see the daily ridership in each of the stations in the bar graph. Users can select a single date from the side panel and the graph and table change to reflect the values on that date. The lines that each station serves can be found on the bar and hovering above the bar gives further information regarding exact ridership number and line information. On selecting comparsion, users can select 2 dates to be compared and the bar graph displays a divergent bar graph which shows the change in rideship values across different stations. The blue bar indicates decrease in ridership and the red bar indicates increase. The tables accordingly change to reflect the change in ridership values when comparing the two dates.
Users can see the ridership across a period of time which can be selected from the dropdown on the sidepanel. Users have the option of seeing Daily, Weekly, Monthly and Yearly ridership for any selected station. The table below the graph reflects the same values and also changes accordingly.
These are a set of pre-selected stations which have been observed to have an unusual pattern in ridership.
One interesting thing which we observed on visualizing the data was that there are two stations belonging to the Yellow line on 23-Aug-2021 on the map, namely Dempster-Skokie and Oakton-Skokie as shown above.
On selecting the station Dempster Skokie and seeing its yearly ridership, we could see data only from 2012.
Likewise for Oakton-Skokie, the data is available only from 2012.
On choosing any date before 2012, we could only see one stop, Skokie in the Yellow line.
On selecting Skokie and viewing its Yearly Ridership, we could see the below chart.
A quick Google search revealed that in April 2012, Skokie was renamed to Dempster Skokie and another Yellow Line station, Oakton-Skokie was introduced.
While looking at the yearly plots for Dempster Skokie and Oakton-Skokie, we can see that there was a significant dip in ridership in both these stations in 2015 that were not observed in other stations. Below shown are the daily ridership data for Oakton-Skokie and Dempster Skokie in the year 2015 with a harsh dip in may and no data from Jul-Sep 2015.
Oakton-Skokie
Dempster Skokie
Again on looking into it, found that both these stops were out of service, due to an embankment collapse on May 17th as the rail-lines were heavily damaged.
On observing the Chicago loop stations, we can see a difference in the number of stations between 2021 and 2004.
Two stops - Washington/State and Randolph/Wabash marked in red below, are missing.
Washington/State ridership over the years shows that it has no data after 2006 as it was abandoned after.
Likewise for Randolph/Wabash, it stopped functioning after 03-Sep-2017.
Users can access the source code if they wish to dive deep into the program and come up with their plots that they wish to visualize. Users can modify the selection of stations and choose their own to visualize and infer data from.