CTA Map Visualization

Analyzing ridership data on CTA metros from 2001 to 2021

Posted by Aditya Ranganathan on March 14, 2022 · 12 mins read

The dashboard can be accessed from here.

INTRO

This dashboard is designed to display ridership data in a simple and intuitive manner. Users can search for ridership data from any of the stations in the station input bar and the map, bar graph and tables change according to the inputs given. Users can select a particular date to see the activity in a particular station and can switch to the previous/next day using the buttons on the sidepanel. Users can also compare ridership data between two dates by clicking on the comparison option and setting the required dates. The bar graph displays a divergent graph that clearly highlights the difference in ridership values between the two dates. Users can view the plot of ridership over time period. The time period can be slected from the sidepanel and period can be set to Daily, Weekly, Monthly or Yearly.

The project was developed in R using RStudio as the development enviroment and ShinyApps to deploy the dashboard. The dashboard was designed with the intent of running on a touch-screen wall with a resolution of 11,520 x 3,240.

PREREQUISITES

  • Installation of R and RStudio is necessary to run the application.

  • For this project we’ll be using a couple of libraries that can be installed using ‘install.packages()’
    install.packages("lubridate")
    install.packages("DT")
    install.packages("plotly")
    install.packages("leaflet")
    install.packages("ggplot2")
    install.packages("dplyr")
    install.packages("tidyr")
    install.packages("scales")
    install.packages("shiny")
    install.packages("shinydashboard")
    
  • Due to the size limitations of ShinyApps, our data file must be less than 5 MB. We’ve chosen to breakup the file into chunks.

PRE-PROCESSING

The data is sourced from the Chicago Data Portal and the corresponding location data is taken from here

The datasets are two Tab Separated Value(.tsv) files, consisting of 1.08 million and 300 rows of data respectively and contains the following attributes:

Ridership Dataset

Column   Data Type
station_id   integer
stationname   string
date   Date
daytype   character
rides   integer

Location Dataset

Column   Data Type
STOP_ID   integer
DIRECTION_ID   string
STOP_NAME   string
STATION_NAME   string
STATION_DESCRIPTIVE_NAME   integer
MAP_ID   integer
ADA   boolean
RED   boolean
BLUE   boolean
G   boolean
BRN   boolean
P   boolean
Pexp   boolean
Y   boolean
Pnk   boolean
O   boolean
Location   string

We first read the ridership data file into the R file as a data frame(cta_station) for manipulating the attributes and location data to lat_long. We then check for NULL values in the datasets.

ok <- complete.cases(cta_station)
dim(cta_station[ok, ]) == dim(cta_station)

ok <- complete.cases(lat_long)
dim(cta_station[ok, ]) == dim(lat_long)

We rename the station_id columns in both datasets for maerging later on

names(cta_station)[1] <- 'station_id'
names(lat_long)[6] <- 'station_id'

Taking a closer look at the attributes, the date column is a character data type which we need to convert to the date format

cta_station$date <- as.Date(cta_station$date, '%m/%d/%Y')

From the date column, we can engineer new features by individually extracting the year, month and week day and storing them in separate attributes.

cta_station$year <- as.numeric(format(cta_station$date, format='%Y'))
cta_station$month <- as.numeric(format(cta_station$date, format='%m'))
cta_station$day <- as.numeric(format(cta_station$date, format='%d'))
cta_station$week_day <- wday(cta_station$date, label=TRUE)

We can also create a month_name column that holds values of the months as string

name_of_month <- function(df){
  df$month_name[df$month==1] <- 'Jan'
  df$month_name[df$month==2] <- 'Feb'
  df$month_name[df$month==3] <- 'Mar'
  df$month_name[df$month==4] <- 'Apr'
  df$month_name[df$month==5] <- 'May'
  df$month_name[df$month==6] <- 'Jun'
  df$month_name[df$month==7] <- 'Jul'
  df$month_name[df$month==8] <- 'Aug'
  df$month_name[df$month==9] <- 'Sep'
  df$month_name[df$month==10] <- 'Oct'
  df$month_name[df$month==11] <- 'Nov'
  df$month_name[df$month==12] <- 'Dec'
  
  return(df)
  }

From the STATION_DESCRIPTIVE_NAME, we can engineer a new column ‘line’ that contains information regarding which lines the stations serve.

lat_long$line <- str_extract(lat_long$STATION_DESCRIPTIVE_NAME, "\\(.*\\)")
lat_long$line <- str_remove_all(lat_long$line, "[\\(\\)]")

We extract the relevant columns from lat_long necessary for our work

loc_data = subset(lat_long, select = c('station_id', 'Location', 'line'))
loc_data <- loc_data %>% distinct(station_id, line, .keep_all = TRUE)

By analysing the stationname column, we can observe that there are 148 uniques stations in the dataset.

unique(cta_station$stationname)
length(unique(cta_station$stationname))

We then merge both cta_station and lat_long to give the final dataframe

final_data <- left_join(cta_station, loc_data, on="station_id")
final_data

Handling the missing values and inconsistencies in final_data

# Fill Missing Data
final_data$Location[final_data$stationname == 'Randolph/Wabash'] = "(41.884431, -87.626149)"
final_data$line[final_data$stationname == 'Randolph/Wabash'] = "Green, Orange, Pink, Purple & Brown Lines"

final_data$Location[final_data$stationname == 'Madison/Wabash'] = "(41.882023, -87.626098)"
final_data$line[final_data$stationname == 'Madison/Wabash'] = "Brown, Green, Orange, Pink & Purple Lines"

final_data$Location[final_data$stationname == 'Washington/State'] = "(41.8837, -87.6278)"
final_data$line[final_data$stationname == 'Washington/State'] = "Red Line"

final_data$Location[final_data$stationname == 'Homan'] = "(41.884914, -87.711327)"
final_data$line[final_data$stationname == 'Homan'] = "Green Line"

# Change data regarding Pink Line
final_data$line[final_data$stationname == "54th/Cermak" & final_data$date < "2006-06-25"] <- "Blue Line"

This is the data file we’ll be using for our dashboard.

THE DASHBOARD

Dashboard Interface

The dashboard is divided into 4 componenets:

THE ABOUT SECTION

About Page

Users can get a brief idea of the interface and how it’s meant to be used. There’s an introduction to the dataset, the link from it was sourced and a brief elaboration of the attributes.

MAP

Map

Users can pan around the map of Chicago, zoom in or out and view data on the various stations. Selecting a station from the side panel calibrates the map to that stations locations with a pop-up showing information on the station such as the number of rides and lines that stations serves.

BAR GRAPH AND TABLE DATA

Bar Plot Divergent Plot

Users can see the daily ridership in each of the stations in the bar graph. Users can select a single date from the side panel and the graph and table change to reflect the values on that date. The lines that each station serves can be found on the bar and hovering above the bar gives further information regarding exact ridership number and line information. On selecting comparsion, users can select 2 dates to be compared and the bar graph displays a divergent bar graph which shows the change in rideship values across different stations. The blue bar indicates decrease in ridership and the red bar indicates increase. The tables accordingly change to reflect the change in ridership values when comparing the two dates.

PLOT OF RIDERSHIP ACROSS TIME PERIOD

Rider Graph

Users can see the ridership across a period of time which can be selected from the dropdown on the sidepanel. Users have the option of seeing Daily, Weekly, Monthly and Yearly ridership for any selected station. The table below the graph reflects the same values and also changes accordingly.

STATIONS OF INTEREST

These are a set of pre-selected stations which have been observed to have an unusual pattern in ridership.

Dempster-Skokie and Oakton-Skokie

SOI1

One interesting thing which we observed on visualizing the data was that there are two stations belonging to the Yellow line on 23-Aug-2021 on the map, namely Dempster-Skokie and Oakton-Skokie as shown above.

On selecting the station Dempster Skokie and seeing its yearly ridership, we could see data only from 2012. SOI2

Likewise for Oakton-Skokie, the data is available only from 2012. SOI3

On choosing any date before 2012, we could only see one stop, Skokie in the Yellow line. SOI4

On selecting Skokie and viewing its Yearly Ridership, we could see the below chart. SOI5

A quick Google search revealed that in April 2012, Skokie was renamed to Dempster Skokie and another Yellow Line station, Oakton-Skokie was introduced.

Dip in Dempster-Skokie and Oakton-Skokie

While looking at the yearly plots for Dempster Skokie and Oakton-Skokie, we can see that there was a significant dip in ridership in both these stations in 2015 that were not observed in other stations. Below shown are the daily ridership data for Oakton-Skokie and Dempster Skokie in the year 2015 with a harsh dip in may and no data from Jul-Sep 2015.

Oakton-Skokie SOI6

Dempster Skokie SOI7

Again on looking into it, found that both these stops were out of service, due to an embankment collapse on May 17th as the rail-lines were heavily damaged.

Change in Station Count

On observing the Chicago loop stations, we can see a difference in the number of stations between 2021 and 2004.

Two stops - Washington/State and Randolph/Wabash marked in red below, are missing. SOI8

Washington/State ridership over the years shows that it has no data after 2006 as it was abandoned after. SOI9

Likewise for Randolph/Wabash, it stopped functioning after 03-Sep-2017. SOI10

NEXT STEPS

Users can access the source code if they wish to dive deep into the program and come up with their plots that they wish to visualize. Users can modify the selection of stations and choose their own to visualize and infer data from.