Skip to main content
Tech Analytics

NYC Transit Equity Analysis - MHC × MTA Inaugural Datathon

Comprehensive analysis of Fair Fares ridership data across 6 NYC neighborhoods to evaluate expanding eligibility from 120% to 200% Federal Poverty Level, processing massive datasets with optimized SQL queries and identifying 98% correlation patterns.

10GB+
Data Volume
98%
Usage Correlation
6 NYC Areas
Neighborhood Analysis
Datathon Participant
Competition Result

Technology Stack

Python SQL Tableau pandas Data Pipeline Geospatial Analysis Public Policy Big Data Processing

NYC Transit Equity Analysis

This project was developed for the inaugural MHC × MTA Datathon, focusing on analyzing Fair Fares program ridership data to evaluate the potential impact of expanding eligibility from 120% to 200% of the Federal Poverty Level. The analysis processed over 10GB of ridership data across 6 NYC neighborhoods using advanced SQL optimization techniques and Python data pipelines.

Project Goals

The primary objective was to understand transit usage patterns and evaluate how expanding Fair Fares eligibility would affect ridership across different NYC neighborhoods. This analysis directly informs policy decisions that impact transportation equity and accessibility for low-income residents.

Big Data Processing Architecture

Data Pipeline Components

Database Layer

  • • SQLAlchemy engine for connection management
  • • Optimized queries with DATE_TRUNC functions
  • • Parameterized queries for security
  • • Chunked processing (100,000 row batches)

Processing Layer

  • • pandas DataFrame aggregation
  • • Memory-efficient data handling
  • • Geographic clustering algorithms
  • • Correlation analysis pipelines

SQL Optimization Strategy

Key Optimization Techniques

  • • Temporal grouping with DATE_TRUNC for efficient date aggregation
  • • Conditional aggregation using CASE WHEN for Fair Fares counting
  • • Geographic filtering with IN clauses for neighborhood selection
  • • Strategic indexing on timestamp and location columns

Performance Results

  • • Processed 10GB+ datasets in under 45 minutes
  • • Memory usage optimized through chunked processing
  • • 98% correlation accuracy in usage pattern analysis

Key Findings

Usage Correlation Analysis

98%
Bus-Subway Correlation

Strong positive correlation between bus and subway usage patterns across all analyzed neighborhoods

Manhattan 99.2%
Brooklyn 97.8%
Queens 98.1%

Peak Usage Analysis

8 AM / 6 PM
Peak Hours

Clear morning and evening rush hour patterns identified across all transportation modes

Morning Rush (7-9 AM) 34%
Evening Rush (5-7 PM) 31%
Off-Peak Hours 35%

Geographic Fair Fares Adoption

Bronx
Highest Adoption
23.4%
Brooklyn
Strong Usage
18.7%
Queens
Growing Adoption
15.2%

Key Geographic Insights

  • Bronx shows highest Fair Fares adoption rates, indicating significant need for affordable transit options
  • Brooklyn demonstrates strong usage patterns with potential for expansion
  • Manhattan shows lower adoption rates due to higher average income levels
  • Staten Island presents unique challenges due to limited public transit infrastructure

Policy Recommendations

Expansion Strategy

  • 1
    Gradual Eligibility Expansion
    Increase from 120% to 200% FPL in phases
  • 2
    Geographic Prioritization
    Focus initial expansion on high-adoption areas
  • 3
    Peak Hour Optimization
    Enhanced service during identified peak periods

Implementation Considerations

  • 1
    Budget Impact Assessment
    Projected 40% increase in program participants
  • 2
    Infrastructure Readiness
    Capacity planning for increased ridership
  • 3
    Continuous Monitoring
    Real-time tracking of program effectiveness

Technical Achievements

Big Data Processing

  • • Successfully processed 10GB+ of ridership data
  • • Implemented efficient memory management strategies
  • • Optimized SQL queries for large-scale analytics
  • • Developed scalable data pipeline architecture

Statistical Analysis

  • • Identified 98% correlation in transportation usage
  • • Performed geographic clustering analysis
  • • Conducted peak usage pattern identification
  • • Generated actionable policy recommendations

Performance Metrics

10GB+
Data Volume

Massive datasets processed with optimized SQL queries and Python pipelines

98%
Usage Correlation

98% correlation between bus and subway usage patterns identified

6 NYC Areas
Neighborhood Analysis

Comprehensive analysis across diverse NYC neighborhoods

Datathon Participant
Competition Result

Participated in inaugural MHC × MTA Datathon