01 — Data & Feature Engineering
A tract-level dataset was assembled by integrating three primary sources: demographic and commuting data from the American Community Survey 2024, NJ Transit infrastructure GIS files, and 2025 TIGER/Line census tract geometries. The result is 2,181 rows with 39 attributes per tract.
ACS Variables
From the ACS, we draw commute mode shares (transit, drive, walk, bike), workforce size, median household income, median age, and demographic shares (pct_black, pct_hispanic, pct_foreign_born). The target variable, transit_share, is the share of employed residents who commute by public transportation.
Transit Infrastructure
Bus stop locations come from NJ Transit's BASEGIS.BUS_STOPS_BY_LINE dataset — over 31,000 stop points statewide. Rail and light rail stations were sourced from separate NJTransit GIS layers. All spatial operations were performed in EPSG:3424 (NJ State Plane, US Survey Feet) for accurate distance calculations.
Engineered Spatial Features
Three spatial accessibility features were computed to move beyond raw stop counts and capture how accessible transit actually is in practice:
-
dist_to_bus — straight-line distance from each tract centroid
to the nearest bus stop, via
sjoin_nearestin GeoPandas. Units: feet. - dist_to_rail — same computation for the nearest rail or light rail station. Captures access to the higher-capacity, longer-distance network.
- bus_density_2mi — count of bus stops within a 3,218-meter (~2-mile) buffer of each tract centroid. Captures network coverage depth, not just proximity to a single stop.
These three features consistently rank as the most predictive variables across both models — outperforming every demographic variable individually.
02 — OLS Regression
An Ordinary Least Squares model was estimated with 10 predictors and transit_share as the outcome. OLS is used for interpretability: standardized coefficients show each predictor's direction and relative magnitude on a comparable scale. R² = 0.556 (adj. R² = 0.554) — 56% of tract-level variance in transit use explained by infrastructure access and demographics, F(10, 2154) = 269.9, p < 0.001.
- bus_density_2mi has the largest standardized coefficient (β = 0.074, p < 0.001): network coverage depth is the single strongest OLS predictor.
- dist_to_bus (β = −0.008) and dist_to_rail (β = −0.009) are both negative and significant (p < 0.001): greater distance to transit consistently reduces ridership.
- pct_hispanic (β = −0.032) and pct_foreign_born (β = +0.027) are statistically significant, suggesting that nativity and ethnicity interact with transit use in ways that reflect both need and network dependency. pct_black is not significant (p = 0.45).
- bus_stops carries a negative coefficient (β = −0.011, p < 0.001) — counterintuitive, but expected: once bus density is controlled for, raw stop count is partly a proxy for tract size and existing service saturation, not additional access.
OLS sets the interpretable baseline. Its limitations — no nonlinear terms, no interaction effects, sensitivity to multicollinearity — motivate the Random Forest model.
03 — Random Forest
A Random Forest regressor (200 trees, max depth 10) was trained on the same 10 features. Random Forest captures nonlinear relationships and feature interactions that OLS cannot — including the sharp ridership threshold that emerges past a certain distance from transit stops.
In notebook test runs (80/20 split), the model achieves R² = 0.58. The model stored for prediction was trained on the full dataset (R² = 0.90 on training data), which reflects overfitting — the honest held-out estimate of ~0.58 is the right benchmark. Either way, it improves meaningfully over OLS (0.556) by capturing nonlinear relationships and feature interactions. Feature importance scores reveal a stark concentration of predictive power:
- bus_density_2mi accounts for 54.8% of total feature importance — more than all other variables combined. Network coverage depth is the dominant determinant of whether people ride transit.
- dist_to_rail is the 2nd most important feature at 10.8%. Distance to rail remains independently predictive because the bus density metric does not capture rail access.
- dist_to_bus falls to 9th at 3.3% — once bus density is in the model, the nearest individual stop adds little marginal information (the density measure already captures proximity implicitly).
- Demographic variables combined account for ~27% of total importance. pct_hispanic (7.8%) and pct_foreign_born (6.1%) are the most influential demographic predictors, reflecting both transit dependency and residential patterns near transit corridors. Infrastructure accounts for the remaining ~73%.
The model is used to generate predicted_transit_share for all 2,181 tracts, which feeds directly into gap analysis and scenario modeling.
04 — Gap Analysis
The transit gap is computed as:
gap = predicted_transit_share − transit_share
A positive gap identifies suppressed demand — the model expects higher ridership than is observed, meaning infrastructure is the binding constraint. These are not low-demand areas; they are access-constrained areas. A negative gap signals overperformance — tracts where ridership exceeds what the infrastructure profile alone would predict, often due to transit-oriented density, employer subsidies, or strong network connections.
The scatter below plots actual vs predicted transit share for all 2,165 modeled tracts. Points above the diagonal are in the gap. The color gradient — from dark (underperforming) through neutral to orange-red (highest suppressed demand) — makes the opportunity zones immediately visible. These tracts are the spatial input for all three investment scenarios.
05 — The Distance–Ridership Threshold
One of the most actionable findings in this analysis is that the relationship between distance and ridership is not linear. Beyond approximately 1 mile from the nearest bus stop, transit share drops sharply — a threshold effect that Random Forest captures and OLS misses.
The chart below plots dist_to_bus (in miles) against transit_share, colored by bus_density_2mi. The pattern is consistent across the 2,165-tract sample: high-ridership tracts cluster near zero distance in high-density service areas. Past the 1-mile mark, ridership converges toward zero regardless of density.
This finding shapes the targeting logic in Scenario 2 (Targeted Bus Investment): new stops are placed in high-demand, high-gap tracts where distance currently exceeds the threshold — the zone where gap-filling converts most efficiently to new ridership.
Methodological Takeaway
Two models, one result: infrastructure access explains ridership better than demographics alone.
OLS provides interpretability and confirms the sign and significance of each variable's effect. Random Forest improves predictive accuracy (R² ≈ 0.58 vs 0.556 for OLS on held-out data) and surfaces the dominance of bus density (54.8% of feature importance), the independent importance of rail access (10.8%), and a nonlinear distance threshold that OLS cannot model. Gap analysis converts predictions into a spatial investment priority map. The methodology moves directly from data to planning decision — identifying where investment changes behavior.