A Tri-environmental GeoAI Co-design Framework

To Examine the Nexus between People's Health and Combined Exposure to Air Pollution and Heatwaves.

Shaokun Lyu, Qiluo Li, Haiyang Li

January 26, 2025

Introduction

Foundamental Idea

As shown in the title, this project leverages cutting-edge geospatial artificial intelligence (GeoAI) to analyze and understand the complex interactions between health outcomes and environmental exposures.

Data Preparation --- Establish a Tri-environment Dataset

The Built Environment Dataset from OpenStreetMap, which provides insights into urban infrastructure and spatial characteristics; the Natural Environment Dataset, capturing factors like air pollution indicators and green spaces like NDVI; and the Social Environment Dataset, which explores demographic, socioeconomic, and community-level attributes. Together, these datasets offer a comprehensive basis for analyzing how different environments influence health outcomes.

Geo-AI Model Establishment

By integrating these datasets, GeoAI models are applied to identify the relative importance and interactions of these attributes, revealing key disparities in environmental exposure and health impacts. To achieve this, we first employ TPOT AutoML to determine the optimal machine learning pipeline for training. Using the best pipeline, we conduct global and local analyses, generating results that highlight both overarching patterns and localized variations in health-environment interactions.

Visualization and Results

SHAP (SHapley Additive exPlanations) analysis is utilized to visualize and interpret the results, offering deeper insights into the contributions of individual features. This comprehensive approach enables a nuanced understanding of how built, natural, and social environments collectively shape public health, particularly in the face of compounding stressors like air pollution and heatwaves.

Data Preparation

Overview of Tri-Environment Data

This research integrates three comprehensive datasets—Built Environment (BE), Natural Environment (NE), and Socioeconomic Environment (SE)—to examine spatial patterns and relationships influencing urban and ecological dynamics. Each dataset represents a unique dimension, enabling a holistic exploration of their combined impacts on environmental and human health outcomes.

01 / 05

Build Environment

The Built Environment refers to the physical and spatial characteristics of urban areas, including infrastructure, buildings, and land use. It plays a crucial role in shaping the living conditions, accessibility, and overall quality of life within a community. This dataset, sourced from OpenStreetMap, provides detailed insights into urban infrastructure and spatial characteristics at the Census Tract level.

Build Environment Variables

Ethnic Diversity (Index): A measure of population diversity using Simpson’s Index, where higher values indicate greater diversity.
- Calculation: 1 - Σ(pi²), where pi is the proportion of population in each ethnic group.

Housing Metrics: Proportions of multi-unit houses, public facilities, rented dwellings, and small one-bedroom homes among total buildings.
- Multi-unit Houses (%): Percentage of buildings classified as multi-unit dwellings (e.g., apartments).
- Public Facilities (%): Proportion of public facilities (e.g., libraries, museums) relative to total buildings.
- Small Houses (%): Percentage of one-bedroom dwellings among all housing units.
- Rent (%): Share of rented dwellings among total housing units.

Proximity Indicators: Distances to key facilities such as city centers, public transit stops, commercial areas, public services, road networks, and healthcare facilities.
- Distance to City Centers: Average distance to the nearest city center in meters.
- Distance to Public Transit: Proximity to transit stops, including bus and tram stations.
- Distance to Healthcare: Nearest distance to hospitals, clinics, or pharmacies.

Land Use and Housing Diversity (Indices): Simpson’s indices measuring diversity in land use and housing types.
- Land Use Diversity (Index): Evaluates the variety of land uses (e.g., parks, residential areas) with a score from 0 to 1.
- Housing Diversity (Index): Reflects heterogeneity in housing types, such as residential and commercial buildings.\

Density Metrics: Density of buildings, roads, and railways in a Census Tract.
- Building Density: Building area per square kilometer in the Census Tract.
- Road and Railway Density: Total length of roads and railways divided by census tract area.

Street Design: Counts of street facilities (e.g., lamps, traffic lights) and traffic connectivity (e.g., crossings, roundabouts).
- Street Facilities: Count of lamps, traffic lights, and stop signs per square kilometer.
- Traffic Connectivity: Number of junctions, roundabouts, and other connectivity elements.

Natural Environment

The Natural Environment dataset, derived from Google Earth Engine's Satellite Imagery, highlights critical ecological and atmospheric conditions necessary for understanding environmental sustainability. These data provide insight into air quality, temperature variations, and vegetation health across Census Tracts.

Natural Environment Variables

Air Quality Indicators
- Aerosol Optical Depth (AOD): Measures the density of atmospheric aerosols produced by human activities within a Census Tract.
- Sulphur Dioxide (SO₂):The density of atmospheric SO₂ in Census Tracts, indicating industrial and vehicular emissions.
- Nitrogen Dioxide (NO₂): A key indicator of vehicle emissions and industrial activities.
- Carbon Monoxide (CO): Reflects pollution levels from incomplete combustion of fossil fuels.
- Ozone (O₃): The density of atmospheric ozone, which affects air quality and is linked to urban pollution.

Land Surface Temperature (LST): The average summer land surface temperature, indicating heat exposure in urban and rural areas.

Built-Up and Vegetation Indices
- Normalized Difference Built-Up Index (NDBI): Ranges from -1 to 1; higher positive values indicate larger built-up areas. Used to assess urbanization levels.
- Normalized Difference Vegetation Index (NDVI): Measures vegetation health and density, with higher values indicating healthier vegetation. Indicates the balance between urban development and natural preservation.

Socioeconomic Environment

The Socioeconomic Environment dataset, compiled from government databases, focuses on demographic and economic factors influencing urban and environmental dynamics. Key variables include:

: Focuses on population composition, including diversity and age distribution.

:Highlights key health issues such as chronic diseases and obesity.

: Examines access to food resources and identifies disparities in food availability.

: Reflects income, education, and socioeconomic disparities.

GeoAI Approach

Overview of the Approach

This study employs a structured three-step approach to analyze the relationships between independent and dependent variables and to identify critical factors influencing the outcomes. The methodology combines automated machine learning (AutoML), global and local Random Forest models, and SHAP (Shapley Additive Explanations) for interpretive analysis.

Workflow Visualization

01 / 05

Introduction to TPOT

TPOT (Tree-based Pipeline Optimization Tool) is an AutoML tool that automates the process of model selection, feature engineering, and hyperparameter tuning. It uses genetic programming to optimize machine learning pipelines, reducing the effort required to find the best-performing model.

Model selection (e.g., Random Forest, Gradient Boosting, etc.).
Feature preprocessing and engineering (e.g., scaling, imputation).
Hyperparameter tuning.

AutoML Process Using TPOT

Input Data:
- Preprocess the dataset by handling missing values and converting columns to numeric.
- Split data into training and testing sets (80:20 split).
Feature Engineering and Combination:
- Generate additional features using Polynomial Features.
- Apply PCA for dimensionality reduction.
- Combine all features into a unified dataset.
Feature Selection and Model Optimization:
- Use Recursive Feature Elimination (RFE) to select the most important features.
- TPOT evaluates various models and identifies Random Forest as the best-performing model.
Final Evaluation:
- Evaluate the optimized Random Forest model on the test set using the R² metric.

Global Random Forest Analysis

After identifying Random Forest as the best model, a global analysis was conducted to assess the importance of independent variables (SE1, SE2, ..., NE3) for each dependent variable (obesity, highchol, diabetes, chd, bphigh).

Data Integration: The dataset was merged with a geospatial shapefile using a standardized GEOID column to incorporate spatial context.
Model Training: A Random Forest model was trained for each dependent variable using 500 trees (n_estimators=500) and a maximum of 6 features (max_features=6).
Global Variable Importance: Feature importance scores were calculated for each variable, quantifying their contributions to predicting dependent variables.

Local Random Forest Analysis

To gain deeper insights, the analysis focused on local patterns by constructing Local Random Forest models for each dependent variable.

Buffer-based Neighborhood Selection: For each data point, a buffer was created to identify its local neighborhood. The buffer size was adjusted to include sufficient neighboring points (minimum of 5).
Local Model Training: A Random Forest model was trained using the local data points within the buffer. The model calculated the importance of the local variable for each independent variable. Additionally, the local model's performance was evaluated using the R² score, which quantified how well the model explained the variation in the dependent variable within the local neighborhood.
Output Storage: Local importance scores and R² values were appended to the geospatial data and exported to a CSV file for documentation and analysis.

SHAP Analysis and Visualization

SHAP Value Computation: SHAP was employed to provide an interpretive layer to the Local Random Forest models. For each dependent variable:
- The global Random Forest model was retrained on the relevant dataset.
- A SHAP explainer was initialized, and SHAP values were computed for all features.
Visualization: SHAP summary plots, heatmaps, and waterfall plots were generated to visualize the contributions of individual features to the predictions. These visualizations helped explain the local variability in the dependent variables and identified critical factors driving changes.

Results

Local Random Forest R² Map

The resulting map of the Local Random Forest R² values demonstrates a strong model performance across the analyzed regions, with the majority of local R² scores exceeding 0.84. This indicates that the Local Random Forest models effectively capture the relationships between independent and dependent variables within the defined spatial neighborhoods. The map highlights regions with varying R² values, ranging from 0.703 to 0.983, providing a visual representation of model fit and predictive accuracy. The high R² values suggest that the models explain a significant proportion of the variance in the dependent variables, validating the approach and its ability to provide reliable localized insights.

SHAP Summary for Dependent Variables (Top 1000 Points)

The SHAP summary plots provide insights into the contributions of independent variables to each dependent variable across the top 1000 data points. For all the dependent variables (bphigh, chd, diabetes, highchol, obesity), SE6 and SE2 consistently emerge as the most impactful features, showing strong positive and negative impacts across the datasets. Other variables, such as SE1, NE2, and BE3, also demonstrate varying degrees of influence, indicating their localized importance depending on the dependent variable. The color gradient highlights the feature values (high or low), and the dispersion of points illustrates the magnitude of SHAP values, representing their effect on the model's predictions.

SHAP Waterfall Plot for Dependent Variables (Top 1000)

The SHAP waterfall plots provide a detailed breakdown of how individual features contribute to the model predictions for each dependent variable (bphigh, chd, diabetes, highchol, and obesity) among the top 1000 points. For bphigh, features like SE6 and SE2 dominate, with SE6 positively adding to the prediction while SE2 has a significant negative contribution. Similarly, in the chd plot, NE1 and BE4 positively drive the final prediction, with smaller but meaningful contributions from SE6 and BE5. In the case of diabetes, SE6, SE1, and SE3 emerge as the most impactful variables, highlighting their importance in shaping the prediction outcome. For highchol, SE6 contributes substantially, alongside SE2 and SE3, with their impacts varying between positive and negative. Lastly, for obesity, NE3, BE1, and SE6 play critical roles, adjusting the base value significantly to achieve the final prediction.

Discussion

The Random Forest models, both global and local, demonstrated strong predictive performance across all dependent variables, as indicated by high R² values and meaningful feature importance scores. The SHAP analyses provided detailed insights into the contributions of individual features, with SE6, SE2, and NE3 consistently emerging as influential predictors across multiple dependent variables. The waterfall plots and summary graphs highlight the robustness and interpretability of the models, allowing us to understand not only the overall variable importance but also the localized impacts on specific predictions. The high performance and interpretability of our model make it suitable for applications in health-related policy analysis and decision-making. For example, the local Random Forest models, coupled with SHAP insights, can assist in identifying region-specific factors affecting health outcomes, which can guide targeted interventions. Moving forward, the model can be expanded to include additional variables, such as temporal trends or external socioeconomic factors, to further refine its predictive power and utility.