A Tri-environmental GeoAI Co-design Framework
To Examine the Nexus between People's Health and Combined Exposure to Air Pollution and Heatwaves.
To Examine the Nexus between People's Health and Combined Exposure to Air Pollution and Heatwaves.
As shown in the title, this project leverages cutting-edge geospatial artificial intelligence (GeoAI) to analyze and understand the complex interactions between health outcomes and environmental exposures.
The Built Environment Dataset from OpenStreetMap, which provides insights into urban infrastructure and spatial characteristics; the Natural Environment Dataset, capturing factors like air pollution indicators and green spaces like NDVI; and the Social Environment Dataset, which explores demographic, socioeconomic, and community-level attributes. Together, these datasets offer a comprehensive basis for analyzing how different environments influence health outcomes.
By integrating these datasets, GeoAI models are applied to identify the relative importance and interactions of these attributes, revealing key disparities in environmental exposure and health impacts. To achieve this, we first employ TPOT AutoML to determine the optimal machine learning pipeline for training. Using the best pipeline, we conduct global and local analyses, generating results that highlight both overarching patterns and localized variations in health-environment interactions.
SHAP (SHapley Additive exPlanations) analysis is utilized to visualize and interpret the results, offering deeper insights into the contributions of individual features. This comprehensive approach enables a nuanced understanding of how built, natural, and social environments collectively shape public health, particularly in the face of compounding stressors like air pollution and heatwaves.
This research integrates three comprehensive datasets—Built Environment (BE), Natural Environment (NE), and Socioeconomic Environment (SE)—to examine spatial patterns and relationships influencing urban and ecological dynamics. Each dataset represents a unique dimension, enabling a holistic exploration of their combined impacts on environmental and human health outcomes.
The Built Environment refers to the physical and spatial characteristics of urban areas, including infrastructure, buildings, and land use. It plays a crucial role in shaping the living conditions, accessibility, and overall quality of life within a community. This dataset, sourced from OpenStreetMap, provides detailed insights into urban infrastructure and spatial characteristics at the Census Tract level.
The Natural Environment dataset, derived from Google Earth Engine's Satellite Imagery, highlights critical ecological and atmospheric conditions necessary for understanding environmental sustainability. These data provide insight into air quality, temperature variations, and vegetation health across Census Tracts.
The Socioeconomic Environment dataset, compiled from government databases, focuses on demographic and economic factors influencing urban and environmental dynamics. Key variables include:
This study employs a structured three-step approach to analyze the relationships between independent and dependent variables and to identify critical factors influencing the outcomes. The methodology combines automated machine learning (AutoML), global and local Random Forest models, and SHAP (Shapley Additive Explanations) for interpretive analysis.
TPOT (Tree-based Pipeline Optimization Tool) is an AutoML tool that automates the process of model selection, feature engineering, and hyperparameter tuning. It uses genetic programming to optimize machine learning pipelines, reducing the effort required to find the best-performing model.
After identifying Random Forest as the best model, a global analysis was conducted to assess the importance of independent variables (SE1, SE2, ..., NE3) for each dependent variable (obesity, highchol, diabetes, chd, bphigh).
To gain deeper insights, the analysis focused on local patterns by constructing Local Random Forest models for each dependent variable.
Local Random Forest R² Map
The resulting map of the Local Random Forest R² values demonstrates a strong model performance across the analyzed regions, with the majority of local R² scores exceeding 0.84. This indicates that the Local Random Forest models effectively capture the relationships between independent and dependent variables within the defined spatial neighborhoods. The map highlights regions with varying R² values, ranging from 0.703 to 0.983, providing a visual representation of model fit and predictive accuracy. The high R² values suggest that the models explain a significant proportion of the variance in the dependent variables, validating the approach and its ability to provide reliable localized insights.
SHAP Summary for Dependent Variables (Top 1000 Points)
The SHAP summary plots provide insights into the contributions of independent variables to each dependent variable across the top 1000 data points. For all the dependent variables (bphigh, chd, diabetes, highchol, obesity), SE6 and SE2 consistently emerge as the most impactful features, showing strong positive and negative impacts across the datasets. Other variables, such as SE1, NE2, and BE3, also demonstrate varying degrees of influence, indicating their localized importance depending on the dependent variable. The color gradient highlights the feature values (high or low), and the dispersion of points illustrates the magnitude of SHAP values, representing their effect on the model's predictions.
SHAP Waterfall Plot for Dependent Variables (Top 1000)
The SHAP waterfall plots provide a detailed breakdown of how individual features contribute to the model predictions for each dependent variable (bphigh, chd, diabetes, highchol, and obesity) among the top 1000 points. For bphigh, features like SE6 and SE2 dominate, with SE6 positively adding to the prediction while SE2 has a significant negative contribution. Similarly, in the chd plot, NE1 and BE4 positively drive the final prediction, with smaller but meaningful contributions from SE6 and BE5. In the case of diabetes, SE6, SE1, and SE3 emerge as the most impactful variables, highlighting their importance in shaping the prediction outcome. For highchol, SE6 contributes substantially, alongside SE2 and SE3, with their impacts varying between positive and negative. Lastly, for obesity, NE3, BE1, and SE6 play critical roles, adjusting the base value significantly to achieve the final prediction.
The Random Forest models, both global and local, demonstrated strong predictive performance across all dependent variables, as indicated by high R² values and meaningful feature importance scores. The SHAP analyses provided detailed insights into the contributions of individual features, with SE6, SE2, and NE3 consistently emerging as influential predictors across multiple dependent variables. The waterfall plots and summary graphs highlight the robustness and interpretability of the models, allowing us to understand not only the overall variable importance but also the localized impacts on specific predictions. The high performance and interpretability of our model make it suitable for applications in health-related policy analysis and decision-making. For example, the local Random Forest models, coupled with SHAP insights, can assist in identifying region-specific factors affecting health outcomes, which can guide targeted interventions. Moving forward, the model can be expanded to include additional variables, such as temporal trends or external socioeconomic factors, to further refine its predictive power and utility.