The level of atmospheric oxygen, a driver of free radical damage and tumorigenesis, decreases sharply with rising elevation. To understand whether ambient oxygen plays a role in human carcinogenesis, we characterized age-adjusted cancer incidence (compiled by the National Cancer Institute from 2005 to 2009) across counties of the elevation-varying Western United States and compared trends displayed by respiratory cancer (lung) and non-respiratory cancers (breast, colorectal, and prostate). To adjust for important demographic and cancer-risk factors, 8–12 covariates were considered for each cancer. We produced regression models that captured known risks. Models demonstrated that elevation is strongly, negatively associated with lung cancer incidence (
At present, four types of cancer—prostate, breast, lung, and colorectal—exceed 100,000 new cases per year in the United States. Of these cancers, lung cancer carries the worst prognosis and will claim an estimated 159,260 lives in 2014 (
Inspired molecular oxygen (O2) leads to intracellular formation of reactive oxygen species (ROS). This occurs either by spontaneous ionizing radiation or by incomplete reduction of O2 during normal cellular respiration (
The amount of DNA damage and cytotoxicity incurred is influenced both by the effectiveness of oxygen metabolism (
Despite the inability to perform controlled experiments of oxygen toxicity in a human setting, elevation provides a natural experimental platform for examining the effects of oxygen on carcinogenesis. The relation between elevation and barometric pressure, and hence oxygen, is roughly linear at habitable altitudes. Across United States counties, elevation accounts for a 34.9% decrease in oxygen from Imperial County, California (−11 m) to San Juan County, Colorado (3,473 m). From its partial pressure at sea level, oxygen is reduced to 88.7% at 1,000 m, 78.5% at 2,000 m, and 69.2% at 3,000 m (
Numerous reports and observations of lower cancer rates at higher elevations appear in the literature of the last four decades (
Building on existing experimental and epidemiological evidence, we designed a study to assess the effect of elevation-dependent ambient oxygen on cancer incidence. We focused on the elevation-varying western United States, maximizing variation in our exposure of interest while minimizing potential confounding. Recent proliferation of high-resolution, publicly-available data enabled a precise ecological evaluation of our hypothesis. We relied on county-level incidence rather than mortality to minimize quality of care and disease progression biases. To accurately assess oxygen exposure, we incorporated subcounty population dispersion into county elevation calculation. We accounted for potential confounding effects by including important risk and demographic factors and evaluating a range of environmental variables that covary with elevation. We compared the association of elevation with lung cancer versus its association with breast, colon, and prostate cancers to discriminate between atmosphere dependent and independent elevation effects. These steps combined with a robust and conservative statistical framework provided a rigorous assessment of our hypothesis: cancer incidence decreases as elevation rises, a trend most pronounced in tissue with direct atmospheric exposure.
From 11 publicly-available databases, we compiled US county data on cancer risk factors, environmental features, demographics, and quality control metrics, while avoiding redundancy. When selecting resources, we balanced several considerations including coverage, precision, collection period, and accessibility. To capture the long latency period of cancers and minimize observational error, we preferred collection periods preceding the cancer incidence timeframe and spanning multiple years. Resources were integrated using county FIPS (Federal Information Processing Standards) codes if available and name matching otherwise. Variables collected as cancer incidence predictors are displayed in
The cancers that each predictor was included for are denoted by ‘l’ (lung), ‘b’ (breast), ‘c’ (colorectal), ‘p’ (prostate), ‘all’ (all 4 cancers), and ‘env’ (as an elevation replacement in the environmental analysis). The data collection period, number of counties with non-missing values after quality control, and unweighted mean and standard deviation are also reported.
Predictor | Cancers |
|
Mean | sd | Years | Units |
---|---|---|---|---|---|---|
Black | all | 259 | 1.9 | 2.6 | 2000 | % |
Education | all | 259 | 24 | 10 | 2006–2010 | % of adults with bachelor’s |
Income | all | 259 | 49 | 11 | 2006–2010 | thousands of US $ |
Metro | all | 259 | 0.43 | 0.5 | 2003 | binary classification |
Obesity | all | 259 | 21 | 3.5 | 2003–2005 | % prevalence |
White | all | 259 | 86 | 11 | 2000 | % |
Elevation | all | 259 | 0.97 | 0.74 | 2000 | kilometers |
Diabetes | c | 259 | 6.6 | 1 | 2004–2008 | % age-adjusted prevalence |
Drinking | b, c | 244 | 15 | 3.8 | 2002–2008 | % binge drinking last 30 days |
Female smoking | b | 258 | 41 | 6.8 | 1997–2003 | % smoked in lifetime |
Male | c, l | 259 | 50 | 1.6 | 2000 | % |
Mammogram | b | 259 | 64 | 6.2 | 2000–2003 | % within last two years |
Meat | c, p | 259 | 67 | 12 | 2006 | lbs per household per year |
Other cancer | b | 259 | 273 | 33 | 2005–2009 | age-adjusted incidence per 100,000 |
Other cancer | c | 258 | 396 | 42 | 2005–2009 | age-adjusted incidence per 100,000 |
Other cancer | l | 255 | 382 | 36 | 2005–2009 | age-adjusted incidence per 100,000 |
Other cancer | p | 259 | 352 | 48 | 2005–2009 | age-adjusted incidence per 100,000 |
Smoking | l, c | 258 | 47 | 5.8 | 1997–2003 | % smoked in lifetime |
Particulate | l, env | 259 | 10 | 1.8 | 2003–2008 | µg/m3 |
Radon | l, env | 258 | 1.7 | 1 | picocuries per liter | |
UVB | env | 259 | 1,072 | 239 | 1996–2005 | kJ/m2 |
Sunlight | env | 259 | 17,060 | 1,939 | 1979–2000 | kJ/m3 |
Precipitation | env | 259 | 1.9 | 1.5 | 1979–2000 | average daily mm |
High temp | env | 259 | 16 | 4.6 | 1979–2000 | °C |
Diurnal temp | env | 259 | 9.5 | 1.6 | 1979–2000 | °C |
Total county cancer incidences (‘All Races (incl. Hisp)’, ‘Both Sexes’, ‘All Ages’) were obtained from the National Cancer Institute (NCI) State Cancer Profiles for the following categories: ‘Lung & Bronchus’, ‘Breast’, ‘Prostate’, ‘Colon & Rectum’, and ‘All Cancer Sites’ (
County-level education, income, and mammogram data were obtained from State Cancer Profiles (
The percent of individuals having reported smoking over 100 cigarettes in their lifetime was downloaded from NCI Small Area Estimates (
Age-adjusted model-based estimates for obesity and diabetes prevalence were downloaded from the Centers for Disease Control (CDC) (
A classification of counties as metropolitan or nonmetropolitan produced in 2003 was obtained from the USDA Economic Research Service (
County averages for minimum and maximum daily temperature (°C), fine particulate matter with an aerodynamic diameter less than 2.5 micrometers (µg/m3), precipitation (mm), and sunlight (kJ/m3) were downloaded from the CDC WONDER database (
Absolute barometric pressure provides a more direct measure of oxygen than does elevation, albeit marginally. However, barometric pressure data is collected at land stations, which are not universally distributed, and frequently reported relative to sea level for weather forecasting. Therefore, we relied on population-weighted elevation to assess atmospheric oxygen exposure.
We calculated county elevation by subdividing a county into census blockgroups, computing the mean elevation for each blockgroup, and calculating the population-weighted average of the blockgroup elevations. On average, US counties contained 66.5 blockgroups with an average population of 1348.3 persons per blockgroup. By accounting for population dispersion within counties, this method better assesses the inhabitants’ exposure to elevation than population agnostic methods. With greater computational resources, future researchers may choose to use census blocks as a finer subdivision combined with higher resolution elevation data. This framework can be generalized for measuring exposure to any topological variable across a geographical area.
We restricted the analysis to states in the contiguous United States with elevation spans exceeding 3,000 m. The selected states—AZ, CA, CO, ID, MT, NV, NM, OR, UT, WA, WY—contained 414 counties composing the Western United States. Next, counties were filtered for quality control. Counties with populations below 10,000 were excluded due to high missingness (values were missing for many of the variables) and observational error (values were present but subject to large margins of error, evidenced by source-reported confidence intervals). Counties with high Native American composition or immigration rates were considered potentially problematic: cancer rates among Native Americans are prone to misestimation (
We evaluated the association between elevation and cancer incidence using multivariate linear regression. Counties were weighted by their population square root up to a maximum population of 250,000 where measurement uncertainty leveled off to minimal levels. The weighting scheme accounted for increasing measurement uncertainty among low population counties without granting heavily populated counties an overwhelming influence.
To minimize confounding effects, we selected well-established factors to include as covariates with elevation. We avoided excessive collinearity (
We employed two regression methods, best subset and lasso, in parallel. The best subset approach allowed us to force elevation into the model and exhaustively evaluate all possible models while remaining amenable to statistical interpretation. However, evaluating all possible subsets creates the potential for overfitting. Lasso addresses this concern by introducing coefficient shrinkage and variable selection (
Predictors displayed expected correlations such as a strong positive correlation between obesity and diabetes. Collinearity was moderate but pervasive. Elevation covaried with most variables including cancers indicating the need to adjust for covariates while carefully considering collinearity. Besides radon, the correlation of elevation with other predictors did not exceed |
For each cancer, we performed best subset regression by evaluating all predictor subsets that included elevation. Subset sizes ranged from one, where elevation was the sole predictor, up to the total number of included variables. For a given size, the predictor set minimizing the residual weighted sum of squares was computed. To identify an optimal model across subset sizes, we chose the subset whose model minimized the Bayesian Information Criterion (BIC) (
We fit a single model for each cancer using lasso regression (
To display the relationship between elevation and cancer incidence while accounting for the effect of covariates, we employed partial regression plots. The
To investigate the potential of a smoking-elevation interaction affecting lung cancer, we partitioned counties into smoking prevalence terciles as follows: high-smoking (49.9–61.9), mid-smoking (44.9–49.9), low-smoking (28.2–44.9). Within each tercile, we regressed lung cancer incidence against elevation for visual examination. For statistical evaluation, we refitted the optimal best subset model with an added interaction term (standardized smoking prevalence × standardized elevation).
To mitigate uncontrolled confounding (i.e., omitted-variable bias), we stratified the lung cancer dataset by state. Health policy and data collection are often enacted at the state level, making stratification by state a sensible choice for maximizing within-strata homogeneity. Within each stratum, lung cancer incidence was regressed against elevation and smoking prevalence. The elevation effect size was estimated across the eleven state-specific models using a fixed-effects meta-analysis, which averaged elevation coefficients weighted by their inverse variances.
We evaluated the association between elevation and lung cancer incidence measured for the following population subgroups: under 65 years old, 65 or older, males, and females. These subgroups were chosen because exposure to risk factors, such as occupational or lifestyle hazards, often segregates by sex or age. For this analysis, we created a dataset with counties that had no missing data for the four subgroup outcomes and the predictors from the optimal best subset model. Proceeding with the best subset covariates, we fit a separate regression model for each population subgroup. We used sex-specific smoking prevalence for the male and female models.
Environmental variables without established cancer risk were not evaluated in the previous regression analyses to avoid problems of collinearity. For lung and breast cancer, we investigated whether substituting elevation with each of seven environmental variables produced a more likely model, which could indicate an indirect elevation-cancer association. For each environmental variable, we performed best subset regression to find the BIC-minimizing set of predictors. The covariates evaluated for each cancer matched those from before (
Analyses were performed using the statistical-computing language
The county-level dataset compiled for this study is available (
Performing best subset regression for each cancer, we found a highly significant, strong negative association between elevation and lung cancer incidence with a standardized coefficient (
The weighted mean and standard deviation for each cancer incidence is reported. For each optimal best subset model, the number of counties (
Incidence | Model | Elevation | ||||||
---|---|---|---|---|---|---|---|---|
Cancer | Mean (sd) |
|
Size |
|
|
|
|
|
Lung | 56.8 (14.4) | 253 | 5 | 70.2% | 1.34× 10−17 | −7.23 | −0.35 | −12.7% |
[−9.29, −5.18] | [−0.46, −0.25] | [−16.35, −9.11]% | ||||||
Breast | 119.3 (16.7) | 243 | 6 | 56.8% | 3.22× 10−03 | −3.63 | −0.15 | −3.0% |
[−7.06, −0.20] | [−0.30, −0.01] | [−5.92, −0.17]% | ||||||
Colorectal | 41.9 (6.0) | 243 | 5 | 34.1% | 0.883 | 0.65 | 0.08 | 1.5% |
[−0.76, 2.06] | [−0.09, 0.24] | [−1.82, 4.92]% | ||||||
Prostate | 148.6 (23.9) | 259 | 4 | 18.9% | 0.974 | 4.71 | 0.14 | 3.2% |
[−1.56, 10.97] | [−0.05, 0.33] | [−1.05, 7.38]% |
The optimal (BIC-minimizing) models contained five predictors for lung and colorectal cancers, six predictors for breast, and four predictors for prostate cancer (
Elevation coefficients and 99% confidence intervals are plotted across a range of best subset model sizes for each cancer. As determined by BIC, more likely models are shaded darker. The optimal (BIC-minimizing) model for each cancer is denoted by a star. Lung cancer is the only cancer to display consistently negative coefficients and confidence intervals.
Lasso regression performs variable selection that operates well under moderate collinearity and coefficient shrinkage that prevents overfitting. Using a conservative setup of the lasso, we again observed a strong, negative association between elevation and lung cancer incidence with a standardized coefficient of −0.33, changing minimally from the best subset estimate (
The number of predictors (size) and
Cancer | Size |
|
|
|
|
---|---|---|---|---|---|
Lung | 6 | 67.1% | −6.64 | −0.33 | −11.7% |
Breast | 6 | 51.3% | −0.39 | −0.02 | −0.3% |
Colorectal | 6 | 27.4% | – | – | – |
Prostate | 2 | 7.8% | – | – | – |
Models produced for each cancer by best subset (
(A) Summary of the predictors included in the optimal best subset model for each cancer (see
For lung cancer, both regression methods found previously characterized positive associations with smoking prevalence, percent of black residents (
Points represent counties shaded by their regression weight based on population. Bivariate (red) and partial (blue) regression lines are displayed with 99% confidence bands. (A) Bivariate plot of county lung cancer incidence (age-adjusted per 100,000) and elevation (km). (B) Partial regression plot for elevation based on the optimal best subset lung model. Association sharpens after adjustment for covariates, illustrated by the tighter confidence band and higher
For breast cancer, the best subset regression model captured known positive associations with education and income (
Given the respiratory intersection of oxygen inhalation and smoking, we asked whether elevation associated differently across smoking prevalences. Stratifying counties into smoking terciles, the strong effect of smoking on lung cancer was illustrated by the non-overlapping confidence bands (
Stratification analyses provide no evidence for a smoking-elevation interaction or state-based confounding. (A) Lung cancer incidence (age-adjusted per 100,000) and elevation (km) are plotted for counties stratified into terciles by smoking prevalence. Counties were shaded by their regression weight, and a bivariate regression was fit for each stratum. The non-overlapping confidence bands (95%) illustrate the strong effect of smoking on lung cancer, while the approximately parallel slopes demonstrate the lack of an observable smoking-elevation interaction. (B) Lung cancer was regressed against elevation and smoking for each state. State-specific elevation coefficients are plotted with 95% confidence intervals. Ten of eleven states displayed negative coefficients. State-specific elevation coefficients were meta-analyzed. The resulting 99% confidence interval overlaps the interval from the optimal best subset model shown in blue. The number of counties within each state analysis is indicated in parentheses.
While focusing on the elevation-varying Western United States minimized the risk of regional confounding, possible differences in health policy or practice between higher and lower elevation states could still exist. State-specific models that accounted for smoking found negative elevation coefficients for ten of eleven states (
Disparate habits and lifestyles exist across age groups and sexes. Using subgroup-specific lung cancer incidences, we refit the best subset model asking whether elevation association would change (
Points represent counties shaded by their regression weight based on population. Partial regression plots are displayed for each subgroup with 99% confidence bands. Subgroups displayed similar strongly negative elevation associations, indicating that our findings were not the result of sex or age-based confounding. The elevation coefficient, standardized elevation coefficient, and partial
Rising elevation leads to lower atmospheric pressure, which helps drive a repertoire of interconnected climatic changes, including perturbations in sun exposure, temperature, and precipitation. This trend was apparent in our data, as many environmental variables correlated with elevation, and thus also with lung and breast cancer (
(A) Environmental variables were strongly collinear. Correlation with elevation was high, hence many environmental variables also covaried with lung and breast cancer. (B) To test whether elevation-association with lung and breast cancer was direct or indirect, we substituted each environmental correlate in place of elevation during best subset selection for each cancer. The optimal model for each elevation-replacement was compared to the unreplaced model by approximating a Bayes factor (
We approximated the likelihood (as a Bayes factor,
As a consequence of elevation’s numerous environmental correlates, we speculated that previous ecological studies of lung cancer may have fallen prey to uncontrolled elevation confounding. Radon (
Lung cancer incidence models for were fit for three sets of predictors: (model 1) radon/UVB and smoking; (model 2) radon/UVB, smoking, and elevation; and (model 3) radon/UVB and the optimal best subset predictors. The standardized UVB/radon coefficient (
Radon | UVB | |||
---|---|---|---|---|
Model |
|
|
||
1 | −0.36 [−0.44, −0.28] | 1.90× 10−15 | −0.18 [−0.28, −0.08] | 3.08× 10−04 |
2 | −0.07 [−0.18, 0.04] | 0.120 | 0.05 [−0.04, 0.15] | 0.864 |
3 | −0.03 [−0.13, 0.07] | 0.308 | −0.01 [−0.10, 0.09] | 0.453 |
We attributed a decrease of 25.2 99% CI [18.0–32.4] lung cancer cases per 100,000 individuals to the range of elevation of counties of the Western United States, equating to approximately 44% of the mean incidence (56.8). Were the entire United States situated at the elevation of San Juan County, CO (3,473 m), we estimate 65,496 99% CI [46,855–84,136] fewer new lung cancer cases would arise per year (
Prior to covariate adjustment, elevation correlated negatively with lung, breast, and colorectal cancer. Adjusting for demographic and risk factors using best subset and lasso regression, we produced sensible models for each cancer that captured known risk factors. Elevation’s negative association with colorectal cancer disappeared with adjustment for demographics, while association with breast cancer proved to be minimal and could be better represented by several other environmental variables. In contrast, lung cancer’s negative association with elevation was sharpened following multiple regression, failed to be captured by any other environmental variables, and had a remarkably strong effect size and significance. Lung-elevation association was robust to county stratification by smoking and state, as well as to population subgrouping by age and sex.
In summary, lung cancer associated with elevation over oxygen-independent environmental factors, and likewise elevation associated with lung cancer but not with non-respiratory cancers. Together these points provide substantial evidence for an inversely-linked inhaled carcinogen tied directly to elevation. Viewing our findings through the lens of the literature, atmospheric oxygen emerges as the most probable culprit.
Since elevation commonly covaries and its effect size on lung cancer is large, the potential for confounding is high. We identified two reported lung cancer associations—radon (
Studies where lung cancer is incorporated as a predictor rather than outcome may also be susceptible to confounding. Since the collection and availability of tobacco data has historically lagged behind lung cancer data, lung cancer rates have often been adopted as a proxy for smoking prevalence (
The confounding potential of elevation extends to any analysis of lung cancer across an elevation-varying region. To protect against this uncontrolled confounding, we urge future lung cancer studies to strongly consider adjusting for elevation. Fortunately, elevation is well-documented across the globe, and many existing epidemiological datasets contain locality information.
Cross-sectional study designs are susceptible to uncontrolled confounding where associations arise due to an unmeasured confounding factor. We designed the study to minimize this risk by focusing on a homogeneous and elevation-varying region, filtering error-prone counties, including established covariates, and performing multiple stratifications and subgroupings. Furthermore, while environmental correlates of elevation represented likely confounders, all seven pervasive environmental factors we investigated could not replace elevation in models of lung cancer. Therefore, our findings gave no indication of uncontrolled confounding for lung cancer. As relevant data becomes available, follow-up across different regions will provide additional assessment of uncontrolled confounding.
Since we evaluated counties rather than individuals, ecological fallacy was also a concern. However, several of our methods and findings limit this possibility. By focusing on US counties, the smallest population grouping with systematic data available for the target region, we inherently reduced the risk of ecological fallacy. Moreover, further increases in group specificity through population subgrouping did not alter the elevation-lung association. Additionally, we reason that an exposure affecting only a portion of the population must confer a very large risk to produce the strong association observed. We find it unlikely that an extremely damaging risk factor on the individual level, such as smoking, would have evaded detection until now. Therefore, we speculate the causal factor is likely mild in carcinogenicity but universal in exposure and thus amenable to translation from the individual to population level. Follow-up biological and experimental analyses will be critical to understanding the causal factor and potential mechanisms underlying the observed elevation association. If future research confirms oxygen-driven tumorigenesis in the human lung, the present study will join the substantial list of ecological analyses that spurred new insights into cancer etiology (
This study was made possible by excellent county level resources, many of which have only recently become available. While the trend towards available, accessible, and reusable data is encouraging, barriers still remain. For example, the 2006–2010 State Cancer Profiles release is missing incidence for two thirds of the counties in Washington due to “state legislation and regulations which prohibit the release of county level data to outside entities.” As the world comes online and the number of people with access to informatics tools expands, we see the spread of open data as a vital catalyst for progress.
We suspected misestimated cancer rates for counties with a high Native American percentage and a poor ability of predictors to assess cancer-risk exposure for counties with high immigration rates. To examine whether these counties were problematic, we created a general model of cancer incidence by regressing all-site cancer incidence against eight demographic and health-related covariates (metro, white, black, education, income, obesity, percent male, and smoking). Elevation was not included in the model to prevent opportunistic threshold selection. The regression was fit on Western-US counties with populations of at least 10,000. Absolute residuals are plotted against percent Native American and the 5-year immigration rate for each county (shaded by their population-based regression weight). Loess curves (displayed in blue with 95% confidence bands) indicate that predicted incidence diverged from reported incidence for both native and immigration-rich counties. Exclusion thresholds were selected, above which counties were filtered (red background), corresponding to the values where absolute residuals began trending higher.
Coefficient estimates from are displayed in unstandardized (
Tab delimited data collected for US counties. Missing values are blank. Source-reported 95% confidence intervals have ‘lower’ and ‘upper’ appended to the corresponding variable name.
The authors declare there are no competing interests.