Identifying highly connected sites for risk-based surveillance and control of cucurbit downy mildew in the eastern United States

View article
Ecology

Introduction

Dispersal properties of a pathogen are fundamental to the development of epidemics at different spatial scales that can range from local to the landscape level. The transmission of invasive plant pathogens and the spread of resultant epidemics influences essential ecosystem services, including biodiversity and food production in agricultural systems (Brown & Hovmøller, 2002; Crowl et al., 2008). Measures that might involve containment and eradication programs can be implemented to reduce the potential impact of these epidemics. However, the planning and implementation of any specific control measure requires an understanding of the mechanics of invasions and the ecological consequences, risks, and dynamics of disease spread. Such control efforts can benefit greatly from epidemic records within a region as they enable an analysis of the overall structure of pathogen dispersal. Information from such analyses can inform the design of control programs for disease epidemics and risk-based surveillance. For example, timely recording of animal movements was fundamental in the containment of the 2011 foot and mouth disease epidemic in the UK, for which retrospective analyses demonstrated that initial spread was influenced by the frequency of animal movement (Ferguson, Donnelly & Anderson, 2001; Kao et al., 2006).

One approach to understand pathogen dispersal and the spread of resultant epidemics is through network analysis, a method that is becoming increasingly popular but still has limited application in plant disease epidemiology (Garrett et al., 2018; Xing et al., 2020). Networks consist of ‘nodes’ and ‘links’, where nodes are the entities of interest (e.g., individual fields or observed sites of disease outbreak), while links connect nodes in various ways, for example, the potential of contact with a pathogen or pathogen transmission between two nodes. Further, networks can be weighted with link weights that are proportional to the probability of transmission. Networks have been used to describe the spread of diseases caused by aerially dispersed plant pathogens such as Podosphaera macularis in hop (Gent, Bhattacharyya & Ruiz, 2019) and Phakopsora pachyrhizi in soybean (Sutrave et al., 2012; Sanatkar et al., 2015). The primary determinants in pathogen dispersal, such as source strength, location of host populations and relevant covariate information, can be formulated as a network spreading model (Firester, Shtienberg & Blank, 2018; Garrett et al., 2018; Gent, Bhattacharyya & Ruiz, 2019; Sutrave et al., 2012). Such models can combine static spatial components, such as field location, and dynamic components of an epidemic, such as wind-based pathogen dispersal, to infer the underlying contact structure of landscape connectivity (With, Gardner & Turner, 1997).

The choice of networks to be studied depends on several factors, for example, the disease of interest and specific questions on the network structure. The latter, in turn, influences the type of network measures to be used in the analysis of pathogen dispersal and disease spread. Static and dynamic networks are common in landscape connectivity analyses. Connections in a static network are fixed links, while connections in a dynamic network change over time. Both static and dynamic networks have been applied in plant disease epidemiology (Sanatkar et al., 2015; Sutrave et al., 2012). In dynamic networks, between-node distances, host availability, wind speed and wind direction, can be formulated as a susceptible-infected (SI) model to describe disease spread (Sutrave et al., 2012). Further, plant diseases display seasonal differences in the occurrence and intensity of epidemics. Thus, an analysis of data from multiple epidemic years is useful in determining if there are recurring patterns that could inform monitoring or disease control measures. Highly connected nodes provide effective surveillance and opportunities for more targeted control to reduce disease spread within the network. An open question still remains regarding which centrality measure is useful for identifying important nodes for surveillance and managing real-world networks (Holme, 2017). Due to inherent differences in pathogen dispersal and disease spread mechanisms, centrality measures used to identify important nodes for surveillance may be specific to different pathosystems (Holme, 2018).

A motivating plant disease example for network analysis to inform surveillance and disease control is cucurbit downy mildew (CDM). A resurgence of the disease occurred around the world in the last 20 years that fundamentally altered cucurbit production and disease management at multiple scales (Holmes et al., 2015; Ojiambo et al., 2015). The resurgence of CDM in Europe and the United States was attributed to the introduction of a new pathotype or species that was previously limited to East Asia (Cohen et al., 2015; Thomas et al., 2017). Fungicides are integral to CDM control due to the lack of cultivars with adequate host resistance and in the absence of control, the disease can result in complete crop loss (Holmes et al., 2015). The disease is caused by an obligate pathogen, Pseudoperonospora cubensis, which exhibits significant long-distance dispersal (Ojiambo & Holmes, 2011). In the continental United States, P. cubensis overwinters below approximately 30-degree latitude in southern Florida and along the Gulf of Mexico on living hosts, and disease outbreaks in northern states rely on aerial dispersal of the pathogen from the south (Ojwang’ et al., 2021). Oospores have been reported in cucurbit fields in the southeastern United States, albeit at a low frequency, however, their role in the epidemiology of CDM is still unclear (Kikway, Keinath & Ojiambo, 2022). Further, while anthropogenic movement of infected transplants could be involved in pathogen dispersal, it is not typically considered due to lack of data.

In 2008, disease surveillance based on a series of sentinel (sites designated for regular monitoring) and non-sentinel (sites not designated for regular monitoring) sites was implemented as part of the CDM ipmPIPE (http://cdm.ipmpipe.org) surveillance system (Ojiambo et al., 2011). Based on the prediction framework developed by Main et al. (2001) and the sentinel site data, an integrated aerobiological model was developed to predict disease occurrence and progression in the eastern United States (Neufeld et al., 2018) to guide growers on when to apply their initial fungicide application. Surveys conducted in Georgia, Michigan, and North Carolina show that the forecasting system resulted in an average reduction of two to three fungicide applications compared to calendar-based application schedules. This reduction in fungicide applications translated to more than $6 million in savings for producers in these three states alone annually (Ojiambo et al., 2011). However, the disease surveillance system is expensive to maintain and thus, there is increasing interest in identifying locations that are critical for pathogen dispersal and disease spread within the region. The latter could facilitate a more targeted surveillance approach by directing the limited resources to locations that are more integral to disease spread and pathogen transmission within the region. These sentinel and non-sentinel sites have been instrumental in understanding the spatio-temporal spread of CDM (Ojiambo & Holmes, 2011; Ojiambo et al., 2017; Ojwang’ et al., 2021).

In this study, we specifically focus on centrality metrics that are directly applicable to CDM surveillance and management to identify highly connected sites. The centrality measures are betweenness (BWC), closeness (CLC), degree (DGC) and eigenvector (EVC), and these metrics have previously been used in network analysis of aerially dispersed plant pathogens and have relevance in describing epidemic spread (Andersen et al., 2019; Gent, Bhattacharyya & Ruiz, 2019; Sanatkar et al., 2015). Our inference of the importance of the highly connected sites is limited to disease records from the existing structure of sentinel and non-sentinel sites within the region. The specific objectives of this study were to: i) determine a centrality measure that is most useful in the surveillance and control of CDM, ii) identify highly connected nodes that are critical for pathogen dispersal and spread of CDM and iii) establish how removal of highly connected nodes influences the spread and containment of CDM in the eastern United States. Portions of this work were previously published as part of a PhD dissertation of the first author (Ojwang’, 2021).

Materials and Methods

Data source

Records of CDM occurrence in the eastern United States from 2008 to 2016 were used in this study. The data were obtained from the CDM ipmPIPE database (http://cdm.ipmpipe.org) that tracks reports of disease occurrence in the United States (Ojiambo et al., 2011). Disease records in the system include reports from a network of regularly monitored sites (sentinel sites) and voluntary reports (non-sentinel sites) submitted by commercial growers, agricultural researchers and the public. Sentinel sites were strategically placed within specific states and planted with different cucurbit host types to monitor CDM occurrence. Sentinel sites were located at research facilities or commercial fields with standard dimensions of 15 m × 61 m and were georeferenced using the Global Positioning System. These sites were planted early and regularly monitored for disease symptoms every 1 to 2 weeks by state collaborators and extension specialists. Cucurbits grown at the sentinel sites were cucumber cv. Straight 8 and Poinsett 76 (Cucumis sativus), cantaloupe cv. Hales Best Jumbo (Cucumis melo), acorn squash cv. Table Ace (Cucurbita pepo), butternut squash cv. Waltham (Cucurbita moschata), giant pumpkin cv. Big Max (Cucurbita maxima), and watermelon cv. Micky Lee (Citrullus lanatus) (Ojiambo et al., 2011). Non-sentinel reports were from locations not designated for regular surveillance but rather voluntary reports from commercial fields, research plots, and home gardens (Table 1). These non-sentinel reports are useful given that, in some epidemic years, CDM was reported earlier in non-sentinel sites than in sentinel sites and thus, they could be informative for inferring sources for disease spread.

Table 1:
States, number of counties in the eastern United States where cucurbit downy mildew was reported, and number of monitoring sites with disease summarized by planting type, during the study period.
Number of
states affected
Number of counties Number of sites by planting type
Year Commercial Home garden Research Sentinela Unspecifiedb Totalc
2008 22 113 68 10 12 59 5 154
2009 24 165 77 26 24 92 1 220
2010 25 118 77 17 24 25 1 144
2011 23 86 57 10 22 28 0 117
2012 25 149 99 20 23 31 0 173
2013 26 179 118 30 23 29 4 204
2014 23 104 53 16 22 20 3 114
2015 27 171 126 15 22 42 4 209
2016 22 107 61 9 19 33 0 122
DOI: 10.7717/peerj.17649/table-1

Notes:

Sentinel planting type refers to fixed plots, planted early and designated for monitoring.
Unspecified refers to reports where the planting type was not stated when the disease was reported in the cucurbit downy mildew monitoring database.
Total number of disease monitoring sites designated as commercial, home garden, research, sentinel and unspecified plot.

Latitudes and longitudes geo-coordinates for sentinel and non-sentinel sites were generated from the customized section of the CDM ipmPIPE website (http://cdm.ipmpipe.org). Where plot data was not available, latitudes and longitudes of county centroids were extracted from US Census Bureau 1990 Gazetteer Files and used as approximate georeferenced points. The compiled data from sentinel and non-sentinel sites included, among other things, the date of first disease symptoms, planting type (sentinel sites, commercial field, research plot, home garden, or unspecified), state, county, and geo-location. A disease case represented a unique combination of host and date of first disease symptoms at a particular location. The total number of disease cases across the study years ranged from 114 to 220, while the number of counties affected ranged from 86 to 179 across epidemic years (Table 1). Correlation analysis was performed to determine whether the number of counties influenced the number of disease reports (Fig. S1) and whether numbers of sites with active surveillance were correlated with the number of counties (Fig. S2) in the region during the study period.

Hourly wind speed and direction at each sentinel site were derived from weather observations from the National Oceanic and Atmospheric Administration Integrated Surface Database (Smith, Lott & Vose, 2011) provided by BASF (Research Triangle Park, Raleigh, NC, USA). Wind measurements were recorded 10 m above the ground and the maximum wind speed was used in this study. Meteorological wind direction is the direction the wind is blowing from, e.g., the wind coming from the north is a northerly wind, and a southerly wind is a wind coming from the south. A raw observation for the meteorological wind direction for a northerly wind is defined as 360°, a southerly wind is 180°, a westerly wind is 270°, and an easterly wind is 90° (Fig. S3). Meteorological wind direction (wd) in degrees was converted to a mathematical direction (md, i.e., the angle as measured in the mathematically conventional way, counterclockwise from the eastward direction) in degrees using the formula:

md={270wd,ifwd270360+(270wd),ifwd>270

The mathematical direction in degrees was subsequently converted to radians (i.e., radians = [degrees × π]/180). The x and y (u and v) components of the hourly wind vectors were then calculated as: x=rcosθ and y=rsinθ, where r is the wind speed in miles per hour and θ is the wind direction in radians (Fig. S3).

Static network analysis

In this study, nodes were a combination of sentinel and non-sentinel sites in the eastern United States. We point out that other locations in the eastern United States that were not monitored in this study may contribute to the risk and spread of CDM. However, the locations where CDM was monitored or reported were available for inclusion in this study. Static networks were constructed for each epidemic year to provide insight into the structure of the spread of CDM in the eastern United States.

The general methodology involved creating a link (l) between a ‘source’ node i at one location and a ‘sink’ node j at another location using a probability that was based on the distance between the two nodes. This probability is given by a connection kernel, which decays with distance such that connections are predominantly localized (Danon et al., 2010). Between-node Euclidean distances were calculated using the Haversine formula (Sinnot, 1984) in the geosphere package (Hijmans, 2017) implemented in the R programming language (R Core Team, 2018). The x and y displacement vectors for two nodes were calculated based on the equirectangular projection as follows:

z=sin2[(φjφi)/2]+cos(φj)cos(φi)sin2[(λjλi)/2]lij=R×2×atan2(z,1z)x=R×(λjλi)cos[(φj+φi)/2]y=R×(φjφi)where φ = latitude (radians), λ = longitude (radians), R = radius of the earth (mean = 6,371 km), and lij = haversine distance between node i to node j.

Links were created using an inverse power-law dispersal kernel y=(lij)b, where y is the probability of transmission from node i to node j (Andersen et al., 2019), lij is the distance between node i and node j, and b is the spread parameter. The parameter b was not estimated in this study but was obtained from a previous study on the isotropic spread of CDM in the eastern United States (Ojiambo et al., 2017) using the same epidemic data from 2008 to 2016 that was used in the present study. Ojiambo et al. (2017) examined how b varied over multiple epidemic years and found that b ranged from 1.61 to 3.36. Thus, a value of b generated for each year from that study was used in the corresponding year examined in the present study to represent isotropic spread through links in the network. In essence, a link was created between node i and node j based on whether it was within a certain distance and if y > τ for 0 < τ < 1, where τ is the threshold probability of pathogen transmission.

Several static networks were created for a range of τ values for uncertainty analysis to determine the influence of τ on link formation as described by Andersen et al. (2019). The range of τ selected was bounded by values that produced a full network and a near-zero probability of link formation (Fig. S4) to facilitate the identification of a network with a giant component (GC), since a network without a GC does not provide much information on the behavior of disease spread. A GC is a connected component whose size is on the same order of magnitude as the size of the whole network. Thus, the value of τ selected to generate the final static network was identified in two stages. First, τ had to result in a network where each node was connected to at least another node (Ferrari, Preisser & Fitzpatrick, 2014). Second, the selected τ also had to have a high proportion of nodes within the GC in the resulting static network. For each epidemic year, the final static network generated using the selected τ value for each epidemic year was used in additional network analyses described below (dynamic networks and error quantification). The degree and the exponent of the degree distribution, γ, for final static networks were estimated in R using the powerLaw package (Gillespie, 2015) as described by Kolaczyk & Csárdi (2020).

Network centrality measures

Centrality measures, betweenness centrality (BWC), closeness centrality (CLC), degree centrality (DGC) and eigenvector centrality (ECV) (Table 2), were calculated using the igraph package in R (Csárdi & Nepusz, 2006) for each static network that was created for different τ values as described below (identification of important nodes). The empirical cumulative distribution functions of BWC, CLC, DGC, and EVC were calculated for each epidemic year to describe the distribution of the generated centrality metrics across all nodes. The cumulative distribution functions of BWC, CLC, DGC, and EVC were obtained using stat_ecdf and visualized using the ggplot2 package in R (Wickham, 2016). The similarity in ranking of nodes among centrality metrics was then assessed using Spearman’s rank-based correlation.

Table 2:
Definition of centrality measures in a network model used to study the spread of cucurbit downy mildew in the eastern United States.
Centrality measure Central node Relevance in epidemic spread
Betweenness (BWC) Acts as a bridge to other nodes Removal of nodes with high betweenness may contain an epidemic
Closeness (CLC) Lies on the shortest path Nodes are able to spread disease through a network
Degree (DGC) Connected to many other nodes Nodes with high degree may be ‘superspreaders’
Eigenvector (ECV) Connected to other high-degree nodes Nodes with neighbors having high degree may be ‘superspreaders’
DOI: 10.7717/peerj.17649/table-2

Identification of important nodes for disease spread within the static network

Analysis of disease outbreaks from 2008 to 2016 was conducted to determine if recurring patterns of disease spread occurred that could help to identify important nodes in the networks. We tallied the number of times a node was observed as infected from 2008 to 2016, herein referred to as the infection frequency. In addition, a new dataset was created with only nodes where the disease occurred in at least one year i.e., infection frequency 1. Two approaches were used to identify nodes potentially important for disease spread that could be useful for risk-based surveillance or disease mitigation: i) selection of nodes based on infection frequency and ii) selection of nodes based on a combination of infection frequency and centrality metrics.

In the first approach, nodes were ranked from highest to lowest based on their infection frequency. In the second approach, a static network was created such that each node was connected to at least another node (Ferrari, Preisser & Fitzpatrick, 2014) using b = 2.11 as estimated previously by Ojiambo et al. (2017) and the centrality metrics were calculated for this network. Centrality metrics were scaled to a value between 0 and 1 and combined with infection frequency in a ratio of 4:1 (frequency:centrality) for each node, to give more weight to infection frequency as described by Sutrave et al. (2012). Nodes were then ranked in decreasing order based on this weighted value. This weighting in the second approach puts more emphasis on nodes where the disease is observed recurrently between years and nodes that either are highly connected and acting as bridges to other nodes (BWC), occur on the shortest path (CLC), or are connected to other potential super-spreaders (DGC and EVC). A sensitivity analysis was conducted with four additional frequency:centrality ratios with different weights. The results of this analysis showed that changing the weights changed the ranks but did not give more weight to the infection frequency (Fig. S5). Further, of all ratios tested, only the 4:1 ratio resulted in consistent results wherein the higher frequency nodes also had higher weights and were ranked higher (Table S1).

For each epidemic year, a range of threshold values (0 < τ < 1) was considered such that bounds for τ produced a range of dense networks and sparse networks. In each year, 20 to 30 individual values of τ were used to construct 20 to 30 networks. Centrality metrics were calculated for each network and the results were ranked in a decreasing order. The top 20 nodes with the highest scores were then selected and a second ranking was done for each node in this set. The number of times a node appeared in the top 20 ranking across all thresholds was recorded to eliminate the nodes that were ranked with higher scores in the dense and sparse networks. The nodes were then ranked in decreasing order. The results across centrality metrics and τ values were combined into a heatmap visualization using the ggplot2 package in R (Wickham, 2016).

Dynamic network model of cucurbit downy mildew

To describe the dynamic process of disease spread occurring on a static network, we modeled the probability of different nodes being infected over a discrete weekly time step, t{1,2,...,T}, in each epidemic year, based on a simplified SI model described by Sutrave et al. (2012) with the following assumptions: i) the pathogen is primarily dispersed by wind, ii) host response to the pathogen is homogeneous and iii) weather is favorable for infection and disease spread. This model combines the static (constant during each year) and the dynamic (time-varying during each year) components of the network and was formulated as:

{αij=(lij)bβij=lijwt|lij|uij=αij×βij}where αij is a constant function of the between-node distance and decays exponentially with distance, βij is the dynamic wind-based infection rate, lij and b are as defined above, lij is the displacement vector between two nodes, wt is the wind vector at time t, and uij is the link weight based on αij and βij between node i and node j at time t.

Given that the probability of a node being infected depends on the number of infected neighbors, the probability ϑi of node i not being infected by its neighbors was calculated as:

ϑi(t)=jNi(1uijpj(t))where pj is the probability of node j being infected at time t, uij ∈ [0,1] is the link weight as defined above, and Ni is a set of neighbors of node i. Given Eq. (4), the probability pi of node i being infected at time t was calculated thus:

pi(t)=1(1pi(t1))ϑi(t)

Values of pi and βij were calculated and updated, respectively, at each weekly time step. All calculations were performed in MATLAB version R2019a (MathWorks Inc., Natick, MA, USA).

Error quantification in the dynamic network model

The observed infection status of a node and the corresponding predicted infection probability of the node were used to quantify the error in the dynamic network model. First, a value of 0 or 1 was assigned to a node that was either non-infected or infected, respectively, in the observed data at each time step t. Secondly, the observed (0 or 1) value for each node was compared to the corresponding infection probability calculated by the model at each time step t. The error was then defined as the absolute difference between the observed and predicted infection probability. The mean error for the infected nodes at time step t was then calculated as Sutrave et al. (2012):

E^in(t)=i=1Nin(t)(1pi(t))Nin(t)where Nin(t) is the total number of infected nodes at time step t, while pi(t) is as defined above. Similarly, the mean error for non-infected nodes for at each time step t was calculated as:

E^hn(t)=i=1Nhn(t)pi(t)Nhn(t)where Nhn(t) is the total number of non-infected nodes at time step t. The total error was obtained by using the expression:

E^=υE^in(t)+(1υ)E^hn(t)where υ is a weighting factor. The ratio υ: (1 − υ) in Eq. (8) was 4:1 such that observed-infected nodes were given four times more weight than the observed-non-infected nodes in evaluating the total error. Here, it was deemed more important to correctly predict infection (i.e., sensitivity) than to correctly predict an absence of infection (i.e., specificity) such that a few nodes incorrectly predicted will have an insignificant effect on the prediction error (Sutrave et al., 2012). All these calculations were performed in MATLAB.

Assessing node importance in disease spread using a dynamic network

The importance of nodes identified as highly connected based on the four centrality measures from the static network analysis, i.e., BWC, CLC, DGC, and EVC, were subsequently evaluated for their impact on disease spread based on link structures of the dynamic network model described above. Nodes identified as most important based on each centrality metric were removed from the networks and the probabilities of disease spreading among the remaining nodes were recalculated in the new dynamic network for each epidemic year as described above. Prediction of disease outbreaks based on all nodes present in the network was subsequently compared to predictions of disease outbreaks when nodes identified as important based on the above centrality measures were removed from the network. This approach of node evaluation is equivalent to intensive disease management, where important nodes are completely removed and the resultant impact of their removal on disease propagation within the network is assessed (Sutrave et al., 2012). A sensitivity analysis was also conducted for a range of υ: (1 − υ) ratios to examine the effect of the choice of the value of the weighting factor υ on the model prediction errors. This analysis showed that increasing the value of υ resulted in negligible changes in prediction errors across all centrality measures and epidemic years (Table S2).

Results

Spatiotemporal dynamics of disease spread in the eastern United States

Observations of disease outbreaks suggested a spatial association between the locations of first and last disease reports. The disease was first observed in a sentinel site in southern Florida in Miami-Dade County in 5 out of 8 epidemic years (Fig. 1). Most of the first disease reports from 2008 to 2016 occurred in February and March in southern Florida or southwestern Texas along the Gulf of Mexico, with reports of initial disease outbreaks being from both sentinel and non-sentinel sites.

Map of locations of disease monitoring.

Figure 1: Map of locations of disease monitoring.

Locations of cucurbit downy mildew outbreaks in the eastern United States from 2008 to 2016. Locations are color-coded based on the week of the year. Shapes represent the surveillance plot type associated with disease reports during the study period. Map Source: ggmap and ggplot.

Subsequent reports of new disease outbreaks progressed northward with time, with new outbreaks occurring later in more northern states (Fig. 1). The first outbreaks of CDM in more northern states (e.g., Michigan, New York, or Wisconsin) occurred considerably later than corresponding reports of first CDM outbreaks in southern states (e.g., Alabama, Georgia or South Carolina). Across all years, the last set of new disease reports occurred in July, August and September across several states within the region (Fig. 1).

The total number of states with CDM ranged from 22 to 27, and the corresponding number of counties ranged from 86 to 179 across the region (Table 1). There was a positive correlation (r = 0.90; P = 0.0002) between the number of disease reports and counties (Fig. S1), with the number of sites increasing with an increasing number of infected counties. However, the correlation between the number of counties where the disease was reported and the number of counties where surveillance was occurring was not significant (r = 0.42; P = 0.2700) (Fig. S2). The linear maximum distance between two disease reports, a measure of the spatial extent of the epidemic, ranged from 2,491 km in 2012 to 3,071 km in 2015.

The number of times that nodes were infected at least once based on combined epidemic data across all years varied from 1 to 6 (Fig. 2). Nodes where the infection frequency was consistently higher than the median frequency (frequency >3) were in Alabama, Maryland, Michigan, North Carolina, Ohio, and South Carolina. Nodes with the highest levels of infection frequency were in Wicomico County in Maryland, Johnson, Lenoir, New Hanover, and Sampson counties in North Carolina, and Sandusky, Huron, and Wayne counties in Ohio, with an infection frequency of 5 and 6 (Fig. 2). The remaining nodes had an infection frequency less than the median and they constituted most of the nodes present in counties scattered throughout the region.

Frequency map of cucurbit downy mildew outbreak.

Figure 2: Frequency map of cucurbit downy mildew outbreak.

Frequency of cucurbit downy mildew outbreaks across all epidemic years from 2008 to 2016 in the eastern United States. Colors represent the frequency (n) of disease cases: red (n = 6), yellow (n = 5), green (n = 4), light blue (n = 3), blue (n = 2) and pink (n = 1). Frequency represents the number of years a node was observed as an infected node (i.e., a location where the disease was reported at least once). Map Source: ggmap and ggplot.

Connectivity threshold and static networks of cucurbit downy mildew

The proportion of nodes in the giant component (GC) and the extent of connectedness in a network were used to select the threshold probability of transmission, τ, to generate the final static networks. For example, for the 2008 epidemic data, networks were more connected at τ = 6.21 × 10−9 (GC = 1.0) than at τ = 1.14 × 10−9 (GC = 0.92), with other threshold values resulting in either highly or sparsely connected networks. Thus, to achieve a balance in connectivity, τ = 6.21 × 10−9 was used to generate the final static network for the epidemic data in 2008 (Fig. 3). Similarly, for the 2009 data, networks were more connected at τ = 7.83 × 10−9 (GC = 0.98) than at τ = 1.12 × 10−8 (GC = 0.95) with the remaining threshold values resulting in either highly or sparsely connected networks. Thus, τ = 7.83 × 10−9 was used to generate the final network for disease records in 2009. This logical approach was used to generate the final networks for disease records for the remaining epidemic years from 2010 to 2016. The corresponding values of τ were 1.0 × 10−19, 4.72 × 10−13, 2.55 × 10−13, 1.0 × 10−14, 2.55 × 10−17, 1.0 × 10−12 and 1.0 × 10−12, respectively (Fig. 3). In summary, the threshold for probability of transmission for the final static networks was very low ranging from (1.0 × 10−19 to 7.8 × 10−9) and the average degree ranged from 12.9 (in 2014) to 52.1 (in 2015). The exponent of the degree distribution (γ) was 2.34 (2008), 1.63 (2009), 2.03 (2010), 1.75 (2011), 1.93 (2013), 1.82 (2014), 2.05 (2015) and 2.14 (2016). Values of γ ≥ 2 indicate that a network is scale-free, i.e., the degrees follow a power-law distribution and the network is characterized by large hubs or nodes with a very high number of links.

Static networks of cucurbit downy mildew epidemics.

Figure 3: Static networks of cucurbit downy mildew epidemics.

Static networks of cucurbit downy mildew epidemics in eastern United States from 2008 to 2016. Closed circles are nodes where disease was reported (either in a sentinel and non-sentinel site) and the lines between two nodes are links for the probability of transmission between two nodes calculated based on the power-law dispersal kernel. Thresholds for probability of pathogen transmission ranged from ranged from 1.0 × 10−19 to 7.8 × 10−9. The initial source of disease outbreak (open square) in 2009, 2011, 2012-2015 was a sentinel site in Miami-Dade County in southern Florida, while the initial source in 2008, 2010 and 2016 was a sentinel site in Collier County in Florida, Alachua County in Florida and Baldwin County in Alabama, respectively. Map Source: ggmap and ggplot.

Centrality measures and selection of important nodes

Betweenness, closeness, degree, and eigenvector centrality metrics varied between epidemic years. Variability among the 20 most important nodes for each of these metrics was also observed for the final static network constructed within a given epidemic year. Overall, variability among the 20 most important nodes within any epidemic year across the entire study was high for BWC. For example, BWC values ranged from 264.5 to 888.3 in 2008 (Table 3), from 1,147.6 to 2,415.7 in 2009 (Table 4), and from 237.6 to 1,718.2 in 2010 (Table 5). The mean values for the 20 most important nodes identified by BWC in these respective years were 441.8, 1,656.9, and 474, with corresponding standard deviations of 441.1, 896.7 and 1,046.9. Variability among the 20 most important nodes as identified by the other centrality metrics was relatively limited (Tables 35), with variability among the nodes identified as important based on CLC being the lowest across the entire study period.

Table 3:
Centrality-based ranking of the twenty most important sites in the cucurbit downy mildew network for the epidemic observed in the eastern United States in 2008.
Betweennessa Closenessa Degreea Eigenvectora
Rank ID State BWC ID State CLC ID State DGC ID State EVC
1 74 MS 888.3 89 NC 0.0034 131 PA 73 128 PA 1.000
2 118 OH 665.3 118 OH 0.0034 52 MD 72 131 PA 0.994
3 135 SC 608.6 125 PA 0.0034 125 PA 72 134 PA 0.989
4 124 OH 534.1 128 PA 0.0034 128 PA 72 125 PA 0.981
5 39 KY 517.2 130 PA 0.0034 130 PA 72 130 PA 0.974
6 141 TN 507.2 124 OH 0.0034 127 PA 71 99 NY 0.963
7 31 GA 500.4 52 MD 0.0034 134 PA 69 127 PA 0.962
8 89 NC 471.1 134 PA 0.0034 99 NY 66 102 NY 0.953
9 137 SC 470.8 86 NC 0.0033 102 NY 65 96 NY 0.943
10 82 NC 416.6 148 VA 0.0033 96 NY 64 97 NY 0.930
11 91 NC 416.6 150 VA 0.0033 129 PA 64 98 NY 0.926
12 139 TN 375.8 131 PA 0.0033 11 DE 63 100 NY 0.902
13 52 MD 372.1 87 NC 0.0033 97 NY 63 52 MD 0.879
14 75 MS 336.7 88 NC 0.0033 98 NY 63 126 PA 0.858
15 125 PA 324.7 127 PA 0.0033 13 DE 62 129 PA 0.856
16 128 PA 305.4 80 NC 0.0033 100 NY 61 111 OH 0.847
17 136 SC 290.5 78 NC 0.0033 10 DE 59 113 OH 0.847
18 33 GA 290.1 79 NC 0.0033 93 NJ 59 117 OH 0.828
19 29 GA 279.0 151 VA 0.0032 94 NJ 59 120 OH 0.820
20 34 GA 264.5 39 KY 0.0032 133 PA 59 101 NY 0.814
Mean 441.8 0.0033 65.4 0.913
SD 441.1 0.0000 9.9 0.132
DOI: 10.7717/peerj.17649/table-3

Notes:

ID, Node identification number.

BWC, Betweenness centrality; CLC, Closeness centrality; DGC, Degree centrality; and EVC, Eigenvector centrality; SD, Standard deviation.

Table 4:
Centrality-based ranking of twenty most important nodes in the cucurbit downy mildew network for the epidemic observed in the eastern United States in 2009.
Betweennessa Closenessa Degreea Eigenvectora
Rank ID State BWC ID State CLC ID State DGC ID State EVC
1 34 GA 2,415.7 122 NC 0.0012 74 MI 35 109 NC 1.000
2 212 VA 2,390.2 132 NC 0.0012 79 MI 35 136 NC 0.979
3 48 KY 2,376.2 134 NC 0.0012 82 MI 33 114 NC 0.979
4 154 OH 2,152.4 129 NC 0.0012 93 MI 33 118 NC 0.966
5 32 GA 2,011.5 124 NC 0.0012 109 NC 33 130 NC 0.960
6 192 TN 1,907.7 135 NC 0.0012 158 OH 33 127 NC 0.960
7 186 SC 1,803.5 205 VA 0.0012 200 VA 33 211 VA 0.937
8 169 PA 1,796.5 212 VA 0.0012 76 MI 32 119 NC 0.913
9 2 AL 1,672.3 48 KY 0.0011 90 MI 32 128 NC 0.906
10 180 SC 1,605.4 163 OH 0.0011 114 NC 32 207 VA 0.898
11 104 MS 1,515.0 164 OH 0.0011 118 NC 32 115 NC 0.891
12 171 PA 1,413.6 165 OH 0.0011 136 NC 32 125 NC 0.887
13 103 MS 1,351.4 133 NC 0.0011 211 VA 32 126 NC 0.884
14 25 FL 1,343.5 192 TN 0.0011 75 MI 31 113 NC 0.882
15 200 VA 1,311.5 123 NC 0.0011 83 MI 31 121 NC 0.872
16 153 OH 1,259.8 169 PA 0.0011 88 MI 31 120 NC 0.869
17 147 NY 1,258.1 171 PA 0.0011 89 MI 31 112 NC 0.869
18 54 KY 1,248.4 183 SC 0.0011 91 MI 31 203 VA 0.867
19 101 MS 1,158.2 207 VA 0.0011 92 MI 31 200 VA 0.864
20 158 OH 1,147.6 203 VA 0.0011 111 NC 31 110 NC 0.850
Mean 1,656.9 0.0011 32.2 0.912
SD 896.7 0.0000 2.8 0.106
DOI: 10.7717/peerj.17649/table-4

Notes:

ID, Node identification number.

BWC, Betweenness centrality; CLC, Closeness centrality; DGC, Degree centrality; and EVC, Eigenvector centrality; SD, Standard deviation.

Table 5:
Centrality-based ranking of twenty most important sites in the cucurbit downy mildew network for the epidemic observed in the eastern United States in 2010.
Betweennessa Closenessa Degreea Eigenvectora
Rank ID State BWC ID State CLC ID State DGC ID State EVC
1 30 KY 1,718.2 30 KY 0.0033 116 OH 56 116 OH 1.000
2 31 KY 1,009.3 31 KY 0.0032 103 OH 54 109 OH 0.998
3 65 MS 691.0 116 OH 0.0032 104 OH 54 106 OH 0.997
4 4 AL 577.1 121 PA 0.0032 105 OH 54 110 OH 0.995
5 139 TX 556.0 105 OH 0.0032 106 OH 54 103 OH 0.995
6 77 NC 486.1 103 OH 0.0032 108 OH 54 113 OH 0.995
7 25 GA 469.3 104 OH 0.0032 109 OH 54 114 OH 0.995
8 74 NC 410.1 108 OH 0.0032 110 OH 54 104 OH 0.995
9 13 FL 404.1 110 OH 0.0032 113 OH 54 108 OH 0.995
10 23 GA 342.0 113 OH 0.0032 114 OH 54 61 MI 0.992
11 26 GA 342.0 114 OH 0.0032 61 MI 53 41 MI 0.983
12 5 AL 331.3 107 OH 0.0032 40 MI 52 53 MI 0.983
13 120 PA 305.2 106 OH 0.0031 41 MI 52 60 MI 0.983
14 130 SC 296.8 109 OH 0.0031 48 MI 52 48 MI 0.983
15 138 TX 282.0 120 PA 0.0031 53 MI 52 105 OH 0.977
16 80 NC 264.0 119 PA 0.0031 60 MI 52 42 MI 0.964
17 67 NC 257.1 115 OH 0.0031 107 OH 52 40 MI 0.960
18 117 PA 253.7 61 MI 0.0031 112 OH 52 111 OH 0.960
19 122 PA 246.7 112 OH 0.0031 122 PA 52 112 OH 0.959
20 140 VA 237.6 111 OH 0.0031 42 MI 51 43 MI 0.959
Mean 474.0 0.0032 53.1 0.983
SD 1,046.9 0.0000 3.5 0.029
DOI: 10.7717/peerj.17649/table-5

Notes:

ID, Node identification number.

BWC, Betweenness centrality; CLC, Closeness centrality; DGC, Degree centrality; and EVC, Eigenvector centrality; SD, Standard deviation.

The distribution of BWC values across the nodes in the examined networks exhibited a power-law distribution. About 85% of the nodes had BWC values <250, with BWC >1,500 being the largest BWC value observed, as shown in the CDF (Fig. S6). In contrast, the distributions of CLC and DGC values were more characteristic of a normal distribution, with the variance of CLC being relatively smaller than that of DGC. The distribution of EVC values followed a Poisson distribution since each other node had an EVC value that was closer to that of one or two other nodes, except for the most important node in each epidemic year (EVC = 1).

Ranking of nodes considered to be important varied among centrality metrics for epidemic years examined (Tables 3 to 5). Spearman’s rank-based correlation coefficients were highest between BWC and CLC, with correlations ranging from 0.43 to 0.74 (Fig. S7). Correlations between BWC and DGC or EVC were relatively lower across the epidemic years except between BWC and DGC in 2016, where r = 0.46 (Fig. S7). The consistency in the rankings of nodes based on centrality measures was summarized as a heatmap to visualize unique nodes within the networks (Fig. 4). Many nodes overlapped in their rankings among the top 20 important nodes (across all thresholds and centralities) in 2010 (Fig. 4A) and 2014 (Fig. 4C) based on BWC and CLC. However, most nodes overlapped across the four centrality measures in 2011 (Fig. 4B). For example, node 117 in Lewis County, West Virginia, appeared more than 20 times in the top 20 rankings based on BWC and CLC. This same node also appeared more than 10 times in the top 20 ranking of nodes based on DGC and EVC.

Illustration of important nodes for disease spread.

Figure 4: Illustration of important nodes for disease spread.

A heatmap representation of the most important nodes (node IDs x-axis) across 20 thresholds and the four centrality measures (y-axis) for 2010 (A), 2011 (B), and 2014 (C) networks. Frequency represents the number of times a node appeared in the top 20 ranked list across all evaluated thresholds. Most nodes overlapped across the four centrality measures in 2011. For example, node 117 in Lewis County in West Virginia appeared more than 20 times in the top 20 ranks based on BWC and CLC. This node also appeared more than ten times in the top 20 ranks based on DGC and EVC.

Infection frequency and centrality selection of important nodes

Identifying important nodes based on infection frequency and centrality measures of static networks showed some similarities and differences based on the examined centrality metric. The ranking of nodes based on BWC and CLC was generally similar across years, while rankings based on EVC differed from all other centrality measures. Based on BWC, nodes that had a frequency >4 had the highest calculated values (combined frequency × centrality), with the largest value being 0.82 for the node in Sandusky County in Ohio that had an infection frequency of 6 (Fig. 5), while the lowest weight was 0.13 for a node in Charleston County in South Carolina. Based on CLC, the largest weight was 0.98 for a node in Sandusky County in Ohio that had a frequency >6, while the lowest weight was 0.198 for a node in Miami-Dade County in Florida. Similarly, the node in Sandusky County in Ohio had the highest weight of 0.93 based on DGC, followed by nodes in Johnston, Lenoir and New Hanover counties in North Carolina, Wicomico County in Maryland, and Huron and Wayne counties in Ohio that had an infection frequency of 5 (Fig. 5). Node ranking based on EVC was comparably different from the ranking based on all other centrality measures. A node in Johnston County in North Carolina had the highest weight of 0.84, followed by nodes in Wicomico County in Maryland, Sampson and Johnston counties in North Carolina and Wayne County in Ohio (Fig. 5).

Node importance based on frequency of disease occurrence and centrality measures.

Figure 5: Node importance based on frequency of disease occurrence and centrality measures.

A depiction of node importance based on a combination of the frequency of cucurbit downy mildew occurrence in the eastern United States and betweenness, closeness, degree or eigenvector network centrality measures. Frequency represents the number of years a node was observed as an infected node based on epidemic years from 2008 to 2016. Frequency of occurrence and centrality measures are weighted based on a ratio of 4:1. Map Source: ggmap and ggplot.

Dynamic network model of disease spread and predicted probability of node infection

The dynamic network model revealed an emerging and evolving network that differed from the static network representation of disease spread (Fig. 6). Generally, similar temporal and spatial patterns were observed in all other years, although the probabilities between nodes in different states and levels of these probabilities differed between years. In all epidemic years, links between nodes closest to the initial disease outbreak (open square) in southern Florida had the highest probabilities of transmission early in the season (i.e., week 10), while the probability of transmission for links between nodes elsewhere in the network was relatively low (Fig. 6). As epidemics progressed in time and space, link probabilities increased for nodes that were more distant from the initial outbreak in more northern latitudes, although probabilities remained relatively low for isolated nodes (Fig. 6).

Dynamic network of the spread of cucurbit downy mildew.

Figure 6: Dynamic network of the spread of cucurbit downy mildew.

Evolving networks resulting from a dynamic network model for the spread of cucurbit downy mildew in the eastern United States in 2008, 2013, 2014 and 2015. Black circles indicate node centroids of disease outbreak, while the open square is initial source of disease outbreak. Lines are links that have been scaled relative to the probability of transmission by time, with darker and thicker lines indicating higher probabilities of transmission. Map Source: ggmap and ggplot.

The probability of infection increased in time and space, with a generally northward expansion of the epidemic front in all years (Fig. 7). Predicted probability of infection increased most during weeks 20 or later. By week 35, the predicted probability increased for most nodes in the eastern United States, with only a relatively few nodes in Illinois and Michigan having a low infection probability.

Prediction of the temporal spread of cucurbit downy mildew.

Figure 7: Prediction of the temporal spread of cucurbit downy mildew.

Prediction of cucurbit downy mildew outbreaks in the eastern United States in 2014 based on cumulative disease outbreaks observed in previous times steps in the same epidemic year. Dark red nodes represent sites predicted to have an outbreak with a high probability. Blue nodes represent sites predicted to have no outbreak with negligible probability of infection, and all other shades from green to dark red represent increasing probability of disease outbreak. A single node in Texas was reported as infected by week 10 in the observed data; thus the county was considered infected with probability of 1 by week 10. Map Source: ggmap and ggplot.

Errors in dynamic model and impact of removal of important nodes on model errors

Based on all nodes in the network, the mean absolute error for the dynamic model generated across weekly time steps and averaged monthly from January to August was lowest in 2015 with a value of 0.09 and highest in 2011 with a value of 0.33. The mean absolute error for the dynamic model across the entire study for all the nodes was 0.21 (Table 6).

Table 6:
Absolute errors for a network model using data from all sites and removal of sites identified as important based on centrality measures used to study the spatio-temporal spread of cucurbit downy mildew in the eastern United States.
Error after removal of important nodes based on centrality measureb
Yeara All nodes Betweenness Closeness Degree Eigenvector
2008 0.18 0.31 0.22 0.21 0.22
2009 0.27 0.39 0.29 0.28 0.33
2010 0.15 0.23 0.20 0.19 0.20
2011 0.33 0.40 0.35 0.34 0.34
2012 0.27 0.33 0.27 0.27 0.27
2013 0.28 0.45 0.30 0.31 0.31
2014 0.26 0.44 0.36 0.37 0.37
2015 0.09 0.12 0.10 0.10 0.10
2016 0.10 0.17 0.09 0.10 0.10
Mean 0.21 0.32 0.24 0.24 0.25
DOI: 10.7717/peerj.17649/table-6

Notes:

For each year, values are means of absolute model errors generated across monthly time steps from January to August.
The 20 most important nodes identified by each centrality measure were removed in the network and the model rerun to calculate the corresponding absolute errors.

Removal of nodes identified as important based on BWC, CLC, DGC, and EVC increased the mean absolute errors, indicating that these nodes were indeed important in the network structure and prediction accuracy. However, the changes in mean absolute errors after node removal varied depending on the specific centrality measure considered. Removal of nodes identified as important by BWC resulted in the largest mean absolute error, 0.32, a 52.4% error rate relative to the base prediction that included all nodes. In contrast, removing nodes identified as important based on CLC, EVC and DGC led to comparatively small increases in mean absolute error (0.24, 0.24 and 0.25, respectively). Thus, model errors due to the removal of nodes identified as important based on BWC were 3 to 4 times higher than errors resulting from the removal of nodes identified as important based on CLC, DGG, or EVC, indicating BWC was superior in identifying important nodes in this data set (Table 6).

The probability of node infection and epidemic progress in the disease network was also affected by removing nodes identified as central in the network. Relative to a network with all nodes present, removing nodes identified as important based on BWC reduced the probability of infection of non-infected nodes in the subsequent time step in all epidemic years (Fig. 8). For example, removing of nodes in counties in north Florida, Georgia, and South Carolina that were identified as important based on BWC arrested the progression of CDM and infection of nodes in north Florida, South Georgia, and South Carolina in 2009 by week 25 (Fig. 8). We observed a similar pattern of infection probability being meaningfully changed in other years as well when node removal was based on BWC, with the precise change in infection probability varying in specific years. In contrast, removing nodes identified as central based on CLC, DGC or EVC had a comparably minor impact on the probability of node infection and epidemic progress in all years (Fig. 8).

Impact of removing important nodes on disease spread in the network.

Figure 8: Impact of removing important nodes on disease spread in the network.

Prediction of cucurbit downy mildew outbreaks in the eastern United States by week 25 for all nodes present in the network (i.e., prediction) compared to prediction when the 20 most important nodes (based on betweenness, closeness, degree, and eigenvector centrality measures) are removed from the network based on data from epidemics in 2008, 2009, 2013 and 2014. Diamond symbols are nodes identified as important based on each centrality metric. The initial source of disease outbreak is represented by a square symbol. Map Source: ggmap and ggplot.

Discussion

Estimating the probability and timing of outbreaks in specific sites and determining where and when the introduction of inoculum can impact the extent of an epidemic, is one of the challenges in predicting the spread of plant diseases and pests (Meentemeyer et al., 2011; Fitzpatrick et al., 2012). The CDM pathogen can be dispersed over long distances and the disease can spread rapidly under favorable environmental conditions (Ojiambo & Holmes, 2011). In this study, networks were formulated based on historical epidemic records of CDM to establish how connectivity of cucurbit fields influences pathogen dispersal and disease spread in the eastern United States. Multiple low- to high-density static networks were initially generated and analyzed, and networks with biologically-plausible structures and topologies were selected for further analysis. The exponent of the degree distributions for most of the examined networks followed a power-law distribution, indicating that static networks of CDM displayed scale-free properties (Pastor-Satorra & Vespignani, 2001), where most nodes had a small number of links, while a smaller number of nodes had a relatively large number of connections. Scale-free connectivity implies the existence of highly connected nodes (hubs) that are responsible for the rapid spread of disease within the network (Jeger et al., 2007). The transmission probability threshold is low or even absent in scale-free networks (Shirley & Rushton, 2005; Pastor-Satorra & Vespignani, 2001) and this may partly explain the low levels of τ observed in the present study. Disease spread in scale-free networks is rapid and models suggest that control of pathogens spreading in such networks should focus on the highly connected sites (Jeger et al., 2007). Thus, targeted sampling of frequently-infected and highly connected sites that are critical in spreading the disease may benefit disease surveillance.

Sites in Florida, Alabama, North, and South Carolina that were infected more frequently in the past may be candidates for disease surveillance. Acquiring the frequency of infection data is a prerequisite, but constant scouting for the disease is expensive. However, once the historical frequency of infection data is available, additional information about network traits is inexpensive to obtain using mathematical models (Sutrave et al., 2012). Network centrality metrics such as BWC, CLC, DGC and EVC can facilitate the identification of such highly connected nodes (Andersen et al., 2019; Gent, Bhattacharyya & Ruiz, 2019) and aid in evaluating strategies for selecting nodes for surveillance (Sanatkar et al., 2015). Based on a complete static network model, these centrality measures were used to identify highly connected sites for the spread of CDM in the eastern United States. Combining past infection frequency with centrality measures improved the identification of important nodes. For example, DGC, BWC, and CLC produced similar rankings with the infection-based frequency for nodes with an infection frequency greater than four. Although EVC produced a different ranking, nodes with a frequency greater than four still had high weights, thus agreeing with the rankings from the other centrality measures. The combination of frequency-based and DGC was useful in selecting sampling nodes for sentinel sites for soybean rust in the United States (Sutrave et al., 2012). DGC is considered the standard measure in network science and is useful for identifying important nodes in static networks of several pathosystems to inform strategic management (Christley et al., 2005; Gent, Bhattacharyya & Ruiz, 2019; Kiss, Green & Kao, 2006; Xing et al., 2020). Unlike other centrality measures, DGC is easier to calculate and does not require assessing the entire network (Christley et al., 2005). In this study, DGC was ineffective in identifying important nodes compared to BWC. Further, BWC rankings were poorly correlated with those of DGC except for the epidemic data collected in 2016.

Betweenness centrality was more useful in identifying the influential nodes in the network as compared to other commonly used metrics. BWC measures the importance of a node by computing how many times a node of interest is on the shortest paths between any two other nodes. This centrality measure has been used to characterize large networks by way of selected nodes since the seminal work by Granovetter (1973). Nodes with high BWC have been used to determine keystone species in food webs, find clusters and communities, and analyze the robustness of networks by identifying sensitive points of failure (Barabási & Bonabeau, 2003; Girvan & Newman, 2002; Vasas & Jordán, 2006). In epidemiology, nodes with high BWC indicate that they are important in disease spread as they act as bridges or ‘hubs’ to other nodes. Thus, greater disease surveillance efforts and treatment should be directed towards these nodes to decrease the risk of pathogen transmission and disease spread within the network (Marquetoux et al., 2016). The observation that BWC was more informative of node importance than other centrality measures emphasizes the need to generate centrality measures that are specific to the disease of interest (Holme, 2018). Invariably, different centrality measures can result in a dissimilar ranking profiles of important nodes for diverse pathosystems, possibly due to the inherent differences in the underlying mechanisms of pathogen dispersal and disease spread, landscape connectivity, or other factors (Dudkina et al., 2023; Holme, 2018; Singer, Thompson & Bonsall, 2022).

The importance of the highly connected sites in disease spread was further evaluated using a dynamic network model. Mean absolute errors and the probability of infection in nodes across the networks were relatively insensitive to removing of nodes identified as central by CLC, DGC, and EVC. In contrast, mean absolute errors and the probability of infection in simulated epidemics were quite sensitive to removing of nodes identified as central based on BWC. This may be related to the physical location of the nodes identified as highly central by the various centrality measures. Removing nodes identified as important based on CLC, DGC and EVC that were located in Pennsylvania, Ohio, and New York did not affect disease progression northward from southern states, whereas removing important nodes in North Carolina largely prevented disease spread. Nodes with high BWC scores were scattered across the region, including in the southern United States. Removing these nodes, reduced disease spread, and in some epidemic years, it entirely halted disease spread from most southern states. Most of the spread of CDM is over relatively short distances of less than 30 km (Ojiambo & Holmes, 2011) as the host is planted from south to north. Since BWC is based on the number of shortest paths that pass through a target node, a target node will have a high BWC score if it appears in many shortest paths. Given the relative short dispersal distances of P. cubensis, it is plausible that BWC may be better at capturing the dynamics of disease transmission for most of the dispersal events that drive the spread of CDM.

Where resources available for control are limited, targeting nodes with high BWC for treatment has also been found to be an effective strategy in impeding epidemics caused by a disease that spreads rapidly (Singer, Thompson & Bonsall, 2022). The most central nodes identified as important based on BWC were sites in Michigan in the Great Lakes region, Ohio in the Midwest, and Maryland, North Carolina, South Carolina, and Virginia along the mid-Atlantic coast. These states are located along the seasonal transport pathway of P. cubensis spores from overwintering locations from the south (Aylor, 2003). Further, most of these states have the largest acreage of cucurbit production in the United States. Thus, a combination of spore transport and host density may be a reason for the location of the most central nodes in the above states. These sites could thus be reasonable targets for more intensive sampling for surveillance of new disease outbreaks within the region. Potentially, more effective disease management in these highly connected sites, such as the strategic deployment of host resistance, could reduce inoculum production that drives infection in neighboring cucurbit fields in the eastern United States.

Static networks capture connectivity patterns at a single point in time, while dynamic models account for changes in network structure over time, allowing for more accurate predictions of disease spread trajectories. Thus, the simpler static network representations are often deficient when compared with a fully dynamic representation, and should thus be used only with caution in epidemiological modelling (Vernon & Keeling, 2009). Unlike the dynamic model used for the spread of soybean rust in the United States (Sutrave et al., 2012), the dynamic model used in this present study incorporated a power-law dispersal gradient characteristic for the long-distance dispersal of plant pathogens. Based on the 2008 and 2009 epidemic data and point-pattern analysis, the dispersal distances for the CDM pathogen were estimated to be up to 390, 737 and 879 km, with 1,000 km being the maximum possible distance of spatial association (Ojiambo & Holmes, 2011). Further, Ojiambo et al. (2017) showed that the spread parameter b varied in different epidemics, with the final epidemic extent ranging from 4.16 × 108 to 6.44 × 108 km2. Thus, different values of b were used in the construction of static networks and in the dynamic network model to account for the difference in spatial spread in each epidemic year. The dynamic network model used in the present study improves on modeling long-distance dispersal by using a flexible threshold for distance to allow for the connectivity of nodes that are further apart (Ferrari, Preisser & Fitzpatrick, 2014). However, the model does not account for differences in environmental factors that are likely to influence pathogen dispersal. In addition, accounting for differences in host susceptibility at the different locations could further improve our ability to generalize the findings reported here to different cucurbit host types. Subsequent studies are also needed to establish how unknown disease sources can be imputed in this network modeling framework and determine how accounting for these unknown sources could influence the network structure and inference made on the location of highly connected sites for disease surveillance reported in this study. Due to the non-random placement of sentinel sites within the monitoring network, these results may not be generalizable and additional studies may be needed to assess how the random placement of sentinel sites could influence the findings reported in this study.

Supplemental Information

Sensitivity analysis of the effect of weighting in the infection frequency:centrality ratio on the frequency and ranking of nodes for different centrality measures.

DOI: 10.7717/peerj.17649/supp-1

Absolute errors for a network model based on removal of nodes identified as important across all centrality measures for different values of the weight factor, υ, in calculation of total error.

DOI: 10.7717/peerj.17649/supp-2

Correlation between total number of reports and number of counties with disease.

Spearman’s correlation between the total number of disease reports and the number of counties where disease was reported based on combined epidemic data from 2008 to 2016.

DOI: 10.7717/peerj.17649/supp-3

Correlation between number of monitoring plots and counties with disease.

Spearman’s correlation between the number of plots with active surveillance and number of counties where disease was reported based on combined epidemic data from 2008 to 2016.

DOI: 10.7717/peerj.17649/supp-4

Conversion of meteorological wind direction to mathematic wind direction.

Graphical illustration of the conversion of meteorological wind direction to mathematical wind direction: (A) A wind bar graph representing raw observations for the meteorological wind direction, i.e., a northerly wind is 360°, a southerly wind is 180°, a westerly wind is 270°, and an easterly wind is 90°. (B) The mathematical convention for the meteorological wind direction, i.e., a northerly wind is 270°, a south wind is 90°, a westerly wind is 0°, and an easterly wind is 180°. (C) The u and v are components of the wind where |r| is the magnitude (or wind speed). Here, we treat |r| as the observed wind speed in miles per hour. A positive u and a negative u represent a westerly wind and an easterly wind, respectively. A positive v and a negative v represent a southerly wind and a northerly wind, respectively. The figures and notes are courtesy of Virginia Weather & Climate Data lecture.

DOI: 10.7717/peerj.17649/supp-5

Static networks constructed using a range of threshold values.

Static networks of cucurbit downy mildew constructed using a range of thresholds values to generate networks bounded by values that produce a full network and a near-zero probability of link formation for epidemics years for the 2009 epidemic. Map Source: ggmap and ggplot.

DOI: 10.7717/peerj.17649/supp-6

Sensitivity analysis to assess the effect of ratios on ranking of nodes.

A sensitivity analysis of the weighting ratio of combining the frequency of occurrence and betweenness, closeness, degree, or eigenvector centrality measures. The frequency represents the years a node was observed as infected based on epidemic years from 2008 to 2016. A) A 4:1 ensures that the high-frequency nodes have high weights and rank higher. B) A 1:1 ratio does not ensure that all the high-frequency nodes have high weights or rank higher (except for closeness for the top 20). C) A 0.5:0.5 ratio does not ensure that all the high-frequency nodes have high weights or rank higher (except for closeness for the top 20). D) A 1:4 ratio produces new ranks and mixed results. E) A 0.65:0.35 ensures that some high-frequency nodes have high weights and rank higher. However, some high-frequency nodes rank lower. Map Source: ggmap and ggplot.

DOI: 10.7717/peerj.17649/supp-7

Cumulative probability distribution of centrality measures for selected epidemic years.

Cumulative probability distribution of centrality metrics of cucurbit downy mildew networks based on disease reports in 2008 (first row), 2011 (second row), 2014 (third row), and 2016 (fourth row) in the eastern United States. Centrality metrics on the horizontal axis are as follows: BWC = betweenness centrality, CLC = closeness centrality, DGC = degree centrality and EVC = eigenvector centrality.

DOI: 10.7717/peerj.17649/supp-8

Correlation between different measures of network centrality.

Correlation between betweenness centrality (BWC), closeness (CLC), degree (DGC), and eigenvector (EVC) centrality metrics for cucurbit downy mildew networks constructed using disease data recorded in specific epidemic years in the eastern United States.

DOI: 10.7717/peerj.17649/supp-9
1 Citation   Views   Downloads