COVID-19 has caused a worldwide health crisis. The World Health Organization (WHO) announced COVID-19 as a pandemic on March 11, 2020. The clinical manifestations of COVID-19 range from influenza-like symptoms to respiratory failure (i.e., diffuse alveolar injury) and its treatment requires advanced respiratory assistance and artificial ventilation. According to the global case statistics from the Center for Systems Science and Engineering (CSSE) of Johns Hopkins University (JHU) (Wang et al., 2020a) (updated August 30, 2020), 24,824,247 confirmed COVID-19 cases, including 836,615 deaths, have been reported so far with pronounced effect in more than 180 countries. COVID-19 can be detected and screened by Reverse Transcription Polymerase Chain Reaction (RT-PCR). However, the shortage of equipment and the strict requirements on the detection environment limit the rapid and accurate screening of suspected cases. Moreover, the sensitivity of RT-PCR is not high enough, resulting in a large number of false-negatives (Ai et al., 2020), which presents early detection and treatment of patients with presumed COVID-19 (Fang et al., 2020). As an important supplement to RT-PCR, CT scans clearly describe the characteristic lung manifestations related to COVID-19 (Chung et al., 2020), the early Ground Glass Opacity (GGO), and late lung consolidation are shown in Fig. 1. Nevertheless, CT scans also show imaging features that are similar to other types of pneumonia, making it difficult to differentiate them. Moreover, the manual depiction of lung infection is a tedious and time-consuming job, which is often influenced by personal bias and clinical experience.
In recent years, deep learning has been gaining popularity in the field of medical imaging due to it’s intelligent and efficient feature extraction ability (Kong et al., 2019; Ye, Gao & Yin, 2019), and has achieved great success. An earliest classic example is the application of deep learning to children’s chest X-rays to detect and distinguish bacterial and viral pneumonia (Kermany et al., 2018; Rajaraman et al., 2018). Also using deep learning methods have been applied to detect various imaging features of chest CT images (Depeursinge et al., 2015; Anthimopoulos et al., 2016). Recently, researchers proposed to detect COVID-19 infections in patients by radiation imaging combined with deep learning technology. Li et al. (2020) proposed a simple Cov-Net deep learning network in combination with a deep learning algorithm, which was used to distinguish COVID-19 and Community-Acquired Pneumonia (CAP) from chest CT scans. Wang & Wong (2020) proposed Covid-Net to detect COVID-19 cases from chest X-ray images, with an accuracy rate of 93.3%. The infection probability of COVID-19 Xu et al. (2020) was calculated from CT scans by adopting a position-oriented attention model that presented accuracy close to 87%. However, the above models rarely involved the segmentation of COVID-19 infection (Chaganti et al., 2020; Shan et al., 2020). The challenges involved in segmentation include: (a) variations in texture, size, and position of the infected areas in CT scans. For example, some infection areas are small, which easily lead to a high probability of false negatives in CT scans. (b) The boundary of GGO is usually of low contrast and fuzzy in appearance, which makes it difficult to distinguish from the healthy regions during the segmentation process. (c) The noise around the infected area is high, which greatly affects the segmentation accuracy and (d) finally the cost and time consumed in obtaining high-quality pixel-level annotation of lung infection in CT scans is high. Therefore, most of the COVID-19 CT scan datasets are focused on diagnosis, and only a few of them provide segmentation labels. However, with the passage of time, the annotated datasets for the segmentation of COVID-19 pulmonary infection were released but due to a lesser amount of data, the phenomenon of overfitting could cause problems while training thus necessitating the need for more segmentation datasets and better algorithms for accurate results.
Therefore, to address the challenges stated above, we propose a new deep learning network called Attention Gate-Dense Network- Improved Dilation Convolution-UNET (ADID-UNET) for the segmentation of COVID-19 from lung infection CT scans. Experimental results on a publicly available dataset illustrate that the proposed model presents reliable segmentation results that are comparable to the ground truths annotated by experts. Also, in terms of performance, the proposed model surpasses other state-of-the-art segmentation models, both qualitatively and quantitatively.
Our contributions in this paper are as follows:
To address the problem that the gradient disappearance in the deep learning network pose, we employ a dense network (Huang et al., 2017) instead of a traditional convolution and max-pooling operations. The dense network extracts dense features and enhances feature propagation through the model. Moreover, the training parameters of the dense network are less, which reduces the size and the computational cost.
To increase the size of the respective field and to compensate for the problems due to blurry edges, an improved dilation convolution (IDC) module is used to connect the encoder and decoder pipelines. The IDC model increases the receptive field of the predicted region providing more edge information, which enhances the edge recognition ability of the model.
Since the edge contrast of GGO is very low, we use the attention gate (AG) instead of simple cropping and copying. This further improves the accuracy of the model to detect the infection areas by learning the characteristics of the infected regions.
Due to the limited number of COVID-19 segmented datasets with segmentation labels, which is less than the minimum number of samples required for training a complex model, we employ data augmentation techniques and expand the dataset on the basis of the collected public datasets.
The rest of the paper is organized as follows: “Related Work” describes the work related to the proposed model. “Methods” introduces the basic structure of ADID-UNET. Details of the dataset, experimental results and discussion are dealt with in “Experimence Results”. Finally, “Conclusion” presents the conclusion.
ADID-UNET model proposed in this paper is based on UNET (Ronneberger, Fischer & Brox, 2015) architecture and therefore, we will discuss the literature related to our work which includes: deep learning and medical image segmentation, improvement of medical image segmentation algorithms, CT scan segmentation, and application of deep learning in segmentation of COVID-19 lesions from lung CT scans.
Deep learning and medical image segmentation
In recent years, deep learning algorithms have become more mature leading to various artificial intelligence (AI) systems based on deep learning algorithms being developed. Also, semantic segmentation using deep learning algorithms (Oktay et al., 2018) has developed rapidly with applications in both natural and medical images. Long, Shelhamer & Darrell (2015) pioneered the use of a fully connected CNN (FCN) to present rough segmentation outputs that were of the input resolution through fractionally strided convolution process also referred as the upsampling or deconvolution. The model was tested on PASCAL VOC, NYUDv2, and SIFT datasets and, presented a Mean Intersection of Union (M-IOU) of 62.7%, 34%, 39.5%, respectively. They also reported that upsampling, part of the in-network, was fast, accurate, and provided dense segmentation predictions. Later through a series of improvements and extensions to FCN (Ronneberger, Fischer & Brox, 2015; Badrinarayanan, Kendall & Cipolla, 2017; Xu et al., 2018), a symmetrical structure composed of encoder and decoder pipelines, called UNET (Ronneberger, Fischer & Brox, 2015), was proposed for biomedical or medical image segmentation. The encoder structure predicted the segmentation area, and then the decoder recovered the resolution and achieved accurate spatial positioning. Also, the UNET used crop and copy operations for the precise segmentation of the lesions. Further, the model achieved good segmentation performance at the International Symposium on Biomedical Imaging (ISBI) challenge (Cardona et al., 2010) with the M-IOU of 0.9203. Moreover, an improved network referred as the SegNet was proposed by Badrinarayanan, Kendall & Cipolla (2017). The model used the first 13 convolution layers of the VGG16 network (Karen & Andrew, 2014) to form an encoder to extract features and predict segmentation regions. Later by using a combination of convolution layers, unpooling and softmax activation function in the decoder, segmentation outputs of input resolution were obtained. When tested with the CamVid dataset (Brostow, Fauqueur & Cipolla, 2009), the M-IOU index of SegNet was nearly 10% higher than that of FCN (Long, Shelhamer & Darrell, 2015). Xu et al. (2018) regarded segmentation as a classification problem in which each pixel was associated with a class label and designed a CNN network composed of three layers of convolution and pooling, a fully connected layer (FC) and softmax function. The model of successfully segmented three-dimensional breast ultrasound (BUS) image datasets was presented into four parts: skin, fibroglandular tissue, mass, and fatty tissue and achieved a recall rate of 88.9%, an accuracy of 90.1%, precision of 80.3% and F1 score of 0.844. According to the aforementioned literature, FCN (Long, Shelhamer & Darrell, 2015) and their improved variants presented accurate segmentation results for both natural or medical images. Therefore, the UNET and variants (Almajalid et al., 2019; Negi et al., 2020), due to its advantages of fast training and high segmentation accuracy are widely used in the field of medical image segmentation.
Improvement of medical image segmentation algorithms
Medical images such as the ultrasound images are generally prone to speckle noise, uneven intensity distribution, and low contrast between the lesions and the backgrounds which affect the segmentation ability of the traditional UNET (Ronneberger, Fischer & Brox, 2015) structure. Therefore, considerable efforts were invested in improving the architecture. Xia & Kulis (2017) proposed a fully unsupervised deep learning network called W-Net model that connects two UNETs to predict and reconstruct the segmentation results. Schlemper et al. (2019) proposed an attention UNET network, which integrated attention modules into the UNET (Ronneberger, Fischer & Brox, 2015) model to achieve spatial positioning and subsequent segmentation. The model presented a segmentation accuracy of 15% higher than the traditional UNET architecture. Zhuang et al. (2019a) combined the goodness of the attention gate system and the dilation convolution module and proposed a hybrid architecture referred as the RDA-UNET. By introducing residual network (He et al., 2016) instead of traditional convolution layers they reported a segmentation accuracy of 97.91% towards the extraction of lesions in breast ultrasound images. Also, the GRA-UNET (Zhuang et al., 2019b) model included a group convolution module in-between the encoder and decoder pipelines to improve the segmentation of the nipple region in breast ultrasound images. Therefore, from the literature, it can be inferred that introducing additional modules like attention gate instead of traditional cropping and copying, inclusion of dilation convolution to increase the receptive fields and use of residual networks can favorably improve the accuracy of the segmentation model. However, these successful segmentation models (Schlemper et al., 2019; Zhuang et al., 2019a; Xia & Kulis, 2017) were rarely tested with CT scans, hence the next section concentrates on the segmentation of CT scans.
CT scan segmentation
CT imaging is a commonly used technology in the diagnosis of lung diseases since lesions can be segmented more intuitively from the chest CT scans. The segmented lesion aid the specialist in the diagnosis and quantification of the lung diseases (Gordaliza et al., 2018). In recent years, most of the classifier models and algorithms based on feature extraction have achieved good segmentation results in chest CT scans. Ye et al. (2009) proposed a shape-based Computer-Aided Detection (CAD) method where a 3D adaptive fuzzy threshold segmentation method combined with chain code was used to estimate infected regions in lung CT scans. In feature-based techniques, due to the low contrast between nodules and backgrounds, the boundary discrimination is unclear leading to inaccurate segmentation results. Therefore, many segmentation techniques based on deep learning algorithms have been proposed. Wang et al. (2017) developed a central focusing convolutional neural network for segmenting pulmonary nodules from heterogeneous CT scans. Jue et al. (2018) designed two deep networks (an incremental and dense multiple resolution residually connected network) to segment lung tumors from CT scans by adding multiple residual flows with different resolutions. Guofeng et al. (2018) proposed a UNET model to segment pulmonary nodules in CT scans which improved the overall segmentation output through the avoidance of overfitting. Compared with other segmentation algorithms such as graph-cut (Ye, Beddoe & Slabaugh, 2009), their model had better segmentation results with a Dice coefficient of 0.73. Recently, Peng et al. (2020) proposed an automatic CT lung boundary segmentation method, called Pixel-based Two-Scan Connected Component Labeling-Convex Hull-Closed Principal Curve method (PSCCL-CH-CPC). The model included the following: (a) the image preprocessing step to extract the coarse lung contour and (b) coarse to finer segmentation algorithm based on the improved principal curve and machine learning model. The model presented good segmentation results with Dice coefficient as high as 96.9%. Agarwal et al. (2020) proposed a weakly supervised lesion segmentation method for CT scans based on an attention-based co-segmentation model (Mukherjee, Lall & Lattupally, 2018). The encoder structure composed of a variety of CNN architectures that includes VGG-16 (Karen & Andrew, 2014), Res-Net101 (He et al., 2016), and an attention gate module between the encoder-decoder pipeline, while decoder composed of upsampling operation. The proposed method first generated the initial lesion areas from the Response Evaluation Criteria in Solid Tumors (RECIST) measurements and then used co-segmentation to learn more discriminative features and refine the initial areas. The paper reported a Dice coefficient of 89.8%. The above literatures suggest that deep learning techniques are effective in segmenting lesions in lung CT scans and many researchers have proposed different deep learning architectures to deal with COVID-19 CT scans. Therefore, in the next section we will further study their related works.
Application of deep learning in segmentation of COVID-19 lesions from lung CT scans
In recent months, COVID-19 has become a hot topic of concern all over the world and CT imaging is considered to be a convincing method to detect COVID-19. However, due to the limited datasets and the time and labor involved in annotations, segmentation datasets related to COVID-19 CT scans are less readily available. But, many researchers have still proposed advanced methods to deal with COVID-19 diagnosis, which also includes segmentation techniques (Fan et al., 2020; Wang et al., 2020b; Yan et al., 2020; Zhou, Canu & Ruan, 2020; Elharrouss et al., 2020; Chen, Yao & Zhang, 2020). On the premise of insufficient datasets with segmentation labels, the Inf-Net network proposed by Fan et al. (2020), combined a semi-supervised learning model and FCN8s network (Long, Shelhamer & Darrell, 2015) with implicit reverse attention and explicit edge attention mechanism to improve the recognition rate of infected areas. The model successfully segmented COVID-19 infected areas from CT scans and reported a sensitivity and accuracy of 72.5% and 96.0%, respectively. Elharrouss et al. (2020) proposed an encoder-decoder-based CNN method for COVID-19 lung infection segmentation based on a multi-task deep-learning based method, which overcame the shortage of labeled datasets, and segmented lung infected regions with a high sensitivity of 71.1%. Wang et al. (2020b) proposed a noise-robust COVID-19 pneumonia lesions segmentation network which included a noise-robust dice loss function along with convolution function, residual network, and Atrous Spatial Pyramid Pooling (ASPP) module. The model was referred as Cople-Net presented automatic segmentation of COVID-19 pneumonia lesions from CT scans. The method proved that the proposed new loss function was better than the existing noise-robust loss functions such as Mean absolute error (MAE) loss (Ghosh, Kumar & Sastry, 2017) and Generalized Cross-Entropy (GCE) loss (Zhang & Sabuncu, 2018) and achieved a Dice coefficient and Relative Volume Error (RVE) of 80.72% and 15.96%, respectively. Yan et al. (2020) employed an encoder-decoder deep CNN structure composed of convolution function, Feature Variation (FV) module (mainly contains convolution, pooling, and sigmoid function), Progressive Atrous Spatial Pyramid Pool (PASPP) module (including convolution, dilation convolution, and addition operation) and softmax function. The convolution function obtained features, FV block enhanced the feature representation ability and the PASPP was used between encoder and decoder pipelines compensated for the various morphologies of the infected regions. The model achieved a good segmentation performance with a Dice coefficient of 0.726 and a sensitivity of 0.751 when tested on the COVID-19 lung CT scan datasets. Zhou, Canu & Ruan (2020) proposed an encoder-decoder structure based UNET model for the segmentation of the COVID-19 lung CT scan. The encoder structure was used to extract features and predict rough lesion areas which composed convolution function and Res-dil block (combines residual block (He et al., 2016) and dilation convolution module). The decoder pipeline was used to restore the resolution of the segmented regions through the upsampling and the attention mechanism between the encoder-decoder framework to capture rich contextual relationships for better feature learning. The proposed method can achieve an accurate and rapid segmentation on COVID-19 lung CT scans with a Dice coefficient, sensitivity, and specificity of 69.1%, 81.1%, and 97.2%, respectively. Further, Chen, Yao & Zhang (2020) proposed a residual attention UNET for automated multi-class segmentation of COVID-19 lung CT scans, which used residual blocks to replace traditional convolutions and upsampling functions to learn robust features. Again, a soft attention mechanism was applied to improve the feature learning capability of the model to segment infected regions of COVID-19. The proposed model demonstrates a good performance with a segmentation accuracy of 0.89 for lesions in COVID-19 lung CT scans. Therefore, the deep learning algorithms are helpful in segmenting the infected regions from COVID-19 lung CT scans which aid the clinicians to evaluate the severity of infection (Tang et al., 2020), large-scale screening of COVID-19 cases (Shi et al., 2020) and quantification of the lung infection (Ye et al., 2020). Table 1 summarizes the deep learning-based segmentation techniques available for COVID-19 lung infections.
|Literature||Data Type||Dataset||Technique||Segmentation results|
|Fan et al. (2020)||CT Scan||100 CT images||Semi supervised CNN||73.9% (DC)|
|FCN8s network||96.0% (Sp)|
|Wang et al. (2020b)||CT Scan||558 CT images||Residual connection||80.7% (DC)|
|Yan et al. (2020)||CT Scan||21,658 CT images||Deep CNN||72.6% (DC)|
|Zhou, Canu & Ruan (2020)||CT Scan||100 CT images||Attention mechanism||69.1% (DC)|
|Res-Net, dilation convolution||81.1% (Sen)|
|Elharrouss et al. (2020)||CT Scan||100 CT images||Encoder-decoder-based CNN||78.6% (Dice)|
|Chen, Yao & Zhang (2020)||CT Scan||110 CT images||Encoder-decoder-based CNN||83.0% (DC)|
|Xu et al. (2020)||CT Scan||110 CT images||CNN||86.7% (ACC)|
|Shuai et al. (2020)||CT Scan||670 CT images||CNN||73.1% (ACC)|
In this section, we first introduce the proposed ADID-UNET network with detailed discussion on the core network components including dense network, improved dilation convolution, and attention gate system. To present realistic comparisons, experimental results are presented at each subsection to illustrate the performance and superiority of the model after adding core components. Further in “Experimence Results” we have presented a summary of the % improvements achieved when compared to the traditional UNET architecture.
ADID-UNET is based on UNET (Ronneberger, Fischer & Brox, 2015) architecture with the following improvements: (a) The dense network proposed by Huang et al. (2017) is used in addition to the convolution modules of encoder and decoder structures, (b) an improved dilation convolution (IDC) is introduced between the frameworks, and (c) the attention gate (AG) system is used instead of the simple cropping and copying operations. The structure of ADID-UNET is shown in Fig. 2. Here fen, fupn, fidc describe the features at the n-th layer of the encoder, decoder, and IDC modules, respectively.
When COVID-19 CT scans are presented to the encoder, the first four layers (each layer has convolutions, rectification, and max pooling functions) extract features (f1–f4) that are passed to dense networks. Here dense networks are used instead of convolution and max-pooling layers to further enhance the features (f5–f6) and in “Dense Network”, we elaborate the need for the dense network and present experimental results to prove its significance. Next, an improved dilation convolution module referred as the IDC model, is used between the encoder–decoder structure to increase the receptive field and gather detailed edge information that assists in extracting the characteristic. The module accepts the feature f6 from the dense networks and after improvement, present fidc them as inputs to the decoder structure. To ensure consistency in the architecture and to avoid losing information, the decoder mirrors the encoder with two dense networks that replace the first two upsampling operations. Further for the better use of the context information between the encoder-decoder pipeline, the AG model is used instead of cropping and copying operations, which aggregates the corresponding layer-wise encoder features with the decoder and presents it to the subsequent upsampling layers. Likewise, the decoder framework presents upsampled features fup1 to fup6 and final feature map (fup6) is presented to the sigmoid activation function to predict and segment the COVID-19 lung infected regions. The following section explains the components of ADID-UNET in detail.
It was presumed that with the increase of network layers, the learning ability of the network will gradually improve, but during the training, for deep networks, the gradient information that is helpful for the generalization may disappear or expand excessively. In literature, the problem is referred as vanishing or explosion of the gradient. As the network begins to converge, due to the disappearance of the gradient the network saturates, resulting in a sharp decline in network performance. Therefore, Zhuang et al. (2019a) introduced residual units proposed by He et al. (2016) into UNET structure to avoid performance degradation during training. The residual learning correction scheme to avoid performance degradation is described in (1): (1)
Here x and y are the input and output vectors of the residual block, Fi is the weight of the corresponding layer. The function is a residue when added to x, avoids vanishing gradient problems, and enables efficient learning.
From (1) the summation of and x in Res-Net (He et al., 2016) avoids the vanishing gradient problems but forwarding the gradient information alone to the proceeding layers may hinder the information flow in the network and the recent work by Huang et al. (2016) illustrated that of Res-Nets discard features randomly during training. Moreover, Res-Nets include large number of parameters, which increases the training time. To solve this problem, Huang et al. (2017) proposed a dense network (as shown in Fig. 3), which directly connects all layers, and thus skillfully obtains all features of the previous layer without convolution.
The dense network is mainly composed of convolution layers, pooling function, multiple dense blocks, and transition layers. Let us consider a network with L layers, and each layer implements a nonlinear transformation Hi. Let x0 represent the input image, i represents layer i, xi−1 is the output of layer i − 1. Hi can be a composite operation, such as batch normalization (BN), rectified linear function (RELU), pooling, or convolution functions. Generally, the output of traditional network in layer i is as follows: (2)
For the residual network, only the identity function from the upper layer is added: (3)
For a dense network, the feature mapping x0, x1,…, xi−1 of all layers before layer i is directly connected, which is represented by Eq. (4): (4)
where denotes the cascade of characteristic graphs and × represents the multiplication operation. Figure 4 shows the forward connection mechanism of the dense network where the output of layers is connected directly to all previous layers.
Generally, a dense network is composed of several dense blocks and transition layers. Here we only use two dense blocks and transition layers to form simple dense networks. Using Eq. (5) to express the dense block: (5) where denotes the cascade of characteristic graphs, βi is the weight of the corresponding layer. In the ADID-UNET model proposed in this paper, the feature f4 (refer to Fig. 2) is fed to the transition layer, which is mainly composed of BN, RELU, and average pooling operation. Later the feature is batch standardized and rectified before convolving with a 1 × 1 kernel function. Again, the filtered outputs go through the same operation and are convoluted with 3 × 3 kernel, before concatenating with the input feature f4. The detailed structure of the two dense blocks and transition layers used in the encoder structure is shown in Fig. 5A. Here w, h correspond to the width and height of the input, respectively, and b represents the number of channels. Besides, s represents the step size of the pooling operation, n represents the number of filtering operations performed by each layer. In our model, n takes values 32, 64, 128, 256, and 512. It should be noted that the output of the first dense layer is the aggregated result of 4 convolution operations (4 × n), which is employed to emphasize the features learning by reducing the loss of features. In the decoding structure, to restore the resolution of the predicted segmentation, a traditional upsampling layer of the UNET (Ronneberger, Fischer & Brox, 2015) is used instead of the transition layer. The detailed structure is shown in Fig. 5B.
For the proposed network, we use only two dense networks mainly (a) to reduce the computation costs and (b) experiments with different layers of dense networks suggest that the use of two dense networks was sufficient since the segmentation results were accurate and comparable to the ground truth. Figure 6 and Table 2 illustrate the qualitative and quantitative comparisons with different numbers of dense network in the encoder-decoder framework.
|Number of dense network||ACC||DC||Sen||Sp||Pc||AUC||F1||Sm||Eα||MAE|
From the analysis of results in Fig. 6 and Table 2, it is found that the effect of using two dense networks in the model is obvious and can present accurate segments of the infected areas that can be inferred directly from the qualitative and quantitative metrics.
Moreover, with high accuracy and a good Dice coefficient, the choice of two dense networks is the best choice in the encoder decoder pipeline. Also, using two dense networks in place of traditional convolutions or residual networks enable global feature propagation, encourage feature reuse, and also solve the gradient disappearance problems associated with deep networks thereby significantly improving the segmentation outcomes.
Improved dilation convolution
Since the encoder pipeline of the UNET structure is analogous to the traditional CNN architecture, the pooling operations involved at each layer propagate either the maximum or the average characteristics of the extracted features, hence connecting the encoder outputs directly to decoder, thus limiting the segmentation accuracy of the network. The RDA-UNET proposed by Zhuang et al. (2019a) utilized a dilation convolution (DC) module between the encoder-decoder pipeline to increase the receptive field and further learn the boundary information accurately. Also, the DC module is often used in many variant UNETs (Chen et al., 2019; Yu & Koltun, 2015) to improve the receptive field, hence, we use the DC module and introduce additional novelty in the DC module.
Equation (6) describes the DC operation between the input image and the kernel .(6)
where α is the RELU function, k is a bias unit and denote the coordinates of the kernel and those of the input images respectively, and r is the dilation rate that controls the size of receptive fields. The size of the receptive field obtained can be expressed as follows: (7) where k_fsize is the convolution kernel size, r is the convolution rate of the dilation and N is the size of the receptive field. As shown in Fig. 7.
Based on our experimental analysis we understand that DC module has a pronounced effect in extracting information for larger objects or lesions and considering that most of the early ground-glass opacity (GGO) or late lung consolidation lesions have smaller areas, we present an improved dilation convolution (IDC) module between the encoder–decoder framework to accurately segment smaller regions.
Figure 8 illustrates the IDC module that consists of several convolution functions with different dilation rates and rectified linear functions (RELU). Our improvements are as follows: (a) combining single strided convolution operations and dilated convolutions with dilation rate such as 2, 4, 8, and 16, respectively. The above combination helps in the extraction of features from both smaller and larger receptive fields thus assisting in the isolation of the small infected COVID-19 regions seen in lung CT scans and (b) referring to the idea of the dense network (Huang et al., 2017), we concatenate the input of the IDC module to its output and use the information of input features to further enhance feature learning. The input of IDC module is the rough segmentation regions obtained by encoder structure. The combination of the original segmentation region features and the accurate features extracted by IDC module not only avoids the loss of useful information, but also provides accurate input for the decoding pipeline, which is conducive to improve the segmentation accuracy of the model. As the inputs advance (left to right in Fig. 8), they get convolved with a 3 × 3 kernel of convolution layers and the dilation rate of IDC is 2, 4, 8, and 16, respectively. From the comparative experiments with the traditional DC model (the dilation rate is the same for both the models), we find that the computational cost and computation time required for the IDC module is less than that of the DC module, as shown in Table 3.
|Method||Total parameters||Trainable parameters||Non-trainable parameters||Train time epoch/(s)||Test time (s)|
From Fig. 9 and Table 4, it is found that the use of layers with convolution and smaller dilation rates at the end along with others ensures the cumulative extraction of features from both smaller and larger receptive fields thus assisting in the isolation of the small infected COVID-19 regions seen in lung CT scans. Also, the performance scores specifically the Dice coefficient is higher (about 3%) for DID-UNET compared to DD-UNET. In summary, the IDC model connected between the encoder–decoder structure, reduces loss of the original features but additionally expands the field of the segmented areas thereby improving the overall segmentation effect.
Although the improved dilation convolution improves the feature learning ability of the network, due to the loss of spatial information in the feature mapping at the end of the encoder structure, the network has difficulties in reducing false prediction for (a) small COVID-19 infected regions and (b) areas with blurry edges with poor contrast between the lesion and background. To solve this problem, we introduce the attention gate (AG) model shown in Fig. 10 mechanism into our model instead of simple cropping and copying. AG model computes the attention coefficient , based on Eq. (8): (8) (9) where n and m represent the feature mapping of the AG module input from the decoder and encoder pipelines, respectively. And pm, pn, pi, pk are the convolution kernels of size 1 × 1. bm,n, bint, bk represent the offset unit. ε1 and ε2 denote the RELU and sigmoid activation function respectively. Here ε2 limits the range between 0 and 1.
Finally, the attention coefficient σ is multiplied by the input feature map fi to present the output go as shown in Eq. (10): (10)
From Fig. 11 and Table 5, results showed that the inclusion AG module improved the performance of the network (ADID-UNET), with segmentation accuracy of almost 97%. Therefore, by introducing the AG model, the network makes full use of the output feature information of encoder and decoder, which greatly reduces the probability of false prediction of small targets, and effectively improves the sensitivity and accuracy of the model.
COVID-19 segmentation dataset collection and processing
Organizing a COVID-19 segmentation dataset is time-consuming and hence there are not many CT scan segmentation datasets. At present, there was only one standard dataset namely the COVID-19 segmentation dataset (MedSeg, 2020), which was composed of 100 axial CT scans from different COVID-19 patients. All CT scans were segmented by radiologists associated with the Italian Association of medicine and interventional radiology. Since the database was updated regularly, on April 13, 2020, another segmented CT scans dataset with segment labels from Radiopaedia was added. The whole datasets that contained both positive and negative slices (373 out of the total of 829 slices have been evaluated by a radiologist as positive and segmented), were selected for training and testing the proposed model.
The dataset consists of 1,838 images with annotated ground truth was randomly divided into 1,318 training samples, 320 validation samples, and 200 test samples. Since the number of training images is less, we expand the training dataset where we first merge the COVID-19 lung CT scans with the ground scene and then perform six affine transformations as mentioned in Krizhevsky, Sutskever & Hinton (2012). Later the transformed image is separated from the new background truth value and added to the training dataset as additional training images. Therefore, the 1,318 images of the training dataset are expanded, and 9,226 images are obtained for training. Figure 12 illustrates the data expansion process.
Segmentation evaluation index
The commonly used evaluation indicators for segmentation such as accuracy (ACC), precision (Pc), Dice coefficient (DC), the area under the curve (AUC), sensitivity (Sen), specificity (Sp) and F1 score (F1) were used to evaluate the performance of the model. These performance indicators are calculated as follows:
(1) For computing accuracy, precision, sensitivity, specificity, and F1 score we generate the confusion matrix where the definitions of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are shown in Table 6.
|Category||Actual lesion||Actual non-lesion|
|Predicted Lesion||True Position (TP)||False Position (FP)|
|Predicted Non-Lesion||False Negative (FN)||True Negative (TN)|
(1) Accuracy (ACC): A ratio of the number of correctly predicted pixels to the total number of pixels in the image.(11)
(2) Precision (Pc): A ratio of the number of correctly predicted lesion pixels to the total number of predicted lesion pixels.(12)
(3) Sensitivity (Sen): A ratio of the number of correctly predicted lesion pixels to the total number of actual lesion pixels.(13)
(4) F1 score (F1): A measure of balanced accuracy obtained from a combination of precision and sensitivity results.(14)
(5) Specificity (Sp): A ratio of the number of correctly predicted non-lesion pixels to the total number of actual non-lesion pixels.(15)
(6) Dice coefficient (DC): Represents the similarity between the model segment output (Y) and the ground truth (X). The higher the similarity between the lesion and the ground truth, the larger the Dice coefficient and the better the segmentation effect. Dice coefficient is calculated as follows: (16)
Also, we use a Dice coefficient (Dice, 1945) loss (dice_loss) as the training loss of the model, the calculation is as follows: (17)
(7) The area under the curve (AUC): AUC is the area under the receiver operating characteristic (ROC) curve. It represents the degree or the measure of separability and indicates the capability of the model in distinguishing the classes. Higher the AUC better is the segmentation output and hence the model.
In addition to the above widely used indicators, we also introduce the Structural metric (Sm) (Fan et al., 2017), Enhanced alignment metric (Eα) (Fan et al., 2018) and Mean Absolute Error (MAE) (Fan et al., 2020; Elharrouss et al., 2020) to measure the segmentation similarity with respect to the ground truth.
(8) Structural metric (Sm): Measures the structural similarity between the prediction map and ground truth segmented mask, it is more in line with the human visual system than Dice coefficient.(18)
where Sos stands for target perception similarity, Sor stands for regional perceptual similarity, β = 0.5 is a balance factor between Sos and Sor. And Sop stands for the final prediction result and Sgt represents the ground truth.
(9) Enhance alignment metric (Eα): Evaluates the local and global similarity between two binary maps computed based on Eq. (19): (19) where w and h are the width and height of ground truth Sgt, (i,j) denotes the coordinates of each pixel in Sgt. α represents the enhanced alignment matrix: (20)
(10) Mean Absolute Error (MAE): Measures the pixel-wise difference between Sop and Sgt, defined as: (21)
The ADID-UNET proposed in this paper is implemented in Keras framework and is trained and tested by using the workstation with NVIDIA GPU P5000. During the training process, we set the learning rate as , and Adam optimizer was selected as the optimization technique. The 9,226 training samples, 320 verification samples, and 200 test samples were resized to 128 × 128 and trained with a batch size of 32 for 300 epochs. Figures 13 and 14 shows the performance curves obtained for the proposed ADID-UNET during training, validation, and testing.
Segmentation results and discussion
To show the performance of the ADID-UNET model, we used 200 pairs of COVID-19 lung infection CT scans as test data, and the segmentation results are shown in Fig. 15. From the analysis of Fig. 15, it was found that the ADID-UNET model can accurately segment the COVID-19 lung infection areas from the CT scans, especially the smaller infected areas, and the segmentation result is very close to the ground truth. This illustrates the effectiveness of the proposed method for the segmentation of COVID-19 lung infection regions from CT scans. Moreover, we can also see that ADID-UNET can accurately segment the complicated infection areas (single COVID-19 lung infection areas and more complex uneven distribution infection areas) in CT scans, which further proves the power of the model proposed in this paper. In a word, the ADID-UNET model proposed in this paper can effectively and accurately segment COVID-19 lung infection areas with different sizes and uneven distribution, and the visual effect of segmentation is very close to the gold standard.
Further, we also compare the proposed model with other state-of-art segmentation models. From the results (Figs. A1 and A2 and Table 7), we can infer that the ADID-UNET model presents segmentation outputs closer to the ground truth. In contrast, the FCN8s network (Long, Shelhamer & Darrell, 2015) presents more under and over segmented regions. Further RAD-UNET (Zhuang et al., 2019a) presents comparable segmentation results but its effect is less pronounced for smaller segments. Analyzing the segmentation visual results from Figs. A1 and A2, we can clearly find that the ADID-UNET model proposed in this paper can accurately segment the COVID-19 lung infection regions than other state-of-the-art model with results close to the ground truth, which proves the efficacy of the proposed ADID-UNET model.
|Fan et al. (2020)||- - -||0.7390||0.7250||0.9600||- - -||- - -||- - -||0.8000||0.8940||0.0640|
|Elharrouss et al. (2020)||- - -||0.7860||0.7110||0.9930||0.8560||- - -||0.7940||- - -||- - -||0.0760|
|Yan et al. (2020)||- - -||0.7260||0.7510||- - -||0.7260||- - -||- - -||- - -||- - -||- - -|
|Zhou, Canu & Ruan (2020)||- - -||0.6910||0.8110||0.9720||- - -||- - -||- - -||- - -||- - -||- - -|
|Chen, Yao & Zhang (2020)||0.8900||- - -||- - -||0.9930||0.9500||- - -||- - -||- - -||- - -||- - -|
Table 7, presents the performance scores for various indicators mentioned in “Experimence Results”. Here, for ADID-UNET the scores such as the Dice coefficient, precision, F1 score, specificity and AUC are 80.31%, 84.76%, 82.00%, 99.66% and 95.51%, respectively. Further, most of the performance indexes are above 0.8 with the highest segmentation accuracy of 97.01%. The above results clearly indicates that the proposed model presents segmentation outputs closer to ground truth annotations.
The proposed model presents an improved version of the UNET model obtained by the inclusion of modules such as the dense network, IDC and the attention gates to the existing UNET (Ronneberger, Fischer & Brox, 2015) structure. The effectiveness of these additions were experimentally verified in “Methods”. Further, to summarize the effectiveness of the addition of each module to the UNET architecture, Table 8 tabulates the improvement at each stage of the addition. From Table 8, it is found that adding additional components to the UNET (Ronneberger, Fischer & Brox, 2015) structure can obviously improve the overall segmentation accuracy of the network. For example, with the inclusion of the dense networks (D-UNET), the metrics such as Dice coefficien (DC) and AUC reached 79.98% and 93.47%, respectively.
|Improvement of D-UNET||↑0.04%||↑0.13%||↑0.44%||↑0.09%||↑3.49%||↑1.45%||↑0.30%||↑1.28%||↑0.04%||↓0.05%|
|Improvement of DID-UNET||↑0.04%||↑0.25%||↓0.65%||↑0.07%||↑1.78%||↑2.02%||↑0.87%||↑0.47%||↓0.16%||↓0.04%|
|Improvement of ADID-UNET||↑0.05%||↑0.33%||↓0.79%||↑0.09%||↑2.29%||↑2.04%||↑0.46%||↑1.09%||↑0.59%||↓0.06%|
Further, the inclusion of the IDC improved the scores further (DID-UNET). Finally, the proposed model with dense network, IDC and the AG modules (namely ADID-UNET) presented the best performance scores and provided an improvement of 0.05%, 0.33%, 2.29%, 2.04% and 1.09% for metrics such as accuracy, DC, precision, AUC and structural metric respectively when compared to traditional UNET architecture.
Furthermore, from Figs. A1 and A2, it is obvious that ADID-UNET performs better than other well-known segmentation models in terms of visualization. Specifically, ADID-UNET can segment relatively smaller infected regions which is of great significance for clinical accurate diagnosis of COVID-19 infection location. The use of (a) dense networks instead of traditional convolution and max-pooling function, (b) inclusion of improved dilation convolution module between the encoder-decoder pipeline and (c) the presence of attention gate network in the skip connections have presented accurate segmentation outputs for various types of COVID-19 infections (GGO and pulmonary consolidation). However, ADID-UNET still has room for improvement in terms of Dice coefficient and sensitivity and also computational costs which can be researched in future.
The paper proposes a new variant of UNET (Ronneberger, Fischer & Brox, 2015) architecture to accurately segment the COVID-19 lung infections in CT scans. The model, ADID-UNET includes dense networks, improved dilation convolution, and attention gate, which has strong feature extraction and segment capabilities. The experimental results show that ADID-UNET is effective in segmenting small infection regions, with performance metrics such as accuracy, precision and F1 score of 97.01%, 84.76%, and 82.00%, respectively. The segmentation results of the ADID-UNET network can aid the clinicians in faster screening, quantification of the lesion areas and provide an overall improvement in the diagnosis of COVID-19 lung infection.
We describe the abbreviations of this paper in detail, as shown in Table A1.
|D-UNET||Inclusion of Dense networks to the UNET structure|
|AG-UNET||Inclusion of Attention gate module to the UNET structure|
|DA-UNET||Inclusion of both dense networks and attention gate module to the UNET structure|
|IDA-UNET||Inclusion of Improved dilation convolution and Attention Gate module to the UNET structure|
|DID-UNET||Inclusion of dense networks and improved dilation convolution to the UNET structure|
|ADID-UNET||Inclusion of dense networks, Improved dilation convolution and Attention Gate modules to the UNET structure|