Sensitivity of deep learning applied to spatial image steganalysis

In recent years, the traditional approach to spatial image steganalysis has shifted to deep learning (DL) techniques, which have improved the detection accuracy while combining feature extraction and classification in a single model, usually a convolutional neural network (CNN). The main contribution from researchers in this area is new architectures that further improve detection accuracy. Nevertheless, the preprocessing and partition of the database influence the overall performance of the CNN. This paper presents the results achieved by novel steganalysis networks (Xu-Net, Ye-Net, Yedroudj-Net, SR-Net, Zhu-Net, and GBRAS-Net) using different combinations of image and filter normalization ranges, various database splits, different activation functions for the preprocessing stage, as well as an analysis on the activation maps and how to report accuracy. These results demonstrate how sensible steganalysis systems are to changes in any stage of the process, and how important it is for researchers in this field to register and report their work thoroughly. We also propose a set of recommendations for the design of experiments in steganalysis with DL.


INTRODUCTION
In the context of criptography and information hiding, steganography refers to hiding messages in digital multimedia files (Hassaballah, 2020) and steganalysis consists of detecting whether a file has a hidden message or not (Reinel, RaÃol & Gustavo, 2019;Tabares-Soto et al., 2020, Chaumont, 2020. In digital image steganography, a message can be hidden by changing the value of some pixels in the image (spatial domain, see Fig. 1) (Hameed et al., 2019) or by modifying the coefficients of a frequency transform (frequency domain) while remaining invisible to the human eye. Some of the steganographic algorithms in the spatial domain are HUGO (Pevny, Filler & Bas, 2010), WOW (Holub & Fridrich, 2012), S-UNIWARD (Holub, Fridrich & Denemark, 2014), HILL (Li et al., 2014), and MiPOD (Sedighi et al., 2016). Pevny, 2011), and by including operations such as cropping, resizing, rotation, and interpolation. In the same year, a CNN able to detect steganographic images in the spatial and frequency domain was proposed by Boroumand, Chen & Fridrich (2019), the main feature of this architecture is the use of residual connections.
Following the most relevant proposals in the steganalysis field, the CNN presented by Zhang et al. (2019) introduced separate convolutions and multi-level average pooling known as Spatial Pyramid Pooling (SPP) (He et al., 2014), which allows the network to process arbitrarily sized images. Tan et al. (2020) sought to decrease the computational cost, storage overheads, and difficulties in training and deployment. The resulting model (i.e., CALPA-Net) improved adaptivity, transferability, and scalability. Furthermore, Wang et al. (2020) proposed a CNN that uses detection mechanisms and joint domains. The authors applied SRM filters and the Discrete Cosine Transform Residual (DCTR) patterns for transformation steganographic impacts.
Currently, GBRAS-Net architecture, presented by Reinel et al. (2021), achieves the highest detection percentages of steganographic images in the spatial domain. In the preprocessing stage, this network keeps the 30 SRM filters and uses a modified TanH activation function. This CNN involves skip-connections, separable and depthwise convolutions using the ELU activation function, for feature extraction. For the classification stage, the CNN uses a softmax directly after global average pooling, removing fully connected layers. Table 1 shows the performance of the CNN architectures. These results correspond to the most relevant architectures for classifying S-UNIWARD and WOW steganographic images. The payloads used are 0.2 and 0.4 bits per pixel (bpp).
In general, a sensitivity analysis refers to the assessment of how the output of a system, or in this case performance of a model, is influenced by its inputs (Razavi et al., 2021), not only training data, but model hyper-parameters, preprocessing operations, and desing choices as well. Besides assuring the quality of a model (Saltelli et al., 2019), sensitivity analysis can provide an important tool in reporting reproducible results, by explaining the conditions around which those results were achieved (Razavi et al., 2021). In its most simple form, consists of varying each of the inputs around its possible values and evaluating the results achieved. Given the accelerated growth of DL techniques for steganalysis, measuring how factors such as image and filter normalization, database partition, and activation function can affect the development and performance of algorithms for steganographic images detection is essential. This research was motivated by the lack of detailed documentation of the experimental set-up, the difficulty to reproduce the CNNs, and the variability of reported results. This paper describes the results of a thorough experimentation process in which different CNN architectures were tested under different scenarios to determine how the training conditions affect the results. Similarly, this paper presents an analysis of how researchers can select the products to report, aiming to deliver reproducible and consistent results. These issues are essential to assess the sensitivity of DL algorithms to different training settings and will ultimately contribute to a further understanding of the problems applied to steganalysis and how to approach them.
The paper has the following order: The "Materials and Methods" section describes the database, CNN architectures, experiments, training and hyper-parameters, hardware and resources. The "Results" section presents the quantitative results found for each of the scenarios. The "Discussion" section discusses the results presented in terms of their relationship and effect on steganalysis systems. Lastly, the "Conclusions" section presents the conclusions of the paper.

Database
The database used for the experiments was Break Our Steganographic System (BOSSBase 1.01) (Bas, Filler & Pevny, 2011). This database consists of 10,000 cover images of 512 × 512 pixels in a Portable Gray Map (PGM) format (8 bits grayscale). For this research, similar to the process presented by Tabares-Soto et al. (2021), the following operations were performed on the images: All images were resized to 256 × 256 pixels.
Each corresponding steganographic image was created for each cover image using S-UNIWARD (Holub, Fridrich & Denemark, 2014) and WOW (Holub & Fridrich, 2012) with payload 0.4 bpp. The implementation of these steganographic algorithms was based on the open-source tool named Aletheia (Lerch, 2020) and the Digital Data Embedding Laboratory at Binghamton University (Binghamton University, 2015). The images were divided into training, validation, and test sets. The size of each group varied according to the experiment.

CNN Architectures
The CNN architectures used in this research, except for GBRAS-Net, were modified according to the strategy described in Tabares  to improve the performance of the networks regarding convergence, stability of the training process, and detection accuracy. The modifications involved the following: a preprocessing stage with 30 SRM filters and a modified TanH activation with range [−3, 3], Spatial Dropout before the convolutional layers, Absolute Value followed by Batch Normalization after the convolutional layers, Leaky ReLU activation in convolutional layers, and a classification stage with three fully connected layers . Figure 2 shows two of the six CNN architectures used for the experiments.

Complexity of CNNs
There are two dimensions to calculate the computational complexity of a CNN, spatial and temporal. The spatial complexity calculates the disk size that the model will occupy after being trained (parameters and feature maps). The time complexity allows calculating floating-point operations per second (FLOPS) that the network can perform (He & Sun, 2015). Eq. (1) is used to calculate the temporal complexity of a CNN and Eq. (2) is used to calculate the spatial complexity.
where: D = number of convolutional network layers (depth) l = convolutional layer where the convolution process is being performed M l = is the size of one side of the feature map in the l − th convolutional layer K l = is the size of one side of the kernel applied on the l − th convolutional layer C l − 1 = number of channels of each convolution kernel at the input of the l − th convolutional layer C l = number of convolution kernels at the output of the l − th convolutional layer It is important to clarify that for spatial complexity, the first summation calculates the total size of the network parameters. The second summation calculates the size of the feature maps. In Table 2, the spatial and temporal complexities of the CNNs worked in this sensitivity analysis can be observed.

Image normalization
Image normalization is a typical operation in digital image processing that changes the ranges of the pixel values to match the operating region of the activation function.    The most used bounds for CNN training are 0 to 255, when the values are integers, and 0 to 1 with floating-point values. The selection of this range affects performance and, depending on the application, one or the other is preferred. The following ranges were tested to demonstrate these effects: [0; 255]: 8 bit integer.

SRM filters normalization
As for image normalization, the SRM filter values shown in Fig. 3 impact network performance. To evaluate the effect of different filter values, experiments were performed without normalization and with normalization by a factor of 1/12, which caused filter values to be in the range [−1, 2/3].

Database partition
Dividing the database into three sets is good practice for artificial intelligence applications: the training set to adjust network parameters, the validation set to change network hyper-parameters, and the test set to perform the final evaluation of the CNN performance. There is a default partition (see "Default partition") which most researchers use in the field. As part of the experimentation process developed in this research, the CNN was tested using three additional database partitions as follow (amounts in image pairs): Train: 2,500, Validation: 2,500, and Test: 5,000.

Activation function of the preprocessing stage
The preprocessing stage, which consisted of a convolutional layer with 30 SRM filters, involves an activation function that affects model performance on specific steganographic algorithms. As part of the experimentation process, four different activation functions were tested: 3 × TanH, 3 × HardSigmoid, 3 × Sigmoid and 3 × Softsign.

Activation maps analysis
The output of a particular layer of a CNN is known as activation maps, indicating how well the architecture performs feature extraction. This paper presents the comparative analysis of the activation maps generated by a cover, a stego, and a "cover-stego" image in a trained model. Furthermore, by comparing, it is possible to see the differences between them.

Accuracy reporting in steganalysis
One of the characteristics of CNN training in steganalysis is the unstable accuracy and loss values between epochs, leading to highly variable results and training curves. Consequently, an abnormally high accuracy value can be achieved at a given time during the training process. Although it is correct to select the best accuracy under comparison, having more data allows a better understanding of the CNN. For example, in this paper, model accuracy was evaluated using the mean and standard deviation of the top five results from training, validation, and testing.

Training and hyper-parameters
The training batch size was set to 64 images for Xu-Net, Ye-Net, Yedroudj-Net, and 32 for SR-Net, Zhu-Net, GBRAS-Net. The number of training epochs needed to reach convergence is 100, except for Xu-Net that uses 150 epochs. The spatial dropout rate was 0.1 in all layers. Batch normalization had a momentum of 0.2, epsilon of 0.001, and renorm momentum of 0.4. The stochastic gradient descent optimizer momentum was 0.95, and the learning rate was initialized to 0.005. Except for GBRAS-Net, all layers used a glorot normal initializer and L2 regularization for weights and bias. For GBRAS-Net architecture, the training network uses Adam optimizer, which has the following configuration: the learning rate is 0.001, the beta 1 is 0.9, the beta 2 is 0.999, the decay is 0.0, and the epsilon is 1e − 08. Convolutional layers, except the first layer of preprocessing, use a kernel initializer called glorot uniform. CNN uses a categorical crossentropy loss for the two classes. The metric used is accuracy. Batch Normalization is configured like the other CNNs. In the original network, the maximum absolute value normalizes the 30 high-pass SRM filters for each filter. The same padding is used on all layers. As shown in Fig. 2, the predictions performed in the last part of the architecture directly use a Softmax activation function.

Hardware and resources
As previously described in Tabares

Image normalization
Image normalization is a typical operation in digital image processing that affects the performance of CNN. Different types of normalization processes were performed on the images (cover and stego) of BOSSBase 1.01 with WOW 0.4 bpp. Training and validation were performed with Xu-Net, Ye-Net, Yedroudj-Net, SR-Net, Zhu-Net, and GBRAS-Net CNNs (see Fig. 2 for Xu-Net and GBRAS-Net), with default data partition (see "Default partition") and no SRM filters normalization. Table 3 shows the best test accuracy results with different image normalizations in the convolutional neural networks with WOW 0.4 bpp. Figure 4, under the title "Image Normalization" shows the accuracy curves of SR-Net, Zhu-Net, and GBRAS-Net CNNs with WOW 0.4 bpp for different image normalizations.

SRM filters normalization
The SRM filters have an impact on the performance of CNNs for steganalysis. Therefore, filter normalization was performed by multiplying by 1/12. In Table 4, each image normalization, distribution of classes within each batch of images, and data partition were equal to "Image normalization"; additionally, SRM filter normalization was done by multiplying by 1/12. Table 4 shows the best test accuracy result with a different image and filter normalization in the CNNs with WOW 0.4 bpp. Figure 5, under the title "SRM Filters Normalization" shows the accuracy curves for SR-Net, Zhu-Net, and GBRAS-Net CNNs with WOW 0.4 bpp and a different image and filter normalization.

Database partition
In artificial intelligence, the databases are divided into training, validation, and testing. For steganalysis, a default data partition is used (see "Default partition"). Tables 5 and 6 show the best accuracy results, mean accuracy and Standard Deviation (SD) of the best models with different data partitions, image pixel values in the range [0, 255], no SRM filter normalization. Table 5 and Fig. 6 shows the results of the different data partitions with S-UNIWARD 0.4 bpp. Table 6  The activation maps of the first and the three last convolution of the network are shown in Fig. 11. The activation maps correspond to cover, stego, and steganographic content images. Figure 12 shows the ROC curves with Confidence Interval (CI) for the WOW steganography algorithm. BOSSBase 1.01 database was used to train the model. These curves correspond to the model presented in Table 1 for GBRAS-Net. The ROC curves show the relationship between the false positive and true positive rates. These curves show the Area Under Curve (AUC) values; higher values indicate that the images were better  classified by the computational model, which, in turn, depends on the steganography algorithm and payload.

Accuracy reporting in steganalysis
The results of the experiment are shown with a data distribution consisting of 8,000, 1,000, and 1,000 pairs of images, analyzed in GBRAS-Net and Xu-Net architecture using BOSSBase 1.01, image pixel values in the range [0,255], with no SRM filter normalization.  Table 8 shows the results of accuracy reporting. The model accuracy was evaluated using the mean and standard deviation of the top five results achieved by the CNN during training, validation, and testing.

DISCUSSION
This study presents results obtained from testing different combinations of image and filter normalization ranges, various database partitions, different activation functions for the preprocessing stage, as well as analysis on activation maps of convolutions and how to report accuracy when training six CNN architectures applied to image steganalysis in the spatial domain. The experiments proposed here show highly variable results, indicating the importance of detailed documentation and reports derived from novel work in this field. Regarding image and SRM filter normalization, as shown in Table 3, the effectiveness of a normalization range depends on the selected CNN, such that SRM normalization (see Table 4) can generate completely different results.
The image normalization experiment demonstrates essential aspects of this analysis. For example, considering the Xu-Net architecture in Table 3, the best result is obtained using images with the original values of the database (i.e., in the range 0 to 255). Given this, one could conclude that there is no need for image normalization in any architecture; however, a different result is observed with the Zhu-Net architecture. Zhu-Net has the best result using the normalization of the pixels from −12 to 8 (inspired by the minimum and maximum values of the original SRM filters). We recommend using the original pixel values as the first option because it is the best option for most of CNN.
When considering the combination of image normalization and filter normalization, the results can be different. For example, for SR-Net architecture from Table 3, the normalization of the pixels between −0.5 to 0.5 generates an accuracy of only 50.2% without filter normalization. Conversely, with normalized SRMs, as shown in Table 4, the SR-Net CNN reaches an accuracy of up to 81.5%. However, as the normalization experiments show, GBRAS-Net is the architecture that best behaves or adapts to changes in data normalization and distributions. We recommend making use of this new architecture.
In the database partition experiment, the architectures' detection accuracy improved as the training set increased and the test set decreased. Furthermore, if the test dataset reduces considerably, performance on future cases can be affected. In response, recent investigations use the BOWS 2 dataset since it contains more information. Consequently, with a bigger dataset, data partition can have more information on training and test that can enhance performance. A small test set may be an inadequate representation of the distribution of the images that the network must classify in a production setting; thus, a higher detection accuracy with this partition may not lead to a helpful improvement. Figures 8-10 show that a smaller training set produces highly variable validation and test curves, while a bigger training set generates smoother curves. Furthermore, these curves show how the validation curve can sometimes be higher or lower than the training curve. For this reason, it is better to choose the models from the results obtained in test data. For this reason, a good representation or quantity of test data is also important.  Table 7 shows that using different activation functions implies changes in performance. In Ye-Net for WOW and S-UNIWARD with 3 × TanH, an average accuracy of 84.2% is achieved, and with 3 × HardSigmoid, an average accuracy of 83.9%. Although for WOW, the best result is given by using the 3 × HardSigmoid activation function overall. A model that serves for detection in several steganographic algorithms is better to use 3 × TanH shown by the average value of accuracy. Figure 11 shows that the activation maps from the stego image have differences with the cover image, which indicates a higher activation of the convolutional layer in the presence of the steganographic noise. Moreover, by comparing the activation maps, it is clear that a good learning process was achieved by extracting relevant features and focusing on borders and texture changes in the images, where the steganographic algorithms are known to embed most of the information. The analysis of the activation maps is an effective tool for researchers to evaluate the learning process and gain an understanding of the features that the CNN recognizes as relevant for the steganalysis task. This shows that GBRAS-Net has an excellent ability to discriminate between images without hidden content and with hidden content.
The design of CNN networks allows capturing steganographic content. The first layer (preprocessing), which contains the filters, is responsible for enhancing this noise while decreasing the content of the input image (see Fig 11. in the Cover and Stego columns for the SRM filters row). The Cover-Stego column in Fig. 11 shows the noise. Adaptive steganography does its job well in adapting to image content; as seen in the image, it does so at hard-to-detect edges and places. As proposed here (Table 8), the main advantage of accuracy reporting is to be able to determine the consistency of the results based not only on the final value or the best one. To obtain these results, as the architectures are trained, a model is saved from each epoch. With these models, the accuracies are then obtained in the datasets. With this, you can know which are the best models. And with this accuracy reporting mode, when a specific experiment is presented, whoever will reproduce it will see the range of results to   expect. As shown here, the sensitivity of deep learning is excellent in this problem, which can lead to reproducing a CNN not obtaining the same result from the reporter. With all information shown in this work for spatial image steganalysis using deep learning, we propose a set of recommendations for the design of experiments, listed below: Recommendation 1: measure CNN sensitivity to data and SRM filter normalizations.
Recommendation 2: measure CNN sensitivity to data distributions.
Recommendation 3: measure CNN sensitivity to data splits.
Recommendation 4: measure CNN sensitivity to activation functions on preprocessing stage. Recommendation 5: show activation maps of cover, stego, and steganographic content images. Recommendation 6: report the top five best epochs with accuracies and their standard deviation. Finally, the contributions of this paper will be listed at a general level: Sensitivity in the percentages of accuracy in detecting steganographic images when applying different normalizations in the pixels of the images on six architectures of CNNs (see Table 3 and Fig. 4). Sensitivity in the percentage of accuracy detecting steganographic images when applying different normalizations in the SRM filters in the preprocessing stage on six CNNs architectures (see Table 4 and Fig. 5). Sensitivity in the percentages of accuracy detecting steganographic images has the partition of the set of images in training, validation, and test (see Tables 5 and 6 and Figs. 8-10). Sensitivity in the percentages of accuracy detecting steganographic images that have tested different activation functions in the preprocessing stage for the training process (see Table 7). The importance of analyzing the activation maps of the different convolutional layers to make new designs of CNNs architectures and understand their behavior (see Fig. 11). The importance of reporting the average and standard deviation in the percentages of accuracy detecting steganographic images to determine the results reported in the experiments (see Table 8).
Some possible limitations of the current work, which was developed under the clairvoyant scenario, come from the nature and characteristics of the database: the use of images with fixed resolutions, the specific cameras used to take the pictures, the bit depth of the images, and that all the experiments were performed in the spatial domain.

CONCLUSIONS
As shown by the results presented in this paper, steganalysis detection systems are susceptible to changes in any stage of the process. Factors such as image and filter normalization ranges, database partition, and activation function in the preprocessing stage affect the CNN performance to the point they determine its success. With this in mind, we present the analysis of the activation maps of convolutions for GBRAS-Net as a valuable tool to assess the CNN training process and its ability to extract distinctive features between cover and stego images. Understanding the behavior of steganalysis systems is key to design strategies and computational elements to overcome their limitations and improve their performance. For example, taking Ye-Net as a reference, using the WOW steganographic algorithm with 0.4 bpp, on the BOSSBase 1.01 database and the values of each pixel without any modification (0 and 255), results in an accuracy of 84.8% in the detection of steganographic images, while applying a normalization of the image pixels between 0 and 1 generates a result of 72.7% (see Table 3), taking into account the normal values of the SRM filters (−12 and 8), now if we normalize the values of the previous filters between 0 and 1 with the same characteristics mentioned above, we obtain results of 82.6% and 69.6% respectively (see Table 4). Now with the same CNN and performing different partitions of the data set (training, validation, and test), we observe accuracy results on average between 76.8% and 86.0% for the S-UNIWARD steganographic algorithm with 0.4 bpp (see Table 5). The above and the other results mentioned in this paper highlight the importance of clearly and precisely defining the experiments performed in steganalysis to report the results reliably and facilitate the reproduction of the experiments by the researchers.
Furthermore, we recommend reporting accuracy values as the mean and standard deviation of the top five results, as it helps account for model consistency and reliability. If possible, we encourage researchers to liberate a repository with code and data resources to reproduce the results and report the implementation details thoroughly, taking into account preprocessing and feature extraction techniques, classification process, and hyperparameters.