A destructive active defense algorithm for deepfake face images
Author and article information
Abstract
The harm caused by deepfake face images is increasing. To proactively defend against this threat, this paper innovatively proposes a destructive active defense algorithm for deepfake face images (DADFI). This algorithm adds slight perturbations to the original face images to generate adversarial samples. These perturbations are imperceptible to the human eye but cause significant distortions in the outputs of mainstream deepfake models. Firstly, the algorithm generates adversarial samples that maintain high visual fidelity and authenticity. Secondly, in a black-box scenario, the adversarial samples are used to attack deepfake models to enhance their offensive capabilities. Finally, destructive attack experiments were conducted on the mainstream face datasets CASIA-FaceV5 and CelebA. The results demonstrate that the proposed DADFI algorithm not only improves the generation speed of adversarial samples but also increases the success rate of active defense. This achievement can effectively reduce the harm caused by deepfake face images.
Cite this as
2024. A destructive active defense algorithm for deepfake face images. PeerJ Computer Science 10:e2356 https://doi.org/10.7717/peerj-cs.2356Main article text
Background
Deepfake is a method that uses deep learning technologies such as generative adversarial networks (Brophy et al., 2023) to create synthetic digital content, including images. Fake news (Phan, Nguyen & Hwang, 2023) generated by deepfake can mislead public information judgment and decision-making. Additionally, deepfake-created fake pornographic content (Sha et al., 2023) can infringe on the rights of the individuals depicted, while fake evidence (Shukla & Goh, 2024) produced by deepfake can undermine legal justice. Due to the growing concerns about information authenticity, artificial intelligence ethics, and social trust raised by deepfake technology, an increasing number of scholars are focusing on how to balance technological development with the protection of personal rights.
Given the above challenges and the increasingly complex deepfake problem, researchers are diligently developing deepfake detectors to address these issues and enhance detection robustness. The goal is to identify face images created by continuously evolving high-quality deepfake digital content. Among the current mainstream research on deepfake detection, the more mature approaches are based on passive detection methods. Examples include the method based on physiological signals proposed by D’Amelio et al. (2023), the method based on traditional image forensics proposed by Zhao et al. (2023), and the method based on image tampering traces proposed by Wu, Liao & Ou (2023). A common perspective among these methods is that they utilize characteristic information extracted directly from the deepfake digital content itself to achieve authenticity identification.
Compared with the above methods, active defense can theoretically detect and mitigate deepfake problems earlier by embedding hidden information such as adversarial samples and utilizing strategies like watermark traceability. Additionally, active defense employs adversarial attack methods to generate offensive adversarial samples, which are visually indistinguishable to humans but cause significant distortion in the digital content output by deepfake models. For instance, Yuan et al. (2024) proposed a semi-fragile digital watermark generation method based on deep learning. Semi-fragile watermarks are sensitive to operations like image compression and rotation, and are particularly responsive to image modifications, allowing easy authenticity verification of digital content. Sun et al. (2023b) demonstrated the transferability of digital watermarks between training data and deepfake models, achieving deepfake traceability by embedding digital watermarks in the target digital content. Radanliev & Santos (2023) utilized the Jacobian matrix to identify key pixels in adversarial attacks, exploring adversarial samples with superior attack performance. Dimlioglu & Choromanska (2024) combined the gradient descent heuristic algorithm to enhance the generalization of adversarial samples by uniformly weighting the gradient across regions and models. Ouyang et al. (2023) proposed a cross-model universal attack watermark generation method, addressing conflicts between watermarks generated by different models through the design of attack channels and two-level fusion strategies.
Introduction
Due to the substantial differences among various deepfake models, the aforementioned attack algorithms achieve optimal results primarily in black-box scenarios. Additionally, the adversarial examples generated by these algorithms often lack sufficient transferability and require access to the deepfake model’s output for iterative updates, resulting in relatively low output efficiency of the adversarial examples.
Given the limitations and challenges of existing deepfake active defense detection methods, this paper innovatively proposes a destructive active defense algorithm for deepfake face images (DADFI) to address the low efficiency and poor transferability of current solutions. The specific steps of this algorithm are as follows:
(1) Generate adversarial samples: By extracting multi-dimensional feature information from images using a generative adversarial network (GAN), adversarial samples are generated. The quality of these samples is enhanced through adversarial loss strategies.
(2) Optimize adversarial samples: The adversarial attack algorithm is applied to the deepfake model, conducting black-box attacks in simulated black-box scenarios to optimize the attack performance of the adversarial samples.
(3) Improve attack capabilities: Training with multiple deepfake models is performed to enhance the generalization ability of the adversarial samples, improving their resilience to cross-model attacks.
The innovation of this algorithm lies in its ability to achieve end-to-end generation of adversarial samples based on a GAN. Unlike conventional adversarial attack algorithms that require frequent access to the deepfake model during the generation process, this algorithm eliminates the need to access the original deepfake model once the model training is complete. This approach significantly enhances the efficiency and transferability of the adversarial samples, providing a robust solution to the deepfake threat.
The paper is structured as follows: the Introduction provides an overview of the deepfake problem and its increasing impact, the motivation for developing efficient and transferable active defense mechanisms, and introduces the proposed DADFI along with its key innovations. The Related Works section is divided into three parts: research on deepfake based on face images, research on adversarial attacks and deepfake, and research on deepfake based on generative adversarial network, offering a comprehensive review of existing literature in these areas. The Algorithm Design section details the strategies for generating and fine-tuning adversarial samples, including the algorithm for generating adversarial examples and the algorithm for fine-tuning against adversarial examples. The Experiments section describes the models, datasets, and parameters used, along with the measurement methods for experimental results. The paper concludes with the Conclusion section, summarizing the findings.
Algorithm Design
As far as the current research status is concerned, the training results of many studies can only be targeted at specific models, and the adversarial attack strategies do not have strong effectiveness and transferability. To address this problem, this paper designs a destructive active defense algorithm (DADFI) for deep fake face images. This algorithm can design more effective and more transferable adversarial samples. This type of adversarial samples can make mainstream deep fake face image models (such as StarGAN, StarGAN-v2 (Cai, Li & Zhang, 2023) and STGAN-SAC (Hou & Nayak, 2023)) output severely distorted fake images. The face image adversarial samples generated using the DADFI are almost indistinguishable from the original images when observed by human vision. However, when these adversarial samples are input into the deepfake face image models, the output face images are greatly distorted. The defense is implemented, as shown in Fig. 1, where the face images come from the public datasets CASIA-FaceV5 (Qi et al., 2023) containing 2,500 face images from 500 Asians and CelebA.
Figure 1: The effect achieved by DADFI.
The destructive active defense algorithm (DADFI) designed in this paper first generates adversarial samples, and then proposes an optimization algorithm for fine-tuning the adversarial samples to improve the attack, effectiveness, and transferability.
Strategies for adversarial samples generation and fine-tuning
This algorithm innovatively designs an adversarial samples generation module and an adversarial samples fine-tuning module. In the generation module, a generator Gen is designed, and in the fine-tuning module, two recognizers Idex and Idey are designed. In addition, n deepfake models modi are set, which i ≤ n. If any face image is set as imagei, the feature space of the face image is set as feaimagei, and the number of all features is set as sumfea. Next, the generator Gen extracts the high-dimensional feature Genimagei of the face image imagei, and superimposes the feature value to the original face image to obtain the adversarial sample iamgeiadv of the face image, which is composed of a set of feature vectors, that is, iamgeiadv = imagei + Genimagei. Based on the projected gradient descent algorithm, a basic adversarial perturbation advi is generated for the deepfake model modi. This adversarial perturbation is added to the original face image to obtain the fine-tuned adversarial sample image′iadv of the face image. The adversarial sample is input into the image output by the deepfake model. Produces obvious distortion, that is, image′iadv=imagei+advi.
In order to obtain adversarial samples that are transferable and universal across models, this paper innovatively uses the projected gradient descent algorithm to implement adversarial attacks based on gradients. The distortion area and distortion effect caused by the basic adversarial perturbation advi on the face image is only the same as the corresponding deepfake models are related, that is to say, this basic adversarial perturbation is robust to multiple deepfake models. In summary, the high-dimensional feature Genimagei has learned adversarial attack capabilities similar to the basic adversarial perturbation advi, and the adversarial samples have characteristics that are common across models.
This algorithm sets modi(x) as the face editing operation of the selected deepfake model, sets the deepfake image generated by the adversarial sample imageiadv of the face image as fakeimagei = modi(imageiadv), and sets the deepfake image generated by the adversarial sample image′iadv based on fine-tuning of the face image as fake′imagei=modi(image′iadv), input yifake = fakeimagei and yireal=fake′imagei into the recognizer Idey, respectively, and improve the attack capability of the adversarial sample imageiadv through adversarial training. In addition, two types of adversarial samples are input into n different deepfake models, and based on the output results, a distorted deepfake face images set [(fakeimage1,fake′image1),(fakeimage2,fake′image2),…,(fakeimagen,fake′imagen)] is constructed. Joint training through discriminators can improve the transferability of adversarial examples. Based on the above strategy, inputting the original face images into the trained generative model can obtain adversarial samples. Compared with the original images, the adversarial samples are basically indistinguishable to the human eye. However, these samples can completely make several different deepfake models output face images with different degrees of distortion, as shown in Fig. 2.
Algorithm for generating adversarial examples
When generating adversarial examples, traditional adversarial example attack algorithms need to repeatedly access the deepfake model. In order to solve this problem and improve the efficiency of adversarial sample generation, this paper innovatively uses a generative adversarial network to design an adversarial sample generation module and an adversarial sample identification module. The generation module designs the generator, and the identification module designs the recognizer. In this network architecture, the input is the original face image, and the output is the generated adversarial sample.
The generator Gen is used to generate a slight perturbation that can make the adversarial sample image similar to the original face image, and then obtain the mapping function from the original face image to the slight perturbation as shown in Eq. (1) (1)Gen:imagei⟶Genimagei. Among them, Gen represents the generator, imagei represents any face image input to the generator, and Gemimagei represents the slight adversarial perturbation output by the generator. The adversarial sample imageiadv can be obtained by superimposing the perturbation onto the original image, as shown in Eq. (2): (2)imageiadv=imagei+Genimagei. After obtaining the adversarial samples, this algorithm uses the recognizer Idex to distinguish them in order to reduce the visual gap between the adversarial samples and the original face images. This algorithm innovatively proposes adversarial training to reduce the visual impact caused by slight perturbations, making the adversarial samples almost visually consistent with the original face images. The generator defined by the algorithm adopts an autoencoding structure. The encoder uses three layers of convolution, and instance normalization is performed after each layer of convolution. The decoder uses three upsampling layers, and instance normalization is performed after each layer of convolution. In addition, four residual blocks are added between the encoder and the decoder. Each residual block consists of two 3*3 convolutional layers. At the same time, instance normalization is performed after each convolutional layer. The first of the residual blocks is the output of the convolutional layer is activated.
Figure 2: Adversarial example generation and fine-tuning.
The recognizer Idex is used to distinguish whether the generated adversarial samples are real or fake compared to the original face images, and iterate during adversarial training so that the generator finally generates a perturbation with minimal visual impact on the original images. The algorithm sets the adversarial loss of the generator Gen and the discriminator Idex to be LossIdex, as shown in Eq. (3): (3)LossIdex=E[logIdex(imagei)]+E[log(1−Idex(imagei+Genimagei))] where E represents the mathematical expectation value. At the same time, in order to improve the perception of adversarial samples and minimize the visual impact of slight adversarial perturbations on the original face images, the algorithm innovatively designs a hinge loss function Lossgemel to constrain the scale of the perturbation, as shown in Eq. (4) (4)Lossgemel=Eimageimax(0,∥Genimagei∥∞−lim). Among them, lim represents the specified infinite norm limit against perturbation. If ∥Genimagei∥∞ exceeds lim, this function will constrain the perturbation.
Algorithm for fine-tuning against adversarial examples
The adversarial samples fine-tuning algorithm consists of the deepfake model modi and the recognizer Idey, where the network structure of the recognizer Idey is consistent with that of the recognizer Idex. The input of this algorithm consists of two types of adversarial samples, one is the adversarial samples imageiadv generated by the generator, and the other is the adversarial sample imageiadv′ added with the basic adversarial perturbation. Among them, the addition of basic adversarial perturbation advi can significantly distort the face image output by the deepfake model, as shown in Eqs. (5) and (6): (5)advimage0=imageori+advi. Among them, advimage0 represents the first slight perturbation to the original face image, imageori represents the original face image, and advi represents slight perturbation. (6)advimagei+1=sliimageori{advimagei+∂(∇imageoriL(modi(advimagei),modi(imageori)))}. Among them, advimagei+1 represents the next iteration of the i-th slight perturbation advimagei of the original face image. The slight perturbation obtained by iterative training of the above algorithm is universal for the deepfake model. That is to say, not only the few face images, but also all the face images in the training data set can be made to have slight perturbations. The fake model outputs distorted face images. This algorithm implements adversarial attacks based on gradients through the projected gradient descent algorithm in advance, and generates slight perturbations for multiple deepfake models. The perturbed adversarial samples can be used as label-assisted optimization of adversarial samples imageiadv.
Adversarial examples fine-tuning uses an adversarial loss function to improve the aggressiveness of adversarial examples. The generation process of slight perturbation is related to the model gradient. That is to say, after different face images are subjected to the same slight perturbation, the deepfake models are input, and the output deepfake images are similar in terms of distortion range and degree of distortion. For this reason, input the fine-tuned adversarial sample imageiadv′ into the image obtained by the deepfake model modi, that is, the fake image fakeimagei′ generated based on the fine-tuned adversarial sample imageiadv‘ of the face image is generated; and input the adversarial sample imageiadv generated by the generator. The image obtained by the deepfake model modi is obtained, that is, the deepfake image fakeimagei generated by the adversarial sample imageiadv of the face image is obtained. At this time, the algorithm identifies the authenticity of the image through the recognizer Idey, and adds adversarial training to make fakeimagei infinitely close to fakeimagei′, and finally the adversarial resistance of the adversarial sample imageiadv is improved. The adversarial loss LossIdey of the recognizer Idey is shown in Eq. (7) (7)LossIdey=E[logIdey(modi(image′iadv))]+E[log(1−Idey(modi(imageiadv)))]. Adversarial examples fine-tuning integrates multiple deepfake models to improve the transferability of adversarial examples. During the training process, this algorithm uses adversarial samples to attack multiple deepfake models, which not only improves transferability, but also improves the cross-model attack capability of adversarial samples. First, this algorithm generates basic adversarial perturbations advi(i ∈ [1, n]) for n deepfake models. Then, the adversarial samples imageiadv and adversarial samples image′iadv are input into multiple different deepfake models respectively, where the adversarial samples imageiadv obtains the deepfake images fakeimagei through the deepfake model, and the adversarial samples image′iadv obtain the deepfake images fake′imagei through the deepfake model, and then collect these images as ass, as shown in Eq. (8): (8)ass=[(fake′image1,fakeimage1),(fake′image2,fakeimage2),…,(fake′imagen,fakeimagen)]. This algorithm makes fakeimagei visually closer to fake′imagei through adversarial training, and can further improve the loss function, as shown in Eq. (9): (9)LossIdey=E[n∑i=1logIdey(modi(image′iadv))]+E[n∑i=1log(1−Idey(modi(imageiadv)))]. After the above improvement process, the objective function Loss of this algorithm is composed of the adversarial loss LossIdex of the recognizer Idex, the adversarial loss LossIdey of the recognizer Idey and the hinge loss Lossgemel, as shown in Eq. (10): (10)Loss=LossIdex+LossIdey+Lossgemel where µrepresents the weight of the hinge loss. During the training of this algorithm, the adversarial loss LossIdex constrains the adversarial samples and the original face images to be highly consistent visually, and the hinge loss Lossgemel constrains the boundaries of slight disturbances. The purpose of both losses is to ensure the generation of high quality. , adversarial samples with visual effects close to the original face images. Deepfake is generally based on a black box scenario. The party whose image is deeply faked does not know the network structure and parameters of the deepfake model. Therefore, the best way to proactively defend is to perform black box attacks on several deepfake models at the same time. This algorithm attacks based on queries, and estimates the gradient of the model by querying the difference in the output results of the deepfake model to improve the model’s generalization ability. Adversarial loss LossIdey simulates adversarial samples to attack multiple deepfake models in a black box scenario, thereby improving the aggressiveness and transferability of adversarial samples.
Strategies to improve the offensiveness of adversarial examples
In traditional research on increasing the aggressiveness of adversarial samples, most studies use fine-tuning strategies. The algorithm proposed in this paper innovatively uses two strategies to increase or maintain the aggressiveness of samples. These two strategies are using a stronger perturbation strategy and dynamically adjusting the perturbation intensity.
Stronger perturbation strategy
When designing a stronger perturbation strategy, this paper adopts a method of gradually accumulating small perturbations. First, the basic adversarial perturbation advi is generated and superimposed on the original face image, as shown in Eq. (11): (11)advimage0=imageori+advi. Among them, advimage0 represents the first slight perturbation of the original face image, imageori represents the original face image, and advi represents a slight perturbation. Then, a stronger perturbation is iteratively trained, as shown in Eq. (12): (12)advimagei+1=sliimageoriadvimagei+∂(∇imageoriL(modi(advimagei),modi(imageori))). Among them, advimagei represents the next iterative perturbation of the original face image. Through multiple iterations, we are able to generate universal small perturbations applicable to different deep fake models, so that the face images in all training datasets have small perturbations, which causes the deep fake model to output distorted face images. This perturbation strategy generates small perturbations through the projected gradient descent algorithm to conduct adversarial attacks on multiple deep fake models, improving the universality and attack effect of the perturbation.
Adjust perturbation intensity
In order to improve the offensiveness, this paper designs a method to dynamically adjust the perturbation strength. In the process of adversarial sample fine-tuning, this paper introduces the adversarial loss function LossIdey, which estimates the gradient of the model by querying the output difference of the deep fake model, so as to conduct adversarial attacks in black box scenarios, as shown in Eq. (13): (13)LossIdey=E[logIdey(modi(image′iadv))]+E[log(1−Idey(modi(imageiadv)))]. During the training process, the refined adversarial sample image′iadv and the generated adversarial sample imageiadv are input, and distorted images are obtained through multiple different deep fake models. Through adversarial training, fakeimagei is made close to fake′imagei, and the loss function is further optimized, as shown in Eq. (14): (14)Loss=LossIdex+LossIdey+Lossgemel where Lossgemel represents the weight of the hinge loss. Through the constraints of the adversarial loss LossIdex and the hinge loss LossIdey, high-quality adversarial samples are generated to make them visually close to the original face images. In the black-box attack scenario, this method of dynamically adjusting the perturbation strength can improve the aggressiveness and transferability of adversarial samples, thereby enhancing the ability to attack multiple deep fake models.
Based on the above algorithm design, the ultimate goal of this algorithm is to generate adversarial samples that are almost visually consistent with the original face images. At the same time, after the adversarial samples are input to different deepfake models, highly distorted face images are output. This algorithm also innovatively uses a generative adversarial network for optimization. The purpose of the confrontation of the generator Gen is to make the target loss function as small as possible, and the purpose of the confrontation of the recognizer Idex and Idey are to make the target loss function as large as possible, and finally reach Nash equilibrium to end the adversarial training.
Experiments
Models, datasets and parameters
The deepfake models selected for the experiments of this paper are StarGAN, StarGAN − v2 and STGAN − SAC. All three models can achieve deepfake of face images. Among them, StarGAN and StarGAN − v2 are used for deepfake of hair and age in face images, while STGAN − SAC is used for deepfake of accessories in face images.
In addition, the datasets selected for the experiment are the public datasets CASIA − FaceV5 and CelebA. Among them, the CASIA − FaceV5 dataset contains a large number of high-definition, diverse and real face images, covering different ages, genders and expressions, providing a comprehensive foundation for face recognition and analysis tasks. The CelebA dataset is another large-scale face attribute dataset, containing more than 200,000 celebrity images, each of which is annotated with 40 attribute labels. The images in this dataset have a wide range of variations in pose, background and lighting conditions. The algorithm proposed in this paper novelly adds the generative adversarial network to the experiment. In the experiment, images of size 300∗300 extracted from the dataset are used as input face images. The weight µrepresenting the hinge loss is set to 0.5. The training cycle and batch processing are set to 100 and 50, respectively.
Finally, the algorithm innovatively designed a model optimizer for the experiments. The learning rate of each parameter is dynamically adjusted by calculating the first and second moment estimates of the gradient, thereby improving training efficiency and accuracy. The specific steps are:
-
Compute gradient grapara for each parameter para.
-
Calculate the first-order moment estimate mean ¯est1 of the gradient, as shown in Eq. (15): (15)¯est1=α1¯est1+(1−α1)grapara. Among them, α1 is used to control the exponential decay rate of the first-order moment estimate, which is set to 0.9 in the experiments.
-
Calculate the second-order moment estimation variance σ of the gradient, as shown in Eq. (16): (16)σ=α2σ+(1−σ)gra2para. Among them, alpha2 is used to control the exponential decay rate of the second-order moment estimation, which is set to 0.99 in the experiments.
-
Perform bias correction on the first-order moment estimate and the second-order moment estimate, as shown in Eq. (17): (17)cor¯est1=est11−αi1,corσ=σ1−αi2. Among them, i represents the current iteration number.
-
Update parameters based on first-order moment estimation and second-order moment estimation, as shown in Eq. (18): (18)para=para−γcor¯est1√corσ+τ. Among them, γ represents the learning rate, and τ is used to stabilize the value. In the experiments, it is set to 1e − 6.
Based on the above, the model optimizer designed in the experiments have a good convergence speed.
Measurement methods for experimental results
First, the experiments innovatively designed the pixel distance dpix. When evaluating the quality of the face images generated by the deepfake model, the quality of the generated images can be measured by calculating dpix between the original face images and the deepfake face images, as shown in Eq. (19): (19)dpix=√n∑i=1(Iinput[i]−Ioutput[i])2. Among them, Iinput[i] represents the value of the i − th pixel of the input original face images, and Ioutput[i] represents the value of the i − th pixel of the output deepfake face images. It is not difficult to realize that the larger the value of dpix, the greater the visual difference between the output deepfake face images and the original face images, that is, the better the attack effect of the adversarial samples.
Then, the value of dpix can more accurately evaluate the quality of the face images generated by the deepfake model from a global level. However, if the visual difference between the output deepfake images and the original face images is mainly local, it should also be evaluated. It is evaluated that the attack effect is good, but the dpix value at this time may be very small. At this time, the experiments innovatively adds a mask matrix to calculate the mask distance. The experiments use an attention priority mechanism to focus attention on the modified parts of the face images, because the mask matrix can specify which part of the pixels should be used to calculate distance, and which part of the pixels should be ignored. If the difference in this part of the area exceeds a certain threshold, it is also considered to be a good attack effect. Calculate the mask distance dMd_pix as shown in Eq. (20): (20)dMd_pix=√n∑i=1M[i](Iinput[i]−Ioutput[i])2. Among them, M is the mask matrix defined in this experiment. The shape of the matrix and the original face image are both 300∗300, in which the value of each pixel is 0 or 1.0 means that the pixel at the position should be ignored, and 1 means that the pixel at the position should be used to calculate the distance. M[i] is the value of the i − th pixel in the mask matrix.
Finally, in order to further evaluate the effect of attacking the deepfake model, the experiments will be conducted by calculating the Frechet distance dFrechet using the mean vector and covariance matrix of the real images and the generated images (Cheng & Huang, 2023), as shown in Eq. (21): (21)dFrechet=∥μinput−μoutput∥22+Tr(∑input+∑output−2√∑input∑output). The lower the distance dFrechet value, the closer the distribution of the generated images and the real images in the feature space are, and the higher the quality of the generated images are. Among them, μinput and μoutput represent the mean vector of the real images and the generated images in the feature space respectively, ∑input and ∑output represent the covariance matrix of the real images and the generated images in the feature space respectively, and ∥μinput−μoutput∥22 calculates the covariance matrix of the real images and the generated images in the feature space. The square of the Euclidean distance of the mean vector is used to measure the difference in mean between the two. Tr(∑input+∑output−2√∑input∑output) calculates the Frechet distance between the covariance matrix of the real images and the generated images in the feature space to measure the difference in covariance between the two. Finally, the distances between these two parts are added to obtain the Frechet distance value, which is used to evaluate the quality of the generated images.
Results
Validity verification
Based on the results of the experiments, this paper designed an objective method to evaluate whether the attack deepfake model was successful. The experiments randomly selected 200 face images from the public datasets CASIA − FaceV5 and CelebA as the training set, and randomly selected 2000 face images from other images in the dataset for all experiments. By calculating the pixel distance dpix, mask distance dMd_pix and Frechet distance dFrechet, we observe the active defense effect against the deepfake models StarGAN, StarGAN − v2 and STGAN − SAC, as shown in Fig. 3.
First, this experiment randomly selected 200 face images from the dataset CASIA − FaceV5 for testing, and modified these original face images based on the face deepfake characteristics of the deepfake models StarGAN, StarGAN − v2 and STGAN − SAC. The results show that the pixel distance dpix of the three models attacked by the DADFI algorithm proposed in this paper is greater than 0.06. The algorithm believes that when the dpix of the original face images and the deepfake images are greater than 0.6, people can easily distinguish the two images visually. In other words, when the dpix of the two images is greater than 0.6, the active defense effect is good. In addition, the mask distance dMd_pix of the three models attacked by the DADFI algorithm is greater than 0.85. The closer the mask distance dMd_pix is to 1, the better the active defense effect. At the same time, the distorted Frechet distance dFrechet values of all output images are above 30, indicating that the difference between distorted images and normal deepfake images are large, and also prove the effectiveness of active defense. The same experiment was also conducted on the CelebA dataset and achieved similar results.
Then, during the training process of the data set, the DADFI algorithm constrains the infinite norm limit lim representing the specified adversarial perturbation to less than 0.05, so there is basically no difference between the original face images and its adversarial samples for normal vision. At the same time, there are large visual differences between the output results of the deepfake models StarGAN, StarGAN − v2 and STGAN − SAC on the original face images and its adversarial samples, human vision can easily detect tampered images. Experimental results show that StarGAN is most vulnerable to adversarial attacks, and the output results after being attacked have the most obvious distortion. Although StarGAN − v2 and STGAN − SAC have a certain degree of robustness to attacks, the output images after the attack are still significantly different from the normal deepfake images, and the models cannot achieve the purpose of tampering with face images.
Figure 3: Experimental results.
Finally, by horizontally comparing and modifying different facial feature attributes of the same models, it was found that the tampered area of the deepfake model will not have a major impact on the generated distortion area, that is, the adversarial attack is robust to deepfake methods. The above experimental results show that the adversarial samples generated by DADFI have the ability to attack deepfake models across models and can protect face images from tampering by deepfake models.
Comparative experiments
First, based on the datasets CASIA − FaceV5 and CelebA, the algorithm selected three common adversarial attack algorithms for comparison. They are MOMA proposed by Sun et al. (2023a), ATS − O2A proposed by Li et al. (2023a) and Adv − Bot proposed by Debicha et al. (2023). Based on the above three algorithms, the experiments innovatively use mask distance dMd_pix and Frechet distance dFrechet to observe the active defense effects of several different algorithms against deepfake models StarGAN, StarGAN − v2 and STGAN − SAC. The experimental results are as follows as shown in Figs. 4, 5 and 6.
Figure 4: StarGAN.
Figure 5: StarGAN-v2.
Figure 6: STGAN-SAC.
Then, the experimental results conducted on three different deepfake models show that the above three common adversarial attack algorithms have certain active defense effects. Compared with the comparative algorithms, the DADFI proposed in this paper can generate adversarial samples that can more effectively attack the two deepfake models of StarGAN and STGAN–SAC, achieving the goal of cross-model defense against deepfake. In addition, the maximum value of the mask distance dMd_pix of the DADFI for attacking these three models is 1.00 for StarGAN, and the minimum value is 0.79 for STGAN–SAC, but it is far ahead of the values of other algorithms. Because the closer the mask distance dMd_pix is to 1, the better the active defense effect is, so the active defense effect of DADFI is better than the comparison algorithms on these three models. At the same time, the distorted Frechet distance dFrechet value of all output images of the DADFI is above 30, with the maximum value being 131.50 for StarGAN and the minimum value being 39.62 for StarGAN − v2, indicating that the difference between the distorted images and the normal deepfake images are large, and also prove effectiveness of active defense.
Finally, the DADFI proposed in this paper is based on an end-to-end generative adversarial network. After training, adversarial samples can be obtained without accessing the deepfake model again. Therefore, this paper compares the efficiency of adversarial samples generation with different algorithms. The experimental results are as shown as Fig. 7.
Figure 7: The average generation time of adversarial examples.
Experimental results show that compared with the comparative algorithms MOMA, ATS − O2A and Adv–Bot, the experimental results of the DADFI are better in terms of adversarial samples generation efficiency.
Verification experiments
In order to verify the aggressiveness and migration of the DADFI proposed in this paper, the paper designed the following two sets of experiments.
Verification experiment 1: Adjust the number of deepfake models in the adversarial samples fine-tuning algorithm, and adjust the three deepfake models trained for adversarial samples attack in DADFI to attack a single deepfake model, in order to verify the effectiveness of the DADFI and the advantages in mobility.
Adjust the number of deepfake models to 1, and when only one of the deepfake models StarGAN, StarGAN–v2, and STGAN–SAC is retained, the attack and transferability of the adversarial samples are evaluated through pixel distance dpix. If the pixel distance dpix produced when attacking the selected deepfake model is high, and the pixel distance dpix produced when attacking the remaining deepfake models is significantly lower than 0.05, in other words, in the case of a single-model attack, the pertinence is high, but the transferability is insufficient. Experimental results prove that the DADFI proposed in this paper uses three deepfake models for training, and uses pixel distance dpix loss to evaluate the distortion effect. The values are all higher than 0.05, and the mask distance dMd_pix scores are high, both close to 1, which is obviously improved transferability of adversarial examples. The pixel distance comparison of any one of the three deepfake models is retained as shown in Fig. 8.
Figure 8: Pixel distance comparison.
Experimental results show that the DADFI proposed in this paper adjusts the number of deepfake models to 1. When only one of the deepfake models StarGAN, StarGAN–v2 and STGAN–SAC is retained, the pixel distance values corresponding to this model are higher than compare models to prove that the transferability of adversarial examples is better.
Verification experiment 2: Remove the adversarial samples fine-tuning algorithm, and determine whether the adversarial samples generated by the generator can make the deepfake model output distorted face images to verify whether the fine-tuning algorithm can improve the aggressiveness of the adversarial samples.
When the adversarial samples fine-tuning algorithm is removed from training and only the adversarial samples generation algorithm is used for training, the pixel distance dpix of the generated adversarial samples to the deepfake model StarGAN is only 0.0382, and the pixel distance dpix of the other two deepfake models is even lower. Moreover, the mask distance dMd_pix values of the above three models are all small, and the adversarial samples are almost not aggressive. By comparing with the DADFI proposed in this paper, DADFI can effectively improve the aggressiveness of adversarial samples and retain the mask distance comparison of any one of the three deepfake models as shown in Fig. 9.
Figure 9: Mask distance comparison.
Experimental results show that when the adversarial samples fine-tuning algorithm is removed from the DADFI training proposed in this paper and only the adversarial samples generation algorithm is retained, the mask distance dMd_pix values corresponding to the model are higher than those of the comparison models, thus proving the attack performance comparison of adversarial samples good.
The results of the above-mentioned verification experiment one and verification experiment two show that the DADFI proposed in this paper has strong attack and migration performances.
Conclusion
This paper proposes a destructive active defense algorithm for deepfake face images (DADFI). Active defense protects face images from being tampered with by deepfake models. The algorithm consists of two sub-algorithms: the adversarial samples generation algorithm and the adversarial samples fine-tuning algorithm. When the adversarial samples generation algorithm generates adversarial samples through generator training, the loss function is used to make the perturbed adversarial samples maintain visual consistency with the original face images, reducing the impact of perturbation on image quality. The adversarial samples fine-tuning algorithm combines existing adversarial attack algorithms to improve the aggressiveness and transferability of adversarial samples through adversarial training, allowing multiple deepfake models to output highly distorted face images. Experimental results show that the DADFI algorithm proposed in this paper can achieve the ability to interfere with the output of deepfake models across models, and the achieved face images distortion effect is close to the adversarial samples generated by the current mainstream adversarial attack algorithms. More importantly, DADFI greatly improves the efficiency of adversarial samples generation. However, despite the promising results, the research does have some limitations in terms of datasets and models. The datasets used for generating and fine-tuning adversarial samples might not cover the full spectrum of real-world face variations, potentially limiting the generalizability of the algorithm. Additionally, the models used in the experiments may not encompass all types of deepfake generation techniques, which means the DADFI algorithm’s performance could vary when faced with unseen or more advanced deepfake models.
Based on the algorithm proposed in this paper, in future work, the authors will solve the following three problems: (1) Study the generation of adversarial samples by local perturbation to reduce the impact on the quality of the original image. (2) Improve the visual consistency of adversarial samples and enhance their similarity with the original image. (3) Optimize the efficiency of adversarial sample generation and improve the practicality of the algorithm.
Additional Information and Declarations
Competing Interests
The authors declare there are no competing interests.
Author Contributions
Yang Yang conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.
Norisma Binti Idris analyzed the data, prepared figures and/or tables, and approved the final draft.
Chang Liu conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.
Hui Wu performed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.
Dingguo Yu conceived and designed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the article, and approved the final draft.
Data Availability
The following information was supplied regarding data availability:
The CASIA-FaceV5 dataset is available from the Institute of Automation Chinese Academy of Sciences at: http://english.ia.cas.cn/db/201610/t20161026_169405.html, and at FigShare: Yang, Yang (2024). CASIA-FaceV5.zip. figshare. Dataset. https://doi.org/10.6084/m9.figshare.26509591.v1.
The CelebA large-scale face dataset is available at: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
The code (original code (DADFI) & 3rd party models (StarGAN, StarGAN-v2 and ST GAN-SAC)) are available at Zenodo:Yang. (2024). A-destructive-active-defense-algorithm-for-deepfake-face-images [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13283088.
Funding
This research was funded by the National Social Science Fund of China (grant no. 22BSH025), the National Natural Science Foundation of China (grant no. 62206241) and the Key Research and Development Program of Zhejiang Province, China (grant no. 2021C03138), the Medium and Long-Term Science and Technology Plan for Radio, Television, and Online Audiovisuals (grant no. 2022AD0400). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.