Addressing human speech characteristics in single-channel speaker extraction networks

Xinghuo Ye; Li Lin; Guifen Jiang; Zhaohui Yuan

doi:10.7717/peerj-cs.3326

Addressing human speech characteristics in single-channel speaker extraction networks

Xinghuo Ye^1,2, Li Lin³, Guifen Jiang³, Zhaohui Yuan ³

1School of Internet, Jiaxing Vocational and Technical College, Jiaxing, China

2Shanghai Minhang Polytechnic, Shanghai, China

3East China JiaoTong University, Nanchang, China

DOI: 10.7717/peerj-cs.3326

Published: 2025-11-03
Accepted: 2025-10-03
Received: 2025-03-04

Academic Editor: Xiangjie Kong

Subject Areas: Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Multimedia, Natural Language and Speech, Neural Networks
Keywords: Time domain, Speaker extraction, Single channel, Deep learning

Copyright: © 2025 Ye et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Ye X, Lin L, Jiang G, Yuan Z. 2025. Addressing human speech characteristics in single-channel speaker extraction networks. PeerJ Computer Science 11:e3326 https://doi.org/10.7717/peerj-cs.3326

The authors have chosen to make the review history of this article public.

Abstract

The objective of single-channel speaker extraction is to isolate the clean speech of a target speaker from a mixture of multiple speakers’ utterances. Conventionally, an auxiliary reference network is utilized to extract the speaker’s voiceprint features from the speech signal, which are then fed as input cues to the primary speech extraction network, thereby enhancing the extraction robustness. Nevertheless, existing studies have largely overlooked the sequential characteristics of speech signals, leading to a mismatch between the receptive field of the model and the inherent signal features. Moreover, prevalent speaker extraction architectures fail to account for the spectral distribution properties of speaker speech. To address these limitations, this study presents an extended version of our previous work (DOI 10.36227/techrxiv.23849361.2), which proposes an innovative approach by extending the Convolutional Neural Network Next (ConvNeXt) model, originally designed for image processing, into the Time-Domain ConvNeXt model (TD-ConvNeXt). The TD-ConvNeXt model is integrated with Temporal Convolutional Networks (TCN) blocks to construct the core architecture of the speech extraction network. Additionally, the ConvNeXt model is adapted into a novel Spk block for the auxiliary network, which effectively learns the identity-related features from the reference speech and represents them as embedding vectors. This methodology enables high-fidelity extraction of the target speaker’s speech while maintaining excellent speech quality. The extraction network and the auxiliary reference network are jointly optimized via multi-task learning. Extensive experimental evaluations demonstrate that the proposed model substantially outperforms state-of-the-art methods in single-channel target speech extraction. Across various signal-to-noise ratios (SNRs), the proposed model achieves superior performance in terms of Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI). Specifically, at an SNR of 5 dB, the proposed model attains SI-SDR, PESQ, and STOI scores of 14.83, 3.03, and 0.934, respectively.

Introduction

Speaker speech extraction is a special case of speech separation that aims to extract the speech signal of a target individual from mixed speech signals of multiple people, commonly referred to as the cocktail party problem (Chen et al., 2017). In recent years, deep learning-based methods have made significant progress in both speech quality and extraction speed. These methods can mine deep features from large amounts of data, possess strong nonlinear modeling capabilities, and improve the generalization performance of the network in speech extraction, achieving stronger adaptability and noise resistance (Saijo et al., 2024; Zmolikova et al., 2023; Wang et al., 2023; Tzinis et al., 2022; Karamatli & Kirbiz, 2022; Wang et al., 2022; Chen et al., 2020; Manamperi et al., 2022). However, current work is based on the assumption that mixed speech signals are mutually independent, which is not always valid in practical scenarios. The sounds of speakers commonly affect each other, causing their speech to no longer be independent and affecting the performance of speech extraction algorithms.

To address the low robustness of single-channel target speech extraction, many researchers have made significant progress using various deep learning techniques. Ju et al. (2023) slash computation via subband processing. Xu et al. (2019) balance speech quality and speaker fidelity with a new loss. Luo & Mesgarani (2019) propose an end-to-end fully convolutional time-domain speech separation network. The VoiceFilter series (Wang et al., 2018, 2020; Rikhye et al., 2021) pioneer speaker-conditioned masking, later optimized for on-device streaming. Delcroix et al. (2020) inject embeddings into time-domain networks, while Spex/SpEx+ (Xu et al., 2020; Ge et al., 2020) solve phase issues via multi-scale convolution and complex masking. Ge et al. (2021) integrate utterance and frame-level references using a multi-stage framework. Zmolikova et al. (2017) propose the speakerbeam model, which uses the speaker’s voiceprint information as a reference for the extraction network to focus on a single speaker’s speech. Subsequent research work mostly refers to similar structures, especially the convolutional neural network next (ConvNeXt) model’s application in speech processing (Liu et al., 2022), which demonstrate high accuracy, good scalability, and effectiveness in speech separation and extraction. However, current studies have paid little attention to the time-series characteristics of speech signals, resulting in a mismatch between the receiving field and the signal characteristics. In addition, the commonly used speaker extraction model does not consider the spectral distribution characteristics of the speaker’s speech. To address these shortcomings, we propose an improved TD-ConvNeXt structure that can process time-series speech data in the case of large receptive fields, which can match the features of the speech signal with strong front-to-back correlation in speech signals and, estimate the mask of target speech. In addition, we believe that the entire network needs TD-ConvNeXt structure to focus longer speech context contacts and temporal convolutional networks (TCN) structure to obtain more complete speech features. In the backbone network of speech extraction (speech extractor), we combine both structures to achieve better results (TCN and TD-ConvNeXt combination) and experimentally verify this hypothesis. In order to overcome the shortcomings of equal weight distribution between the channels of the reference speech, combined with the convolutional neural network and the channel attention module, we convert the ConvNeXt structure into the Spk structure, and add squeeze-and-excitation (SE) blocks to form a new Spk block. The block learns the speaker identity information features from the reference speech as embedding vectors to stimulate the main extraction network, which can strengthen the weight of channels that are beneficial for speaker identification, improving the accuracy of speech extraction while avoiding the phase estimation problem associated with learning and estimating time-frequency maps. To verify the effectiveness of our proposed system, we used SpEx+ (Ge et al., 2020) as a reference model, set up different extraction networks, and determined the optimal extraction network model structure parameters as well as the parameters of the auxiliary network. We also discussed the settings of different window encoder sizes. Finally, we compared our proposed model with SpEx+ (Ge et al., 2020), sDPCCN (Han et al., 2022) and other the state of the art models and verified that our model outperforms existing models in Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) (Roux et al., 2019), Perceptual Evaluation of Speech Quality (PESQ) (Rix et al., 2001), and Short-Time Objective Intelligibility (STOI) (Taal et al., 2011) indicators under different signal-to-noise ratios. Portions of this text were previously published as part of a preprint (Yuan, 2023). The main contributions of this article are:

1.

We extend the ConvNeXt block model in image processing to TD-ConvNeXt block suitable for processing temporal information in time-series data and combine it with TCN block to form the main body of our speech extraction network.
2.

We design a new Spk block that learns speaker identity information features from reference speech as embedding vectors to stimulate the main extraction network.
3.

We conduct extensive experiments to find the optimal structure parameters and network structure of the best extraction network and compare our proposed model with existing advanced speech extraction algorithms, verifying its effectiveness in improving single-channel target speech extraction performance.

This article is organized as follows. In ‘The Improvement of Speech Extraction Scheme’, we motivate and design the proposed architecture. In ‘Experiment Methology’, we report the experiments. In ‘Results’, We have conducted numerous experiments to validate the effectiveness of the new model. ‘Related Work’ discusses the related works. And ‘Conclusions’ concludes the study.

The improvement of speech extraction scheme

Selective auditory attention, an essential cognitive function of humans, empowers individuals to concentrate on pertinent auditory stimuli while effectively discarding irrelevant distractions (Mesgarani & Chang, 2012). This function encompasses the accurate manipulation of sound data, its organization into harmonious auditory sequences, and the isolation of target sounds by subduing competing auditory signals (Hill & Miller, 2010). Unlike a static and unidirectional information extraction approach, selective auditory attention is malleably influenced by diverse factors, including bottom-up sensory inputs and top-down task-specific goals (Kaya & Elhilali, 2017). Intriguingly, in complex environments, humans are capable of constantly learning to regulate and choose their auditory attention in real-time.

This research aims to enhance the precision of target speech extraction. In our previous work, while we laid the foundation for leveraging deep learning in auditory attention modeling, limitations in extraction accuracy persisted. Consequently, the core objective of this updated study is to more effectively utilize the learning capacity and nonlinear modeling capabilities of deep learning models to mimic human selective auditory attention, thereby achieving more accurate selective target speech extraction.

The speech extraction problem

We formally describe the speech extraction problem as follows. Let the speech signal of the target speaker be denoted as $x (t)$ . The mixed speech can then be represented as follows:

(1) $y (t) = x (t) + \sum_{i = 1}^{n} n_{i} (t) .$ Here, $x (t) \in R^{1 \times T}$ , and $n_{i} (t) \in R^{1 \times T}$ denote the speech signal of the target speaker and the interference sources, respectively, $t$ refers to the time frame, and $i$ is the number of interference sources. The objective of speech extraction is to extract $x (t)$ from $y (t)$ .

The architecture of the scheme

Drawing inspiration from the design principles of speech extraction network frameworks proposed in previous research (Luo & Mesgarani, 2019), we have devised a novel deep learning-based speech extraction network, as vividly illustrated in Fig. 1. In this figure, Fig. 1A offers a comprehensive block diagram depicting the general process of speaker extraction, while Fig. 1B presents a detailed system flowchart of the proposed network architecture. The proposed network model is composed of five key components: the Mixture Encoder, Reference Speech Encoder, Speech Extractor, Speaker Encoder, and Speech Decoder. The Mixture Encoder and Reference Speech Encoder undertake the crucial task of transforming the time-domain waveform signals of mixed speech and reference speech, respectively, into high-dimensional time-domain feature representations. These representations capture the essential characteristics of the input speech data.

Figure 1: (A) The block diagram of the speaker extraction network. (B) The proposed TD-ConvNext network model, the blocks in modules with the same function share the same color.
$y (t)$ is the mixed speech, $s (t)$ is the reference speech. ‘Emb’ is the identifiable embedding vector of target speech. The Conv1 $\times 1$ is convolutional operation, “Norm” is layer normalization, the detail of Spk block is shown in Fig. 3.

Download full-size image

DOI: 10.7717/peerj-cs.3326/fig-1

The Speaker Encoder, by contrast, represents an evolved design from our earlier previous work, engineered to more effectively differentiate the unique acoustic attributes of the target speaker from those of other speakers. Whereas the previous work exhibited limitations in capturing fine-grained acoustic variations, this updated encoder generates robust identity embedding information through enhanced discriminative learning. This information serves as a distinctive identifier for the target speaker within the network and is disseminated throughout the model, providing a solid foundation for subsequent processing steps. The integration of identity embedding information into the network architecture builds directly on our previous framework, addressing its prior constraint of suboptimal embedding utilization. Its primary role is to endow the model with the ability to selectively extract target speech from complex mixtures, a capability that was partially realized in the previous work but now refined through optimized information propagation. By leveraging this embedded identity information—now more deeply integrated across model layers—the model achieves higher precision in identifying and isolating the target voice, marking a significant advancement in speech extraction performance.

The Speaker Encoder, by contrast, is engineered to differentiate the unique acoustic attributes of the target speaker from those of other speakers. Through this discrimination process, it generates robust identity embedding information, which serves as a distinctive identifier for the target speaker within the network. This identity embedding information is then disseminated throughout the model, providing a solid foundation for subsequent processing steps. The integration of identity embedding information into the network architecture is of paramount importance. Its primary role is to endow the model with the ability to selectively extract the speech of the target speaker from complex auditory mixtures. By leveraging this embedded identity information, the model can more accurately identify and isolate the target speaker’s voice, thereby enhancing the overall performance of the speech extraction process.

The Speech Decoder is architected to reconstruct the time-domain waveform signals at varying resolutions. Its operational mechanism involves leveraging the time-domain masking information generated by the core Speech Extractor component, enabling the recognition and subsequent reconstruction of the target speech waveform. This process is crucial for restoring the original speech signal quality after the extraction process. Ensuring the fidelity of the extracted speech necessitates that the Speech Extractor is equipped with the capacity to analyze the intricate speech features and their inherent interdependencies, while simultaneously adapting to the dynamic and complex acoustic environments. To address these requirements, we employ a non-causal convolutional model with adjustable dilation rates to build a long-sequence processing architecture. This design allows the model to effectively capture the long-range contextual information within speech signals, thereby facilitating a more comprehensive and nuanced analysis of the speech content. The detailed design and implementation of the Speech Extractor are further elaborated in ‘Speech Extractor’.

Mixture encoder

The speech encoder utilizes one-dimensional convolution to perform multi-scale encoding (with convolution kernels of varying lengths) to juice the time domain features of human speech. This operation bypasses the challenge of phase estimation in short-time Fourier transforms, ultimately improving the performance of speech extraction.

The length of the convolution kernel is determined as $L_{1} (s h o r t)$ , $L_{2} (m i d d l e)$ and $L_{3} (l o n g)$ respectively, the step size is set to $(L_{1} / 2)$ . The output obtained through the speech encoder is a N-dimensions vector $W_{i}$ with length K:

(2) $W_{i} = R e L U (F_{i} (y)), i = 1, 2, 3$ where $W_{i} \in R^{K \times N}$ , and K can be calculated by $K = 2 (T - L_{1}) / L_{1} + 1$ . ReLU is the activation function, $F_{i} (\cdot)$ is the one-dimensional convolution layer, and $y$ is the original mixed speech. The complete output obtained by multi-scale encoder is a concatenation of the three vectors $w_{1}, w_{2},$ and $w_{3}$ .

Similaly, in the proposed network model, we share the same convolutional layer and weights of mixed speech coding with the reference speech coding, the adoption of the twin operation can effectively improves the performance of speech separation that was validated in the SpEx+ model (Ge et al., 2020). The output of the reference speech coding can be obtained as $V_{i} = R e L U (F_{i} (s)), i = 1, 2, 3$ .

Reference speech encoder

The reference speech encoder design is depicted in Fig. 2. The reference speech signal is first passed through one-dimensional convolutional encoders (Conv1Ds) to obtain time domain features. Simultaneously, the reference speech signal undergoes short time Fourier transform (STFT) operation to obtain a two-dimensional time-frequency map. The time-frequency map is then convolved using a 2-D convolution (Conv2D) to obtain two-dimensional time-frequency features. The time domain features obtained by Conv1Ds are concatenated along the channel, while the time-frequency domain features from the Conv2D are concatenated along the frequency direction. These features are then combined to create new features.

However, the one-dimensional convolution has three types of convolutional kernel lengths (20, 80, and 160) with a step size of 10. These kernel lengths correspond to time window lengths of 2.5, 10, and 20 ms respectively when the sampling rate is 8,000.

As shown in Fig. 2, $C o n v 1 D_{S} h o r t$ denotes the encoder with a kernel length of 20, while the other two encoders have kernel sizes of 80 and 160. The process for obtaining temporal features is calculated as follows:

(3) ${\begin{matrix} F c o n v_1 = C o n v 1 D_s (x) \\ F c o n v_2 = C o n v 1 D_m (x) \\ F c o n v_3 = C o n v 1 D_l (x) \end{matrix}$ where $C o n v 1 D_s (x)$ , $C o n v 1 D_m (x)$ and $C o n v 1 D_l (x)$ are convolution operations with kernels of varying lengths.

To obtain time-frequency features using STFT, we set the window length to 20 and the frame shift to 10 to match the kernel length. We apply STFT to the input speech signal $x$ , and the calculation process is as follows:

(4) $X (M) = \sum_{0}^{N - 1} x (n) [\cos (\frac{2 π n m}{N}) - j \sin (\frac{2 π n m}{N})] .$

We separate the real and imaginary parts of it to obtain:

(5) $X (M) = X_{r e a l} + j X_{i m a g} .$

Then, we obtain the magnitude spectrum $X_{m a g}$ as:

(6) $X_{m a g} = \sqrt{{X_{r e a l}}^{2} + {X_{i m a g}}^{2}} .$

We input $X_{m a g}$ into a two-dimensional convolution layer to obtain time-frequency features:

(7) $F_{f r e q} = C o n v 2 D (X_{m a g}) .$

After completing these steps, we obtain both temporal and spectral features, concatenate them, and input them into the speaker encoder to learn the embedded vector.

Speaker encoder

After encoding the reference speech with time domain features, it must meet with further training by the network to obtain a vector that can be used for extracting. The speaker encoder, as depicted in blue in Fig. 1B, aims to produce an identifiable embedding vector for the speech extraction network. Unlike traditional voiceprint extraction networks that use i-vector (Dehak et al., 2010) and x-vector (Snyder et al., 2016) to classify speakers, our speaker encoder concentrates on discovering similar embedding vectors in the time domain system to provide references for speech extraction networks.

The speech encoder encodes the reference speech and generates a high-dimensional identity feature vector V. To further enhance the features obtained by different scale encoders, layer normalization (LN) and a $1 \times 1$ convolution are enforced. The resulting high-dimensional features are then subject to channel integration and cross-channel fusion, ultimately resulting in a dimension change from $3 N$ to $e$ .

In order to obtain a desirable embedding vector, we have developed the Spk block, which is illustrated in Fig. 3A, we explain the structure of it as follows.

Figure 3: The details of three blocks, (A) Spk block, (B) TCN block, (C) TD-ConvNeXt block.
‘ $\oplus$ ’ is the residual connection, ‘Conv’ and ‘dConv’ are convolution and depth-wise separable convolution, ‘gLN’ is global layer normalization (gLN) (Luo & Mesgarani, 2019), ‘PReLU’ is the parametric linear unit, ‘BN’ is the batch normalization, ‘SE’ is the channel attention module, ‘Maxpool 1d’ is the maximum pooling layer (Hu, Shen & Sun, 2018).

Download full-size image

DOI: 10.7717/peerj-cs.3326/fig-3

The Spk block: The Spk block utilizes a depth-wise separable convolution (dConv) to separate the signal into $e$ channels, followed by s batch normalization (BN). Subsequently, a $1 \times 1$ convolution layer is applied to transform the channel number from $e$ to $f$ , with a PReLU non-linear layer, then another $1 \times 1$ convolution layer is adopted, the number of channels still maintained as $f$ . Next, the channel attention module SE block (Hu, Shen & Sun, 2018) was utilized to employ the attention mechanism to each channel, thereby increasing the weight of channels that have positive effects. To prevent the loss of shallow features and the vanishing gradient problem in the training process, the input and output of the SE block were added by skip-connection in the figure. Subsequently, a threshing operation was performed by the maximum pooling layer, which boosted the generalization performance of the trained embedding vector and reduced its length.

In the Speaker encoder, Spk block has been repeatly utilized $N_{s}$ times. Then a convolutional layer was employed to set the output dimension to N′, and the global average layer was exploited to obtain the identity embedding vector $E m b$ with length 1 and dimension N′.

The target speaker identity embedding obtained from the aforementioned analysis serves two purposes. Firstly, it is used to guide the speech extraction network to estimate the correct target masking. Secondly, the identity is embedded through the linear layer and softmax for speaker label prediction. Manipulating co-training of SI-SDR loss and cross-entropy loss, can result in a better match between the identity embedding and the speaker identity, thereby enabling the model to more accurately name the target speaker’s voice and achieve preferable extraction performance.

Speech extractor

In this article, we propose a novel time-domain speech extraction network for obtaining the target speaker’s mask. Recently, the time-domain model based on TCN block (Bai, Kolter & Koltun, 2018) has achieved remarkable success in speech separation tasks. However, the hollow convolution’s receptive field is severely limited due to the convolution kernel size constraint. The ConvNeXt block model (Liu et al., 2022), which uses convolution kernels with larger adaptation, can theoretically achieve a larger receptive field. However, the original application scenarios of ConvNeXt is image processing, it still unable to meet the requirement for a large receptive field of temporal sequence data in speech signal processing. Therefore, we have elevated the ConvNeXt block model to TD-ConvNeXt model to improve its suitability for processing time series data.

The TD-ConvNeXt block: The ConvNeXt (Liu et al., 2022), which is derived from RseNet50 (He et al., 2016) and builds on the Swin Transformer (Liu et al., 2021), has demonstrated superior performance compared to Swin Transformer. To process one-dimensional time series data, we have revamped the ConvNeXt blocks, which is referred to as Time-Domain ConvNeXt block (TD-ConvNeXt block) and is illustrated in Fig. 3C. The original ConvNeXt nonlinear layer and the normalized layer are GELU and LN, respectively. We have modified them to PReLU and gLN.

Compared to the original ConvNext structure, we put the dConv convolution in the first layer, the TD-ConvNext structure has fewer nonlinear and normalized layers. The convolution kernel of dConv has a size of $1 \times P$ and an expansion factor, we put gLN and PReLU between the two convolutions. This approach reduces the number of parameters that need to be saved during training, making it easier to construct a deeper model. TD-ConvNeXt blocks are also a stacked structure in speech extractor, as depicted in Fig. 4B. They are stacked B times with the expansion factor of dConv gradually increasing from $2^{0}$ to $2^{B - 1}$ to enhance the receptive field of modules.

Figure 4: The internal structure of two modules, (A) the TCN blocks stack, (B) the TD-ConvNext blocks stack.

Download full-size image

DOI: 10.7717/peerj-cs.3326/fig-4

Considering the excellent performance of TCN block in processing time series, we have incorporated it as one of the components in our proposed model. The block comprises two $1 \times 1$ convolution layers and a dConv layer, as illustrated in Fig. 3B. The first layer is a $1 \times 1$ convolutional layer, which changes the channel from $a$ to $b$ . Nonlinear layer PReLU and global layer normalization (gLN) (Luo & Mesgarani, 2019) are then added to accelerate network convergence. The middle convolution layer is a deeply separable convolution with a convolution kernel size of $1 \times 3$ and an expansion factor. Since speech data is a long sequence of one-dimensional data, increasing the receptive field can ensure the integrity of the output speech to some extent. A PReLU+gLN layer is added to nonlinearize and normalize the intermediate output features. The last $1 \times 1$ convolution layer communicates and fuses it across channels. To prevent the vanishing gradient issue during training, each module in the entire block is equipped with a skip-connection. As the TCN blocks are stacked B times, as shown in Fig. 4A, we refer to it as Stacked TCN blocks, where the expansion factor of each block increases from $2^{0}$ to $2^{B - 1}$ .

Once the output vector W of the mixed speech encoder and the embedded vector $E m b$ of the speaker encoder are obtained, normalization is performed followed by dimension reduction using a convolution layer. This reduces the dimension of the feature vector from $3 N$ to $a$ . The feature vector includes both the target speaker and jammer, which necessitates integration of the speaker-encoded embedded vector to guide the subsequent speech extraction process. As shown in purple in Fig. 1B, the speech extractor comprises $N_{c}$ stacked TD-ConvNeXt blocks and $N_{t}$ stacked TCN blocks. The speaker feature embedding vector $E m b$ is fused with each large module to provide continuous stimulus to the speech extractor, thereby guiding the speech extraction process.

Furthermore, the $E m b$ size is $R^{1 \times N^{'}}$ , and it is copied to the same length as W′. Consequently, the fusion result size is $C o n c a t (W^{'}, E m b) \in R^{K \times (c + N^{'})}$ , which is then input into each block. Between stacked TD-ConvNeXt blocks and stacked TCN blocks, $1 \times 1$ convolution is used for dimension conversion. Therefore, the feature size of the first block of the stacked input TCN blocks in the model is $R^{k \times (a + N^{'})}$ . The time domain feature code obtained after connecting the two basic modules, is processed through a combination of $1 \times 1$ convolution and ReLU to obtain a mask similar to IBM and IRM (Wang, Narayanan & Wang, 2014). Because we employ the multi-scale scheme, the mask can be denoted as $M a s k_{i}, i = 1, 2, 3$ , where $i = 1, 2, 3$ after training. Consequently, the final speech encoded features can be expressed as follows.

(8) $\hat{S_{i}} = M a s k_{i} \otimes W_{i}, i = 1, 2, 3.$ Here, $\otimes$ represents the dot multiplication between the corresponding elements of the matrix, and $\hat{S_{i}}$ denotes the estimated target speaker speech coding at a certain scale.

Speech decoder

The speech extraction network mixes the embedding vector of the auxiliary network with the time domain features of the mixed speech obtained by the encoder, followed with convolution operations to obtain the speech coding of the target speaker. Finally, the speech coding is reconstructed into a time series through the decoder, resulting in the reconstructed target speech data. This process can be expressed as:

(9) $\hat{x_{i}} = D_{i} (\hat{S_{i}}), i = 1, 2, 3.$ Here, $\hat{x_{i}}$ represents the reconstructed target speaker speech, where $i$ denotes the estimation at different scales. $D_{i} (\cdot)$ refers to the deconvolution, which is the decoder with convolution kernels corresponding to the encoder with lengths $L_{1}$ , $L_{2}$ , and $L_{3}$ , respectively. The decoder generates estimated speech at multiple resolutions, with the estimated speech exhibiting the minimum discrimination rate serving as the output of the model.

Multi-mission learning

Target speaker speech extraction can be viewed as a multi-task process. Firstly, the speech features of the target in the mixed speech are extracted and reconstructed into clean speech containing only the target speaker. Second, the reference speech is trained as the feature representation of the target speaker, which corresponds to the correct label classification, thereby providing a positive incentive for the speech extraction of the target speaker. Therefore, the overall target loss function can be divided into two parts. For the speech extraction of the target speaker, the scale invariant signal-to-noise ratio SI-SDR is used as the loss function, while the cross-entropy loss function is employed for the speaker label. The total loss function can be represented as follows:

(10) $L = L_{S I - S D R} + γ L_{C E}$ where $γ$ denotes the weight factor occupied by the cross-entropy loss function. The loss functions for the two subtasks are presented separately below.

SI-SDR is a commonly used speech evaluation metric that has better robustness than SDR (Roux et al., 2019). To encode and decode operations in multi-scale scenarios, we use convolution kernels of different sizes. This enables the speech to have output in different resolution windows. Therefore, the entire SI-SDR loss during the training process can be expressed as follows:

(11) $\begin{aligned} L_{S I - S D R} = & - [(1 - α - β) * ρ (\hat{x_{1}}, x) \\ + α * ρ (\hat{x_{2}}, x) + β * (ρ \hat{x_{3}}, x)] . \end{aligned}$

In Eq. (11), $\hat{x_{i}}, i = 1, 2, 3$ represents the estimation of the mixed speech $y (t)$ obtained under different resolution windows after extraction. $ρ (\hat{x_{i}}, x)$ denotes the SI-SDR loss.

The cross-entropy loss function (CE loss) is commonly employed for classification tasks, and can be represented as follows.

(12) $L_{C E} = - \sum_{1}^{T_{s}} I_{i} \log (q (I_{i}))$ where $T_{s}$ is the number of speakers, $I_{i}$ is the true vector of speaker labels, and $q (I_{i})$ is the estimated predicted probability.

Experiment methology

In this section, we conduct numerous experiments to validate the proposed scheme.

The data set

Our experimental dataset is the 100-h libriSpeech dataset (Panayotov et al., 2015), which comprises 251 speakers and approximately 28,000 clean sentences as the training and validation sets. The test set consists of 40 speakers and 2,620 clean sentences, with all original sentences sampled at a 16 k sampling rate. Each statement has a duration of approximately 10 s.

To simulate real-world environments, we used the WHAM! (Wichern et al., 2019) noisy dataset as ambient background noise. We randomly selected a clean sentence without background noise from two different speakers. We chose the first speaker as the target source and selected another sentence from the target speaker’s speech as the reference speech. The other speakers were regarded as the interference sources.

The mixed speech was generated using the direct stacking method. Firstly, environmental noise was mixed with the speech of another speaker. Then, the mixed speech was randomly combined with the speech of the target speaker at signal strengths of −5, 0, and 5db. A total of 20,000 statements were generated as the training set, with 5,000 selected for the validation set. In the test set, we used three signal-to-noise ratios (SNRs), and each test statement was a hybrid generation of LibriSpeech and WHAM! The reference sentence was randomly selected from all the speech of the target speaker, which was not repeated by the target speaker in the mixed speech.

Experiment environment

The hardware used for learning and training in this article was an NVIDIA Tesla V100 graphics card with 32G video memory. For the software environment, we used Pytorch version 1.10, CUDA version 10.2, and Python version 3.7.6. The main parameters used in model training are presented in Tables 1–3. The optimizer was Adam (Kingma & Ba, 2014), with an initial learning rate of $1 e^{- 3}$ and a batch size of 10. The initial $L_{1}$ was set to 20, representing a time window length of 2.5 ms, as the sampling rate of the initial speech in the dataset was 16 k, which was resampled to 8 k in the experiment. Similarly, the initial time window length of $L_{2}$ and $L_{3}$ were 10 and 20 ms, respectively. B represents the number of times small blocks are stacked into large blocks in the model. For instance, TCN block was stacked eight times into TCN blocks, and the expansion factor of dConv convolution kernel was increased from $2^{0}$ to $2^{7}$ . The number of Spk blocks was set to four, with a speech sample length of four seconds. $α$ and $β$ denote the weight of SI-SDR loss under medium window length and long window length in the loss function, which were pre-set to 0.1. $γ$ represents the weight factor of the cross-entropy loss function, which was set to 10.

Table 1:

The setting of main parameters.

Parameters	Value
Initial learning rate	$1 e^{- 3}$
Epoch	100
Batch size	10
$L_{1}$ (short length)	20
$L_{2}$ (middle length)	80
$L_{3}$ (long length)	160
B (the number of blocks in stack)	8
$N_{s}$ (the number of Spk blocks)	4
Sampling rate	8,000 Hz
Sampling length	4 s
$α$ (the weight of middle length in loss)	0.1
$β$ (the weight of long length in loss)	0.1
$γ$ (the weight factor of the cross-entropy loss)	10

DOI: 10.7717/peerj-cs.3326/table-1

Table 2:

The effect of different configurations of stacked TD-ConvNexts and stacked TCNs.

The bold & underline values denote the optimal SI-SDR values in the experimental results.

Model configurations	Speech extractors							Number of parameters	SI-SDR (dB)
	Stacked TD-ConvNeXt				Stacked TCN				Values			Avg.
	c	d	P	$N_{c}$	a	b	$N_{t}$		−5	0	5
SpEx+	–	–	–	–	256	512	4	11.2M	8.31	11.02	14.40	11.24
1	96	192	7	2	256	512	2	7.2M	8.04	10.88	14.22	11.05
2	128	256	7	2	256	512	2	7.7M	7.81	10.48	13.64	10.64
3	160	320	7	2	256	512	2	8.4M	7.79	10.43	13.63	10.62
4	192	384	7	2	256	512	2	9.2M	8.35	11.14	14.38	11.29
5	96	192	7	2	256	512	3	9.4M	8.34	11.18	14.47	11.33
6	128	256	7	2	256	512	3	10M	8.08	10.81	14.06	10.98
7	160	320	7	2	256	512	3	10.7M	8.60	11.47	14.80	11.62
8	192	384	7	2	256	512	3	11.M	8.54	11.29	14.48	11.44
9	96	192	7	4	256	512	3	10.1M	8.29	11.04	14.30	11.21
10	128	256	7	4	256	512	3	11.2M	8.35	11.16	14.45	11.32
11	160	320	7	4	256	512	3	12.5M	8.24	11.14	14.61	11.33
12	192	384	7	4	256	512	3	14.1M	8.20	11.13	14.50	11.28
13	160	320	3	2	256	512	3	10.7M	8.31	11.08	14.32	11.24
14	160	320	11	2	256	512	3	10.7M	8.58	11.38	14.63	11.53
15	160	320	7	2	256	256	3	7.3M	7.54	10.37	13.82	10.58

DOI: 10.7717/peerj-cs.3326/table-2

Table 3:

The effect of different configurations of the speaker-encoder.

The bold & underline values denote the optimal SI-SDR values in the experimental results.

Model configurations	Speaker Encoders			Number of Parameters	SI-SDR (dB)
	Spk blocks				SI-SDR (dB)
	$e$	$f$	Q		−5	0	5	Avg.
SpEx+	–	–	–	11.2M	8.31	11.02	14.40	11.24
7	–	–	–	10.7M	8.60	11.47	14.80	11.62
18	96	192	3	9.4M	8.62	11.32	14.52	11.49
19	96	384	3	9.6M	8.20	11.15	14.48	11.28
20	128	256	3	9.6M	8.42	11.22	14.50	11.38
21	128	384	3	9.7M	8.83	11.58	14.88	11.76
22	192	384	3	10.0M	8.47	11.16	14.43	11.35
23	128	384	7	9.7M	8.45	11.21	14.34	11.33
24	128	384	11	9.7M	8.45	11.25	14.51	11.40

DOI: 10.7717/peerj-cs.3326/table-3

Results

This section aims to verify the effectiveness of the proposed model, starting from the benchmark model SpEx+ (Ge et al., 2020). The reference model uses stacked TCN blocks to process time-domain signals. However, in our experiments, we found that a combination of TCN block and TD-ConvNeXt block stacking can produce better results.

To evaluate the effectiveness of the proposed model, three speech quality evaluation metrics, namely SI-SDR (Roux et al., 2019), PESQ (Rix et al., 2001) and STOI (Taal et al., 2010), will be utilized. SI-SDR represents the waveform similarity between the reconstructed speech and the original speech, higher values indicating better restoration. PESQ is an objective evaluation index of speech, with values ranging from −0.5 to 4.5, higher values indicating better speech quality. STOI measures short-time speech intelligibility, with values ranging from 0 to 1, where higher values indicate better speech clarity.

The influence of different configuration of speech extractor on model performance

The speech extractors comprise different stacking patterns of TD-ConvNeXt blocks and TCN blocks, and the number of modules processed has a significant impact on speech extraction performance. This section mainly investigates the influence of different components of the speech extractor on the model’s performance through experiments, aiming to obtain the best configuration of the speech extractor.

Table 2 presents the SI-SDR results in the environment of two-person speech mixing and noise under different model structures and reference models. We created the speech extraction model using a stack of different combinations of TD-ConvNeXt blocks and TCN blocks, while keeping the speaker encoder section unchanged. It can be observed from Table 2 that configuration 7 has an average SI-SDR evaluation index that is 0.38 dB higher than that of the benchmark model. In model configurations 9–12, when the number of TD-ConvNeXt block stacks increased from 2 to 4, the system performance leveled off and slightly decreased compared to the 2-layer stacks. Therefore, the structure of the speech extractor is based on configuration 7, and the parameters c and d are set as 160 and 320, respectively, which correspond to the number of convolutional channels in Fig. 3B.

Based on the results presented in Table 2, it can be observed that the best performance is achieved when the size of the dConv convolution kernel is $1 \times 7$ , larger convolution kernels slightly reduce the model’s performance. Compared to model 7, the SI-SDR index of Model 14 is approximately 0.1 dB lower, at the cost of greater computational complexity.

In addition, we made some changes to the configuration parameters $a$ and $b$ for TCN blocks. However, as seen in model structures 15–27 in the table, for the optimal model structure 7, the number of parameters is reduced, and the corresponding SI-SDR also decreases. Therefore, in this experiment, we keep $a = 256$ and $b = 512$ as the optimal configuration parameters.

The influence of different configurations of the speaker encoder

Previous works have mainly focused on the design of the speech extractor, while neglecting the role of the speaker encoder (embedding vector). In this section, based on the optimal speech extractor obtained in the previous experiment, we modify the structure of the speaker encoder to investigate the impact of different configurations on the overall model, aiming to obtain the configuration structure of the speaker encoder with the best performance.

Table 3 presents the results of our experiments. In Table 2, we selected the best speech extractor structure obtained in the previous section (Structure No. 7). The table shows that the models with Spk block outperform the benchmark model, SpEx+. Moreover, by fine-tuning parameters $e$ , $f$ and the size of the dConv convolution kernel in the Spk block, the average SI-SDR of model 21 is 0.14 dB higher than that of model 7, indicate that with the same conditions, the system can achieve better performance and better anti-interference, which means that the extracted speech is closer to the real speech.

Effect of encoder window length on model performance

Multi-scale encoders consist of multiple convolution kernel lengths. The encoded and decoded speech at different lengths contain information of different scales and resolutions, and the detailed features of speech at high resolution significantly impact the human auditory experience. Therefore, in addition to the SI-SDR evaluation indicator, we added PESQ (the speech perception evaluation) and STOI (the short-term intelligibility) to our evaluation metrics, we tested different combinations of window lengths to verify the model’s performance.

Table 4 presents the experimental results of different encoder window lengths. As shown in the table, compared to model 21, model 25 also uses a codec with a window length of 40. The SI-SDR slightly improved, the PESQ values are similar, and the STOI have significantly improved. Models 25–27 indicate that a higher STOI can be achieved with more short-window codecs.

Table 4:

Speech extraction performance with different encoder lengths.

The bold & underline values denote the optimal SI-SDR, PESQ and STOI values in the experimental results.

Model	$L_{1}$	$L_{2}$	$L_{3}$	SI-SDR (dB)			PESQ			STOI
–				−5	0	5	−5	0	5	−5	0	5
21	20	80	160	8.83	11.58	14.88	2.28	2.62	3.03	0.835	0.878	0.917
25	20	40	80	8.87	11.64	14.83	2.29	2.62	3.03	0.853	0.897	0.934
26	20	40	160	8.45	11.35	14.53	2.26	2.61	3.01	0.846	0.892	0.928
27	40	80	160	7.70	10.53	13.90	2.16	2.51	2.93	0.803	0.851	0.893

DOI: 10.7717/peerj-cs.3326/table-4

The impact of the weight setting of the loss function on the performance of the model

The multi-scale speech extraction model used in this article encounters a weight assignment problem during the loss calculation process. In the previous experiment, we pre-set $α = 0.1$ and $β = 0.1$ , resulting in a weight of 0.8 for the small scale with a window length of 20, and 0.1 for both the medium and long scales. In this section, we fine-tune the weights $α$ and $β$ of the loss function to investigate the impact of weight settings on the model’s performance.

Table 5 presents the experimental results after selecting the best performing model 25 from the previous section. Models 28 to 30 set different weights for Model 25. From the results, it can be observed that the SI-SDR and STOI of Model 25 are higher than those of Models 28 to 30. Therefore, we continue to keep $α$ and $β$ as 0.1 for the loss function weight setting.

Table 5:

The impact of the weight of the loss function on speech extraction.

The bold & underline values denote the optimal SI-SDR, PESQ and STOI values in the experimental results.

Model	$α$	$β$	SI-SDR (dB)			PESQ			STOI
–			−5	0	5	−5	0	5	−5	0	5
25	0.1	0.1	8.87	11.64	14.83	2.29	2.62	3.03	0.853	0.897	0.934
28	0.2	0.1	8.46	11.22	14.41	2.29	2.64	3.04	0.824	0.866	0.904
29	0.1	0.2	8.66	11.48	14.71	2.27	2.62	3.03	0.850	0.893	0.930
30	0.2	0.2	8.77	11.52	14.79	2.27	2.62	3.02	0.852	0.895	0.932

DOI: 10.7717/peerj-cs.3326/table-5

Comparison of the proposed network with the state of the art

Through the experiments described in the previous sections, we obtained the current best-performing network model 25, which we named TDNext. In this section, we reproduced five baseline models, SpEx+ (Ge et al., 2020), SpEx (Xu et al., 2020), TD-SpeakerBeam (Delcroix et al., 2020), SpeakerBeam (Delcroix et al., 2018) and sDPCCN (Han et al., 2022), and compared TDNext with these five baselines using the three evaluation indicators of SI-SDR, PESQ, and STOI to verify the effectiveness of our proposed algorithm.

From Table 6, it can be seen that TDNext outperforms all the baselines in SI-SER and STOI metrics at the same SNR levels. The above experimental results demonstrate the effectiveness of TDNext in the target speech extraction task.

Table 6:

Comparison of the extraction performance of proposed network and other models under two-person mixed speech.

The bold & underline values denote the optimal SI-SDR, PESQ and STOI values in the experimental results.

SNR	Models	Domain	SI-SDR (dB)	PESQ	STOI
−5	sDPCCN (Han et al., 2022)	TFD	6.36	1.89	0.778
	SpEx+ (Ge et al., 2020)	TD	8.31	2.28	0.841
	SpEx (Xu et al., 2020)	TD	7.73	2.15	0.837
	TD-SpeakerBeam (Delcroix et al., 2020)	TD	7.34	2.08	0.831
	SpeakerBeam (Delcroix et al., 2018)	FD	6.02	1.86	0.768
	TDNext (Our model)	TD	8.87	2.29	0.853
0	sDPCCN (Han et al., 2022)	TFD	9.68	2.20	0.854
	SpEx+ (Ge et al., 2020)	TD	11.02	2.62	0.884
	SpEx (Xu et al., 2020)	TD	10.38	2.49	0.878
	TD-SpeakerBeam (Delcroix et al., 2020)	TD	9.87	2.4	0.865
	SpeakerBeam (Delcroix et al., 2018)	FD	9.25	2.18	0.843
	TDNext (Our model)	TD	11.64	2.62	0.897
5	sDPCCN (Han et al., 2022)	TFD	13.56	2.60	0.909
	SpEx+ (Ge et al., 2020)	TD	14.40	3.03	0.925
	SpEx (Xu et al., 2020)	TD	13.51	2.89	0.912
	TD-SpeakerBeam (Delcroix et al., 2020)	TD	12.86	2.77	0.903
	SpeakerBeam (Delcroix et al., 2018)	FD	13.02	2.54	0.897
	TDNext (Our model)	TD	14.83	3.03	0.934

DOI: 10.7717/peerj-cs.3326/table-6

Note:

TD/TFD/FD represents Time domain, Time Frequency domain, and Frequency domain respectively.

To visually observe the effectiveness of different models in extracting speech, we present typical spectrograms of the original mixed speech, clean speech, and the speech extracted by our proposed model and the comparative models in Fig. 5. From the figures, it can be seen that there is a high-frequency blank area without speech energy distribution in the beginning area of clean speech. In contrast, the interference speech energy distribution can be observed in this area in the spectrogram of the mixed speech. The performance of SpEx+ model and our proposed model is better, with clear blank areas in the silent region, achieving a good noise reduction effect. Correspondingly, there is a clear boundary between sentences in the blank area in the middle of clean speech, only our model shows a clear separation with a distinct boundary between each sentence, which may result in clearer auditory perception and better speech understanding, as demonstrated in the spectrogram.

Figure 5: Model extraction speech time-frequency diagram, (A) mixed voice, (B) clean voice, (C) speech time-frequency diagram extracted by sDPCCN (Han et al., 2022), (D) speech time-frequency diagram extracted by SpEx+ (Ge et al., 2020), (E) speech time-frequency diagram extracted by TDNext.

Download full-size image

DOI: 10.7717/peerj-cs.3326/fig-5

Ablation experiment

To further verify the effectiveness of the TD-ConvNeXt block proposed in this article, an ablation experiment was conducted in this section with and without removing the TD-ConvNeXt block and its expansion factor. The results are presented in Table 7.

Table 7:

The results of ablation experiment.

The bold & underline values denote the optimal SI-SDR, PESQ and STOI values in the experimental results.

Models	SI-SDR (dB)			PESQ			STOI
	5	0	5	−5	0	5	−5	0	5
TDNext	8.87	11.64	14.83	2.29	2.62	3.03	0.853	0.897	0.934
W/O D.Conv.	8.37	11.07	14.23	2.23	2.57	3.01	0.848	0.887	0.932
W/O TD.	8.03	10.62	13.66	2.18	2.54	2.97	0.838	0.879	0.915

DOI: 10.7717/peerj-cs.3326/table-7

In Table 7, it can be observed that after removing the expansion factor from the TD-ConvNeXt block of the whole network and returning to an ordinary convolutional neural network (W/O D.Conv), all three indicators decreased. Moreover, after removing the TD-ConvNeXt block (W/O TD) from the whole network, the three indicators decreased further. The increased decline rate confirms the performance improvement effect of the TD-ConvNeXt block proposed in this article for the whole network.

Related work

In recent years, deep learning has demonstrated enormous potential in speech extraction, leading to the adoption of deep learning techniques in almost all speech extraction methods. Therefore, this article will only focus on summarizing the deep learning-based methods that have been developed in recent years for speech extraction.

Unlike speech seperation, for speech extraction tasks, it is not necessary to determine the number of speakers. The output content is a single source signal (Choi et al., 2005). In 2017, Zmolikova et al. (2017) proposed the SpeakerBeam model, which uses the speaker’s voiceprint information as the embedding of the extraction network to focus only on a single speaker. This is also the first deep learning model that uses embedding vectors to guide the model to extract the target speech, and subsequent work mostly refers to similar ideas. Based on the SpeakerBeam model, Xu et al. (2019) propose an optimized reconstruction spectrum loss function, introducing dynamic information error and considering multiple changes in time and amplitude. In literature (Wang et al., 2018) proposed the VoiceFilter model, which also splices the speaker’s voiceprint information and mixed speech into the separation network to improve the performance of the speech recognition system. Then, by improving long short-term memory (LSTM) to speed up training speed and reduce computation costs, VoiceFilter-lite is proposed (Wang et al., 2020), making it possible to run on mobile devices and greatly reducing computational complexity. Following this, in 2021, this team proposed a speaker embedding technology based on attention mechanism (Rikhye et al., 2021), which can extract speech from multiple speakers at the same time.

For speech encoding, there is also a direct use of convolutional encoding operations to skip the STFT transformation, avoiding the problem of estimating phase. Literature (Delcroix et al., 2020) proposes the Time-Domain SpeakerBeam (TD-SpeakerBeam), replacing the STFT operation with a learnable encoder. Compared with previous work, the quality of the extracted target speech has been significantly improved. SpEx (Xu et al., 2020) proposes an end-to-end time-domain speech extraction network based on Conv-TasNet, using a multi-scale encoder to obtain features of different scales. Based on this work, the original author proposed the SpEx+ (Ge et al., 2020) model, which believes that the speaker encoding method used by SpEx and the mixed speech encoding method are different, which will lead to performance degradation. Therefore, a shared weight encoder is used to map reference speech and mixed speech to a similar space and prove the effectiveness through experiments. In addition, this team proposed SpEx++ (Ge et al., 2021) soon, which believes that using pre-registered speech as reference speech in actual environments is unreasonable. Table 8 shows the comparison of speech extraction methods based on deep learning.

Table 8:

Comparison of deep learning-based speech extraction methods.

Models	Architecture	Supervision method	Speaker embedding	Domain focus
SpeakerBeam (Zmolikova et al., 2017)	Extraction network with speaker voiceprint as embedding	Magnitude spectrum reconstruction loss	Direct voiceprint embedding	General extraction
SBF-MTSAL (Xu et al., 2019)	Optimized network with dynamic time-amplitude error loss	Temporal Spectrum Approximation (TSA) loss	Voiceprint embedding with dynamic error term	Loss optimization
VoiceFilter (Wang et al., 2018)	Separation network with voiceprint-mixed speech concatenation	Speaker-conditioned spectrogram masking	Voiceprint concatenated with mixed speech	ASR integration
VoiceFilter-lite (Wang et al., 2020)	LSTM-optimized model for real-time inference	Spectrogram masking with reduced computation	Voiceprint integration for streaming	Mobile/on-device
Multi-user VoiceFilter-Lite (Rikhye et al., 2021)	Attention-based speaker embedding for multi-speaker extraction	Multi-speaker attentive masking	Attentive embedding for simultaneous extraction	Multi-speaker extraction
TD-SpeakerBeam (Delcroix et al., 2020)	Time-domain model with learnable encoder replacing STFT	Time-domain reconstruction loss	Speaker-aware embedding in raw audio domain	Time-domain processing
SpEx (Xu et al., 2020)	Conv-TasNet-based multi-scale time-domain encoder	End-to-end time-domain reconstruction loss	Reference speech encoding for speaker cues	Multi-scale features
SpEx+ (Ge et al., 2020)	Shared weight encoder for reference and mixed speech	Time-domain loss with feature space alignment	Shared encoder for consistent embedding	Embedding alignment
SpEx++ (Ge et al., 2021)	Multi-stage extraction with utterance and frame-level reference signals	Enhanced supervision for real-world scenarios	Utterance/frame-level references	Real-world flexibility

DOI: 10.7717/peerj-cs.3326/table-8

Additionally, there are works that use speaker image information as embedding vectors to extract speech. Pan, Ge & Li (2022) advocates using auxiliary information such as speaker lip movement images as a reference for the speaker. Sato et al. (2021) simultaneously uses speaker image information and speech information as reference information, avoiding the situation where speaker images are occluded in practical applications and achieving more robust results.

In recent years, speech-aware objective functions like PESQNet (Xu, Strake & Fingscheidt, 2022) and MOSNet (Lo et al., 2019) have emerged in the field of speech processing. PESQNet is an end-to-end deep neural network that estimates PESQ scores of enhanced speech signals, allowing reference-free loss for real data. It can be trained in a weakly supervised manner, combining with denoising and dereverberation loss terms. MOSNet is a speech quality evaluation model based on a deep neural network. It takes speech signals as input and outputs a mean opinion score (MOS) reflecting human perception of speech quality. It can evaluate speech quality under various conditions and has been widely applied in speech enhancement and recognition. These speech-aware objective functions optimize speech enhancement models and other systems based on human auditory perception, improving speech quality and intelligibility. Compared to traditional loss functions, they better reflect human speech perception. In the future, speech-aware objective functions are expected to become more sophisticated and accurate, providing stronger support for speech processing technologies.

Compared with traditional methods of speech extraction systems, deep learning systems can achieve higher accuracy in extraction. As the demand for speech extraction technology continues to increase, single-channel speech extraction systems face new challenges. For example, how to improve the real-time and stability of the model, how to selectively extract multiple people’s speech signals at the same time, how to accurately extract in reverberant and low signal-to-noise ratio environments, etc., require further research and exploration.

Conclusions

In this article, we address the issue of the traditional ConvNeXt model’s inability to adapt to the time-series characteristics of speech signals in single-channel target speech extraction. We extend the ConvNeXt structure to TD-ConvNeXt, and mix TCN and TD-ConvNeXt structures to estimate the mask of the target speech. In the auxiliary network, we design a new Spk block model as embedding vectors to stimulate the main extraction network. We utilize STFT in the reference speech encoder to enhance the robustness of the model. Extensive experiments are conducted to verify the significant single-channel target speech extraction performance improvements of our proposed model.

[1] Bai S, Kolter JZ, Koltun V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ArXiv

[2] Chen Z, Li J, Xiao X, Yoshioka T, Wang H, Wang Z, Gong Y. 2017. Cracking the cocktail party problem by multi-beam deep attractor network.

[3] Chen Z, Yoshioka T, Lu L, Zhou T, Meng Z, Luo Y, Wu J, Xiao X, Li J. 2020. Continuous speech separation: dataset and analysis.

[4] Choi S, Cichocki A, Park H-M, Lee S-Y. 2005. Blind source separation and independent component analysis: a review. Neural Information Processing-Letters and Reviews 6(1):1-57

[5] Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19(4):788-798

[6] Delcroix M, Ochiai T, Zmolikova K, Kinoshita K, Tawara N, Nakatani T, Araki S. 2020. Improving speaker discrimination of target speech extraction with time-domain speakerbeam.

[7] Delcroix M, Zmolikova K, Kinoshita K, Ogawa A, Nakatani T. 2018. Single channel target speaker extraction and recognition with speaker beam.

[8] Ge M, Xu C, Wang L, Chng ES, Dang J, Li H. 2020. SpEx+: a complete time domain speaker extraction network.

[9] Ge M, Xu C, Wang L, Chng ES, Dang J, Li H. 2021. Multi-stage speaker extraction with utterance and frame-level reference signals.

[10] Han J, Long Y, Burget L, Černockỳ J. 2022. DPCCN: densely-connected pyramid complex convolutional network for robust speech separation and extraction.

[11] He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition.

[12] Hill KT, Miller LM. 2010. Auditory attentional control and selection during cocktail party listening. Cerebral Cortex 20(3):583-590

[13] Hu J, Shen L, Sun G. 2018. Squeeze-and-excitation networks.

[14] Ju Y, Chen J, Zhang S, He S, Rao W, Zhu W, Wang Y, Yu T, Shang S. 2023. TEA-PSE 3.0: tencent-ethereal-audio-lab personalized speech enhancement system for ICASSP 2023 DNS-challenge.

[15] Karamatli E, Kirbiz S. 2022. MixCycle: unsupervised speech separation via cyclic mixture permutation invariant training. IEEE Signal Processing Letters 29:2637-2641

[16] Kaya EM, Elhilali M. 2017. Modelling auditory attention. Philosophical Transactions of the Royal Society B: Biological Sciences 372(1714):20160101

[17] Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv

[18] Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. 2021. Swin transformer: hierarchical vision transformer using shifted windows.

[19] Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S. 2022. A convnet for the 2020s.

[20] Lo C-C, Fu S-W, Huang W-C, Wang X, Yamagishi J, Tsao Y, Wang H-M. 2019. MOSNet: deep learning based objective assessment for voice conversion. ArXiv

[21] Luo Y, Mesgarani N. 2019. Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(8):1256-1266

[22] Manamperi W, Samarasinghe PN, Abhayapala TD, Zhang J. 2022. GMM based multi-stage wiener filtering for low SNR speech enhancement.

[23] Mesgarani N, Chang EF. 2012. Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485(7397):233-236

[24] Pan Z, Ge M, Li H. 2022. USEV: universal speaker extraction with visual cue. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30(1):3032-3045

[25] Panayotov V, Chen G, Povey D, Khudanpur S. 2015. Librispeech: an ASR corpus based on public domain audio books.

[26] Rikhye R, Wang Q, Liang Q, He Y, McGraw I. 2021. Multi-user voicefilter-lite via attentive speaker embedding.

[27] Rix AW, Beerends JG, Hollier MP, Hekstra AP. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs.

[28] Roux JL, Wisdom S, Erdogan H, Hershey JR. 2019. SDR—half-baked or well done?

[29] Saijo K, Wichern G, Germain FG, Pan Z, Roux JL. 2024. TF-Locoformer: transformer with local modeling by convolution for speech separation and enhancement.

[30] Sato H, Ochiai T, Kinoshita K, Delcroix M, Nakatani T, Araki S. 2021. Multimodal attention fusion for target speaker extraction.

[31] Snyder D, Ghahremani P, Povey D, Garcia-Romero D, Carmiel Y, Khudanpur S. 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification.

[32] Taal CH, Hendriks RC, Heusdens R, Jensen J. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech.

[33] Taal CH, Hendriks RC, Heusdens R, Jensen J. 2011. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19(7):2125-2136

[34] Tzinis E, Wichern G, Subramanian A, Smaragdis P, Roux JL. 2022. Heterogeneous target speech separation. ArXiv

[35] Wang Z-Q, Cornell S, Choi S, Lee Y, Kim B-Y, Watanabe S. 2023. Tf-gridnet: making time-frequency domain models great again for monaural speaker separation.

[36] Wang Z, Giri R, Venkataramani S, Isik U, Valin J-M, Smaragdis P, Goodwin M, Krishnaswamy A. 2022. Semi-supervised time domain target speaker extraction with attention. ArXiv

[37] Wang Q, Moreno IL, Saglam M, Wilson K, Chiao A, Liu R, He Y, Li W, Pelecanos J, Nika M, Gruenstein A. 2020. VoiceFilter-lite: streaming targeted voice separation for on-device speech recognition. ArXiv

[38] Wang Q, Muckenhirn H, Wilson K, Sridhar P, Wu Z, Hershey J, Saurous RA, Weiss RJ, Jia Y, Moreno IL. 2018. Voicefilter: targeted voice separation by speaker-conditioned spectrogram masking. ArXiv

[39] Wang Y, Narayanan A, Wang DL. 2014. On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(12):1849-1858

[40] Wichern G, Antognini J, Flynn M, Zhu LR, McQuinn E, Crow D, Manilow E, Roux JL. 2019. WHAM!: extending speech separation to noisy environments.

[41] Xu C, Rao W, Chng ES, Li H. 2019. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss.

[42] Xu C, Rao W, Chng ES, Li H. 2020. SpEx: multi-scale time domain speaker extraction network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28:1370-1384