Leveraging Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre -trained Transformer ( GPT ) Fusion for Accurate Accent Classification in English
Abstract
This study undertakes an in-depth review of the development of a reliable language-identification system using deep-learning techniques. The primary goal was to extract the mel -frequency cepstral coefficients from audio files and use a transformer-based model to classify the languages. The dataset includes audio recordings in five languages: Spanish, Arabic, Dutch, Pashto , and Russian. Originally, the audio files of each language were processed separately to extract relevant features, thereby avoiding the disadvantages of combining more than one file to create a single dataset . These features were standardized and used to train a transformer model that could handle sequential data and was particularly suited to audio signals. The model presented in this study was subjected to extensive training and testing, yielding a significant increase in accuracy over traditional methods. Early stopping was used to avoid overfitting , thereby ensuring that the model preserved its generalization ability for a wide range of audio examples. The model was then evaluated using metrics such as confusion matrices and classification reports, which showed that it could distinguish among the five target languages. This study further advances research on language processing by offering a solution that is both efficient and scalable , with potential applications in real-time communication for language identification systems and multilingual interfaces.