PeerJ Computer Science:Computational Linguisticshttps://peerj.com/articles/index.atom?journal=cs&subject=8800Computational Linguistics articles published in PeerJ Computer ScienceCuentosIE: can a chatbot about “tales with a message” help to teach emotional intelligence?https://peerj.com/articles/cs-18662024-02-292024-02-29Antonio FerrándezRocío Lavigne-CervánJesús PeralIgnasi Navarro-SoriaÁngel LloretDavid GilCarmen Rocamora
In this article, we present CuentosIE (TalesEI: chatbot of tales with a message to develop Emotional Intelligence), an educational chatbot on emotions that also provides teachers and psychologists with a tool to monitor their students/patients through indicators and data compiled by CuentosIE. The use of “tales with a message” is justified by their simplicity and easy understanding, thanks to their moral or associated metaphors. The main contributions of CuentosIE are the selection, collection, and classification of a set of highly specialized tales, as well as the provision of tools (searching, reading comprehension, chatting, recommending, and classifying) that are useful for both educating users about emotions and monitoring their emotional development. The preliminary evaluation of the tool has obtained encouraging results, which provides an affirmative answer to the question posed in the title of the article.
In this article, we present CuentosIE (TalesEI: chatbot of tales with a message to develop Emotional Intelligence), an educational chatbot on emotions that also provides teachers and psychologists with a tool to monitor their students/patients through indicators and data compiled by CuentosIE. The use of “tales with a message” is justified by their simplicity and easy understanding, thanks to their moral or associated metaphors. The main contributions of CuentosIE are the selection, collection, and classification of a set of highly specialized tales, as well as the provision of tools (searching, reading comprehension, chatting, recommending, and classifying) that are useful for both educating users about emotions and monitoring their emotional development. The preliminary evaluation of the tool has obtained encouraging results, which provides an affirmative answer to the question posed in the title of the article.A bilingual benchmark for evaluating large language modelshttps://peerj.com/articles/cs-18932024-02-292024-02-29Mohamed Alkaoud
This work introduces a new benchmark for the bilingual evaluation of large language models (LLMs) in English and Arabic. While LLMs have transformed various fields, their evaluation in Arabic remains limited. This work addresses this gap by proposing a novel evaluation method for LLMs in both Arabic and English, allowing for a direct comparison between the performance of the two languages. We build a new evaluation dataset based on the General Aptitude Test (GAT), a standardized test widely used for university admissions in the Arab world, that we utilize to measure the linguistic capabilities of LLMs. We conduct several experiments to examine the linguistic capabilities of ChatGPT and quantify how much better it is at English than Arabic. We also examine the effect of changing task descriptions from Arabic to English and vice-versa. In addition to that, we find that fastText can surpass ChatGPT in finding Arabic word analogies. We conclude by showing that GPT-4 Arabic linguistic capabilities are much better than ChatGPT’s Arabic capabilities and are close to ChatGPT’s English capabilities.
This work introduces a new benchmark for the bilingual evaluation of large language models (LLMs) in English and Arabic. While LLMs have transformed various fields, their evaluation in Arabic remains limited. This work addresses this gap by proposing a novel evaluation method for LLMs in both Arabic and English, allowing for a direct comparison between the performance of the two languages. We build a new evaluation dataset based on the General Aptitude Test (GAT), a standardized test widely used for university admissions in the Arab world, that we utilize to measure the linguistic capabilities of LLMs. We conduct several experiments to examine the linguistic capabilities of ChatGPT and quantify how much better it is at English than Arabic. We also examine the effect of changing task descriptions from Arabic to English and vice-versa. In addition to that, we find that fastText can surpass ChatGPT in finding Arabic word analogies. We conclude by showing that GPT-4 Arabic linguistic capabilities are much better than ChatGPT’s Arabic capabilities and are close to ChatGPT’s English capabilities.Special issue on analysis and mining of social media datahttps://peerj.com/articles/cs-19092024-02-292024-02-29Arkaitz ZubiagaPaolo Rosso
This Editorial introduces the PeerJ Computer Science Special Issue on Analysis and Mining of Social Media Data. The special issue called for submissions with a primary focus on the use of social media data, for a variety of fields including natural language processing, computational social science, data mining, information retrieval and recommender systems. Of the 48 abstract submissions that were deemed within the scope of the special issue and were invited to submit a full article, 17 were ultimately accepted. These included a diverse set of articles covering, inter alia, sentiment analysis, detection and mitigation of online harms, analytical studies focused on societal issues and analysis of images surrounding news. The articles primarily use Twitter, Facebook and Reddit as data sources; English, Arabic, Italian, Russian, Indonesian and Javanese as languages; and over a third of the articles revolve around COVID-19 as the main topic of study. This article discusses the motivation for launching such a special issue and provides an overview of the articles published in the issue.
This Editorial introduces the PeerJ Computer Science Special Issue on Analysis and Mining of Social Media Data. The special issue called for submissions with a primary focus on the use of social media data, for a variety of fields including natural language processing, computational social science, data mining, information retrieval and recommender systems. Of the 48 abstract submissions that were deemed within the scope of the special issue and were invited to submit a full article, 17 were ultimately accepted. These included a diverse set of articles covering, inter alia, sentiment analysis, detection and mitigation of online harms, analytical studies focused on societal issues and analysis of images surrounding news. The articles primarily use Twitter, Facebook and Reddit as data sources; English, Arabic, Italian, Russian, Indonesian and Javanese as languages; and over a third of the articles revolve around COVID-19 as the main topic of study. This article discusses the motivation for launching such a special issue and provides an overview of the articles published in the issue.A neural machine translation method based on split graph convolutional self-attention encodinghttps://peerj.com/articles/cs-18862024-02-282024-02-28Fei WanPing Li
With the continuous advancement of deep learning technologies, neural machine translation (NMT) has emerged as a powerful tool for enhancing communication efficiency among the members of cross-language collaborative teams. Among the various available approaches, leveraging syntactic dependency relations to achieve enhanced translation performance has become a pivotal research direction. However, current studies often lack in-depth considerations of non-Euclidean spaces when exploring interword correlations and fail to effectively address the model complexity arising from dependency relation encoding. To address these issues, we propose a novel approach based on split graph convolutional self-attention encoding (SGSE), aiming to more comprehensively utilize syntactic dependency relationships while reducing model complexity. Specifically, we initially extract syntactic dependency relations from the source language and construct a syntax dependency graph in a non-Euclidean space. Subsequently, we devise split self-attention networks and syntactic semantic self-attention networks, integrating them into a unified model. Through experiments conducted on multiple standard datasets as well as datasets encompassing scenarios related to team collaboration and enterprise management, the proposed method significantly enhances the translation performance of the utilized model while effectively mitigating model complexity. This approach has the potential to effectively enhance communication among cross-language team members, thereby ameliorating collaborative efficiency.
With the continuous advancement of deep learning technologies, neural machine translation (NMT) has emerged as a powerful tool for enhancing communication efficiency among the members of cross-language collaborative teams. Among the various available approaches, leveraging syntactic dependency relations to achieve enhanced translation performance has become a pivotal research direction. However, current studies often lack in-depth considerations of non-Euclidean spaces when exploring interword correlations and fail to effectively address the model complexity arising from dependency relation encoding. To address these issues, we propose a novel approach based on split graph convolutional self-attention encoding (SGSE), aiming to more comprehensively utilize syntactic dependency relationships while reducing model complexity. Specifically, we initially extract syntactic dependency relations from the source language and construct a syntax dependency graph in a non-Euclidean space. Subsequently, we devise split self-attention networks and syntactic semantic self-attention networks, integrating them into a unified model. Through experiments conducted on multiple standard datasets as well as datasets encompassing scenarios related to team collaboration and enterprise management, the proposed method significantly enhances the translation performance of the utilized model while effectively mitigating model complexity. This approach has the potential to effectively enhance communication among cross-language team members, thereby ameliorating collaborative efficiency.exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularieshttps://peerj.com/articles/cs-18882024-02-282024-02-28Tiancheng YangIlia SucholutskyKuang-Yu JenMatthias Schonlau
Background
Pathology reports contain key information about the patient’s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility.
Objective
To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) “What kind of rejection does the patient show?”; (2) “What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?”
Methods
Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT’s tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model’s vocabulary. All three models were fine-tuned with information retrieval heads.
Results
The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate.
Conclusion
ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer’s vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
Background
Pathology reports contain key information about the patient’s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility.
Objective
To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) “What kind of rejection does the patient show?”; (2) “What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?”
Methods
Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT’s tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model’s vocabulary. All three models were fine-tuned with information retrieval heads.
Results
The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate.
Conclusion
ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer’s vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.Diffusion models in text generation: a surveyhttps://peerj.com/articles/cs-19052024-02-232024-02-23Qiuhua YiXiangfan ChenChenwei ZhangZehai ZhouLinan ZhuXiangjie Kong
Diffusion models are a kind of math-based model that were first applied to image generation. Recently, they have drawn wide interest in natural language generation (NLG), a sub-field of natural language processing (NLP), due to their capability to generate varied and high-quality text outputs. In this article, we conduct a comprehensive survey on the application of diffusion models in text generation. We divide text generation into three parts (conditional, unconstrained, and multi-mode text generation, respectively) and provide a detailed introduction. In addition, considering that autoregressive-based pre-training models (PLMs) have recently dominated text generation, we conduct a detailed comparison between diffusion models and PLMs in multiple dimensions, highlighting their respective advantages and limitations. We believe that integrating PLMs into diffusion is a valuable research avenue. We also discuss current challenges faced by diffusion models in text generation and propose potential future research directions, such as improving sampling speed to address scalability issues and exploring multi-modal text generation. By providing a comprehensive analysis and outlook, this survey will serve as a valuable reference for researchers and practitioners interested in utilizing diffusion models for text generation tasks.
Diffusion models are a kind of math-based model that were first applied to image generation. Recently, they have drawn wide interest in natural language generation (NLG), a sub-field of natural language processing (NLP), due to their capability to generate varied and high-quality text outputs. In this article, we conduct a comprehensive survey on the application of diffusion models in text generation. We divide text generation into three parts (conditional, unconstrained, and multi-mode text generation, respectively) and provide a detailed introduction. In addition, considering that autoregressive-based pre-training models (PLMs) have recently dominated text generation, we conduct a detailed comparison between diffusion models and PLMs in multiple dimensions, highlighting their respective advantages and limitations. We believe that integrating PLMs into diffusion is a valuable research avenue. We also discuss current challenges faced by diffusion models in text generation and propose potential future research directions, such as improving sampling speed to address scalability issues and exploring multi-modal text generation. By providing a comprehensive analysis and outlook, this survey will serve as a valuable reference for researchers and practitioners interested in utilizing diffusion models for text generation tasks.Categorization of tweets for damages: infrastructure and human damage assessment using fine-tuned BERT modelhttps://peerj.com/articles/cs-18592024-02-162024-02-16Muhammad Shahid Iqbal MalikMuhammad Zeeshan YounasMona Mamdouh JamjoomDmitry I. Ignatov
Identification of infrastructure and human damage assessment tweets is beneficial to disaster management organizations as well as victims during a disaster. Most of the prior works focused on the detection of informative/situational tweets, and infrastructure damage, only one focused on human damage. This study presents a novel approach for detecting damage assessment tweets involving infrastructure and human damages. We investigated the potential of the Bidirectional Encoder Representations from Transformer (BERT) model to learn universal contextualized representations targeting to demonstrate its effectiveness for binary and multi-class classification of disaster damage assessment tweets. The objective is to exploit a pre-trained BERT as a transfer learning mechanism after fine-tuning important hyper-parameters on the CrisisMMD dataset containing seven disasters. The effectiveness of fine-tuned BERT is compared with five benchmarks and nine comparable models by conducting exhaustive experiments. The findings show that the fine-tuned BERT outperformed all benchmarks and comparable models and achieved state-of-the-art performance by demonstrating up to 95.12% macro-f1-score, and 88% macro-f1-score for binary and multi-class classification. Specifically, the improvement in the classification of human damage is promising.
Identification of infrastructure and human damage assessment tweets is beneficial to disaster management organizations as well as victims during a disaster. Most of the prior works focused on the detection of informative/situational tweets, and infrastructure damage, only one focused on human damage. This study presents a novel approach for detecting damage assessment tweets involving infrastructure and human damages. We investigated the potential of the Bidirectional Encoder Representations from Transformer (BERT) model to learn universal contextualized representations targeting to demonstrate its effectiveness for binary and multi-class classification of disaster damage assessment tweets. The objective is to exploit a pre-trained BERT as a transfer learning mechanism after fine-tuning important hyper-parameters on the CrisisMMD dataset containing seven disasters. The effectiveness of fine-tuned BERT is compared with five benchmarks and nine comparable models by conducting exhaustive experiments. The findings show that the fine-tuned BERT outperformed all benchmarks and comparable models and achieved state-of-the-art performance by demonstrating up to 95.12% macro-f1-score, and 88% macro-f1-score for binary and multi-class classification. Specifically, the improvement in the classification of human damage is promising.Synergizing language learning: SmallTalk AI In industry 4.0 and Education 4.0https://peerj.com/articles/cs-18432024-01-312024-01-31Chunxiao ZhangZhiyan LiuAravind B.R.Hariharasudan A
Background
As Industry 4.0 debuted roughly a decade ago, it is now necessary to examine how it affects various aspects of the discipline. It is the responsibility of the education sector to guarantee that the next generation is equipped mentally, physically, and cognitively to face unforeseen challenges. Numerous educational institutions are outfitted with Industry 4.0 technology-based learning. Industry 4.0 fosters advancements in learning methodologies, especially for language enhancements. Learners may gain knowledge at their base, providing them an opportunity for independent study. The majority of subjects have been acquired through Industry 4.0. This research chapter explores the intersection of Industry 4.0 and education, specifically focusing on the SmallTalk AI tool. It investigates how technological and digital innovations within the context of Industry 4.0 can serve as powerful tools to enhance language learning outcomes.
Methods
This article presents a comprehensive analysis of statistical data and empirical evidence to support the positive impact of Industry 4.0 technology of SmallTalk on language acquisition particularly speaking. The study also determines the relationship among participants’ usage through the technology acceptance model (TAM). Furthermore, it examines the challenges and opportunities associated with integrating these innovations into language learning pedagogies, offering insights for educators and policymakers to harness the potential of Industry 4.0 in fostering language proficiency. The research employs quantitative analysis. The data obtained from educational institutions has been analyzed using the SPSS and AMOS software.
Results
The results indicate that Industry 4.0 has had an important effect on English language acquisition. This self-supported adaptable system of education facilitates effective student learning. This study also suggests that future research into the utility of Industry 4.0 be conducted elsewhere internationally.
Background
As Industry 4.0 debuted roughly a decade ago, it is now necessary to examine how it affects various aspects of the discipline. It is the responsibility of the education sector to guarantee that the next generation is equipped mentally, physically, and cognitively to face unforeseen challenges. Numerous educational institutions are outfitted with Industry 4.0 technology-based learning. Industry 4.0 fosters advancements in learning methodologies, especially for language enhancements. Learners may gain knowledge at their base, providing them an opportunity for independent study. The majority of subjects have been acquired through Industry 4.0. This research chapter explores the intersection of Industry 4.0 and education, specifically focusing on the SmallTalk AI tool. It investigates how technological and digital innovations within the context of Industry 4.0 can serve as powerful tools to enhance language learning outcomes.
Methods
This article presents a comprehensive analysis of statistical data and empirical evidence to support the positive impact of Industry 4.0 technology of SmallTalk on language acquisition particularly speaking. The study also determines the relationship among participants’ usage through the technology acceptance model (TAM). Furthermore, it examines the challenges and opportunities associated with integrating these innovations into language learning pedagogies, offering insights for educators and policymakers to harness the potential of Industry 4.0 in fostering language proficiency. The research employs quantitative analysis. The data obtained from educational institutions has been analyzed using the SPSS and AMOS software.
Results
The results indicate that Industry 4.0 has had an important effect on English language acquisition. This self-supported adaptable system of education facilitates effective student learning. This study also suggests that future research into the utility of Industry 4.0 be conducted elsewhere internationally.Using artificial intelligence to explore sound symbolic expressions of gender in American Englishhttps://peerj.com/articles/cs-18112024-01-172024-01-17Alexander KilpatrickAleksandra Ćwiek
This study investigates the extent to which gender can be inferred from the phonemes that make up given names and words in American English. Two extreme gradient boosted algorithms were constructed to classify words according to gender, one using a list of the most common given names (N∼1,000) in North America and the other using the Glasgow Norms (N∼5,500), a corpus consisting of nouns, verbs, adjectives, and adverbs which have each been assigned a psycholinguistic score of how they are associated with male or female behaviour. Both models report significant findings, but the model constructed using given names achieves a greater accuracy despite being trained on a smaller dataset suggesting that gender is expressed more robustly in given names than in other word classes. Feature importance was examined to determine which features were contributing to the decision-making process. Feature importance scores revealed a general pattern across both models, but also show that not all word classes express gender the same way. Finally, the models were reconstructed and tested on the opposite dataset to determine whether they were useful in classifying opposite samples. The results showed that the models were not as accurate when classifying opposite samples, suggesting that they are more suited to classifying words of the same class.
This study investigates the extent to which gender can be inferred from the phonemes that make up given names and words in American English. Two extreme gradient boosted algorithms were constructed to classify words according to gender, one using a list of the most common given names (N∼1,000) in North America and the other using the Glasgow Norms (N∼5,500), a corpus consisting of nouns, verbs, adjectives, and adverbs which have each been assigned a psycholinguistic score of how they are associated with male or female behaviour. Both models report significant findings, but the model constructed using given names achieves a greater accuracy despite being trained on a smaller dataset suggesting that gender is expressed more robustly in given names than in other word classes. Feature importance was examined to determine which features were contributing to the decision-making process. Feature importance scores revealed a general pattern across both models, but also show that not all word classes express gender the same way. Finally, the models were reconstructed and tested on the opposite dataset to determine whether they were useful in classifying opposite samples. The results showed that the models were not as accurate when classifying opposite samples, suggesting that they are more suited to classifying words of the same class.STTA: enhanced text classification via selective test-time augmentationhttps://peerj.com/articles/cs-17572023-12-192023-12-19Haoyu XiongXinchun ZhangLeixin YangYu XiangYaping Zhang
Test-time augmentation (TTA) is a well-established technique that involves aggregating transformed examples of test inputs during the inference stage. The goal is to enhance model performance and reduce the uncertainty of predictions. Despite its advantages of not requiring additional training or hyperparameter tuning, and being applicable to any existing model, TTA is still in its early stages in the field of NLP. This is partly due to the difficulty of discerning the contribution of different transformed samples, which can negatively impact predictions. In order to address these issues, we propose Selective Test-Time Augmentation, called STTA, which aims to select the most beneficial transformed samples for aggregation by identifying reliable samples. Furthermore, we analyze and empirically verify why TTA is sensitive to some text data augmentation methods and reveal why some data augmentation methods lead to erroneous predictions. Through extensive experiments, we demonstrate that STTA is a simple and effective method that can produce promising results in various text classification tasks.
Test-time augmentation (TTA) is a well-established technique that involves aggregating transformed examples of test inputs during the inference stage. The goal is to enhance model performance and reduce the uncertainty of predictions. Despite its advantages of not requiring additional training or hyperparameter tuning, and being applicable to any existing model, TTA is still in its early stages in the field of NLP. This is partly due to the difficulty of discerning the contribution of different transformed samples, which can negatively impact predictions. In order to address these issues, we propose Selective Test-Time Augmentation, called STTA, which aims to select the most beneficial transformed samples for aggregation by identifying reliable samples. Furthermore, we analyze and empirically verify why TTA is sensitive to some text data augmentation methods and reveal why some data augmentation methods lead to erroneous predictions. Through extensive experiments, we demonstrate that STTA is a simple and effective method that can produce promising results in various text classification tasks.