PeerJ Computer Science:World Wide Web and Web Sciencehttps://peerj.com/articles/index.atom?journal=cs&subject=11900World Wide Web and Web Science articles published in PeerJ Computer ScienceSpecial issue on analysis and mining of social media datahttps://peerj.com/articles/cs-19092024-02-292024-02-29Arkaitz ZubiagaPaolo Rosso
This Editorial introduces the PeerJ Computer Science Special Issue on Analysis and Mining of Social Media Data. The special issue called for submissions with a primary focus on the use of social media data, for a variety of fields including natural language processing, computational social science, data mining, information retrieval and recommender systems. Of the 48 abstract submissions that were deemed within the scope of the special issue and were invited to submit a full article, 17 were ultimately accepted. These included a diverse set of articles covering, inter alia, sentiment analysis, detection and mitigation of online harms, analytical studies focused on societal issues and analysis of images surrounding news. The articles primarily use Twitter, Facebook and Reddit as data sources; English, Arabic, Italian, Russian, Indonesian and Javanese as languages; and over a third of the articles revolve around COVID-19 as the main topic of study. This article discusses the motivation for launching such a special issue and provides an overview of the articles published in the issue.
This Editorial introduces the PeerJ Computer Science Special Issue on Analysis and Mining of Social Media Data. The special issue called for submissions with a primary focus on the use of social media data, for a variety of fields including natural language processing, computational social science, data mining, information retrieval and recommender systems. Of the 48 abstract submissions that were deemed within the scope of the special issue and were invited to submit a full article, 17 were ultimately accepted. These included a diverse set of articles covering, inter alia, sentiment analysis, detection and mitigation of online harms, analytical studies focused on societal issues and analysis of images surrounding news. The articles primarily use Twitter, Facebook and Reddit as data sources; English, Arabic, Italian, Russian, Indonesian and Javanese as languages; and over a third of the articles revolve around COVID-19 as the main topic of study. This article discusses the motivation for launching such a special issue and provides an overview of the articles published in the issue.A message recovery attack on multivariate polynomial trapdoor functionhttps://peerj.com/articles/cs-15212023-08-282023-08-28Rashid AliMuhammad Mubashar HussainShamsa KanwalFahima HajjejSaba Inam
Cybersecurity guarantees the exchange of information through a public channel in a secure way. That is the data must be protected from unauthorized parties and transmitted to the intended parties with confidentiality and integrity. In this work, we mount an attack on a cryptosystem based on multivariate polynomial trapdoor function over the field of rational numbers Q. The developers claim that the security of their proposed scheme depends on the fact that a polynomial system consisting of 2n (where n is a natural number) equations and 3n unknowns constructed by using quasigroup string transformations, has infinitely many solutions and finding exact solution is not possible. We explain that the proposed trapdoor function is vulnerable to a Gröbner basis attack. Selected polynomials in the corresponding Gröbner basis can be used to recover the plaintext against a given ciphertext without the knowledge of the secret key.
Cybersecurity guarantees the exchange of information through a public channel in a secure way. That is the data must be protected from unauthorized parties and transmitted to the intended parties with confidentiality and integrity. In this work, we mount an attack on a cryptosystem based on multivariate polynomial trapdoor function over the field of rational numbers Q. The developers claim that the security of their proposed scheme depends on the fact that a polynomial system consisting of 2n (where n is a natural number) equations and 3n unknowns constructed by using quasigroup string transformations, has infinitely many solutions and finding exact solution is not possible. We explain that the proposed trapdoor function is vulnerable to a Gröbner basis attack. Selected polynomials in the corresponding Gröbner basis can be used to recover the plaintext against a given ciphertext without the knowledge of the secret key.I-Cubid: a nonlinear cubic graph-based approach to visualize and in-depth browse Flickr image resultshttps://peerj.com/articles/cs-14762023-08-102023-08-10Umer RashidMaha SaddalAbdur Rehman KhanSadia ManzoorNaveed Ahmad
The existing image search engines allow web users to explore images from the grids. The traditional interaction is linear and lookup-based. Notably, scanning web search results is horizontal-vertical and cannot support in-depth browsing. This research emphasizes the significance of a multidimensional exploration scheme over traditional grid layouts in visually exploring web image search results. This research aims to antecedent the implications of visualization and related in-depth browsing via a multidimensional cubic graph representation over a search engine result page (SERP). Furthermore, this research uncovers usability issues in the traditional grid and 3-dimensional web image search space. We provide multidimensional cubic visualization and nonlinear in-depth browsing of web image search results. The proposed approach employs textual annotations and descriptions to represent results in cubic graphs that further support in-depth browsing via a search user interface (SUI) design. It allows nonlinear navigation in web image search results and enables exploration, browsing, visualization, previewing/viewing, and accessing images in a nonlinear, interactive, and usable way. The usability tests and detailed statistical significance analysis confirm the efficacy of cubic presentation over grid layouts. The investigation reveals improvement in overall user satisfaction, screen design, information & terminology, and system capability in exploring web image search results.
The existing image search engines allow web users to explore images from the grids. The traditional interaction is linear and lookup-based. Notably, scanning web search results is horizontal-vertical and cannot support in-depth browsing. This research emphasizes the significance of a multidimensional exploration scheme over traditional grid layouts in visually exploring web image search results. This research aims to antecedent the implications of visualization and related in-depth browsing via a multidimensional cubic graph representation over a search engine result page (SERP). Furthermore, this research uncovers usability issues in the traditional grid and 3-dimensional web image search space. We provide multidimensional cubic visualization and nonlinear in-depth browsing of web image search results. The proposed approach employs textual annotations and descriptions to represent results in cubic graphs that further support in-depth browsing via a search user interface (SUI) design. It allows nonlinear navigation in web image search results and enables exploration, browsing, visualization, previewing/viewing, and accessing images in a nonlinear, interactive, and usable way. The usability tests and detailed statistical significance analysis confirm the efficacy of cubic presentation over grid layouts. The investigation reveals improvement in overall user satisfaction, screen design, information & terminology, and system capability in exploring web image search results.Web content topic modeling using LDA and HTML tagshttps://peerj.com/articles/cs-14592023-07-112023-07-11Hamza H.M. AltarturiMuntadher SaadoonNor Badrul Anuar
An immense volume of digital documents exists online and offline with content that can offer useful information and insights. Utilizing topic modeling enhances the analysis and understanding of digital documents. Topic modeling discovers latent semantic structures or topics within a set of digital textual documents. The Internet of Things, Blockchain, recommender system, and search engine optimization applications use topic modeling to handle data mining tasks, such as classification and clustering. The usefulness of topic models depends on the quality of resulting term patterns and topics with high quality. Topic coherence is the standard metric to measure the quality of topic models. Previous studies build topic models to generally work on conventional documents, and they are insufficient and underperform when applied to web content data due to differences in the structure of the conventional and HTML documents. Neglecting the unique structure of web content leads to missing otherwise coherent topics and, therefore, low topic quality. This study aims to propose an innovative topic model to learn coherence topics in web content data. We present the HTML Topic Model (HTM), a web content topic model that takes into consideration the HTML tags to understand the structure of web pages. We conducted two series of experiments to demonstrate the limitations of the existing topic models and examine the topic coherence of the HTM against the widely used Latent Dirichlet Allocation (LDA) model and its variants, namely the Correlated Topic Model, the Dirichlet Multinomial Regression, the Hierarchical Dirichlet Process, the Hierarchical Latent Dirichlet Allocation, the pseudo-document based Topic Model, and the Supervised Latent Dirichlet Allocation models. The first experiment demonstrates the limitations of the existing topic models when applied to web content data and, therefore, the essential need for a web content topic model. When applied to web data, the overall performance dropped an average of five times and, in some cases, up to approximately 20 times lower than when applied to conventional data. The second experiment then evaluates the effectiveness of the HTM model in discovering topics and term patterns of web content data. The HTM model achieved an overall 35% improvement in topic coherence compared to the LDA.
An immense volume of digital documents exists online and offline with content that can offer useful information and insights. Utilizing topic modeling enhances the analysis and understanding of digital documents. Topic modeling discovers latent semantic structures or topics within a set of digital textual documents. The Internet of Things, Blockchain, recommender system, and search engine optimization applications use topic modeling to handle data mining tasks, such as classification and clustering. The usefulness of topic models depends on the quality of resulting term patterns and topics with high quality. Topic coherence is the standard metric to measure the quality of topic models. Previous studies build topic models to generally work on conventional documents, and they are insufficient and underperform when applied to web content data due to differences in the structure of the conventional and HTML documents. Neglecting the unique structure of web content leads to missing otherwise coherent topics and, therefore, low topic quality. This study aims to propose an innovative topic model to learn coherence topics in web content data. We present the HTML Topic Model (HTM), a web content topic model that takes into consideration the HTML tags to understand the structure of web pages. We conducted two series of experiments to demonstrate the limitations of the existing topic models and examine the topic coherence of the HTM against the widely used Latent Dirichlet Allocation (LDA) model and its variants, namely the Correlated Topic Model, the Dirichlet Multinomial Regression, the Hierarchical Dirichlet Process, the Hierarchical Latent Dirichlet Allocation, the pseudo-document based Topic Model, and the Supervised Latent Dirichlet Allocation models. The first experiment demonstrates the limitations of the existing topic models when applied to web content data and, therefore, the essential need for a web content topic model. When applied to web data, the overall performance dropped an average of five times and, in some cases, up to approximately 20 times lower than when applied to conventional data. The second experiment then evaluates the effectiveness of the HTM model in discovering topics and term patterns of web content data. The HTM model achieved an overall 35% improvement in topic coherence compared to the LDA.Query sampler: generating query sets for analyzing search engines using keyword research toolshttps://peerj.com/articles/cs-14212023-06-072023-06-07Sebastian SchultheißDirk LewandowskiSonja von MachNurce Yagci
Search engine queries are the starting point for studies in different fields, such as health or political science. These studies usually aim to make statements about social phenomena. However, the queries used in the studies are often created rather unsystematically and do not correspond to actual user behavior. Therefore, the evidential value of the studies must be questioned. We address this problem by developing an approach (query sampler) to sample queries from commercial search engines, using keyword research tools designed to support search engine marketing. This allows us to generate large numbers of queries related to a given topic and derive information on how often each keyword is searched for, that is, the query volume. We empirically test our approach with queries from two published studies, and the results show that the number of queries and total search volume could be considerably expanded. Our approach has a wide range of applications for studies that seek to draw conclusions about social phenomena using search engine queries. The approach can be applied flexibly to different topics and is relatively straightforward to implement, as we provide the code for querying Google Ads API. Limitations are that the approach needs to be tested with a broader range of topics and thoroughly checked for problems with topic drift and the role of close variants provided by keyword research tools.
Search engine queries are the starting point for studies in different fields, such as health or political science. These studies usually aim to make statements about social phenomena. However, the queries used in the studies are often created rather unsystematically and do not correspond to actual user behavior. Therefore, the evidential value of the studies must be questioned. We address this problem by developing an approach (query sampler) to sample queries from commercial search engines, using keyword research tools designed to support search engine marketing. This allows us to generate large numbers of queries related to a given topic and derive information on how often each keyword is searched for, that is, the query volume. We empirically test our approach with queries from two published studies, and the results show that the number of queries and total search volume could be considerably expanded. Our approach has a wide range of applications for studies that seek to draw conclusions about social phenomena using search engine queries. The approach can be applied flexibly to different topics and is relatively straightforward to implement, as we provide the code for querying Google Ads API. Limitations are that the approach needs to be tested with a broader range of topics and thoroughly checked for problems with topic drift and the role of close variants provided by keyword research tools.SEMGROMI—a semantic grouping algorithm to identifying microservices using semantic similarity of user storieshttps://peerj.com/articles/cs-13802023-05-122023-05-12Fredy H. Vera-RiveraEduard Gilberto Puerto CuadrosBoris PerezHernán AstudilloCarlos Gaona
Microservices is an architectural style for service-oriented distributed computing, and is being widely adopted in several domains, including autonomous vehicles, sensor networks, IoT systems, energy systems, telecommunications networks and telemedicine systems. When migrating a monolithic system to a microservices architecture, one of the key design problems is the “microservice granularity definition”, i.e., deciding how many microservices are needed and allocating computations among them. This article describes a semantic grouping algorithm (SEMGROMI), a technique that takes user stories, a well-known functional requirements specification technique, and identifies number and scope of candidate microservices using semantic similarity of the user stories’ textual description, while optimizing for low coupling, high cohesion, and high semantic similarity. Using the technique in four validation projects (two state-of-the-art projects and two industry projects), the proposed technique was compared with domain-driven design (DDD), the most frequent method used to identify microservices, and with a genetic algorithm previously proposed as part of the Microservices Backlog model. We found that SEMGROMI yields decompositions of user stories to microservices with high cohesion (from the semantic point of view) and low coupling, the complexity was reduced, also the communication between microservices and the estimated development time was decreased. Therefore, SEMGROMI is a viable option for the design and evaluation of microservices-based applications. The proposed semantic similarity-based technique (SEMGROMI) is part of the Microservices Backlog model, which allows to evaluate candidate microservices graphically and based on metrics to make design-time decisions about the architecture of the microservices-based application.
Microservices is an architectural style for service-oriented distributed computing, and is being widely adopted in several domains, including autonomous vehicles, sensor networks, IoT systems, energy systems, telecommunications networks and telemedicine systems. When migrating a monolithic system to a microservices architecture, one of the key design problems is the “microservice granularity definition”, i.e., deciding how many microservices are needed and allocating computations among them. This article describes a semantic grouping algorithm (SEMGROMI), a technique that takes user stories, a well-known functional requirements specification technique, and identifies number and scope of candidate microservices using semantic similarity of the user stories’ textual description, while optimizing for low coupling, high cohesion, and high semantic similarity. Using the technique in four validation projects (two state-of-the-art projects and two industry projects), the proposed technique was compared with domain-driven design (DDD), the most frequent method used to identify microservices, and with a genetic algorithm previously proposed as part of the Microservices Backlog model. We found that SEMGROMI yields decompositions of user stories to microservices with high cohesion (from the semantic point of view) and low coupling, the complexity was reduced, also the communication between microservices and the estimated development time was decreased. Therefore, SEMGROMI is a viable option for the design and evaluation of microservices-based applications. The proposed semantic similarity-based technique (SEMGROMI) is part of the Microservices Backlog model, which allows to evaluate candidate microservices graphically and based on metrics to make design-time decisions about the architecture of the microservices-based application.SocioPedia+: a visual analytics system for social knowledge graph-based event explorationhttps://peerj.com/articles/cs-12772023-03-202023-03-20Tra My NguyenHong-Woo ChunMyunggwon HwangLee-Nam KwonJae-Min LeeKanghee ParkJason J. Jung
In the recent era of information explosion, exploring event from social networks has recently been a crucial task for many applications. To derive valuable comprehensive and thorough insights on social events, visual analytics (VA) system have been broadly used as a promising solution. However, due to the enormous social data volume with highly diversity and complexity, the number of event exploration tasks which can be enabled in a conventional real-time visual analytics systems has been limited. In this article, we introduce SocioPedia+, a real-time visual analytics system for social event exploration in time and space domains. By introducing the dimension of social knowledge graph analysis into the system multivariate analysis, the process of event explorations in SocioPedia+ can be significantly enhanced and thus enabling system capability on performing full required tasks of visual analytics and social event explorations. Furthermore, SocioPedia+ has been optimized for visualizing event analysis on different levels from macroscopic (events level) to microscopic (knowledge level). The system is then implemented and investigated with a detailed case study for evaluating its usefulness and visualization effectiveness for the application of event explorations.
In the recent era of information explosion, exploring event from social networks has recently been a crucial task for many applications. To derive valuable comprehensive and thorough insights on social events, visual analytics (VA) system have been broadly used as a promising solution. However, due to the enormous social data volume with highly diversity and complexity, the number of event exploration tasks which can be enabled in a conventional real-time visual analytics systems has been limited. In this article, we introduce SocioPedia+, a real-time visual analytics system for social event exploration in time and space domains. By introducing the dimension of social knowledge graph analysis into the system multivariate analysis, the process of event explorations in SocioPedia+ can be significantly enhanced and thus enabling system capability on performing full required tasks of visual analytics and social event explorations. Furthermore, SocioPedia+ has been optimized for visualizing event analysis on different levels from macroscopic (events level) to microscopic (knowledge level). The system is then implemented and investigated with a detailed case study for evaluating its usefulness and visualization effectiveness for the application of event explorations.Benefits, challenges, and usability evaluation of DeloreanJS: a back-in-time debugger for JavaScripthttps://peerj.com/articles/cs-12382023-02-242023-02-24Paul LegerFelipe RuizHiroaki FukudaNicolás Cardozo
JavaScript Web applications are a common product in industry. As with most applications, Web applications can acquire software flaws (known as bugs), whose symptoms are seen during the development stage and, even worse, in production. The use of debuggers is beneficial for detecting bugs. Unfortunately, most JavaScript debuggers (1) only support the “step into/through” feature in an execution program to detect a bug, and (2) do not allow developers to go back-in-time at the application execution to take actions to detect the bug accurately. For example, the second limitation does not allow developers to modify the value of a variable to fix a bug while the application is running or test if the same bug is triggered with other values of that variable. Using concepts such as continuations and static analysis, this article presents a usable debugger for JavaScript, named DeloreanJS, which enables developers to go back-in-time in different execution points and resume the execution of a Web application to improve the understanding of a bug, or even experiment with hypothetical scenarios around the bug. Using an online and available version, we illustrate the benefits of DeloreanJS through five examples of bugs in JavaScript. Although DeloreanJS is developed for JavaScript, a dynamic prototype-based object model with side effects (mutable variables), we discuss our proposal with the state-of-art/practice of debuggers in terms of features. For example, modern browsers like Mozilla Firefox include a debugger in their distribution that only support for the breakpoint feature. However DeloreanJS uses a graphical user interface that considers back-in-time features. The aim of this study is to evaluate and compare the usability of DeloreanJS and Mozilla Firefox’s debugger using the system usability scale approach. We requested 30 undergraduate students from two computer science programs to solve five tasks. Among the findings, we highlight two results. First, we found that 100% (15) of participants recommended DeloreanJS, and only 53% (eight) recommended Firefox’s debugger to complete the tasks. Second, whereas the average score for DeloreanJS is 71.6 (“Good”), the average score for Firefox’s debugger is 55.8 (“Acceptable”).
JavaScript Web applications are a common product in industry. As with most applications, Web applications can acquire software flaws (known as bugs), whose symptoms are seen during the development stage and, even worse, in production. The use of debuggers is beneficial for detecting bugs. Unfortunately, most JavaScript debuggers (1) only support the “step into/through” feature in an execution program to detect a bug, and (2) do not allow developers to go back-in-time at the application execution to take actions to detect the bug accurately. For example, the second limitation does not allow developers to modify the value of a variable to fix a bug while the application is running or test if the same bug is triggered with other values of that variable. Using concepts such as continuations and static analysis, this article presents a usable debugger for JavaScript, named DeloreanJS, which enables developers to go back-in-time in different execution points and resume the execution of a Web application to improve the understanding of a bug, or even experiment with hypothetical scenarios around the bug. Using an online and available version, we illustrate the benefits of DeloreanJS through five examples of bugs in JavaScript. Although DeloreanJS is developed for JavaScript, a dynamic prototype-based object model with side effects (mutable variables), we discuss our proposal with the state-of-art/practice of debuggers in terms of features. For example, modern browsers like Mozilla Firefox include a debugger in their distribution that only support for the breakpoint feature. However DeloreanJS uses a graphical user interface that considers back-in-time features. The aim of this study is to evaluate and compare the usability of DeloreanJS and Mozilla Firefox’s debugger using the system usability scale approach. We requested 30 undergraduate students from two computer science programs to solve five tasks. Among the findings, we highlight two results. First, we found that 100% (15) of participants recommended DeloreanJS, and only 53% (eight) recommended Firefox’s debugger to complete the tasks. Second, whereas the average score for DeloreanJS is 71.6 (“Good”), the average score for Firefox’s debugger is 55.8 (“Acceptable”).A selective approach to stemming for minimizing the risk of failure in information retrieval systemshttps://peerj.com/articles/cs-11752023-01-102023-01-10Gökhan GökselAhmet ArslanBekir Taner Dinçer
Stemming is supposed to improve the average performance of an information retrieval system, but in practice, past experimental results show that this is not always the case. In this article, we propose a selective approach to stemming that decides whether stemming should be applied or not on a query basis. Our method aims at minimizing the risk of failure caused by stemming in retrieving semantically-related documents. The proposed work mainly contributes to the IR literature by proposing an application of selective stemming and a set of new features that derived from the term frequency distributions of the systems in selection. The method based on the approach leverages both some of the query performance predictors and the derived features and a machine learning technique. It is comprehensively evaluated using three rule-based stemmers and eight query sets corresponding to four document collections from the standard TREC and NTCIR datasets. The document collections, except for one, include Web documents ranging from 25 million to 733 million. The results of the experiments show that the method is capable of making accurate selections that increase the robustness of the system and minimize the risk of failure (i.e., per query performance losses) across queries. The results also show that the method attains a systematically higher average retrieval performance than the single systems for most query sets.
Stemming is supposed to improve the average performance of an information retrieval system, but in practice, past experimental results show that this is not always the case. In this article, we propose a selective approach to stemming that decides whether stemming should be applied or not on a query basis. Our method aims at minimizing the risk of failure caused by stemming in retrieving semantically-related documents. The proposed work mainly contributes to the IR literature by proposing an application of selective stemming and a set of new features that derived from the term frequency distributions of the systems in selection. The method based on the approach leverages both some of the query performance predictors and the derived features and a machine learning technique. It is comprehensively evaluated using three rule-based stemmers and eight query sets corresponding to four document collections from the standard TREC and NTCIR datasets. The document collections, except for one, include Web documents ranging from 25 million to 733 million. The results of the experiments show that the method is capable of making accurate selections that increase the robustness of the system and minimize the risk of failure (i.e., per query performance losses) across queries. The results also show that the method attains a systematically higher average retrieval performance than the single systems for most query sets.Autonomous schema markups based on intelligent computing for search engine optimizationhttps://peerj.com/articles/cs-11632022-12-082022-12-08Burhan Ud Din AbbasiIram FatimaHamid MukhtarSharifullah KhanAbdulaziz AlhumamHafiz Farooq Ahmad
With advances in artificial intelligence and semantic technology, search engines are integrating semantics to address complex search queries to improve the results. This requires identification of well-known concepts or entities and their relationship from web page contents. But the increase in complex unstructured data on web pages has made the task of concept identification overly complex. Existing research focuses on entity recognition from the perspective of linguistic structures such as complete sentences and paragraphs, whereas a huge part of the data on web pages exists as unstructured text fragments enclosed in HTML tags. Ontologies provide schemas to structure the data on the web. However, including them in the web pages requires additional resources and expertise from organizations or webmasters and thus becoming a major hindrance in their large-scale adoption. We propose an approach for autonomous identification of entities from short text present in web pages to populate semantic models based on a specific ontology model. The proposed approach has been applied to a public dataset containing academic web pages. We employ a long short-term memory (LSTM) deep learning network and the random forest machine learning algorithm to predict entities. The proposed methodology gives an overall accuracy of 0.94 on the test dataset, indicating a potential for automated prediction even in the case of a limited number of training samples for various entities, thus, significantly reducing the required manual workload in practical applications.
With advances in artificial intelligence and semantic technology, search engines are integrating semantics to address complex search queries to improve the results. This requires identification of well-known concepts or entities and their relationship from web page contents. But the increase in complex unstructured data on web pages has made the task of concept identification overly complex. Existing research focuses on entity recognition from the perspective of linguistic structures such as complete sentences and paragraphs, whereas a huge part of the data on web pages exists as unstructured text fragments enclosed in HTML tags. Ontologies provide schemas to structure the data on the web. However, including them in the web pages requires additional resources and expertise from organizations or webmasters and thus becoming a major hindrance in their large-scale adoption. We propose an approach for autonomous identification of entities from short text present in web pages to populate semantic models based on a specific ontology model. The proposed approach has been applied to a public dataset containing academic web pages. We employ a long short-term memory (LSTM) deep learning network and the random forest machine learning algorithm to predict entities. The proposed methodology gives an overall accuracy of 0.94 on the test dataset, indicating a potential for automated prediction even in the case of a limited number of training samples for various entities, thus, significantly reducing the required manual workload in practical applications.