PeerJ Computer Sciencehttps://peerj.com/articles/index.atom?journal=cs Articles published in PeerJ Computer ScienceAdaptations of data mining methodologies: a systematic literature reviewhttps://peerj.com/articles/002672020-05-252020-05-25Veronika PlotnikovaMarlon DumasFredrik Milani
The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.
The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.Influence of tweets and diversification on serendipitous research paper recommender systemshttps://peerj.com/articles/cs-2732020-05-182020-05-18Chifumi NishiokaJörn HaukeAnsgar Scherp
In recent years, a large body of literature has accumulated around the topic of research paper recommender systems. However, since most studies have focused on the variable of accuracy, they have overlooked the serendipity of recommendations, which is an important determinant of user satisfaction. Serendipity is concerned with the relevance and unexpectedness of recommendations, and so serendipitous items are considered those which positively surprise users. The purpose of this article was to examine two key research questions: firstly, whether a user’s Tweets can assist in generating more serendipitous recommendations; and secondly, whether the diversification of a list of recommended items further improves serendipity. To investigate these issues, an online experiment was conducted in the domain of computer science with 22 subjects. As an evaluation metric, we use the serendipity score (SRDP), in which the unexpectedness of recommendations is inferred by using a primitive recommendation strategy. The results indicate that a user’s Tweets do not improve serendipity, but they can reflect recent research interests and are typically heterogeneous. Contrastingly, diversification was found to lead to a greater number of serendipitous research paper recommendations.
In recent years, a large body of literature has accumulated around the topic of research paper recommender systems. However, since most studies have focused on the variable of accuracy, they have overlooked the serendipity of recommendations, which is an important determinant of user satisfaction. Serendipity is concerned with the relevance and unexpectedness of recommendations, and so serendipitous items are considered those which positively surprise users. The purpose of this article was to examine two key research questions: firstly, whether a user’s Tweets can assist in generating more serendipitous recommendations; and secondly, whether the diversification of a list of recommended items further improves serendipity. To investigate these issues, an online experiment was conducted in the domain of computer science with 22 subjects. As an evaluation metric, we use the serendipity score (SRDP), in which the unexpectedness of recommendations is inferred by using a primitive recommendation strategy. The results indicate that a user’s Tweets do not improve serendipity, but they can reflect recent research interests and are typically heterogeneous. Contrastingly, diversification was found to lead to a greater number of serendipitous research paper recommendations.Adaptive divergence for rapid adversarial optimizationhttps://peerj.com/articles/cs-2742020-05-182020-05-18Maxim BorisyakTatiana GaintsevaAndrey Ustyuzhanin
Adversarial Optimization provides a reliable, practical way to match two implicitly defined distributions, one of which is typically represented by a sample of real data, and the other is represented by a parameterized generator. Matching of the distributions is achieved by minimizing a divergence between these distribution, and estimation of the divergence involves a secondary optimization task, which, typically, requires training a model to discriminate between these distributions. The choice of the model has its trade-off: high-capacity models provide good estimations of the divergence, but, generally, require large sample sizes to be properly trained. In contrast, low-capacity models tend to require fewer samples for training; however, they might provide biased estimations. Computational costs of Adversarial Optimization becomes significant when sampling from the generator is expensive. One of the practical examples of such settings is fine-tuning parameters of complex computer simulations. In this work, we introduce a novel family of divergences that enables faster optimization convergence measured by the number of samples drawn from the generator. The variation of the underlying discriminator model capacity during optimization leads to a significant speed-up. The proposed divergence family suggests using low-capacity models to compare distant distributions (typically, at early optimization steps), and the capacity gradually grows as the distributions become closer to each other. Thus, it allows for a significant acceleration of the initial stages of optimization. This acceleration was demonstrated on two fine-tuning problems involving Pythia event generator and two of the most popular black-box optimization algorithms: Bayesian Optimization and Variational Optimization. Experiments show that, given the same budget, adaptive divergences yield results up to an order of magnitude closer to the optimum than Jensen-Shannon divergence. While we consider physics-related simulations, adaptive divergences can be applied to any stochastic simulation.
Adversarial Optimization provides a reliable, practical way to match two implicitly defined distributions, one of which is typically represented by a sample of real data, and the other is represented by a parameterized generator. Matching of the distributions is achieved by minimizing a divergence between these distribution, and estimation of the divergence involves a secondary optimization task, which, typically, requires training a model to discriminate between these distributions. The choice of the model has its trade-off: high-capacity models provide good estimations of the divergence, but, generally, require large sample sizes to be properly trained. In contrast, low-capacity models tend to require fewer samples for training; however, they might provide biased estimations. Computational costs of Adversarial Optimization becomes significant when sampling from the generator is expensive. One of the practical examples of such settings is fine-tuning parameters of complex computer simulations. In this work, we introduce a novel family of divergences that enables faster optimization convergence measured by the number of samples drawn from the generator. The variation of the underlying discriminator model capacity during optimization leads to a significant speed-up. The proposed divergence family suggests using low-capacity models to compare distant distributions (typically, at early optimization steps), and the capacity gradually grows as the distributions become closer to each other. Thus, it allows for a significant acceleration of the initial stages of optimization. This acceleration was demonstrated on two fine-tuning problems involving Pythia event generator and two of the most popular black-box optimization algorithms: Bayesian Optimization and Variational Optimization. Experiments show that, given the same budget, adaptive divergences yield results up to an order of magnitude closer to the optimum than Jensen-Shannon divergence. While we consider physics-related simulations, adaptive divergences can be applied to any stochastic simulation.SANgo: a storage infrastructure simulator with reinforcement learning supporthttps://peerj.com/articles/cs-2712020-05-042020-05-04Kenenbek ArzymatovAndrey SapronovVladislav BelavinLeonid GremyachikhMaksim KarpovAndrey UstyuzhaninIvan TchoubArtem Ikoev
We introduce SANgo (Storage Area Network in the Go language)—a Go-based package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The flexible structure of the package allows us to create a model of a real storage system with a configurable number of components. The granularity of the simulated system can be defined depending on the replicated patterns of actual system behavior. Accurate replication enables us to reach the primary goal of our simulator—to explore the stability boundaries of real storage systems. To meet this goal, SANgo offers a variety of interfaces for easy monitoring and tuning of the simulated model. These interfaces allow us to track the number of metrics of such components as storage controllers, network connections, and hard-drives. Other interfaces allow altering the parameter values of the simulated system effectively in real-time, thus providing the possibility for training a realistic digital twin using, for example, the reinforcement learning (RL) approach. One can train an RL model to reduce discrepancies between simulated and real SAN data. The external control algorithm can adjust the simulator parameters to make the difference as small as possible. SANgo supports the standard OpenAI gym interface; thus, the software can serve as a benchmark for comparison of different learning algorithms.
We introduce SANgo (Storage Area Network in the Go language)—a Go-based package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The flexible structure of the package allows us to create a model of a real storage system with a configurable number of components. The granularity of the simulated system can be defined depending on the replicated patterns of actual system behavior. Accurate replication enables us to reach the primary goal of our simulator—to explore the stability boundaries of real storage systems. To meet this goal, SANgo offers a variety of interfaces for easy monitoring and tuning of the simulated model. These interfaces allow us to track the number of metrics of such components as storage controllers, network connections, and hard-drives. Other interfaces allow altering the parameter values of the simulated system effectively in real-time, thus providing the possibility for training a realistic digital twin using, for example, the reinforcement learning (RL) approach. One can train an RL model to reduce discrepancies between simulated and real SAN data. The external control algorithm can adjust the simulator parameters to make the difference as small as possible. SANgo supports the standard OpenAI gym interface; thus, the software can serve as a benchmark for comparison of different learning algorithms.Exact acceleration of complex real-time model checking based on overlapping cyclehttps://peerj.com/articles/cs-2722020-05-042020-05-04Guoqing WangLei ZhuangYu SongMengyang HeDing MaLing Ma
When real-time systems are modeled as timed automata, different time scales may lead to substantial fragmentation of the symbolic state space. Exact acceleration solves the fragmentation problem without changing system reachability. The relatively mature technology of exact acceleration has been used with an appended cycle or a parking cycle, which can be applied to the calculation of a single acceleratable cycle model. Using these two technologies to develop a complex real-time model requires additional states and consumes a large amount of time cost, thereby influencing acceleration efficiency. In this paper, a complex real-time exact acceleration method based on an overlapping cycle is proposed, which is an application scenario extension of the parking-cycle technique. By comprehensively analyzing the accelerating impacts of multiple acceleratable cycles, it is only necessary to add a single overlapping period with a fixed length without relying on the windows of acceleratable cycles. Experimental results show that the proposed timed automaton model is simple and effectively decreases the time costs of exact acceleration. For the complex real-time system model, the method based on an overlapping cycle can accelerate the large scale and concurrent states which cannot be solved by the original exact acceleration theory.
When real-time systems are modeled as timed automata, different time scales may lead to substantial fragmentation of the symbolic state space. Exact acceleration solves the fragmentation problem without changing system reachability. The relatively mature technology of exact acceleration has been used with an appended cycle or a parking cycle, which can be applied to the calculation of a single acceleratable cycle model. Using these two technologies to develop a complex real-time model requires additional states and consumes a large amount of time cost, thereby influencing acceleration efficiency. In this paper, a complex real-time exact acceleration method based on an overlapping cycle is proposed, which is an application scenario extension of the parking-cycle technique. By comprehensively analyzing the accelerating impacts of multiple acceleratable cycles, it is only necessary to add a single overlapping period with a fixed length without relying on the windows of acceleratable cycles. Experimental results show that the proposed timed automaton model is simple and effectively decreases the time costs of exact acceleration. For the complex real-time system model, the method based on an overlapping cycle can accelerate the large scale and concurrent states which cannot be solved by the original exact acceleration theory.A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression datahttps://peerj.com/articles/cs-2702020-04-132020-04-13Reinel Tabares-SotoSimon Orozco-AriasVictor Romero-CanoVanesa Segovia BucheliJosé Luis Rodríguez-SoteloCristian Felipe Jiménez-Varón
Cancer classification is a topic of major interest in medicine since it allows accurate and efficient diagnosis and facilitates a successful outcome in medical treatments. Previous studies have classified human tumors using a large-scale RNA profiling and supervised Machine Learning (ML) algorithms to construct a molecular-based classification of carcinoma cells from breast, bladder, adenocarcinoma, colorectal, gastro esophagus, kidney, liver, lung, ovarian, pancreas, and prostate tumors. These datasets are collectively known as the 11_tumor database, although this database has been used in several works in the ML field, no comparative studies of different algorithms can be found in the literature. On the other hand, advances in both hardware and software technologies have fostered considerable improvements in the precision of solutions that use ML, such as Deep Learning (DL). In this study, we compare the most widely used algorithms in classical ML and DL to classify the tumors described in the 11_tumor database. We obtained tumor identification accuracies between 90.6% (Logistic Regression) and 94.43% (Convolutional Neural Networks) using k-fold cross-validation. Also, we show how a tuning process may or may not significantly improve algorithms’ accuracies. Our results demonstrate an efficient and accurate classification method based on gene expression (microarray data) and ML/DL algorithms, which facilitates tumor type prediction in a multi-cancer-type scenario.
Cancer classification is a topic of major interest in medicine since it allows accurate and efficient diagnosis and facilitates a successful outcome in medical treatments. Previous studies have classified human tumors using a large-scale RNA profiling and supervised Machine Learning (ML) algorithms to construct a molecular-based classification of carcinoma cells from breast, bladder, adenocarcinoma, colorectal, gastro esophagus, kidney, liver, lung, ovarian, pancreas, and prostate tumors. These datasets are collectively known as the 11_tumor database, although this database has been used in several works in the ML field, no comparative studies of different algorithms can be found in the literature. On the other hand, advances in both hardware and software technologies have fostered considerable improvements in the precision of solutions that use ML, such as Deep Learning (DL). In this study, we compare the most widely used algorithms in classical ML and DL to classify the tumors described in the 11_tumor database. We obtained tumor identification accuracies between 90.6% (Logistic Regression) and 94.43% (Convolutional Neural Networks) using k-fold cross-validation. Also, we show how a tuning process may or may not significantly improve algorithms’ accuracies. Our results demonstrate an efficient and accurate classification method based on gene expression (microarray data) and ML/DL algorithms, which facilitates tumor type prediction in a multi-cancer-type scenario.A survey on exponential random graph models: an application perspectivehttps://peerj.com/articles/cs-2692020-04-062020-04-06Saeid GhafouriSeyed Hossein Khasteh
The uncertainty underlying real-world phenomena has attracted attention toward statistical analysis approaches. In this regard, many problems can be modeled as networks. Thus, the statistical analysis of networked problems has received special attention from many researchers in recent years. Exponential Random Graph Models, known as ERGMs, are one of the popular statistical methods for analyzing the graphs of networked data. ERGM is a generative statistical network model whose ultimate goal is to present a subset of networks with particular characteristics as a statistical distribution. In the context of ERGMs, these graph’s characteristics are called statistics or configurations. Most of the time they are the number of repeated subgraphs across the graphs. Some examples include the number of triangles or the number of cycle of an arbitrary length. Also, any other census of the graph, as with the edge density, can be considered as one of the graph’s statistics. In this review paper, after explaining the building blocks and classic methods of ERGMs, we have reviewed their newly presented approaches and research papers. Further, we have conducted a comprehensive study on the applications of ERGMs in many research areas which to the best of our knowledge has not been done before. This review paper can be used as an introduction for scientists from various disciplines whose aim is to use ERGMs in some networked data in their field of expertise.
The uncertainty underlying real-world phenomena has attracted attention toward statistical analysis approaches. In this regard, many problems can be modeled as networks. Thus, the statistical analysis of networked problems has received special attention from many researchers in recent years. Exponential Random Graph Models, known as ERGMs, are one of the popular statistical methods for analyzing the graphs of networked data. ERGM is a generative statistical network model whose ultimate goal is to present a subset of networks with particular characteristics as a statistical distribution. In the context of ERGMs, these graph’s characteristics are called statistics or configurations. Most of the time they are the number of repeated subgraphs across the graphs. Some examples include the number of triangles or the number of cycle of an arbitrary length. Also, any other census of the graph, as with the edge density, can be considered as one of the graph’s statistics. In this review paper, after explaining the building blocks and classic methods of ERGMs, we have reviewed their newly presented approaches and research papers. Further, we have conducted a comprehensive study on the applications of ERGMs in many research areas which to the best of our knowledge has not been done before. This review paper can be used as an introduction for scientists from various disciplines whose aim is to use ERGMs in some networked data in their field of expertise.A new non-monotonic infeasible simplex-type algorithm for Linear Programminghttps://peerj.com/articles/cs-2652020-03-302020-03-30Charalampos P. TriantafyllidisNikolaos Samaras
This paper presents a new simplex-type algorithm for Linear Programming with the following two main characteristics: (i) the algorithm computes basic solutions which are neither primal or dual feasible, nor monotonically improving and (ii) the sequence of these basic solutions is connected with a sequence of monotonically improving interior points to construct a feasible direction at each iteration. We compare the proposed algorithm with the state-of-the-art commercial CPLEX and Gurobi Primal-Simplex optimizers on a collection of 93 well known benchmarks. The results are promising, showing that the new algorithm competes versus the state-of-the-art solvers in the total number of iterations required to converge.
This paper presents a new simplex-type algorithm for Linear Programming with the following two main characteristics: (i) the algorithm computes basic solutions which are neither primal or dual feasible, nor monotonically improving and (ii) the sequence of these basic solutions is connected with a sequence of monotonically improving interior points to construct a feasible direction at each iteration. We compare the proposed algorithm with the state-of-the-art commercial CPLEX and Gurobi Primal-Simplex optimizers on a collection of 93 well known benchmarks. The results are promising, showing that the new algorithm competes versus the state-of-the-art solvers in the total number of iterations required to converge.Towards a blockchain-based certificate authentication system in Vietnamhttps://peerj.com/articles/cs-2662020-03-302020-03-30Binh Minh NguyenThanh-Chung DaoBa-Lam Do
Anti-forgery information, transaction verification, and smart contract are functionalities of blockchain technology that can change the traditional business processes of IT applications. These functionalities increase the data transparency, and trust of users in the new application models, thus resolving many different social problems today. In this work, we take all the advantages of this technology to build a blockchain-based authentication system (called the Vietnamese Educational Certification blockchain, which stands for VECefblock) to deal with the delimitation of fake certificate issues in Vietnam. In this direction, firstly, we categorize and analyze blockchain research and application trends to make out our contributions in this domain. Our motivating factor is to curb fake certificates in Vietnam by applying the suitability of blockchain technology to the problem domain. This study proposed some blockchain-based application development principles in order to build a step by step VECefblock with the following procedures: designing overall architecture along with business processes, data mapping structure and implementing the decentralized application that can meet the specific Vietnamese requirements. To test system functionalities, we used Hyperledger Fabric as a blockchain platform that is deployed on the Amazon EC2 cloud. Through performance evaluations, we proved the operability of VECefblock in the practical deployment environment. This experiment also shows the feasibility of our proposal, thus promoting the application of blockchain technology to deal with social problems in general as well as certificate management in Vietnam.
Anti-forgery information, transaction verification, and smart contract are functionalities of blockchain technology that can change the traditional business processes of IT applications. These functionalities increase the data transparency, and trust of users in the new application models, thus resolving many different social problems today. In this work, we take all the advantages of this technology to build a blockchain-based authentication system (called the Vietnamese Educational Certification blockchain, which stands for VECefblock) to deal with the delimitation of fake certificate issues in Vietnam. In this direction, firstly, we categorize and analyze blockchain research and application trends to make out our contributions in this domain. Our motivating factor is to curb fake certificates in Vietnam by applying the suitability of blockchain technology to the problem domain. This study proposed some blockchain-based application development principles in order to build a step by step VECefblock with the following procedures: designing overall architecture along with business processes, data mapping structure and implementing the decentralized application that can meet the specific Vietnamese requirements. To test system functionalities, we used Hyperledger Fabric as a blockchain platform that is deployed on the Amazon EC2 cloud. Through performance evaluations, we proved the operability of VECefblock in the practical deployment environment. This experiment also shows the feasibility of our proposal, thus promoting the application of blockchain technology to deal with social problems in general as well as certificate management in Vietnam.Legal document similarity: a multi-criteria decision-making perspectivehttps://peerj.com/articles/cs-2622020-03-232020-03-23Rupali S. WaghDeepa Anand
The vast volume of documents available in legal databases demands effective information retrieval approaches which take into consideration the intricacies of the legal domain. Relevant document retrieval is the backbone of the legal domain. The concept of relevance in the legal domain is very complex and multi-faceted. In this work, we propose a novel approach of concept based similarity estimation among court judgments. We use a graph-based method, to identify prominent concepts present in a judgment and extract sentences representative of these concepts. The sentences and concepts so mined are used to express/visualize likeness among concepts between a pair of documents from different perspectives. We also propose to aggregate the different levels of matching so obtained into one measure quantifying the level of similarity between a judgment pair. We employ the ordered weighted average (OWA) family of aggregation operators for obtaining the similarity value. The experimental results suggest that the proposed approach of concept based similarity is effective in the extraction of relevant legal documents and performs better than other competing techniques. Additionally, the proposed two-level abstraction of similarity enables informative visualization for deeper insights into case relevance.
The vast volume of documents available in legal databases demands effective information retrieval approaches which take into consideration the intricacies of the legal domain. Relevant document retrieval is the backbone of the legal domain. The concept of relevance in the legal domain is very complex and multi-faceted. In this work, we propose a novel approach of concept based similarity estimation among court judgments. We use a graph-based method, to identify prominent concepts present in a judgment and extract sentences representative of these concepts. The sentences and concepts so mined are used to express/visualize likeness among concepts between a pair of documents from different perspectives. We also propose to aggregate the different levels of matching so obtained into one measure quantifying the level of similarity between a judgment pair. We employ the ordered weighted average (OWA) family of aggregation operators for obtaining the similarity value. The experimental results suggest that the proposed approach of concept based similarity is effective in the extraction of relevant legal documents and performs better than other competing techniques. Additionally, the proposed two-level abstraction of similarity enables informative visualization for deeper insights into case relevance.