PeerJ Computer Science Preprints: Computational Linguistics

PeerJ Computer Science Preprints: Computational Linguistics https://peerj.com/preprints/index.atom?journal=cs&subject=8800 Computational Linguistics articles published in PeerJ Computer Science Preprints LSTM neural network for textual ngrams https://peerj.com/preprints/27377 2018-11-23 2018-11-23 Shaun C. D'Souza

Cognitive neuroscience is the study of how the human brain functions on tasks like decision making, language, perception and reasoning. Deep learning is a class of machine learning algorithms that use neural networks. They are designed to model the responses of neurons in the human brain. Learning can be supervised or unsupervised. Ngram token models are used extensively in language prediction. Ngrams are probabilistic models that are used in predicting the next word or token. They are a statistical model of word sequences or tokens and are called Language Models or Lms. Ngrams are essential in creating language prediction models. We are exploring a broader sandbox ecosystems enabling for AI. Specifically, around Deep learning applications on unstructured content form on the web.

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day science fiction and fantasy literature as they are to those 19th century classics. We present a study to compare classic literature to modern literature in terms of performance of natural language processing tools for the automatic extraction of social networks as well as their network structure. We find that there are no significant differences between the two sets of novels but that both are subject to a high amount of variance. Furthermore, we identify several issues that complicate named entity recognition in modern novels and we present methods to remedy these.

Many people make their opinions available on the Internet nowadays, and researchers have been proposing methods to automate the task of classifying textual reviews as positive or negative. Usual supervised learning techniques have been adopted to accomplish such a task. In practice, positive reviews are abundant in comparison to negative's. This context poses challenges to learning-based methods and data undersampling/oversampling are popular preprocessing techniques to overcome the problem. A combination of sampling techniques and learning methods, like Artificial Neural Networks (ANN) or Support Vector Machines (SVM), has been successfully adopted as a classification approach in many areas, while the sentiment classification literature has not explored ANN in studies that involve sampling methods to balance data. Even the performance of SVM, which is widely used as a sentiment learner, has been rarely addressed under the context of a preceding sampling method. This paper addresses document-level sentiment analysis with unbalanced data and focus on empirically assessing the performance of ANN in the context of undersampling the (majority) set of positive reviews. We adopted the performance of SVM as a baseline, since some studies have indicated SVM as being less subject to the class imbalance problem. Results are produced in terms of a traditional bag-of-words model with popular feature selection and weighting methods. Our experiments indicated that SVM are more stable than ANN in highly unbalanced (80%) data scenarios. However, under the discarding of information generated by random undersampling, ANN outperform SVM or produce comparable results.

R has many capabilities most of which are not known by many users, yet waiting to be discovered. For this reason we provide more tips on how to write really efficient code without having to program in C++, programming advice, and tips to avoid errors and numerical overflows. This is the first time, to the best of our knowledge, that such a long list of tips is provided. The tips are categorized, according to their use, for matrices, simple functions, numerical optimization, parallel computing, programming tips, general advice, etc.

Background. Automatic contradiction detection or conflicting statements detection in text consists of identifying discrepancy, inconsistency and defiance in text and has several real world applications in questions and answering systems, multi-document summarization, dispute detection and finder in news, and detection of contradictions in opinions and sentiments on social media. Automatic contradiction detection is a technically challenging natural language processing problem. Contradiction detection between sources of text or two sentence pairs can be framed as a classification problem. Methods. We propose an approach for detecting three different types of contradiction: negation, antonyms and numeric mismatch. We derive several linguistic features from text and use it in a classification framework for detecting contradictions. The novelty of our approach in context to existing work is in the application of artificial neural networks and deep learning. Our approach uses techniques such as Long short-term memory (LSTM) and Global Vectors for Word Representation (GloVe). We conduct a series of experiments on three publicly available dataset on contradiction detection: Stanford dataset, SemEval dataset and PHEME dataset. In addition to existing dataset, we also create more dataset and make it publicly available. We measure the performance of our proposed approach using confusion and error matrix and accuracy. Results. There are three feature combinations on our dataset: manual features, LSTM based features and combination of manual and LSTM features. The accuracy of our classifier based on both LSTM and manual features for the SemEval dataset is 91.2%. The classifier was able to correctly classify 3204 out of 3513 instances. The accuracy of our classifier based on both LSTM and manual features for the Stanford dataset is 71.9%. The classifier was able to correctly classify 855 out of 1189 instances. The accuracy for the PHEME dataset is the highest across all datasets. The accuracy for the contradiction class is 96.85%. Discussion. Experimental analysis demonstrate encouraging results proving our hypothesis that deep learning along with LSTM based features can be used for identifying contradictions in text. Our results shows accuracy improvement over manual features after applying LSTM based features. The accuracy results varies across datasets and we observe different accuracy across multiple types of contradictions. Feature analysis shows that the discriminatory power of the five feature varies.

Background. Automated Essay Scoring (AES) is an area which falls at the intersection of computing and linguistics. AES systems conduct a linguistic analysis of a given essay or prose and then estimates the writing skill or the essay quality in the form a numeric score or a letter grade. AES systems are useful for the school, university and testing company community for efficiently and effectively scaling the task of grading a large number of essays. Methods. We propose an approach for automatically grading a given essay based on 9 surface level and deep linguistic features, 2 feature selection and ranking techniques and 4 text classification algorithms. We conduct a series of experiments on publicly available manually graded and annotated essay data and demonstrate the effectiveness of our approach. We investigate the performance of two different features selection techniques (1) RELIEF (2) Correlation-based Feature Subset Selection (CFS) with three different machine learning classifiers (kNN, SVM and Linear Regression). We also apply feature normalization and scaling. Results. Our results indicate that features like world count with respect to the world limit, appropriate use of vocabulary, relevance of the terms in the essay with the given topic and coherency between sentences and paragraphs are good predictors of essay score. Our analysis reveals that not all features are equally important and few features are more relevant and better correlated with respect to the target class. We conduct experiments with k-nearest neighbour, logistic regression and support vector machine based classifiers. Our results on 4075 essays across multiple topics and grade score range are encouraging with an accuracy of 73% to 93%. Discussion. Our experiments and approach are based on Grade 7 to Grade 10 essays which can be generalized to essays from other grades and level after doing context specific customization. Few features are more relevant and important than other features and it is interplay or combination of multiple feature values which determines the final score. We observe that different classifiers result in difference accuracy.

Human speech is the most important part of General Artificial Intelligence and subject of much research. The hypothesis proposed in this article provides explanation of difficulties that modern science tackles in the field of human brain simulation. The hypothesis is based on the author’s conviction that the brain of any given person has different ability to process and store information. Therefore, the approaches that are currently used to create General Artificial Intelligence have to be altered.

Kamus Dewan is the authoritative dictionary for Bahasa Malaysia, containing a wealth of linguistic and cultural information about Bahasa Malaysia. It is currently available in print, as well as a searchable online dictionary. However, the online dictionary lacks advanced search capabilities that target specific fields within each headword and lemma entry. For this information to be targeted and extracted efficiently by computers, the macro- and micro-structures of Kamus Dewan entries need to be first annotated or marked up explicitly. We describe how TEI-P5 guidelines have been applied in this endeavour to make the Kamus Dewan more machine-tractable. We also give some examples of how the machine-tractable data from Kamus Dewan can be used for linguistic research and analysis, as well as for producing other language resources.

It is proposed an information theoretic search engine is like RADAR. The query words are the emitted signals and the document database is the object to be detected. Various echoes come off the database, and analogous with echo cancelation, the signal with the lowest entropy is selected. Commensurate with Shannon's theory, low entropy documents are signal, higher entropy documents are noise. Thus, my proposal separates signal from noise. As many relevant documents can be tined to be signal as desired.

We describe and experimentally validate a question-asking framework for machine-learned linguistic knowledge about human emotions. Using the Socratic method as a theoretical inspiration, we develop an experimental method and computational model for computers to learn subjective information about emotions by playing emotion twenty questions (EMO20Q), a game of twenty questions limited to words denoting emotions. Using human-human EMO20Q data we bootstrap a sequential Bayesian model that drives a generalized pushdown automaton-based dialog agent that further learns from 300 human-computer dialogs collected on Amazon Mechanical Turk. The human-human EMO20Q dialogs show the capability of humans to use a large, rich, subjective vocabulary of emotion words. Training on successive batches of human-computer EMO20Q dialogs shows that the automated agent is able to learn from subsequent human-computer interactions. Our results show that the training procedure enables the agent to learn a large set of emotions words. The fully trained agent successfully completes EMO20Q at 67% of human performance and 30% better than the bootstrapped agent. Even when the agent fails to guess the human opponent's emotion word in the EMO20Q game, the agent's behavior of searching for knowledge makes it appear human-like, which enables the agent maintain user engagement and learn new, out-of-vocabulary words. These results lead us to conclude that the question-asking methodology and its implementation as a sequential Bayes pushdown automaton are a successful model for the cognitive abilities involved in learning, retrieving, and using emotion words by an automated agent in a dialog setting.