PeerJ Computer Science Preprints: Natural Language and Speech

PeerJ Computer Science Preprints: Natural Language and Speech https://peerj.com/preprints/index.atom?journal=cs&subject=10600 Natural Language and Speech articles published in PeerJ Computer Science Preprints ProPheno 1.0: An online dataset for accelerating the complete characterization of the human protein-phenotype landscape in biomedical literature https://peerj.com/preprints/27479 2019-05-02 2019-05-02 Morteza Pourreza Shahri Indika Kahanda

Identifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. One of the best resources that captures the protein-phenotype relationships is the biomedical literature. In this work, we introduce ProPheno, a comprehensive online dataset composed of human protein/phenotype mentions extracted from the complete corpora of Medline and PubMed Central Open Access. Moreover, it includes co-occurrences of protein-phenotype pairs within different spans of text such as sentences and paragraphs. We use ProPheno for completely characterizing the human protein-phenotype landscape in biomedical literature. ProPheno, the reported findings and the gained insight has implications for (1) biocurators for expediting their curation efforts, (2) researches for quickly finding relevant articles, and (3) text mining tool developers for training their predictive models. The RESTful API of ProPheno is freely available at http://propheno.cs.montana.edu.

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts into relevant categories, and automatically generate structured case notes. In our preliminary work, we develop a human-powered doctor-patient conversation transcript annotator and obtain a gold standard dataset through the National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations into six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross-validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Finally, we use part-of-speech (POS) tagging, grammatical rules of English language and verb conjugation, we generate written versions of the pieces of text belonging to different categories. These formal text are aggregated in to filling different sections of the EHR forms.

Cognitive neuroscience is the study of how the human brain functions on tasks like decision making, language, perception and reasoning. Deep learning is a class of machine learning algorithms that use neural networks. They are designed to model the responses of neurons in the human brain. Learning can be supervised or unsupervised. Ngram token models are used extensively in language prediction. Ngrams are probabilistic models that are used in predicting the next word or token. They are a statistical model of word sequences or tokens and are called Language Models or Lms. Ngrams are essential in creating language prediction models. We are exploring a broader sandbox ecosystems enabling for AI. Specifically, around Deep learning applications on unstructured content form on the web.

Manual curation of scientific literature for ontology-based knowledge representation has proven infeasible and unscalable to the large and growing volume of scientific literature. Automated annotation solutions that leverage text mining and Natural Language Processing (NLP) have been developed to ameliorate the problem of literature curation. These NLP approaches use parsing, syntactical, and lexical analysis of text to recognize and annotate pieces of text with ontology concepts. Here, we conduct a comparison of four state of the art NLP tools at the task of recognizing Gene Ontology concepts from biomedical literature using the Colorado Richly Annotated Full-Text (CRAFT) corpus as a gold standard reference. We demonstrate the use of semantic similarity metrics to compare NLP tool annotations to the gold standard.

We propose a simple neural network model which can learn relation between sentences by passing their representations obtained from Long Short Term Memory(LSTM) through a Relation Network. The Relation Network module tries to extract similarity between multiple contextual representations obtained from LSTM. Our model is simple to implement, light in terms of parameters and works across multiple supervised sentence comparison tasks. We show good results for the model on two sentence comparison datasets. Unbalanced sentiment classification: an assessment of ANN in the context of sampling the majority class https://peerj.com/preprints/26618 2018-03-05 2018-03-05 Rodrigo Moraes João Francisco Valiati Wilson Pires Gavião Neto

Many people make their opinions available on the Internet nowadays, and researchers have been proposing methods to automate the task of classifying textual reviews as positive or negative. Usual supervised learning techniques have been adopted to accomplish such a task. In practice, positive reviews are abundant in comparison to negative's. This context poses challenges to learning-based methods and data undersampling/oversampling are popular preprocessing techniques to overcome the problem. A combination of sampling techniques and learning methods, like Artificial Neural Networks (ANN) or Support Vector Machines (SVM), has been successfully adopted as a classification approach in many areas, while the sentiment classification literature has not explored ANN in studies that involve sampling methods to balance data. Even the performance of SVM, which is widely used as a sentiment learner, has been rarely addressed under the context of a preceding sampling method. This paper addresses document-level sentiment analysis with unbalanced data and focus on empirically assessing the performance of ANN in the context of undersampling the (majority) set of positive reviews. We adopted the performance of SVM as a baseline, since some studies have indicated SVM as being less subject to the class imbalance problem. Results are produced in terms of a traditional bag-of-words model with popular feature selection and weighting methods. Our experiments indicated that SVM are more stable than ANN in highly unbalanced (80%) data scenarios. However, under the discarding of information generated by random undersampling, ANN outperform SVM or produce comparable results.

Background. Office of Academic Affairs (OAA), Office of Student Life (OSL) and Information Technology Helpdesk (ITD) are support functions within a university which receives hundreds of email messages on the daily basis. A large percentage of emails received by these departments are frequent and commonly used queries or request for information. Responding to every query by manually typing is a tedious and time consuming task and an automated approach for email response suggestion can save lot of time. Methods. We propose an application and solution approach for automatically generating and suggesting short email responses to support queries in a university environment. Our proposed solution can be used as one tap or one click solution for responding to various types of queries raised by faculty members and students in a university. We create a dataset for the application domain and make it publicly available. We apply a machine learning framework for classifying emails into categories such as office of academic affairs or information technology department. We apply a machine learning based classification approach for sub-category level classification also. We apply text pre-processing techniques, feature selection, support vector machine and naïve naive classifiers. We present an approach to overcome various natural language processing based challenges in the text. Results. We conduct a series of experiments and evaluate the approach using confusion matrix and accuracy based metrics. We study the discriminatory power of features and compare their relevance for the classification task. Our experimental results reveal that the proposed approach is effective. We conclude from our experiments that discriminatory features can be extracted from the text within our specific domain and automatic email response suggestion can be accurately created using machine learning algorithms and framework. We experiment with two different learning algorithms and observe that SVM outperforms Naïve Bayes. We achieve a classification accuracy of above $85\%$ for all the classes and sub-classes. Discussion. Our experiments on email response suggestion are conducted on a corpus consists of short and frequent emails by a university function but the proposed approach and techniques can be generalized to other domains also. We observe that different classifiers give different results and there is a significant difference in the predictive power of features.

Human speech is the most important part of General Artificial Intelligence and subject of much research. The hypothesis proposed in this article provides explanation of difficulties that modern science tackles in the field of human brain simulation. The hypothesis is based on the author’s conviction that the brain of any given person has different ability to process and store information. Therefore, the approaches that are currently used to create General Artificial Intelligence have to be altered. Weather events identification in social media streams: tools to detect their evidence in Twitter https://peerj.com/preprints/2241 2017-09-21 2017-09-21 Valentina Grasso Imad Zaza Federica Zabini Gianni Pantaleo Paolo Nesi Alfonso Crisci

Severe weather impact identification and monitoring through social media data is a good challenge for data science. In last years we assisted to an increase of natural disasters, also due to climate change. Many works showed that during such events people tend to share specific messages by of mean of social media platforms, especially Twitter. Not only they contribute to"situational" awareness also improving the dissemination of information during emergency but can be used to assess social impact of crisis events. We present in this work preliminary findings concerning how temporal distribution of weather related messages may help the identification of severe events that impacted a community. Severe weather events are recognizable by observing the synchronization of twitter streams volumes concerning extractions by using different but semantically graduate terms and hash-tags including the specific containing geo-content names. Impacting events seems immediately recognizable by graphical representation of weather streams and when the time-line show a specific parallel-wise pattern that we named "Half Onion Shape". Different but weather semantically linked twitter streams could exhibits different magnitude, in order to their term popularity, but they show, when a weather event occurs, the same temporal relative maximum. In reason of to these interesting indications, that needs to be confirmed through more deeper analysis, and of the great use of social media, as Twitter, during crisis events it's becoming fundamental to have a suite of suitable tools to monitor social media data. For Twitter data a comprehensive suite of tools is presented: the DISIT-Twitter Vigilance Platform for twitter data retrieve,management and visualization.

Scaling up the analysis of sensitive or confidential documents frequently stumbles on the limited number of individuals with the necessary clearance to access the documents. The availability of cryptographic protocols compatible with text processing methods can greatly improve this situation allowing for the automated processing of large corpora of confidential documents by ``untrusted'' third-parties. In this paper we propose a protocol which allows for secure outsourcing of text analytics tasks without compromising the confidentiality of documents. The method scales to large corpora, and presents linear time complexity on the size of the corpus.