A longitudinal study of topic classification on Twitter

View article
PeerJ Computer Science
This is an extended and revised version of a preliminary conference report that was presented in Iman et al. (2017).
See Manning, Raghavan & Schütze (2008) for a discussion and definition of this commonly used ranking metric.
Logistic Regression allows us to better understand failure cases for topical classifiers, i.e., Random Forest is likely to have gotten all of the top-5 right.
We could not run these longitudinal experiments with Random Forest due to the significant computational expense of the analysis in this section and the hyperparameter tuning that is required, thus we opted to perform this analysis with the much faster and still strongly competitive Logistic Regression classifier.
The ranking for Random Forest only differs slightly.
The United Nations Children’s Fund (UNICEF) is an organization that aims to provide emergency food and healthcare to children and mothers in developing countries everywhere.
We remark that the original Black Lives Matter protests originated in St. Louis, Missouri in the aftermath of the police shooting of Michael Brown on August 9, 2014.
It should also be remarked that Mutual Information (MI) is very sensitive to frequency so a high MI feature must be both informative and frequent to rank highly. This explains why the high MI features are so generic, i.e., they are frequent and hence cover many more tweets than low MI features.

Main article text

 

Introduction

  • We empirically show that the random forest classifier generalizes well to unseen future topical content (including content with no hashtags) in terms of its average precision (AP) and Precision@n (for a range of n) evaluated over long time-spans of typically 1 year or more after training.

  • We demonstrate that the performance of classifiers tends to drop over time–roughly 35% drop in Mean Average Precision 350 days after training ends, which is an expected, but nonetheless significant decrease. We attribute this to the fact that over long periods of time, features that are predictive during the training period may prove ephemeral and fail to generalize to prediction at future times.

  • To address the problem above, we show that one can remove tweets containing training hashtags from the validation set to allow better parameter tuning leading to less overfitting and improved long-term generalization. Indeed, although our approach here is simple, it yields a roughly 11% improvement for Mean Average Precision.

  • Finally, we provide a detailed analysis of features and feature classes and how they contribute to classifier performance. Among numerous insights, we show that the class of hashtags and simple terms have some of the most informative feature instances. We also show that the volume of tweets by a user correlates more with their informativeness than their follower or friend count.

Notation and problem definition

Data description

  • Hashtag: a topical keyword specified using the # sign.

  • Mention: a Twitter username reference using the @ sign.

  • Term: any non-hashtag and non-mention unigrams.

Methodology

Dataset labelling

Dataset splitting

Topic classification features

Supervised learning algorithms

  1. Logistic Regression (LR) (Fan et al., 2008): LR uses a logistic function to predict the probability that a tweet is topical. We used L2 regularization with the hyperparameter C (the inverse of regularization strength) selected from a search over the values C ∈ {10−12, 10−11, …, 1011, 1012}.

  2. Naïve Bayes (NB) (McCallum & Nigam, 1998): NB makes a naïve assumption that all are features are independent conditioned on the class label. Despite the general incorrectness of this independence assumption, McCallum & Nigam (1998) remark that it is known to make an effective topic classifier. Like LR, NB predicts the probability that a tweet is topical. For parameter estimation, we used Bayesian smoothing using Dirichlet priors with hyperparameter α selected from a search over the values α ∈ {10−20, 10−15, 10−8, 10−3, 10−1, 1}.

  3. RankSVM (Lee & Lin, 2014): RankSVM is a variant of the support vector machine algorithm used to learn from pairwise comparison data (in our case pairs consist of a positive labeled datum that should be ranked above a negatively labeled datum) that naturally produces a ranking. We used a linear kernel with the regularization hyperparameter C (the trade-off between training error and margin) selected in the range C ∈ {10−12, 10−11, …, 1011, 1012}.

  4. Random Forest (RF) (Breiman, 2001): RF is an ensemble learning method for classification that operates by constructing a multitude of decision trees at training time and predicting the class that is the mode of the class prediction of the individual trees (the number of trees that predict the most common class being the score). RF is known to be a classifier that generalizes well due to its robustness to overfitting. For RF, we tuned the hyperparameter for the number of trees in the forest selected from a search over the respective values {10, 20, 50, 100, 200}.

  5. k-Nearest Neighbors (k-NN) (Aha, Kibler & Albert, 1991): k-NN is a non-parametric method used for classification. An instance is classified by a plurality vote of its k neighbors, with the object being assigned to the class most common among its k nearest neighbors (the number of k neighbors for the most common class being the score). The value of k is the primary hyperparameter for k-NN and was selected from a search over the respective values {1, 2, 3, …, 10}.

Results and Discussion

Classification performance analysis

Overall classification performance

  • AP: Average Precision over the ranked list (Manning, Raghavan & Schütze, 2008); the mean over all topics provides the Mean Average Precision (MAP).

  • P@k: Precision at k for k ∈ {10, 100, 1000}.

  • ✓: the tweet was labeled topical by our test hashtag set.

  • ★: the tweet was determined to be topical through manual evaluation even though it did not contain a hashtag in our curated hashtag set (this corresponds to a mislabeled example due to the non-exhaustive strategy used to label the data).

  • ✗: the tweet was not topical.

Longitudinal classification performance

  • Regarding question (1), it is clear that the classification performance drops over time–a roughly 35% drop in MAP from the 50th to the 350th day after training. Clearly, there will be topical drift over time for most topics (e.g., Natural Disasters, Social Issues, Epidemics) as different events occur and shift the focus of topical conversation. While there are more sophisticated training methods for mitigating some of this temporal drift (e.g., Wang et al., 2019), overall, it would seem that the most practical and effective method for long-term generalization would involve a periodic update of training hashtags and data labels.

  • Regarding question (2), Fig. 2E clearly shows an overall performance improvement from discarding training hashtags (and their tweets) from the validation set. In fact, for MAP alone, we see roughly an 11% improvement. Hence, these experiments suggest there may be a long-term generalization advantage to excluding training hashtags from the validation hashtags and data, which we conjecture discourages hyperparameters that lead to hashtag memorization from the training set.

Feature analysis

  • What are the best features for learning classifiers and do they differ by topic?

  • For each feature type, do any attributes correlate with importance?

with marginal probabilities of topic p(t) and feature p(j) occurrence and joint probability p(t, j) computed empirically over the sample space of all tweets, where higher values for this metric indicate more informative features j for the topic t.

  • Looking at the average MI values, the order of informativeness of feature types is the following: Hashtag, Term, Mention, User, Location. The overall informativeness of Hashtags is not surprising given that hashtags are used on Twitter to tag topics of interest. While the Term feature is not strictly topical, it contains a rich vocabulary for describing topics that Mention, User, and Location lack.

  • The Location feature provides high MI regarding the topics of Human Disaster, LBGT, and Soccer indicating that a lot of content in these topics is geographically localized.

  • Revisiting Table 4, we note the following ranking of topics from highest to lowest AP for Logistic Regression5 : Iran, Tennis, Natural Disaster, Celebrity Death, Human Disaster, Space, Social Issue, Soccer, Epidemics, LGBT. It turns out that this ranking is anti-correlated with the ranking of topics according to average MI of features in Fig. 3. To establish this relationship more clearly, in Fig. 4 we show a scatterplot of topics according to MI rank vs. AP rank. Clearly, we observe that there is a negative correlation between the topic ranking based on AP and MI; in fact, the Kendall τ rank correlation coefficient is −0.68 indicating a fairly strong inverse ranking relationship. To explain this, we conjecture that lower average MI indicates that there are fewer good features for a topic; however, this means that classifiers for these topics can often achieve high ranking precision because there are fewer good features and the tweets with those features can be easily identified and ranked highly, leading to high AP. The inverse argument should also hold.

  • The topic has little impact on which feature is most important, indicating stability of feature type informativeness over topics.

  • While Hashtag had a higher mean MI score than Term in the previous analysis, we see that Term has the highest median MI score across all topics, indicating that the high mean MI of Hashtag is mainly due to its outliers. In short, the few good Hashtag outliers are the overall best individual features, while Term has a greater variety of strong (but not absolute best) features.

  • Across all topics, User is often least informative. However, the distribution of Location and Mention typically performs competitively with Hashtag, although their outliers do not approach the best Hashtag features, explaining why Hashtag has an overall higher average in Fig. 3.

  • Overall, Hashtags dominate the top 0.001 percentile of features indicating that they account for the most informative features overall.

  • However, from percentiles 0.01 to 10, we largely see an increasing proportion of Term features among each percentile. This indicates that while the most informative features are Hashtags, there are relatively few of them compared to the number of high MI terms.

  • Not to the same extent as Terms, we note that Mentions also start to become notably more present as the percentile range increases, while Locations and Users appear least informative overall among the 10th percentile and smaller.

  • Uservs.

  • Favorite count: # of tweets user has favorited.

  • Followers count: # of users who follow user.

  • Friends count: # of users followed by user.

  • Hashtag count: # of hashtags used by user.

  • Tweet count: # of tweets from user.

  • Hashtagvs.

  • Tweet count: # of tweets using hashtag.

  • User count: # of users using hashtag.

  • Locationvs. User count: # of users using location.

  • Mentionvs. Tweet count: # of tweets using mention.

  • Termvs. Tweet count: # of tweets using term.

Conclusions

Additional Information and Declarations

Competing Interests

Lexing Xie is an Academic Editor for PeerJ.

Author Contributions

Mohamed Reda Bouadjenek conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Scott Sanner conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Zahra Iman conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Lexing Xie conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Daniel Xiaoliang Shi performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The code is available at GitHub: https://github.com/SocialSensorProject/socialsensor.

Funding

The authors received no funding for this work.

3 Citations 1,391 Views 88 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more