Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

View article
PeerJ Computer Science

Main article text

 

Introduction

Measures in the field of topic modeling

Coherence

Stability

Renyi entropy

Granulated topic model with word embeddings

Granulated latent Dirichlet allocation model

  • Matrices Θ and Φ are initialized.

  • Loop on the number of iterations

    • For each document d ∈ D repeat |d| times:

      • sample an anchor word wj ∈ d uniformly at random

      • sample its topic t as in Gibbs sampling (Griffiths & Steyvers, 2004)

      • set ti = t for all i such that |i − j| ≤ l, where l is a predefined window size.

Granulated latent Dirichlet allocation with word embeddings

Computational experiments

  • The ‘Lenta’ dataset is a set of 8,630 news documents in Russian language with 23,297 unique words. The documents are manually marked up into 10 classes. However, some of the topics are close to each other and, therefore, this dataset can be described by 7–10 topics.

  • The ‘20 Newsgroups’ dataset is a collection of 15,425 news articles in English language with 50,965 unique words. The documents are marked up into 20 topic groups. According to Basu, Davidson & Wagstaff (2008), 14–20 topics can describe the documents of this dataset, since some of the topics can be merged.

  • The ‘WoS’ dataset is a class-balanced dataset, which contains 11,967 abstracts of published papers available from the Web of Science. The vocabulary of the dataset contains 36,488 unique words. This dataset has a hierarchical markup, where the first level contains seven categories, and the second level consists of 33 areas.

Results

Results on coherence

Results on stability

Results on Renyi entropy

Computational speed

Discussion

Comparison of models in terms of coherence

Stability of topic models

Determining the optimal number of topics

Conclusions

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Sergei Koltcov conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Anton Surkov conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Vladimir Filippov conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft.

Vera Ignatenko analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The datasets are available in Zenodo: Ignatenko Vera. (2023). Topic modeling datasets [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8407610.

The source codes are available in Zenodo: skoltsov. (2023). veraignatenko/TM-with-neural-networks: For Zenodo (v1.0). Zenodo. https://doi.org/10.5281/zenodo.8410811.

Funding

The results of the project “Modeling the structure and socio-psychological factors of news perception”, carried out within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE University) in 2022, are presented in this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgements

 

This research was supported in part through computational resources of HPC facilities at NRU HSE.

Appendix A

 

A lot of topic models with elements of neural networks, which include network architecture directly as well as application of word embeddings, have been proposed over the past 7 years. In this section, we briefly describe some of these models.

In the Gaussian LDA model (Das, Zaheer & Dyer, 2015), documents are represented as sequences of word embeddings obtained with word2vec, while topics are described as multivariate normal distributions in embedding space (in contrast to classical topic model representation, where topics are discrete distributions over the dictionary). Correspondingly, every such distribution is characterized by its mean and variance. The distribution of topics by documents is set as a Dirichlet distribution analogous to the LDA model. Modeling word embeddings, not the words itself, is the distinctive peculiarity of this approach. Different types of word embeddings can be used with this model. Model inference is based on Gibbs sampling. It should be noted that there are a lot of hyperparameters, including the number of topics, which require careful tuning. Also, only the Euclidean distance is used as a proximity measure of embeddings, while the standard approach is to use the cosine measure. This weakness is discussed in Li et al. (2016c). The algorithm implementation of Gaussian LDA model can be found at https://github.com/rajarshd/Gaussian_LDA.

Nguyen et al. (2015) incorporate word embeddings into two topic models: LDA and one-topic-per-document DMM model (analogue to LDA, where every document has only one topic) and obtain two new models LF-LDA and LF-DMM, correspondingly. The main purpose of their work is to improve the quality of topic modeling for short texts and small corpuses. The authors use two variants of pre-trained word embeddings: ‘GloVe and word2vec. Inference is based on Gibbs sampling. In the framework of htis model, each topic has its representation in the embedding space. The generation of words is as follows. The so-called “unfair coin” (with lambda parameter being responsible for success probability) is tossed for every word. Then, word is generated from standard LDA topic in case of failure. Otherwise, the word embedding generated from the component of the topic vector is taken instead of the word itself. Thus, it is a mixed model, where the word is generated either from the component of the topic, which corresponds to the Dirichlet LDA topic, or from the latent feature component represented as a vector. The representations of topic vectors as well as matrices Φ and Θ (similar to LDA) are estimated during the model inference. It is shown that the best PMI-score is given by models with lambda equal to 1 meaning that topics are represented only by their embeddings. The peculiarity of this model is that every topic is a combination of the Dirichlet topic (inferred with LDA) and a vector representation, which can have different top words, however the authors don’t mention how to handle it. The model implementation can be found at https://github.com/datquocnguyen/LFTM. However, the algorithm is rather slow and cannot be used for large corpora (Li et al., 2016a).

In 2016, a spherical topic model was proposed (Batmanghelich et al., 2016), which is an extension of the HDP (Hierarchical Dirichlet Process) model (Blei & Jordan, 2006). Since the model is nonparametric, the number of topics is generated automatically. In spherical HDP, every topic (sHDP) is a von Mises-Fisher distribution in normalized embeddings’ space (unit sphere). The distribution of topics in documents (proportions) are inferred with HDP. Word embeddings are learned with word2vec. The authors state that the model allows to get more coherent topics than Gaussian LDA and HDP. The implementation of this model can be found at https://github.com/Ardavans/sHDP.

Xun et al. (2016) use the additional information from word embeddings to get better topic models for short texts. They use word2vec embeddings trained on the Wiki dataset. Similar to Gaussian LDA model, it is assumed that every topic has a normal distribution in word embeddings space. In addition, it is supposed that every short text has only one topic. Also, the authors suppose that every word either belongs to a Gaussian topic or is one of the background words which are generated from LDA topics. An unfair coin is tossed for every word and in case of failure the word is generated from the background topic (as in LDA model). Otherwise, the word embedding is considered instead of the word and assumed that this word embedding is generated from the Gaussian topic. Inference algorithm is based on Gibbs sampling.

In Li et al. (2016c), a new model called mix-vMF is proposed. The main idea is similar to the Gaussian LDA, but with von Mises-Fisher distributions used instead of normal distributions, meaning that every topic is a mix of von Mises-Fisher distributions in the space of normalized vectors on the unit sphere. The distributions of topics by documenst are Dirichlet distributions (similar to LDA), while the distributions of word embeddings by topics are used instead of the distribution of words by topics. The authors claim that von Mises-Fisher distribution can reflect the vectors’ similarity in terms of cosine distance more efficiently, while a mix of such distributions is able to attribute heterogenous word embeddings to the same topic. Thus, a collection of documents is represented as a collection of word embeddings. Inference is based on Gibbs sampling. The authors use word embeddings from the GloVe model. The implementation of this model is not publicly available.

Li et al. (2016b) propose TopicVec model. The model creates topic embeddings for topics that implies that topic is a point in the space of word embeddings. The model takes into consideration a certain number of context words (parameter l) in front of every word and a topic, when assigning a probability to this word. The distribution of topics by documents is Dirichlet distribution (as in LDA). Thus, a generative model for documents is obtained. This model uses pre-trained word embeddings from PSDVec (Li, Zhu & Miao, 2017). The top words for every topic are found as the nearest words according to weighted cosine measure between topic embedding and word embeddings. There is a matrix of the distribution of topics by documents, however there is no matrix of the distribution of words by topics. Inference is based on variational E-M algorithm. The main advantage of the model is the rejection of the bag-of-words hypothesis and consideration of words’ order. The implementation of this model can be found at https://github.com/askerlee/topicvec.

In 2017, a correlated Gaussian topic model was proposed by Xun et al. (2017b). The approach is based on correlated topic model, where words are represented by their embeddings, while topics are multivariate normal distributions in the space of word embeddings. The main goal is modeling of topics and correlations between topics in the word embedding space. Word embeddings are trained separately on a large corpus with word2vec prior to topic modeling. This approach is very similar to the Gaussian LDA model with the difference that the distributions of topics for documents is log-normal, not Dirichlet. It allows to find the covariance between topics. The model inference is based on Gibbs sampling. The implementation of the model is not publicly available.

Zhao, Du & Buntine (2017) propose a new model, which is called WEI-FTM. This model is aimed to improve topic modeling quality for short texts. This model is a “focused topic model”, meaning that each topic is focused on some set of words, which implies that the topic is a distribution not over the whole dictionary, but over its subset. Similar to LDA, each document is generated by K topics, the distribution of words by topic is a Dirichlet distribution for a certain subset of words, while ϕwt equals zero for the rest of the words. The procedure of focus words selection is carried out as follows. A coin with success probability equal to function of the dot product between the word and the topic vector is tossed. The word is included in subset of words of the current topic in case of success, and not included in case of failure. The model is rather similar to LDA, namely, distribution of topics by documents is Dirichlet distribution and distribution of words by topics is Dirichlet distribution but only over a certain subset for every topic. Word embeddings are used only to extract focus words for every topic and are trained separately with GloVe. Model inference is based on Gibbs sampling. There are two ways to choose top words in this model. One can either use the traditional matrix Φ or consider the dot product of the word and the topic vector representation. The authors use both approaches and calculate topic coherence separately. There is no implementation of this model in the public domain.

The authors of Collaborative Language Model (CLM) (Xun et al., 2017a) model topics using the global context and train word embeddings using the local context simultaneously. In the framework of the model, topic embeddings are trained, and the dot product of the topic and the word embedding is used to evaluate the word’s contribution into topic. The authors suggest not using pre-trained embeddings, because there are not so much open data in some fields. Thus, word embeddings use information only from a given dataset. This model is not generative and not stochastic but is solving the non-negative matrix factorization optimization task, where the target function consists of decomposition of words’ co-occurrence matrix(to get word embeddings), decomposition of topic matrix (to get vector representations for topics) and norm restrictions (regularization) for every matrix. Inference is based on equalizing of target function derivatives to zero and iterative calculation of respective matrix decompositions. The authors demonstrate that their model produces coherent topics and good vector representations of words (pairs of words are sorted according to a cosine measure, then a correlation between the sorted pairs and word similarity human-made rates is calculated). The implementation of the model can be found at: https://github.com/XunGuangxu/2in1.

In Zhao et al. (2017), the MetaLDA model is proposed. This model allows to use meta-information of document (e.g., author or label) and words (e.g., word embeddings). It is supposed that meta-information improves the quality of topic modeling for short texts. Meta-information for both words and documents is encoded in binary vectors. It is assumed that the distribution of words by topics is a Dirichlet distribution with V hyperparameters (having different values for different topics), where V is the size of the dictionary, and the distribution of topics by documents is a Dirichlet distribution with T hyperparameters (different for every document), where T is the number of topics. The binary vectors of meta-information are used to determine hyper-parameters for the Dirichlet distributions. If two words (w1 and w2) have similar meta-information, then the values of hyperparameters for the distribution of topic over these words will be similar as well as their math expectations of Φw1,k and Φw2,k. The implementation of the model can be found at https://github.com/ethanhezhao/MetaLDA/.

Miao, Grefenstette & Blunsom (2017) suggest several topic models based on neural variational inference (python code is available at https://github.com/zll17/Neural_Topic_Models). Neural networks is capable of approximating various functions and learning complicated non-linear distributions for unsupervised models. That is why the authors suggest using an alternative neural approach to topic modelling based on parameterized distributions over topics, which can be trained with backpropagation in the framework of neural variational inference. Neural variational inference approximates a posterior distribution of a generative model by means of variational distribution parameterized by a neural network. The authors propose three different models. The generative process of the models is as follows:

  • Proportions of topics θd for each document d ∈ D are distributed as G(μ0,σ20), where G(μ0,σ20) consists of neural network θ = g(x) sampling from normal distribution xN(μ0,σ20).

  • Topic zn has a multinomial distribution zn ∼ Multi(θd) for each observed word wn, n =1 , …, Nd.

  • Each word wn ∼ Multi(βzn), n = 1, …, Nd, where βzn is the distribution of words in topic zn

Let t ∈ RK×L be vector representations of topics and v ∈ RV×L be vector representations of words, then distribution of words in topic k is βk=softmax(vtTk). The first proposed model is GSM (Gaussian Softmax distribution). It has the finite number of topics K. Proportions of topics in documents distributions are set with gaussian softmax (Miao, Grefenstette & Blunsom, 2017):

  • xN(μ0,σ20),

  • θ=softmax(WT1x), where W1 is linear transformation.

The second proposed model is GSB (Gaussian Stick Breaking distribution). It has the finite number of topics K as well. Proportions of topics in documents are set with Gaussian stick breaking process:

  • xN(μ0,σ20),

  • η=sigmoid(WT2x) gives stick breaking proportions,

  • θd = fSB(η), where fSB(η) is stick breaking construction. For example, for K = 3, fSB(η1η2) = (η1η2⋅(1 − η1), (1 − η2)⋅(1 − η1)). For K = 4, fSB(η1η2η3) = (η1η2⋅(1 − η1), η3⋅(1 − η2)⋅(1 − η1), (1 − η3)⋅(1 − η2)⋅(1 − η1)). Thus, for any K, ∑kθkd = 1.

In the third model, the Recurrent Stick Breaking process (RSB), the number of topics is unbounded, and the distribution of topics in documents is set with the Recurrent Stick Breaking process (Miao, Grefenstette & Blunsom, 2017):

  • xN(μ0,σ20),

  • η = fRNN(x), where fRNN(x) is decomposed as hk=RNNSB(hk1),ηk=sigmoid(hTk1x). RNN denotes a recurrent neural network. Thus, the proportions for Recurrent Stick Breaking are generated sequentially by RNN.

  • θd = fSB(η), where fSB(η) is the same function as in the previous model.

Lower variational estimate of likelihood is used for the model inference. Variational parameters μ(d), σ(d) (for the document d) are generated with an inference network based on multilayer perceptron. Generative parameters (such as tv and parameters of g(x)) as well as variational parameters (such as μ(d), σ(d) are updated with stochastic gradient backpropagation algorithm. The authors demonstrate that the first two models perform better than the standard LDA model in terms of perplexity, while the third one preforms better than the HDP model.

Bunk & Krestel (2018) suggest a new model WELDA (Word Embedding Latent Dirichlet Allocation), which combines LDA with word embeddings. Pre-trained skip-gram word2vec embeddings are used. This model merges the classical Dirichlet distributions and multivariate normal distributions in word embeddings space. The main idea is to find vector representations for words, train standard LDA until convergence and find the parameters of normal distribution for each topic according to its top words (more precisely, according to their word embeddings). After that, additional iterations of Gibbs sampling are run, where an unfair coin is tossed for every word (with lambda probability of success, giving Bernoulli distribution) in the following way. Let word w be assigned to topic t, then in case of success, a vector from the embeddings space of this topic is sampled (i.e., from multivariate normal distribution of the topic in embeddings space), the nearest word embedding to this sampled vector is found and the corresponding word is labeled with topic t, and all counters are recalculated after that. Model implementation can be found at https://github.com/AmFamMLTeam/hltm_welda (not original implementation).

Dieng, Ruiz & Blei (2020) suggested a new topic model called embedded topic model (ETM). This model is generative and probabilistic: every document is a mixture of topics, and every word is attributed to a specific topic. At the same time, every word has a vector representation (word embedding), and topics are the vectors in the same space as well. One of the main goals of this model is enrichment of topic models with the usage of the similarity of words according to their vector representations. Let us consider this model in more detail since it is used in our experiments. Let ρ denote the matrix of words’ vector representations for the dictionary of a given collection. This matrix has L × V size, where L is the size of vector representations of words and V is the number of unique words. Each column of ρ is a vector representation of a word. Also, let αk ∈ 𝕞RL denote the vector representation of topic k. Then, the generative process for document d can be described as follows:

  1. Proportions of topics for given document, θd, are sampled from the logit-normal distribution LN(0,I).

  2. For each position n in the document, a topic zdn is sampled from a categorical distribution Cat(θd) and the word is sampled according to wdn ∼ softmax(ρTαzdn).

Thus, the words are generated from a categorical distribution with a parameter equal to the dot product of vector representations of the word and the topic. The matrix Φ of the distribution of words by topics can be calculated as ϕvk = softmax(ρTαk)|v. Model inference is based on maximization of the log-likelihood of documents collection. However direct computation is not possible, that is why variational inference is used. A family of additional multivariate normal distributions is used in inference, whose parameters (vector of means and covariance matrix) are evaluated with a neural network, which takes documents as inputs and outputs the parameters (μd, Σd) for every document. Thus, the evaluation of the parameters of auxiliary distributions in variational inference is carried out by a special neural network. Inference algorithm can be described as follows:

  1. Initialize model and variational parameters νμνΣ

  2. Iterative steps:

    1. Compute ϕk = softmax(ρTalphak) for every topic k

    2. Choose a minibatch of documents (B)

    3. For every document d ∈ B:

      • Construct normalized bag-of-words representation of document (xd)

      • Compute μd = NN(xdνμ)

      • Compute Σd = NN(xdνΣ)

      • Sample θd ∼ LN(μd, Σd)

      • For each word wdn in the document d: Compute p(wdn|θd)=θTdϕwdn

    4. Compute variational lower bound (ELBO) and its gradient

    5. Update values of α1:K

    6. Update values of variational parameters νμνΣ

The authors propose two options for the model: (1) with pre-trained embeddings; (2) learning the embeddings as part of the fitting procedure. The numerical experiments of the authors demonstrate that the model with pre-trained embeddings gives slightly better quality on average in terms of topics’ interpretability and predictive ability compared to the alternative. Also, ETM significantly improves the quality measures such as semantic coherence and predictive ability in comparison to LDA. It should be noted, that skip-gramm pre-trained embeddings are used in this work; however, the authors admit the possibility to use other types of word embeddings. Also, the number of topics is set manually in this model leaving the problem of selecting the number of topics open.

Further development of ETM model was proposed in Harandizadeh, Priniski & Morstatter (2022). The new proposed model (keywords assisted ETM) incorporates user knowledge in the form of informative topic-level priors over the vocabulary. Namely, the user specifies a set of seed word lists associated with topics of interest, that, in turn, guides statistical inference.

At the end of 2019, the W-LDA model was proposed (Nan et al., 2019). This is a neural topic model based on Wasserstein autoencoder with Dirichlet prior on the latent document-topic vectors. The encoder consists of a multi-layer perceptron mapping bag-of-words representation of a document to an output layer of K units, then softmax is applied to obtain the document-topic vector θ. Given θ, the decoder consists of a single layer neural network mapping θ to an output layer of V units, then softmax is applied to obtain a probability distribution over the words in the vocabulary (ˆw). Thus, ˆwi=exp(hi)Vj=1exp(hj), where h = βθ + b, β is the matrix of topic word vectors as in LDA and b is an offset vector. The top words of each topic can be extracted based on the decoder matrix weights (i.e., top entries of βk sorted in descending order). The authors demonstrate that their model produces significantly better topics compared to LDA, ProdLDA (Tolstikhin et al., 2018), NTM-R (Ding, Nallapati & Xiang, 2018).

WTM-GMM is an improved version of the original W-LDA https://zll17.github.io/2020/11/17/Introduction-to-Neural-Topic-Models/#WTM-MMD. Gaussian mixture distribution is considered as prior distribution. The authors propose two types of evolution strategy: gmm-std and gmm-ctm. The gmm-std adopts Gaussian mixture distribution, whose components have fixed means and variances. In contrast, the components of Gassuian mixture distribution of the gmm-ctm are adjusted to fit the latent vectors through the whole training process. The number of the components is usually set as the number of topics. Empirically, the WTM-GMM model usually achieves better performance than W-LDA in terms of topic coherence.

In the spring of 2020, the Bidirectional Adversarial Topic model (BAT) was proposed (Wang et al., 2020). This model uses a Dirichlet distribution as a prior distribution of topics (analogous to LDA). Moreover, an extension of this model, Gauusian BAT, which is able to account the words similarity based on their vector representations, was also proposed. In this work, bidirectional adversarial training is used for the first time in topic modelling. The BAT and Gaussian-BAT models use a Dirichlet prior for modelling topics. Let V be the size of the dictionary and K be the number of topics. BAT model consists of 3 components: (1) The encoder which takes a V-dimensional document representation (dr) as input and transforms it into a K-dimensional distribution of topics in the document (θr); (2) The generator takes a random distribution of topics in the document as input (θf), which is sampled from the Dirichlet distribution, and generates an artificial V-dimensional distribution of words (df). (3) The discriminator takes a real pair of distributions pr=[θr,dr] and an artificial pair of distributions pf=[θf,df] as input. After that, it has to distinguish real distributions and the artificial ones. The output of discriminator is used to train the encoder, generator and discriminator in adversarial training. The encoder contains a V-dimensional layer of distribution of words in a document, an S-dimensional layer of representations and a K-dimensional layer of distribution of topics in a document. Each document d has its own representation (dr) weighted with TF-IDF. Firstly, the encoder projects dr into S-dimensional semantic space by means of the representation layer as follows: hes=BN(Wesdr+bes),oes=max(hes,leakhes), where WesRS×K is the weight matrix of the representation layer, bes is the bias, hes is the state vector normalized with batch normalization, leak denotes the parameter of Leaky ReLu activation function, oes is the output of the representation layer. Then, the encoder converts oes onto a K dimensional topic space: θr=softmax(Wetoes+bet), where WetRK×S is the weight matrix of the topic distribution layer, bet is the bias of this layer, θr denotes the topic distribution of document dr, while θkr is the proportion of topic k in this document. The generator projects the distribution of topics in documents onto the distribution of words in the document in contrast to the encoder. Thus, the generator consists of a K-dimensional layer of the distribution of topics in documents, an S-dimensional representation layer and a V-dimensional layer of the distribution of words in a document. The distribution of topics in documents θf (θkf denotes the proportion of topic k in the document) is assumed to be Dirichlet distribution with a K-dimensional parameter α. First, the generator projects the distribution of topics in documents onto an S-dimensional representation space: hgs=BN(Wgsθf+bgs), ogs=max(hgs,leakhgs), where WgsRS××K is the weight matrix of the representation layer, bgs is the bias, hgs is the state vector normalized with batch normalization, leak denotes the parameter of Leaky ReLu activation function, ogs is the output of the representation layer. Then ogs is transformed into the distribution of words in a document by means of linear layer and softmax: df=softmax(Wgwogs+bgw), where WgwRV×S is the weight matrix of the distribution of words, bgw is the bias of the layer, df denotes the distribution of words corresponding to θf. For each word v, dvf is the probability of this word in the artificial document df. The discriminator consists of the 3 layers: a V + K-dimensional layer for joint distribution, an S-dimensional layer of representations and an output layer. The main task of the discriminator is to distinguish real input data pr=[θr;dr]] from artificial ones pf=[θf;df]. The output of the discriminator is Dout. Large Dout value indicates that the generator defines the input as real. The authors also propose a modified Gaussian-BAT with a modified generator to consider information about the relationship of words based on vector representations of words. The multivariate normal distribution (N(μk,Σk)) is used to model topic k, where μk and Σk are the parameters learned during training. The probability of every word v is calculated as follows: p(ev|topic=k)=N(ev;μk,Σk), ϕvk=p(ev|topic=k)Vv=1p(ev|topic=k), where ev denotes the vector representation of the word v, ϕk is the normalized distribution of words in the k-th topic. The artificial distribution df corresponding to a certain θf can be computed as follows: df=Kk=1ϕkθkf. The encoder and decoder are the same as in BAT. Pairs of real distributions pr=[θr;dr]] and pairs of artificial distributions pf=[θf;df] are considered as random samples from two (K+V) - dimensional joint distributions Pr and Pf. The main task is to make Pf as close to Pr as possible. Wasserstein distance is used as the proximity measure between Pr and Pf. A detailed algorithm of training is given in Wang et al. (2020). The authors demonstrate that the proposed models perform better than the standard LDA and GSM models in terms of the ‘topic coherence’ measure.

In Xu et al. (2022), a neural topic model with deep mutual information estimation (NTM-DMIE) is proposed. This method maximizes the mutual information between the input documents and their latent topic representation. The framework of NTM-DMIE consists of two main components, namely, Document-Topic Encoder and Topic-Word Decoder. The Document-Topic Encoder simulates the document-topic distribution as in LDA and learns topic representations of documents. Moreover, in the encoder, mutual information is estimated between the documents and their topic representations. The Topic-Word Decoder learns the topic-word distribution as in LDA. The authors demonstrate that the proposed model outperforms state-of-the-art neural topic models in terms of ’topic coherence’ and ’topic uniqueness’ metrics.

In Shao et al. (2022), the role of embeddings and their changes in embedding-based neural topic models is studied. Moreover, the authors propose an embedding regularized neural topic model (ERNTM), which applies the specially designed training constraints on word embeddings and topic embeddings to reduce the optimization space of parameters. The authors compare the proposed model with the baseline models, such as ETM, GSM, NTM (Ding, Nallapati & Xiang, 2018) and demonstrate its competitiveness.

To increase the quality of topic modeling on short texts, a neural topic model integrating SBERT and data augmentation was proposed in Cheng et al. (2023). The authors introduce a data augmentation technique that uses random replacements, insertions, deletions, and other operations to increase the robustness of text data and incorporate it with keyword information obtained through the TextRank algorithm. Then, the augmented text data is vectorized and used as input for a BiLSTM-Att module to obtain long-distance dependency information and overcome the influence of noisy words. The authors also propose SBERT model, which, in contrast to BERT, takes the entire sentence as a processing unit. The information that was enhanced through data augmentation and processed through the attention mechanism is merged with semantic feature information. The resulting feature information is fed into a neural topic model based on ProdLDA model.

In general, topic modeling with the application of neural networks is actively developing. One of the best reviews in this area is Zhao et al. (2021), although this work is from 2021. The active development of transformer models has given rise to several new works using topic modeling as an auxiliary tool. For example, in Giang, Song & Jo (2022), topic modeling was used to improve the segmentation of high-level images as follows (TopicFM model). This approach represents an image as a set of topics marked with different colors, i.e., encodes high-level contextual information of images based on the topic modeling strategy in data analysis. Each topic is an embedding fed into the ‘cross-attention layer’ input, characterized by three matrices (queries, keys, values). Thus, a standard transformer scheme is used. At the transformer’s output, probabilities characterizing the distance between the attribute Fi and separate topics T are obtained. Further, one can obtain a similar matrix for different images and then compare the images with each other. Thus, TopicFM provides reliable and accurate feature-matching results even in complex scenes with large changes in scale and viewpoint.

In Wang et al. (2023), large language models (LLMs) are considered implicit topic models. The authors of this paper, relying on the fact that LLMs are Monte Carlo generative models, propose to consider the generation process in dependence on the topic and the topic-token matrix. With this point of view, topic modeling is reduced to a classification procedure, that is, to obtaining a label for each document in the form of the probability that this document belongs to a particular topic. This paper analyzed the following LLMs: GPT2, GPT3, GPTS-instruct, GPT-J, OPT, and LLaMA. In their tests, the authors proposed a two-stage algorithm that first extracts latent conceptual lexemes from a large language model and then selects demonstrations from clues that are most likely to predict the corresponding conceptual lexemes.

The Zero Shot Classification technology (Brown et al., 2020) should also be noted. In the framework of this technology, large language models are used to classify text data using transfer learning. That is, the LLM-based classifier can output the probabilities that the text belongs to different topics, and the user specifies the topics. Thus, it is possible to construct a simple text clustering algorithm for a given set of topics. Further development of this direction is demonstrated in Ding et al. (2022). In this paper, the authors propose a topic classification system originally trained on Wikipedia. Thus, the trained classifier can classify an external document on various topics with high accuracy. The proposed framework is also based on zero-shot classification technology.

We would also like to mention some other applications in the field of NLP where topic modeling is used as an auxiliary tool. Joshi et al. (2023) proposes the ‘DeepSumm’ method for text summarization. In the framework of this approach, each sentence is encoded with two different recurrent neural networks based on the probability distributions of the topics and embeddings. Then a sequence-to-sequence network is applied to the encoding of each sentence. The outputs of the encoder and decoder in the sequence-to-sequence networks are combined after weighting by an attention mechanism and converted into an estimate by a multilayer perceptron network. Accordingly, several scores are obtained for each sentence, namely the score obtained using the Sentence Topic Score (STS) topic model, and the score obtained using embeddings: Sentence Content Score (SCS). In addition, the authors offer Sentence Novelty Score (SNS) and Sentence Position Score (SPS). Based on these four scores, a Final Sentence Score (FSS) is calculated. Accordingly, all sentences are ranked according to the final score, and a brief summary of the text is the set of sentences that received the maximum value.

In conclusion, we would like to note that in the last two years, there has been a change in focus from the development of topic models to the use of models in conjunction with large language models, or even to the replacement of topic sampling procedures for classification with style transfer, that is, the widespread use of zero-shot technology. At the same time, embeddings have become an integral part of transformers when working with various NLP tasks.

Appendix B

 

In this section, we briefly describe the main models of word embeddings.

Word2vec model

The technology of word embeddings is based on the hypothesis of local co-occurrence of words. In the framework of this hypothesis, it is assumed that words that often occur with similar surrounding words have the same or similar semantic meaning. This hypothesis was proposed by Mikolov et al. (2013). The probability of occurrence of word w0 with surrounding words is expressed as follows: p(w0|Wc)=exp(s(w0,wc))wVexp(s(wi,wc)), where w0 is a vector representation of the target word, wc is the context vector, s(w0wc) is a function matching two vectors with a number, for example, a distance measure such as a cosine measure.

CBOW (continuous bag of words) is a model where the network is trained on a continuous sequence of words. In the framework of this model, the order of words is not important. In this model, a sequence of 2k + 1 words is used, where the central word is is the word under study, and the context vector is build based on the corresponding surrounding words. Thus, each vector has a set of words, which often occur together. In the CBOW model, the probability of words is based on minimization of Kullback–Leibler divergence: KLB(p||q)=xVp(x)ln(p(x)q(x)), where p(x) is the probability distribution of words from the dataset, q(x) is the word distribution generated by the model. Skip-gram model is a model of phrases with a gap. This model is similar to the previous model. The principle of the CBOW model is a prediction of a word given context, and the principle of skip-gram model is a prediction of context given a word. In general, the word2vec model is optimized based on negative sampling procedure (Mikolov et al., 2013). Negative sampling is a way to create negative examples for model learning (i.e, to show pairs of words that are not neighboring in the context).

GloVe model

The main idea of the GloVe model is to extract semantic relations between words using the matrix of co-occurrence of words. This model minimizes the difference between the product of word vectors and the logarithm of the probability of their co-occurrence using stochastic gradient descent (Pennington, Socher & Manning, 2014). In this case, it is possible to connect satellites of one planet or the city’s postal code with its name, that could not be done using the Word2vec model.

FastText model

The FastText model is an extension of the Word2vec model. In the FastText model, skip-gram, negative sampling, and model of symbolic n-grams are used (Joulin et al., 2017). Each word is presented as a composition of several sequences of symbols of a certain length. In this approach, word embedding is the sum of these n-grams. Parts of words are also likely to occur in other words, making it possible to produce vector representations for rare words.

Doc2vec model

The Doc2vec model allows one to map an entire document into a numeric vector (Le & Mikolov, 2014). The developers of the concept proposed the following algorithm. Each paragraph is represented as vector of words, where each word is represented as a numeric vector, characterizing the proximity of words to each word in a paragraph. Paragraph vectors and vectors of words are averaged to improve word prediction. In general, this approach is analogous to the word2vec model, with the only difference being that the window slides over the document’s paragraphs. Moreover, in the framework of this model, the algorithm of the stochastic gradient is used for the optimization of the softmax function.

ElMo model

In Peters et al. (2018), the authors proposed an approach where word vectors are trainable functions of inner states of a deep bidirectional language model (biLM), which was previously trained on a large corpus of texts. The authors used deep network ’bidirectional LSTM’ with several normalization layers. This approach uses a character-by-character data representation strategy, so ElMo (Embeddings from Language Models) provides three layers of representations for each input token, including those that are outside the training set due to pure character input. In contrast to this approach, traditional word embedding methods only provide one level of representation for lexemes from a fixed vocabulary.

Further development of algorithms for building embeddings went through using large language models; namely, embeddings began to be used as an input to transformers that deal with various NLP tasks, such as translation, text summarization, sentiment analysis, and others. In addition, the architecture of transformers began to be used to build embeddings.

Tuning of embeddings for ‘transfer learning’ tasks

In Cer et al. (2018), the authors propose two models for encoding sentences into embeddings, which are specifically aimed at the transfer of learning. The proposed variants of coding models make it possible to find a compromise between accuracy and computational costs since training large language models is an extremely costly procedure, both in terms of time and finances. The first model encodes sentences into embeddings based on the ‘sub-graph of the transformer’ architecture (Vaswani et al., 2017). This sub-graph uses attention to calculate context-sensitive representations of words in a sentence taking into account the order and the identity of all other words. The context-aware word representations are converted to a fixed-length sentence encoding vector by computing the element-wise sum of the representations at each word position. The encoder takes as input a string with PTB (Penn Treebank tokenization) tokens in lowercase and outputs a 512-dimensional vector as the sentence embedding.

The second encoding model is based on the Deep Averaging Network (DAN) (Iyyer et al., 2015), in which the embeddings for words and bigrams are first averaged together and then passed through a feedforward deep neural network (DNN) to obtain the final sentence embeddings. Like the Transformer encoder, the DAN encoder takes a lowercase PTB token string as input and produces a 512-dimensional sentence embedding. The authors have shown that transfer learning based on the transformer-based embedding encoder performs as well or better than learning based on the DAN encoder. Models in ‘transfer learning’ tasks that use sentence-level embeddings tend to perform better than models that only use word-level transfers.

Electra model

In Clark et al. (2020), a new method for pretraining text encoders based on discriminators is proposed. The proposed model (Electra: Efficiently Learning an Encoder that Classifies Token Replacements Accurately) consists of two generator networks and a discriminator (based on transformers). The idea of such a network is as follows: the learning mode based on masking words is replaced by a masking model of lexemes taken from the generator network. Then, instead of training a model that predicts the original identity of the latent lexemes, a discriminative model is trained that predicts whether each token in the latent form has been replaced by a generator pattern or not. As the authors have shown in their experiments, this architecture is superior to models such as BERT and GPT-2 and is comparable to the RoBERTa and XLNet models with a lower amount of training data. In addition, this paper shows the procedure for generating embeddings based on token masking.

Acoustic word embeddings

Recently, works that construct embeddings not from text data but from images and audio tracks have appeared. For example, in Jacobs & Kamper (2023), an algorithm for constructing acoustic embeddings (AWE) is proposed, which are speech segments that encode phonetic content so that different implementations of the same phonetic content have the same embeddings. This model is based on a recurrent network for matching word segments in embedding.

Recent applications of word embeddings

In general, the works of 2022-mid-2023 are not focused on developing new embedding models but on creating new architectures for large language models, as well as on forming various ways to use embeddings. For example, in Muennighoff (2022), the author considers the SGPT (decoder-only transformers) model for semantic search and extraction of meaningful embeddings based on prompt engineering and fine-tuning. In addition, websites are being actively developed where users can either find ready-made embeddings, as was done in the framework of this work, or sites that contain ready pre-trained neural networks tailored for various NLP tasks, including building embeddings. One such popular resource is the ‘Hugging Face’ repository (https://huggingface.co/blog/getting-started-with-embeddings). The most recent review (April 2023) of large language models is given in Yang et al. (2023). This paper discusses the areas of application of transformers such as decoder-only, encoder-only, and encoder–decoder architectures in the context of various NLP tasks. It should be noted that, by definition, the architecture of the Transformers type uses an embedding layer as an input layer. Hence, the main flow of scientific work in 2022–2023 is related to developing different architectures that use the above embedding construction schemes.

In addition, we would like to mention Wang et al. (2019), in which the authors consider the most popular models for building embeddings, such as Continuous-Bag-of-Words (CBOW), Skip-Gram, a model based on Co-occurrence Matrix, FastText, N-gram Model, Deep Contextualized Model, and other obsolete dictionary-based models. This work is notable for the fact that the authors try to build quality metrics for embedding construction models, regardless of the context of the NLP task. The authors of this paper have formulated several characteristics that embedding algorithms must comply with. For example, the test data on which it is recommended to test embedding construction models should be varied with a good spread in the word space. Common and rare words should be included in the estimation. The performance of embedding models should also have good statistical significance in order to be able to rank such models. The authors conducted many interesting and useful experiments for the end users, in which they showed how the nature of embeddings changes when they are built for such NLP tasks as (1) part-of-speech (POS) tagging, (2) named entity recognition, (3) sentiment analysis, (4) neural machine translation (NMT). The result of this work is a guide to the selection of appropriate evaluation methods for various applications. The authors showed that there are many factors that affect the quality of embeddings. In addition, the authors pointed out that, until now, there are no ideal methods for evaluating testing a subspace of words for the presence of linguistic relationships, since it is difficult to understand exactly how embeddings encode linguistic relationships.

References

 
2 Citations 1,167 Views 77 Downloads