Word embedding for social sciences: an interdisciplinary survey

View article
PeerJ Computer Science
Portions of this text were previously published as part of a preprint (Matsui & Ferrara, 2022) and the author’s dissertation (Matsui, 2022)
Kozlowski, Taddy & Evans (2019) summarize the history of word embedding models for social scientists, and Chaubard et al. (2019) provide an excellent description of word2vec.
In this survey, we review the articles (Ash, Chen & Ornaghi, 2020; Lee & Kim, 2021; Caliskan, Bryson & Narayanan, 2017) that uses GloV model (Pennington, Socher & Manning, 2014; Kawaguchi, Kuroda & Sato, 2021) that uses fastText model (Bojanowski et al., 2017).
While “target” is also popularly used instead of “word,” we stick to using “word” in this survey to avoid potential confusions.
In the literature, it is often mentioned that this log-likelihood maximization with the noise distribution is considered as Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010), and the SGNS model is one of the variations of NCE.
Gutmann & Hyvärinen (2010) describes the details, and this article also follows the notation and equations of that article.
We also should note that cosine similarity is not a distance metric.
Regarding the robustness check, Ash, Chen & Ornaghi (2020) also studied the correlations between their stereotype measurements of 100 and 300 dimension embedding vectors. In addition, they tested three sets of window sizes in their robustness check.

Main article text

 

Introduction

Survey methodology

Word embedding models

word2vec: a popular word embedding model

CBOW, skip-gram model and SGNS model

Taxonomy of applied methods and labeling literature

Labeling literature

Taxonomy

Pre-trained models

Overfitting models

Working variable

Reference words

Comparing the same words

Non-text

Cosine similarity or euclidean distance?

Pitfalls with cosine similarity for similarity measurement with word embedding vectors

Conclusion

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Akira Matsui conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Emilio Ferrara conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

This is a literature review.

Funding

This work was supported by JSPS KAKENHI Grant-in-Aid for Scientific Research (No. JP22K20159); Research Institute of Science and Technology for Society, Japan, Grant Number JPMJRS23L4. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

289 Visitors 242 Views 23 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more