Semantic approaches for query expansion: taxonomy, challenges, and future research directions

View article
PeerJ Computer Science

Main article text

 

Introduction

Rationale

Organization of the study

Survey methodology

  • RQ1: Which data types and sources are most commonly utilized in query expansion methods?

  • RQ2: What techniques and methodologies are employed to enhance query expansion, particularly those that incorporate semantic approaches?

  • RQ3: What are the major challenges and future research directions in the field of semantic query expansion?

Background

Preprocessing

Tokenization

Data cleaning

Stemming

Feature extraction

Terms selection

Terms ranking

Query reformulation

Evaluation

Precision=TotalnumberofdocumentsretrievedthatarerelevantTotalnumberofdocumentsthatareretrieved
Averageprecision=nprecisionofthetopnretrieveddocumentsTotalnumberofrelevantdocuments.

Impact and applications of semantic query expansion

  • Improve IR systems: IR systems are at the heart of SQE and its contributions toward relevance and preciseness of the results obtained. Semantically expanded queries allow search engines and digital libraries to understand the needs of the user even in cases when the query provided by the user is not well-defined or clear. For instance, the query refinement by SQE in web search engines provides results that are more relevant contextually, hence a better user experience. This bears implications not only for general search engines but also for the niche or specialized search engines used across different domains like law, academia, and e-commerce.

  • Health care and biomedical research: The SQE will be of immense help in such an information-intense sector as health care, where timely and accurate information is highly valued. Such a facility in medical literature retrieval will expand the search query related to symptoms, diagnoses, treatments, or medications by adding synonyms, related terms, or concepts relevant in context. This would allow researchers and practitioners to locate articles or studies that they might otherwise miss. For example, results from databases such as PubMed can be enhanced using SQE to retrieve a more comprehensive set of relevant studies for healthcare professionals. Also, it will enhance the preciseness of the medical queries in clinical decision support systems, hence contributing to better decision-making.

  • Multilingual search and cross-lingual applications: Perception capability by SQE through the barriers of different languages is one of the most striking applications of SQE. As the internet becomes more global and sources increasingly diverse, SQE can be of great value in supporting multilingual search. Using cross-lingual semantic expansion, for instance, the SQE may enable users to retrieve relevant information across multiple languages without having to reformulate queries in each target language. This is of great relevance to international business and diplomacy, education, and customer support alike. It can be applied to improve the retrieval of multilingual content, such as in Wikipedia, e-commerce websites, or even social networking sites, so that more people are able to access information available on a global scale.

  • E-Commerce and customer support: SQE, if applied to an e-commerce context, would refine product results by inferring the intent of the customer query and extending into related products, categories, or features. This “laptop” search can be extended to “laptops with extended battery life” or even ”best gaming laptop,” making sure the results are closer toward what the customer is searching for. In this respect, SQE can also extend state-of-the-art customer care chatbots or virtual assistants with a larger degree of accuracy by allowing query extension with synonyms and related terms, making the interaction very intuitive and fast.

  • Social media and trend analysis: With social media platforms becoming probably the most significant source of information and opinions, SQE may play an important role in the analysis of trends and sentiment. All this-coupled with the synonym and terms expansion contextually related to the main query-will enhance the possibility of catching emerging trends, monitoring public sentiment, or tracking important discussions related to a particular topic. It is expected that SQE should bring out more relevant tweets in the case of Twitter data analytics by expanding the query into including other forms of expression that capture basically the same idea, thereby giving good insight into public opinions that could be helpful in enabling businesses and governments to make informed decisions.

  • Personalized user experience: With more and more personal digital assistants coming up-like Siri, Alexa, and Google Assistant-SQE can give further impetus to the intelligence of these systems to understand user queries. Semantic expansion of the queries will let virtual assistants provide more precise answers, keeping in mind the user’s preference, location, and context. This would lead to not only seamless but far more personalized user experiences across applications in e-commerce, media, and daily life.

  • Education and knowledge management: In the educational sector, the SQE would enhance knowledge retrieval through digital textbooks, online courses, and other educational databases. Advanced results when searching, other than mere keyword matching to concepts, terms, and ideas related to the query at hand, would be afforded to students and researchers. This can also semantically extend the queries of users in knowledge management systems so as to disclose to them relevant documents or articles that they might be interested in.

Query expansion categories

Local-source category

Global-source category

Knowledge-based category (Semantic expansion)

Linguistic structure approach

Ontology approach

Hybrid approach

Linguistic-based approaches

Linguistic models

Linguistic-embeddings models

Wor2vec

BERT

Ontology-based approach

General domain

Domain-specific

Hybrid-based approach

General domain

Domain-specific

Discussion

Challenges

Source availability

User intervention

User context

Time complexity

Threshold selection

Scalability

Query drift

Security

Open issues and future research directions

  • Constructing an ontology can be a time-consuming and demanding task. Therefore, it is important to develop an algorithm that can automate this process while integrating the expertise of ontology developers and experts. For instance, in Deepak & Priyadarshini (2018), an approach is proposed that accomplishes this process in a more controlled and careful manner.

  • The availability of ontologies and lexicons is limited in languages other than English, such as Arabic, which can hinder the expansion of candidate queries. Building a multilingual ontology or lexicon can improve the QE to overcome this limitation. In Al-Smadi et al. (2019), an automated question/answer system is introduced that bridges the gap between Arabic questions and linked data by translating natural language questions to SPARQL queries to retrieve answers from linked data, such as Dbpedia. To achieve this, the named entity of the original query is discovered with the help of Wikipedia labels available in Arabic. In contrast, AlAgha (2015) utilizes the statistical parser of the Arabic Toolkit Service to identify the subject, object, and predicate of the Arabic text and construct the corresponding RDF triple. To improve the reliability of linguistic resources, slang phrases must also be considered. N-gram and word/sentence embedding can play a significant role in query processing, aiding in catching such phrases with the help of reliable sources.

  • Applying AI to the expansion field is a challenge, but it could add intelligence to the approaches and generate accurate and useful expanded queries. Deep learning, for instance, could help in recognizing hidden patterns among texts or images, which can provide more semanticity. Furthermore, it can provide contextual analysis of the text and match it with an accurate topic; thus, the framework can proceed to fetch the closest topic. For instance, Mohamed & Shokry (2020) represents a framework that applies a machine learning approach that aims to train a Word2vec model using CBOW on an Arabic corpus and use it for searching the Quran. It scored a high precision by increasing the intelligence of the vector structure. In addition, Fang, Zhang & Yin (2018) used a trained Word2vec model to capture both sequential and semantic information of biomedical texts. Furthermore, in applications that focus on searching in domain-specific areas, machine learning will be useful since the training dataset is precise and determined. For instance, in Zhong et al. (2020) the authors used a deep learning methodology to construct an answer/question system for building regulations to help engineers in retrieving needed information. In addition, Cakir & Gurkan (2023) and Khader & Ensan (2023) showed a great potential in using generative models that can expand the query in an intelligent way.

  • In the realm of semantic QE, incorporating real-time data is an area that is yet to be fully explored. However, a real-time semantic search has the potential to provide valuable insights on trending topics, especially during emergencies and major events, when people tend to scour social media platforms for information. By leveraging the user’s semantic attributes, a framework described in Zhu et al. (2017) was able to effectively search Twitter stream data, yielding more accurate and interesting results.

  • The semantic search for multimedia content, such as images and videos, is still in its nascent stages. Existing frameworks rely on user intervention to determine semanticity, as seen in Dao Thi Thuy et al. (2017), which can be misleading and fails to provide an actual semantic expansion approach. Incorporating deep semanticity with minimal user judgment can lead to better semantic frameworks. One way to achieve this is by constructing a vector space of the images based on multiple features to create a digital description of the multimedia content. This approach was used in Tautkute & Trzcinski (2021), where the authors aim to extract semantic information from different data forms for image searching. They use a GAN framework to generate a synthetic-related image based on a multimodal query containing both an input image and textual description, which is then used to search for more related images.

Conclusion

Additional Information and Declarations

Competing Interests

Author Contributions

Data Availability

Funding

The authors received no funding for this work.

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more