Unsupervised query reduction for efficient yet effective news background linking

View article
PeerJ Computer Science
We experimented with other retrieval models implemented within Lucene, however, the default one achieved the best results for the baseline method, therefore, we adopted it for all other methods.
Only one very long query article (141 paragraphs) was omitted from the histogram for clarity.
There were two outlier queries that took more time compared to others for processing. However, removing both did not noticeably affect the average processing time.

Main article text

 

Introduction

  1. RQ1: Can we effectively retrieve the background links by just using the lead paragraphs of the input article to construct the search query?

  2. RQ2: How effective and efficient are typical keyword extractions techniques for this task?

  3. RQ3: Which query reduction technique is more effective if we allow further reranking of the candidate links?

  • While many researchers addressed the background linking problem, mainly within the scope of TREC news track, our study is the first to highlight the efficiency aspect of this problem, aiming to build an efficient background linking system that maintains the far most obtained effectiveness, besides being the first to extensively review the literature for this problem, considering as well other resources for background linking than news articles.

  • We present a new comparative study between several state-of-the-art unsupervised keyword extraction techniques for a new downstream task (news background linking) in which they were never evaluated, in terms of both effectiveness and efficiency.

  • We show that we can quite efficiently reduce the query response time needed for the retrieval of the background links, while maintaining the retrieval effectiveness of the full article approach, using simple unsupervised statistical keyword extraction techniques.

  • We make our source code for running the different methods and experiments publicly available.

Background linking problem

Definition

Relation to other problems

Methodology

Query formulation and document scoring

where wt,d is the weight of term t in d according to the retrieval model f (e.g., the term frequency of t in d). Finally, the documents with the highest scores can be retrieved as background links for the article A.

Unsupervised keyword extraction techniques

  • Unsupervised methods: Since the problem we address is very recent, and there is no labeled data for supervised learning, (i.e., there is no golden set of keywords extracted from the query articles that can be used to retrieve the best set of background links), we focus only on selecting keyword extraction techniques that are mainly unsupervised.

  • Effective number of keywords: Our goal from extracting the keywords is to use them to form search queries to retrieve candidate background documents from a big news articles collection. Hence, we prefer the keyword extraction technique that can provide a large number of good representative keywords. In our preliminary experiments, using few search keywords in a retrieval query, even of good quality or representation of the query topic, considerably lowered the retrieval effectiveness. Accordingly, techniques that failed to provide large number of good keywords (30 in our preliminary experiments) were excluded, such as Teket (Rabby et al., 2020).

  • Effectiveness on news articles: When reporting its effectiveness, many keyword extraction studies conduct the experiments on scientific article datasets or even books; however, news articles have special features. They are shorter, less cohesive, and they may discuss multiple subtopics. Hence, we selected the recent techniques that worked best when tested on English news articles datasets.

k- Core

k- Truss

PositionRank

TopicRank

Multipartite Rank

sCAKE

Yake

Time complexity of keyword extraction techniques

Experimental evaluvation

Experimental setup

Dataset

Preprocessing and indexing

Baseline

Implementation issues

Retrieval

Evaluation measures

Experimental results

Leading paragraphs as search queries (RQ1)

Keyword extraction for search query reduction (RQ2)

How far can we reach with reranking? (RQ3)

Conclusion and future work

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Marwa Essam conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Tamer Elsayed conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The raw data is available at Zenodo: Marwa-Essam81. (2022). Marwa-Essam81/EfficientNewsBL: @newsbackroundLinking (@newsbackgroundlinking). Zenodo. https://doi.org/10.5281/zenodo.7329399.

Funding

This work was made possible by the NPRP grant NPRP 11S-1204-170060 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1 Citation 853 Views 49 Downloads