Hybrid TF–IDF and user-based collaborative filtering for product recommendation in an offline Indonesian grocery store
Abstract
Most Indonesian micro retail grocery stores lack the digital infrastructure to capture and analyse customer purchase behaviour. As a result, customers rarely receive personalized product suggestions and many items remain under-exposed on the shelves. This problem is particularly evident in mom-and-pop stores such as Toko Solo Latri, where all interactions are recorded only as semi-digital transaction logs without explicit ratings or reviews, leading to sparse and implicit feedback that challenges traditional recommender algorithms. This study proposes a hybrid recommendation model tailored for small offline grocery retailers, combining Term Frequency–Inverse Document Frequency (TF–IDF)–based Content-Based Filtering (CBF) with User-Based Collaborative Filtering (UBCF) within the CRISP–DM framework. Product descriptions are constructed from name, brand, category, packaging, and price information and transformed into TF–IDF vectors to compute content similarity via cosine distance. Customer purchase histories are converted into user–item frequency matrices to estimate behavioural similarity between customers. To mitigate sparsity and improve stability, K-Means clustering is applied for customer segmentation. The outputs of CBF and UBCF are then integrated into a weighted hybrid scoring function. The model is evaluated on real transaction data from Toko Solo Latri comprising 102,735 transaction records, 320 products, and 200 customers. Performance is assessed using Precision@k, Recall@k, F1-Score@k, and NDCG@k under both global (80:20 train–test split) and per-user evaluation schemes. Despite the highly sparse and implicit nature of the data, the hybrid model exhibits stable ranking performance. In the global 80:20 evaluation, the system achieves Precision@5 = 0.1574, Recall@5 = 0.0103, F1-Score@5 = 0.0193, and NDCG@5 = 0.1835, with comparable trends in the per-user setting. While the absolute scores are modest, they are consistent with prior findings on low-density transactional datasets, and the hybrid approach outperforms pure CBF and pure UBCF in terms of ranking quality. These results demonstrate that combining content similarity with behavioural similarity offers a practical and deployable solution for micro–retail grocery recommendation under severe data sparsity and implicit feedback. For Indonesian UMKM undergoing digital transformation, the proposed TF–IDF-based hybrid model, implemented with lightweight tooling and a Flask-based web interface, provides a feasible path towards data-driven product personalization. Future work may explore deep learning, matrix factorization, or graph-based methods to further improve recommendation accuracy in similar low-resource retail settings.