A hybrid integration framework based on LOOCV and SARIMA: relationship exploring and predictive analysis between discipline attention and literature research

Yulin Zhao; Junke Li; Kai Liu; Chaowang Shang

doi:10.7717/peerj-cs.2754

A hybrid integration framework based on LOOCV and SARIMA: relationship exploring and predictive analysis between discipline attention and literature research

Yulin Zhao¹, Junke Li ², Kai Liu^2,3, Chaowang Shang¹

1Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan, China

2School of Information Engineering, Suqian University, Jiangsu, China

3School of Computer and Information, Qiannan Normal University for Nationalities, Duyun, China

DOI: 10.7717/peerj-cs.2754

Published: 2025-04-01
Accepted: 2025-02-19
Received: 2024-10-29

Academic Editor: Xiangjie Kong

Subject Areas: Computer Education, Data Science, Multimedia, Network Science and Online Social Networks, Visual Analytics
Keywords: Discipline attention, Literature research, Relationship exploration, SARIMA, Predictive analysis

Copyright: © 2025 Zhao et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Zhao Y, Li J, Liu K, Shang C. 2025. A hybrid integration framework based on LOOCV and SARIMA: relationship exploring and predictive analysis between discipline attention and literature research. PeerJ Computer Science 11:e2754 https://doi.org/10.7717/peerj-cs.2754

The authors have chosen to make the review history of this article public.

Abstract

Analyzing the relationship between the discipline of network attention and literature research can provide new insights for the innovative development of future disciplines. Many current studies focus on network attention, but its innovative application in the field of subject teaching has not been fully verified. Based on this, this paper proposed a relationship analysis and predictive analysis (RAPA) framework based on leave-one-out cross-validation (LOOCV) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA) to explore the relationship between subject attention and literature research from the perspective of junior high school information technology. Based on the RAPA framework, five key keywords of this subject were extracted by combining the Baidu Index and China National Knowledge Infrastructure (CNKI) in first. Secondly, LOOCV was used to explore the relationship between subject attention represented by keywords and literature researches. Then, SARIMA was used to predict the future trends of subject attention and its literature researches. Finally, the prediction errors of different methods were compared. Based on the RAPA framework, the correlation analysis found that the r-values of subject attention and literature researches were all greater than 0.75, indicating a positive correlation between them. The predictive analysis found that the subject attention of junior high school information technology will be flat or decline in the next 2 years. Meanwhile, the amount of literature in this discipline has decreased compared to previous years, with an average of approximately 136. The prediction comparison showed that the prediction method in this study has a smaller mean absolute error (MAE) than other methods, and the MAE difference is 3.51. This indicated that subject attention, as an auxiliary variable of scientific research literature, is conducive to the quantitative analysis of literature research. At the same time, this study revealed the influence and role of big data represented by Internet attention in educational research.

Introduction

As a subject in the basic education stage, junior high school information technology is based on the background of cultivating students’ learning psychology, discipline behavior, and learning methods of computers. It is a discipline that emphasizes both theory and practice. Subject network concern refers to the current scholars’ concern about the hot words and information of the subject. With the popularity of the Internet and intelligent devices, people are no longer limited to academic journals, and more and more people use computers and mobile terminals to search for relevant information. Compared with traditional sci-tech periodicals, network media has a short time lag, fast transmission speed, and more easily meets the needs of scholars. However, whether there is a complex dynamic relationship between academic attention based on network media and the amount of traditional journal literature is rarely proposed in the existing literature. As the media and carrier of knowledge dissemination, the research on the relationship between sci-tech periodical literature and network media is of practical significance.

In recent years, many scholars have been devoted to the study of online media attention and scientific and technological literature. Zhang et al. (2021) empirically analyzed the Google search index and found that Internet concern has a strong Granger causality with Bitcoin transaction volume. Based on the Baidu index, Xia & Hu (2021) revealed that investors’ concern about Corona Virus Disease 2019 (COVID-19) had a positive effect on the pharmaceutical stock market. Cascajares et al. (2021) found that environmental sciences and medicine are the most relevant fields in bibliometrics, second only to social sciences and computer science through bibliometric evaluation by Scopus and Web of Science (WoS). Some scholars have made a comparative analysis by using the literature research of scientific and technological journals and the hot spots of network concern. Huang (2019) proposed that tourism journals should be good at capturing the cutting-edge issues and publish them in time to improve their competitiveness by comparing the literature of tourism journals and the hot spots of Internet concern. Zhang et al. (2018) proposed a hot spot identification method for scientific research literature based on social network attention, and verified the timeliness of this method through multiple comparative experiments. The current research content is rich and diverse, providing reference resources for this study. However, most research has focused on single-dimensional analysis of online attention or scientific journals. For example, the impact of Internet attention on the market (Zhang et al., 2021; Xia & Hu, 2021), identifying research hotspots using online attention (Zhang et al., 2018), and bibliometric analysis of scientific journals (Cascajares et al., 2021). In addition, some studies have conducted comparative analysis between online attention and scientific journals, such as the comparative study on the hotspot timeliness of tourism journal literature and tourism attention (Huang, 2019). However, few studies have focused on analyzing the relationship between online attention and scientific journals, especially in the field of information technology that emphasizes the balance between theory and practice. In view of this, this article aims to explore the relationship between disciplinary attention and literature research from the perspective of information technology discipline. Under the guidance of this relationship, the number of articles in information technology discipline in the next 2 years is predicted.

To achieve this goal, a hybrid integration framework based on leave-one-out cross-validation (LOOCV) and Seasonal AutoRegressive Integrated Moving Average Model (SARIMA) is constructed. By analyzing the relationship between disciplinary attention and literature research, this framework further validates its effectiveness and strengthens the auxiliary role of disciplinary attention to literature research. The highlight of this study lies in the proposed hybrid integration framework RAPA. The RAPA framework is based on two dimensions: disciplinary attention and literature research, and integrates multiple methods such as Pearson Correlation, LOOCV, Multiple Regression, SARIMA, etc. In the RAPA framework, this article discusses several aspects of disciplinary attention and literature research in detail, including relationship analysis, prediction analysis, and prediction difference comparison. Therefore, this article’s research questions mainly focus on the following aspects:

RQ1. What is the relationship between subject network attention and its literature researches based on a new perspective of junior high school information technology?

RQ2. Based on the above relationship, what is the future trend of subject network attention and literature researches?

RQ3. Based on the prediction results, what is the discrepancy between our prediction method and other prediction methods?

In addition, the main contributions of this study are as follows:

(1) On the basis of previous studies on Internet attention (medicine, economy, agriculture, etc.), this article further extended the applied research on Internet attention in disciplines, so as to expand the research field of network attention.

(2) From the perspective of the information technology subject in junior high school, this article broke the previous review research on this subject and increased the quantitative research on the subject.

(3) This article proposed a hybrid integration framework relationship analysis and predictive analysis (RAPA) based on LOOCV and SARIMA, and confirmed the quantitative relationship between disciplinary attention and the amount of scientific literature from a disciplinary perspective. This provided a reference case for the quantitative analysis of scientific journal literature research.

Materials and Methods

Materials

The research object is from China National Knowledge Infrastructure (CNKI). Specifically, all the literature-themed “Junior High School Information Technology” from 2011 to 2023 was searched in Chinese Journal Full-text Database (CJFD), and keywords were extracted using keyword frequency analysis. This article first calculated the keyword frequency for each year using the complete counting method of word frequency analysis, then accumulated the word frequencies of the same keywords for all years and output them in reverse order. Finally, all keywords in the information technology subject of junior high school in the past decade were counted. These keywords were used as preliminary pre-selected objects of junior high school information technology subjects and further filtered as required.

The quantity of academic articles is indexed by the quantity of CNKI literature of the keywords collected. The subject network attention is measured by Baidu Index. In this article, network attention refers to the degree to which netizens pay attention to a certain social hot topic or thing in the era of the internet. When this kind of attention is specifically applied to a discipline, it forms the discipline network attention. The discipline network attention reflects the degree of people’s interest in research hotspots, article topics, or content related to the discipline. Usually, it is measured by the number of searches for relevant information about the subject, such as keywords, academic articles, etc. This indicator reflects the popularity and influence of the discipline on the Internet, and the Baidu Index is an important tool to measure this indicator. The Baidu Index is based on the record of massive search data of netizens on Baidu’s search platform, which can reflect social hot spots, and netizens’ interests and needs over a period of time. Specifically, Baidu Index is derived by weighting the search scale and frequency of keywords based on Baidu’s massive search behavior data of netizens. It provides a quantitative indicator of netizens’ attention and has been widely used by many scholars in various fields in recent years (Chen et al., 2023; Liu et al., 2024), with high authority and universality. Therefore, this article chooses Baidu index as the quantitative index of subject network attention.

This article collects the relevant data from the CNKI and Baidu Index, and analyzes the relationship between discipline literature researches and discipline network concern.

RAPA framework

According to the two dimensions of disciplinary attention and literature research, this article constructs a hybrid integration framework for relationship analysis and predictive analysis (RAPA) by using multiple methods comprehensively (Fig. 1). Meanwhile, this article verifies the effectiveness of the RAPA framework from the perspective of information technology discipline.

Figure 1: RAPA research framework.

Download full-size image

DOI: 10.7717/peerj-cs.2754/fig-1

Firstly, this article takes the discipline of information technology as an example and extracts the main keywords of the discipline by word frequency analysis. Then, based on the main keywords, it divides two dimensions: disciplinary attention and disciplinary literature research.

Secondly, according to the two dimensions divided, this study collects their respective datasets as the basic data source for the RAPA framework.

Finally, this article explores the specific application process of the RAPA framework using several methods such as Pearson Correlation, LOOCV, SARIMA, and Predictive regression, further verifying the effectiveness of the RAPA framework implementation.

The specific implementation process of the RAPA framework includes three steps, which correspond to three research questions RQ1, RQ2, and RQ3, respectively. Each step used a unique method for experimental data analysis, which is described in detail below.

Step 1: For RQ1, RAPA analyzed the relationship between subject network attention and subject literature researches.

Selecting representative data of subject network attention (Baidu Index) and CNKI subject literature researches, RAPA used the Pearson correlation coefficient to measure the relationship between subject attention and the number of its literature researches. The Pearson correlation coefficient can reflect the linear correlation between two variables and has been used in multiple studies (Alhawarat, Abdeljaber & Hilal, 2021). Therefore, this article referred to the Rahadian et al.’s (2023) practice and used quantitative data to analyze the correlation between them.

Step 2: For RQ2, RAPA predicted the future trend of subject network attention and subject literature researches.

Based on the above relationship, RAPA first established a regression model between subject attention and subject literature quantity by cross-validation method; secondly, using time series to predict subject attention; finally, the subject attention and the regression model were used to predict the subject literature quantity. The time series prediction method based on network attention has been applied to many fields (Wang et al., 2023; Yadav & Thakkar, 2024). This study drew on Luo et al.’s (2023) approach and predicted the future development trend of information technology in junior high school.

Step 3: For RQ3, RAPA explored the differences between different predictive methods.

Based on the prediction results of the subject literature researches, RAPA compared the prediction accuracy and differences between the method in this article and gray prediction (Aykaç Özen & Öbekcan, 2023) using two evaluation indicators, Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). The comparison was used to verify the validity of the prediction method and the reliability of the experimental result in this article. The following was a detailed introduction to specific research methods.

Relationship analysis

Correlation calculation

In statistics, the linear Correlation between two variables A and B is generally expressed by the Pearson Correlation coefficient (r), whose value is between −1 and 1. The correlation coefficient can be defined only when the standard deviation of both variables is not zero. See Formula (1) for the expression. The greater the absolute value of r, the larger the relation between variables. The absolute value of r goes from 0 to 1, and the correlation goes from weak to strong.

(1) $r = \frac{\sum (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum {(x_{i} - \bar{x})}^{2}} \sqrt{\sum {(y_{i} - \bar{y})}^{2}}}$

LOOCV

As a statistical method for estimating the performance of machine learning models, Cross-validation (Fleming & Garen, 2022) generates multiple small training and test sets and uses these data sets to adjust the model parameters. The standard k-fold cross-validation is to cross-train the model with k-1 subsets and test the model with the other subsets. This method’s advantage is that all the sample data are used as training and verification models. LOOCV is a special cross-validation method (Zhao et al., 2023). In the training process of each iteration, one sample is used as the validation set, and the other n-1 samples are used as the training set. Repeat this process until each sample of the dataset is used as a verification set. Cross-validation helps optimize the model’s hyperparameters and improves its generalization performance through iterative training and validation on multiple subsets of data. On this basis, the article combines cross-validation to build a multiple regression equation, which is used to establish the relationship between discipline concern and the number of academic articles.

(2) $l n y = b + a_{1} l n x_{1} + a_{2} l n x_{2} + a_{3} l n x_{3} + . . . . . . + a_{n} l n x_{n} + ϵ_{n} .$ In Formula (2), y represents the number of literature researches on the discipline. x₁, x₂,…, x_n is the explanatory variable, namely the Baidu Index of multiple keywords reflecting the subject. b is the regression constant; a₁, a₂,…, a_n is the regression coefficient to be estimated, namely the mean value of model parameters trained by cross-validation. ε is the random error. In this article, both independent variables and dependent variables are logarithmically processed to eliminate heteroscedasticity.

Predictive analysis

SARIMA model

The SARIMA model is one of the predictive analysis methods of time series (Theerthagiri & Ruby, 2023). The general expression for the SARIMA model is:

(3) $Φ_{p} (L) A_{P} (L^{s}) Δ_{d} Δ_{s}^{D} y_{t} = Θ_{q} (L) B_{Q} (L^{s}) ϵ_{t} .$

In Formula (3), P, Q, p, and q represent the maximum lag order of seasonal and non-seasonal autoregressions, and moving average operators respectively. D and d represent the differential order of seasonal and non-seasonal, respectively; s is the change period of seasonal series. The above formula can be denoted as ARIMA(p, d, q) × (P, D, Q)s-order seasonal time series model.

The main steps of SARIMA model construction and prediction are as follows:

(1) Test the stationarity of the original sequence. If the sequence is not stationary, transform the original sequence into a stable sequence by difference and seasonal difference, so that $x_{t} = Δ_{d} Δ_{s}^{D} y_{t}$ ;

(2) Build the model with x_t, specifically:

a. Test the stationary sequence after difference, and use Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to determine the optimal order, and then determine the model parameters;

b. Establish a model with the estimated parameters, and perform model fitting to evaluate the model effectiveness.

c. Use the established model for forecasting.

Augmented Dickey-Fuller Test

The SARIMA model requires that the time series be stationary, meaning that the statistical patterns of the data do not change over time. Augmented Dickey-Fuller (ADF) test is a commonly used rigorous statistical test method (Zou & Politis, 2021). The calculation formula for ADF statistics is:

(4) $A D F = (Y_{t} - Y_{t - 1}) - δ Y_{t - 1} .$

In Formula (4), Y_t represents the current value of the time series, and Y_t−1 represents the previous value of the time series. $δ$ represents the trend of a time series, which can be a constant or a linear trend. If the value of the ADF statistic is less than a certain critical value, the null hypothesis can be rejected, indicating that the time series has a unit root. Otherwise, the null hypothesis cannot be rejected.

Prediction differences

MAE

MAE refers to the average absolute error between predicted values and actual values. The numerical unit of MAE is the same as the original data, which can intuitively reflect the size of prediction error. Its calculation formula is:

(5) $M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |$ where n represents the number of samples, y_i represents the actual value, and ${\hat{y}}_{i}$ represents the predicted value.

MAPE

MAPE not only considers the deviation between predicted values and actual values, but also the ratio between deviation and actual values. This indicator does not change due to the global scaling of the target variable (Barboza, Nunes Silva & Augusto Fiorucci, 2023). It is suitable for problems with large dimensional differences in the target variable. Its calculation formula is:

(6) $M A P E = \frac{100 %}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} | .$

The value of MAPE is a percentage, which can intuitively reflect the size of the prediction error relative to the actual value. In Formula (6), n represents the number of samples, y_i represents the actual value, and ${\hat{y}}_{i}$ represents the predicted value.

Results

Relationship analysis and modeling

Based on RAPA framework, this article used correlation analysis, cross-validation and other methods to explore the relationship between subject network attention and subject literature researches.

Keyword screening

The screening of keywords follows the following steps:

(1) Extract Trending Keywords and their word frequency. Retrieve literature on the topic of “Junior High School Information Technology” from 2011 to 2023 on CNKI and export them. Then use VOSviewer to obtain Trending Keywords and their frequency.

(2) Construct a keyword frequency matrix. Count keywords and word frequencies by year, and put them in the list. In the list, each column represents each keyword, and each row represents the keyword frequency for each year, forming a word frequency matrix.

(3) Calculate the total frequency of each keyword. Count the total word frequency for all years based on the keywords in each column, and then sort them in reverse order according to the total word frequency. Table 1 shows the top 30 keywords in literature research on junior high school information technology.

Table 1:

Trending Keywords about “middle school information technology” subject.

Num	Keywords	Frequency	Num	Keywords	Frequency
1	Information technology	1,170	16	Information-based teaching	61
2	Junior high school information technology	272	17	Teaching resources	57
3	Junior high school mathematics	227	18	Educational informatization	55
4	Teaching	221	19	Learning interest	48
5	Classroom teaching	121	20	Teaching strategy	47
6	Micro-lesson	119	21	Informatization environment	45
7	Junior high school physics	117	22	Junior middle school mathematics teaching	40
8	Modern information technology	113	23	Information literacy	39
9	Junior high school information technology teaching	90	24	Multimedia technology	38
10	Application	88	25	Instructional design	31
11	Multimedia teaching	83	26	Junior high school biology	27
12	Teaching mode	77	27	Micro video	27
13	Flipped classroom	76	28	Autonomous learning	25
14	Core literacy	69	29	Curriculum integration	22
15	Efficient classroom	65	30	Computational thinking	21

DOI: 10.7717/peerj-cs.2754/table-1

Relationship analysis

The preliminary extracted keywords have a large amount of data and complex information, so it is necessary to further streamline the sample to eliminate secondary words. Based on the experimental purpose, secondary filtration was conducted:

1. Remove words that are not relevant to the topic; such as: junior high school physics, junior high school biology;

2. Remove words that are not included in Baidu Index; such as: core literacy, computational thinking;

3. Remove words that cover a large range: such as: teaching, application.

Secondly, this article used Pearson correlation analysis to explore the relationship between subject attention and literature research represented by keywords. On this basis, keywords that are highly correlated with junior high school information technology subjects (r > 0.5) and have statistical significance are selected as research objects.

This article conducts correlation analysis on the filtered words through t-test. Meanwhile, based on the Pearson test results, select words that can characterize this subject and have statistical significance (p < 0.05): Information-based teaching (r = 0.887**), Micro-lesson (r = 0.841**), Flipped classroom (r = 0.798**), Multimedia teaching (r = 0.867**), Educational informatization (r = 0.753**). These keywords collectively reflect the key content and characteristics of junior high school information technology subject from the aspects of teaching mode, teaching resources, teaching methods, and teaching technology (Huang & Wang, 2020). Among them, information-based teaching emphasizes the use of information technology (such as computer, network, multimedia, etc.) to optimize the teaching process and improve the teaching effect. It reflects the modernization trend of teaching methods and technology in the field of junior high school information technology. Micro-lessons promote the fragmentation and personalization of teaching content, making teaching more flexible and efficient. It represents the innovation in the presentation of teaching content and teaching resources. Flipped classroom aims to fully utilize classroom time to enhance students’ self-learning and cooperation abilities (Qin & Lu, 2018). Its implementation process requires the support of information technology, which reflects the reform of the teaching mode of information technology. Multimedia teaching provides students with rich and diverse learning resources (such as images, sounds, videos, etc.), making the teaching content more vivid and intuitive. It reflects the diversity of teaching methods. Educational informatization is an important force in promoting the development of information technology disciplines, as well as a process of promoting educational modernization and reform. It emphasizes the comprehensive and profound impact of information technology on education. The selected keywords reflect the subject characteristics of middle school information technology from multiple dimensions, which together constitute the rich connotation of middle school information technology subject. In addition, it can be found that the discipline network attention represented by these five keywords is positively correlated with academic study (0.6 < r < 0.8 represents strongly related).

Relationship modeling

The simple correlation only represents the correlation between single factors, which cannot completely and objectively reflect the actual relationship between independent variable and the dependent variable. Therefore, multiple regression analysis is further applied based on simple correlation analysis (Duan, 2022). This article used the strong correlation reflected by keywords to build the relationship between the subject’s network attention with the number of academic articles. The network attention of the keywords screened in the previous step was selected as the independent variable, that is, the annual Baidu Index of information-based teaching, micro-lesson, flipped classroom, multimedia teaching, and education informatization. The number of academic articles on this subject is selected as the dependent variable, that is, the annual academic articles of CNKI-themed “Junior High School Information Technology”. The variable definitions are shown in Table 2.

Table 2:

Model variables and their definitions.

Variable	Dimension	Indicator	Symbol	Data source
Independent variable	Subject attention	Information-based teaching	x₁	Annual average of Baidu Index
		Micro-lesson	x₂
		Flipped classroom	x₃
		Multimedia teaching	x₄
		Educational informatization	x₅
Dependent variable	Literature research on Junior high school information technology	The amount of literature on information technology topics in junior high school	y	Amount of CNKI annual publication

DOI: 10.7717/peerj-cs.2754/table-2

The article selects the sample data set from 2011 to 2023 for model parameter training, and finally calculates the average of each regression coefficient. The regression equation obtained is:

(7) $l n y = 0.366 l n x_{1} + 0.345 l n x_{2} - 0.149 l n x_{3} + 0.655 l n x_{4} - 0.143 l n x_{5} - 0.483.$

Formula (7) reflected the relationship between subject network attention and subject literature researches represented by keywords, and was also an answer to RQ1. Subject network attention refers to the attention paid by network users to subject keywords, which represents the development status of the subject. The subject literature researches are quantitative reflections of scholars’ studies on the subject. There was a positive correlation between the subject attention and the amount of literature represented by the keywords. Its relationship also reflected the dynamic correlation between the network attention effect and traditional scientific journals. This helped to predict the future publishing trend of this subject.

Predictive analysis

Based on RAPA framework, this article made a prediction analysis on the basis of the relationship between subject network attention and the number of academic papers. First, RAPA used the SARIMA model to predict the network attention of the keywords in the next 2 years based on historical observations. Then, according to the regression relationship model between the subject’s attention and its literature research, the publication number of the literature researches in this subject is further predicted.

Construction of predictive models

Based on the information technology subject of junior high school, this article predicts the subject attention according to the selected keywords. Considering the limited space, this article takes the keyword “Flipped classroom” as an example to analyze the construction process of the prediction model in detail. Other keywords are similar.

Sequence analysis

Figure 2 showed the original sequence of the ‘flipped classroom’, which exhibited a trend of regular fluctuations. By dividing the original time series into three parts: Trend, Seasonal, and Residual term, to observe the seasonal characteristics of the data. Figure 2 showed that the time series has a relatively obvious seasonal trend, with annual data showing a trend of first increasing and then decreasing. Meanwhile, the ADF test results of the original sequence were obtained (Table 3). Table 3 showed that p = 0.944730, much larger than 0.05, with unit roots. Therefore, the original sequence was non-stationary and required differential processing to meet the stationarity characteristics.

Figure 2: Time series decomposition of the model.

Download full-size image

DOI: 10.7717/peerj-cs.2754/fig-2

Table 3:

ADF test results of the model.

Test statistic	p-value	Critical value (1%)	Critical value (5%)	Critical value (10%)
−0.145348	0.944730	−3.510712	−2.896616	−2.585482

DOI: 10.7717/peerj-cs.2754/table-3

Sequence stabilization and testing

From a statistical point of view, only stationary time series can avoid the existence of “pseudo-regression”. This article first performed first-order difference on the original sequence, followed by seasonal difference, and drew a test chart for the differential sequence (Fig. 3). Combining Fig. 3 and the ADF test, it was found that the mean and variance of the sequence after differential processing have stabilized. The p = 7.852714e−11, less than 0.05, indicating that the sequence is stable after one ordinary difference and one seasonal difference.

Figure 3: Sequence test after first-order difference of the model.

Download full-size image

DOI: 10.7717/peerj-cs.2754/fig-3

Model parameter determination

Model parameters can be determined by observing the trailing and truncating phenomenon of Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) graphs. However, this method may not be entirely accurate and is subject to a certain degree of subjectivity. Therefore, it is necessary to combine multiple fitting models and select the appropriate model based on the AIC, BIC, and residual analysis results. Based on the maximum likelihood estimation method, this article used the auto_arima function in the pmdarima library for parameter estimation. Finally, the optimal model was selected according to AIC and BIC. The smaller the AIC and BIC, the better the fitting and prediction performance of the model.

In this case, ARIMA(1,1,1)(2,1,0)12 was selected as the optimal model according to the results of model operation and considering the significance of estimated parameters. By substituting the trained optimal model into Formula (3), the basic expression of the flipped classroom model was obtained:

(8) $(1 - φ_{1} L) (1 - α_{1} L^{12} - α_{2} L^{24}) Δ Δ_{12} y_{t} = (1 + θ_{1} L) ϵ_{t} .$

By inputting the ARIMA(1,1,1)(2,1,0)12 model result parameters into Formula (8), the detailed expression of the model can be obtained as follows:

(9) $(1 - 0.9237 L) (1 + 0.702 L^{12} + 0.5288 L^{24}) Δ Δ_{12} y_{t} = (1 - 0.5844 L) ϵ_{t}$

Model evaluation

Based on the estimated parameters, this article selected December 2019 to December 2023 for model evaluation and drew a model evaluation diagram (Fig. 4). According to Fig. 4, the trend of the fitted value and the actual value was basically consistent. The model has a good prediction effect. Therefore, this model can be used for subsequent prediction.

Figure 4: Comparison chart of the actual and fitted values of the model.

Download full-size image

DOI: 10.7717/peerj-cs.2754/fig-4

Similarly, build the model of Information-based teaching, Micro-lesson, Multimedia teaching and Educational informatization, and predict the future attention trend according to the modeling results.

Prediction of subject attention

Referring to the modeling and prediction process of the keyword “flipped classroom”, this article predicts other keywords to reflect the attention trend of junior high school information technology subjects (Fig. 5).

Figure 5: The prediction results of the model.
(A) Information-based teaching. (B) Micro-lesson. (C) Flipped classroom. (D) Multimedia teaching. (E) Educational informatization.

Download full-size image

DOI: 10.7717/peerj-cs.2754/fig-5

The prediction results of subject attention represented by keywords show that the hotspot of information technology subject will show a flat or declining trend in the next 2 years. Among them, information-based teaching, flipped classroom, and multimedia teaching show a stable trend, which is the same as the level of concern in recent years. However, Micro-lesson and educational informatization are showing a downward trend. Especially in the field of educational informatization, by 2025, its attention index will drop to around 175. The hotspot of information technology subject presents an unstable trend, and there has been a state differentiation. As a new type of teaching resources and methods, Multimedia teaching and flipped classrooms play an important role in integrating information technology and classroom teaching. It is also a hot topic that various disciplines will continue to pay attention to in the future (Qin & Lu, 2018). Since the rise of educational informatization, it has comprehensively promoted the use of modern information technology in the process of education and teaching. However, comparative analysis shows that China should continue to promote the construction of digital education infrastructure and improve the efficiency of digital education resources (Wang & Chen, 2022).

Prediction of the number of subject literature researches

The number of subject literature researches is obtained by substituting the predicted values of disciplinary attention into the regression equation. First, according to the relationship model between subject attention and literature research (Formula 7), the independent variable in 2024 was substituted into Formula (7) to obtain the dependent variable value. By further performing anti-logarithm treatment on the dependent variable, the number of literature researches on junior high school information technology in 2024 could be obtained, which was 145.

Secondly, to reduce prediction error, data from 2024 was added to the sample set and a regression prediction model with LOOCV was constructed. The resulting model is as follows:

(10) $l n y = - 0.753 + 0.361 x_{1} + 0.286 x_{2} - 0.108 x_{3} + 0.701 x_{4} - 0.109 x_{5} .$

Then, the independent variable of subject attention in 2025 was substituted into Formula (10). Further anti-logarithm treatment could obtain the number of literature researches in 2025, which was 123. Figure 6 shows the publication number of literature research in this subject over the next 2 years. It can be seen that the overall number of publications in junior high school information technology literature has shown a fluctuating trend over the years. After 2023, the publication number of academic literature tends to be flat. The number of publications has decreased compared with previous years, with an average of 136 articles.

Figure 6: Prediction results of literature researches.
The horizontal axis represents time, and the vertical axis represents the number of literature researches in this discipline.

Download full-size image

DOI: 10.7717/peerj-cs.2754/fig-6

An overall analysis of RQ2 found that the discipline attention represented by the five keywords presented a steady or declining trend. In recent years, the public’s attention to information technology subject in junior high school is not high, showing a slowing trend. Meanwhile, its academic literature researches would not increase in the next 2 years. Information technology subject is a compulsory course in primary and secondary schools, which is of significance to the reform of classroom teaching methods. Properly increasing literature researches of this discipline can effectively tap the key points within the discipline and promote the in-depth integration of multiple disciplines.

Prediction comparison

Based on the relationship between subject attention and literature research established in RAPA framework, this article predicted the future publication number of information technology subject themes in junior high school. To evaluate the prediction effect and ensure the necessity of establishing the above relational model, RAPA also used GM(1,1) (Althobaiti & Shabri, 2022; Yeh & Chang, 2021) to forecast the number of articles in this discipline. The two prediction methods were compared and analyzed to solve RQ3. In addition, RAPA selected the Standard Deviation (SD), MAE and MAPE of the prediction results as evaluation indicators for prediction accuracy. Table 4 listed the index values of GM(1,1) and attention-based regression prediction methods.

Table 4:

Comparison of model accuracy.

Evaluating indicator	SD	MAE	MAPE
GM(1,1)	35.12	37.78	14.35%
Multiple regression prediction	30.92	34.27	13.21%

DOI: 10.7717/peerj-cs.2754/table-4

According to the overall prediction accuracy in Table 4, the multiple regression prediction error combined with disciplinary attention was relatively small. It indicated that the method of using subject attention to predict literature quantity was of high accuracy. It also showed that network concern can assist discipline literature research and improve the accuracy of prediction. Specifically, the standard deviation of multiple regression prediction results was small, indicating that the forecast value was nearly the actual value, and the prediction result was relatively well. According to the calculated prediction results, there was little difference between the data results of the two methods, which also confirmed the correlation between the subject network attention reflected by the Baidu index and the amount of CNKI academic articles. Meanwhile, this also showed that it was reasonable to use the attention to predict the subject literature researches.

Discussion

The implications of the findings

This study aimed to explore the relationship between subject network attention and literature research from the perspective of information technology in junior high school. The main research results were summarized in three aspects, namely the correlation between disciplinary attention and literature research, the prediction results of disciplinary attention and its literature research, and the comparison of prediction accuracy. The results indicated a certain dynamic correlation between subject attention represented by the network index and the amount of literature research in scientific journals. The public’s behavioral willingness to pay attention to disciplines is recorded through big data, while their contribution to scientific research is reflected in the output of scientific journal literature. In recent years, the amount of information technology subject literature in junior high school has not increased significantly, which may be related to the public’s attention for the future development of this discipline. The results of each problem were discussed in detail in this article.

Keyword extraction and relevance criterion analysis based on the RAPA framework found that the Trending Keywords of information technology discipline in junior high school were Information-based teaching (r = 0.887**), Micro-lesson (r = 0.841**), Flipped classroom (r = 0.798**), Multimedia teaching (r = 0.867**), and Educational informatization (r = 0.753**). It represented the research focus of information technology curriculum in Chinese middle schools, and also reflected the diversification of teaching methods. This was similar to the views of Huang & Wang (2020), who stated that diversified teaching methods of information technology in primary and secondary schools infiltrated and crossed each other, gradually moving towards integration. This was a realistic choice to improve and innovate teaching methods. In addition, the correlation calculation results showed that the r-value between the subject network attention represented by multiple keywords and the number of academic articles was greater than 0.75, showing a positive correlation. This indicated that online media and the scientific journals had a correlation in disseminating scientific knowledge, but the timeliness of their communication content was unknown. While scholars have conducted comparative studies on this timeliness and found that scientific journals can detect hot issues in the professional field earlier than online media, and guide the public to pay attention to these issues (Li, Feng & Tian, 2017). Currently, online media and scientific journals are in parallel, which not only satisfy the public’s pursuit for distinctive content, but also drive the development of the subject.

Time series prediction based on the RAPA framework found that the attention of information technology subject in junior high school will show a flat or declining trend in the next 2 years. Moreover, the academic hot topics represented by keywords present a trend of differentiation. Information-based teaching, Multimedia teaching, and Flipped classroom are strategic ways to integrate classroom technology, improve interactivity, and enhance learning experience (Gao et al., 2020). The trend of its future attention remains the same as in recent years, indicating that the public still holds expectations for the application mode of integrating technology in future middle school information technology classroom. The attention to micro-lessons and educational informatization is showing a downward trend, especially educational informatization, whose attention index will drop to 175. Upon investigation, it was found that the future implementation of educational informatization still faces some challenges, such as issues with equipment, training, and resources. These challenges indirectly affect the attention index. In addition, regression prediction found that the number of literature in this discipline will decrease in the next few years after 2023, with an average of about 136 articles. This is consistent with Zhao et al.’s (2023) prediction trend on the literature researches of high school information technology subjects. She suggested that one of the reasons for the low number of scientific journal’s articles may be the homogenization of many literature contents. Therefore, the hot topic attention and literature researches of this discipline need to be improved.

A comparative analysis of predictions based on the RAPA framework found that the error in predicting literature researches using disciplinary network attention was relatively small. Specifically, compared with other methods such as GM(1,1), MAE in this study was reduced by 3.51 and MAPE by 1.14%. This indicates that disciplinary network attention, as an auxiliary variable in literature research, contributes to the rationality of quantitative analysis in disciplinary literature research. This was similar to the research idea of Duan & Pan (2017) and Pai, Li & Liu (2019) using real-time data of network media as intermediary variables to identify emerging disciplines. In the big data environment, network media data has strong dynamism, and the data characteristics of different disciplines are different. Therefore, the attention reflected by online media data helps researchers in different disciplines to grasp the frontier and hot spots of the discipline, and promote the progress of the discipline and science.

Impact and recommendations of the findings

The result of this study is very enlightening to the information technology subject development of in secondary school. The information technology field has always been in rapid development and transformation, and secondary school information technology subject needs to keep up with the latest technological trends in a timely manner. The results suggested that the public’s attention to the future of this discipline was not high, and the number of its literature researches was at a medium level. Information technology subject in secondary schools is a subject to train students’ practical operation and problem-solving abilities, and also an important way to cultivate their quality. Therefore, it is necessary to strengthen the development path of this discipline, such as focusing on innovation and creativity, integrating interdisciplinary knowledge, emphasizing information literacy, etc., to enhance the public’s attention to this discipline. At the same time, attention should be paid to the phenomenon of de-homogenization in literature research in this discipline, expanding the multi-dimensional research content. This is conducive to the long-term development of information technology subject in secondary schools.

In addition, the results also showed the correlation between the subject network attention and its literature researches. This reflected from the side the dynamic correlation between online platforms and scientific journals. The era of online big data has brought unprecedented challenges to traditional scientific and technological journals, changing their dissemination methods and achieving digital transformation of journals. However, their timeliness is different. The emergence of journal research hotspots often occurs earlier than the peak of online attention (Li, Feng & Tian, 2017), which helps the public better grasp the direction of online public opinion. At the same time, network attention can also help scientific journals to capture social hot issues, promoting the transformation of traditional scientific journals in the information wave. Therefore, in the process of teaching research, it is necessary to comprehensively utilize network big data and traditional data to fully reflect the more comprehensive data value, thereby improving the quality of literature research.

Research limitations and future work

This study also has some limitations. First, it only explored the correlation between discipline concern and the quantity of the academic article in the view of middle school information technology subjects. The extrapolation of research conclusions is limited. The applicability of other disciplines needs further investigation and discussion. Second, the attention data selected in this study comes from the Baidu index. Although its search engine has a great influence, there may still be statistical bias. Attention data reflects the behavior dynamics of a large number of Internet users, which may be reflected on different platforms. This article did not collect and verify data of other platforms.

Given these limitations, more comprehensive and extensive sample-size studies will be necessary in the future to validate and extend the current findings. First, the generalizability of this study’s results to other disciplines is unknown. Therefore, future studies can be conducted from disciplinary perspectives, such as English and mathematics, to explore the relationship between the attention of different disciplines and their literature researches. While expanding research findings, summarize the characteristics of different disciplines. Secondly, future research can further expand the sample types based on the data sources. For example, Altmetrics Attention Score (AAS) is a weighted metric containing online real-time data of 17 different types of network media on Altmetric.com’s comprehensive platform (Altmetric, 2024). It can measure the social and academic influence of literature research in a diversified and timely manner, making the evaluation more comprehensive, specific, and objective. In addition, 360 Trend is also a commonly used search engine (Trend, 2024). Therefore, future research can combine different indices to measure attention to compare the similarities and differences between different Index platforms.

Conclusions

Based on the constructed the RAPA framework, this article analyzed the relationship between discipline attention and its literature research from the perspective of information technology subject, and predicted the future trend of this subject. Based on the RAPA framework, this article extracted five main keywords of junior high school information technology subject: Information-based teaching, Micro-lesson, Flipped classroom, Multimedia teaching, and Educational informatization. Correlation analysis found that there was a positive correlation between this subject network attention and literature researches. Its correlation coefficient r was greater than 0.75. Predictive analysis found that public’s attention to information technology subject in junior high school is not high in the next 2 years, and its network attention shows a flat or declining trend. Meanwhile, the amount of literature has decreased compared with the previous years, with an average of about 136 articles. Prediction comparison found that the method using subject attention to predict literature researches has higher accuracy. Its MAE was 3.51 less than other methods, with a relatively small error.

The findings highlighted the complementary relationship between disciplinary attention and literature researches. Both of these are equally important to academic researchers and help them make judgments and decisions on a hot issue. At the same time, the results also indicated that the discipline needs to strengthen innovation to explore future development paths, and increase public attention and literature researches. In addition, this study also revealed the inestimable value and impact of big data represented by online attention. Therefore, while using traditional data for research, the interactivity and diversity of online media data should be considered.

Supplemental Information

Python code.

DOI: 10.7717/peerj-cs.2754/supp-1

Download

Data.

DOI: 10.7717/peerj-cs.2754/supp-2

Download

[1] Alhawarat MO, Abdeljaber H, Hilal A. 2021. Effect of stemming on text similarity for Arabic language at sentence level. PeerJ Computer Science 7(3):e530

[2] Althobaiti ZF, Shabri A. 2022. Prediction of CO2 emissions in Saudi Arabia using nonlinear grey Bernoulli model NGBM (1,1) compared with GM (1,1) model. Journal of Physics: Conference Series 2259(1):12011

[3] Altmetric. 2024. Altmetric: a digital science solution.

[4] Aykaç Özen H, Öbekcan H. 2023. Short-term estimations of PM10 concentration in the Middle Black Sea region based on grey prediction models. CLEAN-Soil, Air, Water 51(10):2200400

[5] Barboza F, Nunes Silva G, Augusto Fiorucci J. 2023. A review of artificial intelligence quality in forecasting asset prices. Journal of Forecasting 42(7):1708-1728

[6] Cascajares M, Alcayde A, Salmerón-Manzano E, Manzano-Agugliaro F. 2021. The bibliometric literature on Scopus and WoS: the medicine and environmental sciences categories as case of study. International Journal of Environmental Research and Public Health 18(11):5851

[7] Chen H, Fang X, Xiang E, Ji X, An M. 2023. Do online media and investor attention affect corporate environmental information disclosure? Evidence from Chinese listed companies. International Review of Economics & Finance 86(6):1022-1040

[8] Duan L. 2022. Correlation analysis between atmospheric environment and public sentiment based on multiple regression model. Wireless Communications and Mobile Computing 2022:3393079

[9] Duan QF, Pan XQ. 2017. Identification of emerging topics in science using social media. Journal of the China Society for Scientific and Technical Information 36(12):1216-1223

[10] Fleming SW, Garen DC. 2022. Simplified cross-validation in principal component regression (PCR) and PCR-like machine learning for water supply forecasting. JAWRA Journal of the American Water Resources Association 58(4):517-524

[11] Gao S, Li W, Liu J, Cui L. 2020. Development and research of flipped classroom based on modern information technology and MOOC. Frontiers in Educational Research 3(3):26-29

[12] Huang WS. 2019. Comparative analysis of time between research hotspot and network Focusin travel periodicals. Journal of Hechi University 39(03):56-61

[13] Huang B, Wang D. 2020. The research hotspots, themes and trends of the teaching methods of information technology in primary and secondary schools—based on the co-word analysis of 749 master theses. Digital Education 6(2):16-21

[14] Li TT, Feng QL, Tian J. 2017. Comparative analysis of research hotspots in scientific journals and hotspots on the internet. Chinese Journal of Scientific and Technical Periodicals 28(11):992-996

[15] Liu J, Chu N, Wang P, Zhou L, Chen H. 2024. A novel hybrid model for freight volume prediction based on the Baidu search index and emergency. Neural Computing and Applications 36(3):1313-1328

[16] Luo T, Zhou J, Yang J, Xie Y, Wei Y, Mai H, Lu D, Yang Y, Cui P, Ye L, Liang H, Huang J. 2023. Early warning and prediction of scarlet fever in China using the Baidu search index and autoregressive integrated moving average with explanatory variable (ARIMAX) model: time series analysis. Journal of Medical Internet Research 25:e49400

[17] Pai YX, Li CL, Liu YM. 2019. AAS high concern subject research topic identification based on z-index. Information and Documentation Services 40(6):30-37

[18] Qin ZY, Lu WQ. 2018. Reflections on the in-depth integration of information technology and flipped classroom in the era of microlecture. Theory and Practice of Education 38(23):17-19

[19] Rahadian H, Bandong S, Widyotriatmo A, Joelianto E. 2023. Image encoding selection based on Pearson correlation coefficient for time series anomaly detection. Alexandria Engineering Journal 82(2):304-322

[20] Theerthagiri P, Ruby AU. 2023. Seasonal learning based ARIMA algorithm for prediction of Brent oil Price trends. Multimedia Tools and Applications 82(16):24485-24504

[21] Trend. 2024. Trend: a big data statistical analysis platform.

[22] Wang Q, Chen H. 2022. Crossing the digital divide: a comparative analysis of educational informatization policies in the United States, Japan, and the United Kingdom. Journal of Comparative Education 2022(4):42-57

[23] Wang Y, Zhou H, Zheng L, Li M, Hu B. 2023. Using the Baidu index to predict trends in the incidence of tuberculosis in Jiangsu Province, China. Frontiers in Public Health 11:1203628

[24] Xia Y, Hu W. 2021. Impact of COVID-19 attention on pharmaceutical stock prices based on internet search data.

[25] Yadav H, Thakkar A. 2024. NOA-LSTM: an efficient LSTM cell architecture for time series forecasting. Expert Systems with Applications 238(3):122333

[26] Yeh MF, Chang MH. 2021. GM(1,1;λ) with constrained linear least squares. Axioms 10(4):278