Achieving optimal trade-off for student dropout prediction with multi-objective reinforcement learning

View article
PeerJ Computer Science

Main article text

 

Introduction

  • We design multi-perspective sources of vector reward to train the agent to prevent information loss caused by the dimensionality reduction of summation the multiple rewards into one scala reward.

  • We leverage vectorized value functions and perform envelope value updates to train a unified policy network optimized across the full spectrum of preferences within a domain. Consequently, this trained network is capable of generating the optimal policy for any preference specified by the user.

  • We evaluate our proposed model using two real-world MOOC datasets. To the best of our knowledge, this is the first work that the MORL method being used in the accuracy-earliness trade-off issue of SDP. Results show that our method significantly outperforms state-of-the-art approaches in achieving optimal trade-off. This innovative AI application not only advances the field of SDP but also signifies a major leap in the application of artificial intelligence in multi-objective optimization.

Methodology

System model

Preliminary

Markov decision process

Scalar reward based RL

Problem reformulation

  1. States space In the task of accuracy-earliness trade-off for SDP, the objective is to predict labels lL as early as possible, without access to complete sequence information. Consequently, at each timestep t, the state st is represented by a set of partial time series variables X:t, essentially a slice across all variables at timestep t. The partial sequence is a key aspect of the problem, reflecting the real-world challenge of making early predictions based on limited information.

  2. Action space The action space is A=Acad. If the agent selects ad, it signifies the choice of the ‘WAIT’ action, leading to the advancement of the system by one timestep. The action selection process then restarts with the new state, Xt+1 = st+1. On the other hand, if Ac=L is choose, the agent opts for ‘HALT’, concluding the processing of the current time series and triggering a classification label prediction. The timestep at which ‘HALT’ is selected, or when t reaches T (the preset maximum limit), is identified as the halting point τ. The action space is represented as follows: A={ad,WAITAc,HALT

  3. Reward vector Unlike the previous scalarization methods that use a single scalarized reward (Martinez et al., 2018; Martinez et al., 2020; Hartvigsen et al., 2019), we introduce a vectorized reward. The difference between them is illustrated as Fig. 2. The reward vector can be represented as r(st,at)=[r1(st,at),r2(st,at)], for retaining crucial information about each objective. The first element of the reward vector, r1(st,at), pertains to the accuracy of predicted labels. The second element in the reward vector, r2(st,at), is associated with the earliness of prediction. To be specific, r1 and r2 are given as: r(st,at)={[r1=0,r2=λtp], ifat=ad[r1=1,r2=1], ifat=l[r1=1,r2=1], ifatl. When the agent opts to wait (denoted by action ad), it receives no reward for accuracy (r1 = 0) but incurs an increasingly negative reward over time for delay (r2 =  − λtp), with parameters λ and p determining the intensity and rate of this penalty. In contrast, if the agent makes a correct prediction (action l), it is rewarded on both accuracy and timeliness (r1 = 1, r2 = 1). However, an incorrect prediction penalizes accuracy (r1 =  − 1) while still rewarding timeliness (r2 = 1). This reward structure incentivizes the agent to make timely and accurate decisions, addressing the trade-off between accuracy and earliness.

  4. Preference space To manage the complexity of the reward vector, we introduce the concept of preference space Ω, which is typically represented as a vector ω=(ω0,ω1)Ω. It could be regarded as a series of rays extending radially in the first orthant in Fig. 1, being used to weigh the relative importance of different objectives encapsulated in the reward vector.

Proposed framework

Learning the global optimal strategy

Parameter training procedure

 
_______________________ 
Algorithm 1 Training Algorithm___________________________________________________________ 
  1:  Initialize replay buffer Dτ. 
  2:  Initialize action-value function Q(s,ω,a;θ). 
  3:  Initialize target action-value function Q′ (s,ω,a;θ′) by copying: θ′ ← θ. 
  4:  for episode = 1,...,M do 
 5:       Sample a training pair from the dataset { 
 ( 
  Xi,yi) 
      } 
       i=1...n 
  6:       while not terminal and t ≤ T do 
 7:            Obtain partially observable state st = Xi:t. 
  8:            Sample a preference ω ∼Dω and concatenate it with state st. 
  9:            Agent  receives  the  input  [st,ω]  and  picks  an  action  at  based  on 
     Eq (??). 
10:            Environment  steps  forward  according  to  at  and  gets  the  multi- 
     objective reward vector rt, the next state st+1, and the terminal state. 
11:            Store transition (st,at,rt,st+1,terminal) in Dτ. 
12:            if update then 
13:                  Sample  Nτ   transitions  (sj,aj,rj,sj+1)   ∼ Dτ   according  to 
     Eq (??). 
14:                  Sample Nω preferences W = {ωi ∼Dω}. 
15:                  Compute y according to Eq (??). 
16:                  Compute the loss function based on Eq (??) and Eq (??). 
17:                  Update Q-network by minimizing the loss function according to 
     Eq (??). 
18:            end if 
19:            if at = WAIT then 
20:                  Increment time t = t + 1. 
21:            else 
22:                  Predict and set terminal = True. 
23:            end if 
24:       end while 
25:  end for____________________________________________________________________________    

Experiment

Dataset description

Implementation details

Baselines

  1. Non-RL methods

    • LSTM-Fix (Ma, Sigal & Sclaroff, 2016): LSTM-Fix involves training a classifier using the entire time series, but using only the initial part of the time series data available up to the preset halting point to make prediction. Its characteristic of relying on a predefined halting point for classification provides a contrast to MORL’s dynamic early prediction capability.

    • NonMyopic (Dachraoui, Bondu & Cornuéjols, 2015): NonMyopic is chosen to showcase an approach that calculates optimal prediction timing of early warning for SDP, contrasting with MORL’s method of learning from vectorized rewards to dynamically balance accuracy and earliness, highlighting the rapidity of MORL’s prediction timing.

  2. RL methods

    • ECTS (Martinez et al., 2018): ECTS leverages a reinforcement learning framework to facilitate early classification of time series. It is conceptualized as a MDP, characterized by a scalar reward function and optimized by a DQN. This approach utilizes a user-preset parameter λ to balance timely and accurate classification. By adjusting λ, users can finely tune the model to trade-off the dual objectives. Its comparison with MORL underlines MORL’s superior handling of multi-objective optimization through MOMDP, vector reward function and MODDQN.

    • DDQN + PER (Martinez et al., 2020): It implements a DDQN with PER for demonstrating the effectiveness of using advanced reinforcement learning techniques to address the unbalanced memory issue. These advanced techniques are also adopted in our MORL model. The main difference between DDQN+PER and MORL lies in their reward design and optimization mechanisms, where DDQN+PER is based on scalar reward design and depends on preset hyper-parameter to combine different objectives, while MORL evolves vector reward and optimizes a single policy network across a spectrum of preferences without pre-setting them.

    • EARLIEST (Hartvigsen et al., 2019): EARLIEST is a deep learning based method, which is composed of a RNN-based Discriminator with a RL-based Controller. A novel aspect of EARLIEST is the integration of minimizing Discriminator errors and maximizing Controller rewards into a unified loss function. Moreover, the model incorporates an additional loss term, regulated by the hyper-parameter λ, specifically designed to promote early halting. This innovative approach deftly merges the strategic decision-making prowess of reinforcement learning with the deep learning’s predictive strengths. Its comparison to MORL highlights the latter’s efficiency in navigating through multi-objective dilemmas using a single policy network, showcasing EARLIEST’s complexity in managing similar tasks.

Evaluation metrics

  1. Average accuracy Acc defines the model’s average prediction accuracy on a testing set D={(Xj,lj)}j=1..n as: Avg. Acc=nj=1(fclassifier(Xj)=lj)/n

  2. Average proportion used A halting point tpred represents the earliest time step at which the agent decides to halt and predict a class label: tjpred=mint[1,T]{argmaxaAQ(Xj:t,a)Ac} Accordingly, the Average Proportion Used is computed as the mean of halting points on all sequences from the testing set, such that: APU=nj=1tjpred/n

  3. Average harmonic mean Avg. HM expresses the ability of our model to provide accurate predictions at the earliest. The calculation of the Avg. HM is as follows: Avg. HM=2(1APU)(Avg. ACC)(1APU)+(Avg. Acc).

Results

Experimental comparison between RL methods

Experimental comparison between Non-RL methods and RL methods

Policy adaptation

Conclusions

Supplemental Information

The open datasets KDDCup2015 and XuetangX

The KDDCup2015 dataset encompasses data from 39 courses and 72,395 students, with a 30-day historical window and 7 distinct types of student learning activities. The XuetangX dataset contains 19 courses, 23,839 students and 22 event types, with a 35 days history period.

DOI: 10.7717/peerj-cs.2034/supp-1

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Feng Pan conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Hanfei Zhang conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Xuebao Li analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Moyu Zhang analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Yang Ji conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The MORL-MOOC dataset is available at GitHub: https://github.com/leondepf/MORL-MOOC/tree/master.

Funding

The authors received no funding for this work.

1 Citation 634 Views 38 Downloads