A Bayesian brain model of adaptive behavior: an application to the Wisconsin Card Sorting Task

Adaptive behavior emerges through a dynamic interaction between cognitive agents and changing environmental demands. The investigation of information processing underlying adaptive behavior relies on controlled experimental settings in which individuals are asked to accomplish demanding tasks whereby a hidden regularity or an abstract rule has to be learned dynamically. Although performance in such tasks is considered as a proxy for measuring high-level cognitive processes, the standard approach consists in summarizing observed response patterns by simple heuristic scoring measures. With this work, we propose and validate a new computational Bayesian model accounting for individual performance in the Wisconsin Card Sorting Test (WCST), a renowned clinical tool to measure set-shifting and deficient inhibitory processes on the basis of environmental feedback. We formalize the interaction between the task’s structure, the received feedback, and the agent’s behavior by building a model of the information processing mechanisms used to infer the hidden rules of the task environment. Furthermore, we embed the new model within the mathematical framework of the Bayesian Brain Theory (BBT), according to which beliefs about hidden environmental states are dynamically updated following the logic of Bayesian inference. Our computational model maps distinct cognitive processes into separable, neurobiologically plausible, information-theoretic constructs underlying observed response patterns. We assess model identification and expressiveness in accounting for meaningful human performance through extensive simulation studies. We then validate the model on real behavioral data in order to highlight the utility of the proposed model in recovering cognitive dynamics at an individual level. We highlight the potentials of our model in decomposing adaptive behavior in the WCST into several information-theoretic metrics revealing the trial-by-trial unfolding of information processing by focusing on two exemplary individuals whose behavior is examined in depth. Finally, we focus on the theoretical implications of our computational model by discussing the mapping between BBT constructs and functional neuroanatomical correlates of task performance. We further discuss the empirical benefit of recovering the assumed dynamics of information processing for both clinical and research practices, such as neurological assessment and model-based neuroscience.


Introduction
Computational models of cognition provide a way to formally describe and empirically account for mechanistic, process-based theories of adaptive cognitive functioning [56,15,36].A foundational theoretical framework for describing functional characteristics of neurocognitive systems has recently emerged under the hood of Bayesian brain theories [32,25].Bayesian brain theories owe their name to their core assumption that neural computations resemble inference processes following the logic of Bayesian probability theory.
From a Bayesian perspective, cognitive agents exist in an uncertain environment and adaptive behavior emerges through a dynamic interaction between cognitive agents and environmental demands.In order to behave adaptively, arXiv:2003.07394v1[q-bio.NC] 16 Mar 2020 cognitive agents must be sensitive to changes in their environment.More formally, they must generate and maintain internal probabilistic models of environmental states as external sensory information is gathered [24].From these internal models, they derive beliefs about the causal structure of the environment and make predictions about future environmental states.Moreover, internal models form a basis for choosing future actions which can change the state of the environment and are, in turn, modified and refined by changes in the environment.As a result, internal beliefs and predictions are also updated to match the new model, according to principles of Bayesian inference [27,25,13].
The empirical assessment of adaptive functioning often relies on dynamic reinforcement learning (RL) tasks which require participants to adapt their behavior during the unfolding of the task.A typical RL task unfolds through multiple trials as participants observe certain environmental contingencies, take actions, and receive feedback based on their actions.Optimal performance in a RL experimental paradigm requires that agents infer the probabilistic model underlying the hidden environmental states.Since these models usually change as the task progresses, agents, in turn, need to adapt their inferred model, in order to take optimal actions.
In the present work we propose and validate a computational Bayesian model which accounts for the adaptive behavior of cognitive agents in reinforcement learning tasks.More precisely, we focus on the widely adopted Wisconsin Card Sorting Test (WCST; [8,29]) as a particular instance of such a task.The WCST is perhaps the most popular neuropsychological setting employed to measure set-shifting, cognitive flexibility and impulsive response modulation [10,1] and we consider it as a fundamental paradigm for investigating adaptive behavior from a Bayesian perspective.
The environment of the WCST consists of a target and a set of stimulus cards with geometric figures which vary according to three perceptual features.The WCST requires participants to infer the correct classification principle by trial and error using the examiner's feedback.The feedback is thought to carry a positive or negative information signaling the agent whether the immediate action was appropriate or not.Modeling adaptive behavior in the WCST from a Bayesian perspective is straightforward, since observable actions emerge from the interaction between the internal probabilistic model of the agent and a set of discrete environmental states.
Performance in WCST and similar RL tasks [6,22] is usually measured via a rough summary metric such as the number of correct/incorrect responses or pre-defined psychological scoring criteria (see for instance [29]).These metrics are then used to infer the underlying cognitive processes involved in the task.A major shortcoming of this approach is that it simply assumes the cognitive processes to be inferred without specifying an explicit process model.Moreover, summary measures do not utilize the full information present in the data, such as trial-by-trial fluctuations or various interesting agent-environment interactions.For this reason, crude scoring measures are often insufficient to disentangle the dynamics of the relevant cognitive (sub)processes involved in a RL task.Consequently, an entanglement between processes at the metric level can prevent us from answering interesting research questions about aspects of adaptive behavior.
In our view, a sound computational account for adaptive behavior in RL tasks needs to provide at least a quantitative measure of effective belief updating about the environmental states at each trial.This measure should be complemented by a measure of how feedback-related information influences behavior.The first measure should account for the integration of meaningful information.In other words, it should describe how prior beliefs about the current environmental state change after an observation has been made.The second measure should account for signaling the (im)probability of observing a certain environmental configuration (e.g., an (un)expected feedback given a response) [50].
Indeed, recent studies suggest that the meaningful information content and the pure unexpectedness of an observation are processed differently at the neural level.Moreover, such disentanglement appears to be of crucial importance to the understanding of how new information influences adaptive behavior [42,50,45].Inspired by these results and previous computational proposals [33], we integrate these different information processing aspects into the current model from an information-theoretic perspective.
Our computational cognitive model draws heavily on the mathematical frameworks of Bayesian probability theory and information theory [49].First, it provides a parsimonious description of observed data in the WCST via two neurocognitively meaningful parameters, which we dub flexibility and information loss (to be explained in the Model section).Moreover, it captures the main response patterns obtainable in the WCST via different parameter configurations.Second, we formulate a functional connection between cognitive parameters and underlying information processing mechanisms related to belief updating and prediction formation.We formalize and distinguish between Bayesian surprise and Shannon surprise as the main mechanisms for adaptive belief updating.Moreover, we introduce a third quantity, which we dub predictive Entropy and which quantifies an agent's subjective uncertainty about the current internal model.Finally, we propose to measure these quantities on a trial-by-trial basis and use them as a proxy for formally representing the dynamic interplay between agents and environments.
The rest of the paper is organized as follows.First, the WCST is described in more detail and a mathematical representation of the new Bayesian computational model is provided.Afterwards, we explore its characteristics through simulations.We also present an application in which we apply a novel and powerful Bayesian deep neural network method [46] for model evaluation and parameter estimation.We apply the model to a real behavioral data from an already published dataset.Finally, we discuss the results as well as the main strengths and limitations of the proposed model.

The Wisconsin Card Sorting Test
In a typical WCST, participants learn to pay attention and respond to relevant stimulus features, while ignoring irrelevant ones, as a function of experimental feedback.Individuals are asked to match a target card with one of four stimulus cards.Each card depicts geometric figures that vary in terms of three features, namely, color (red, green, blue, yellow), shape (triangle, star, cross, circle) and number of objects (1, 2, 3 and 4), according to a correct sorting rule on any given trial (see Figure 1).Figure 1: Suppose that the current sorting rule is the feature shape.The target card in the first trial (left box) contains two blue triangles.A correct response requires that the agent matches the target card with the stimulus card containing the single triangle (arrow represents the correct choice), regardless of the features color and number.The same applies for the second trial (right box) in which matching the target card with the stimulus card containing three yellow crosses is the correct response.
Each response in the WCST is followed by a feedback informing the participant if his/her response is correct or incorrect.After some fixed number of consecutive responses, the sorting rule is changed by the experimenter without warning, and participants are required to infer the new sorting rule.Clearly, the most adaptive response would be to explore the remaining possible rules.However, participants sometimes would persist responding according to the old rule and produce what is called a perseverative response.

The Model
The core idea behind our computational framework is to encode the concept of belief into a generative probabilistic model of the environment.Belief updating then corresponds to recursive Bayesian updating of the internal model based on current and past interactions between the agent and its environment.Optimal or sub-optimal actions are selected according to a well specified or a misspecified internal model and, in turn, cause perceptible changes in the environment.
We assume that the cognitive agent aims to infer the true hidden state of the environment by processing and integrating sensory information from the environment.Within the context of the WCST, the hidden environmental states might change at a non-constant rate, so the agent needs to rely on environmental feedback and own actions to infer the current state.We assume that the agent maintains an internal probability distribution over the states at each individual trial of the WCST.The agent then updates this distribution upon making new observations.In particular, the hidden environmental states to be inferred are the three features, s t ∈ {1, 2, 3}.The posterior probability of the states depends on an observation vector x t = (a t , f t ), which consists of the pair of agent's response (actions) a t ∈ {1, 2, 3, 4} and received feedback f t ∈ {0, 1} in a given trial t = 0, ..., T .The discrete response a t represents the stimulus card indicator being matched with a target card at trial t.We denote a sequence of observations as x 0:t = (x 0 , x 1 , ..., x t ) = ((a 0 , f 0 ), (a 1 , f 1 ), (a 2 , f 2 ), ..., (a t , f t )) and set x 0 = ∅ in order to indicate that there are no observations at the onset of the task.Thus, trial-by-trial belief updating is recursively computed according to Bayes' rule: Accordingly, the agent's posterior belief about the task-relevant features s t after observing a sequence of responsefeedback pairs x 0:t is proportional to the product of the likelihood of observing a particular response-feedback pair and the agent's prior belief about the task-relevant feature in the current trial.The likelihood of an observation is computed as follows: and p(a t |s t = i) indicates the probability of a matching between the target and the stimulus card assumed that the current feature is i.Here, we assume the likelihood of a current observation to be independent from previous observations without loss of generality, that is: The prior belief for a given trial t is computed based on the posterior belief generated in the previous trial, p(s t−1 |x 0:t−1 ), and the agent's belief about the probability of transitions between the hidden states, p(s t |s t−1 ).The prior belief can also be considered as a predictive probability over the hidden states.The predictive distribution for an upcoming trial t is computed according to the Chapman-Kolmogorov equation: where Γ(t) represents a stability matrix describing transitions between the states (to be explained shortly).Thus, the agent combines information from the updated belief (posterior distribution) and the belief about the transition properties of the environmental states to predict the most probable future state.The predictive distribution represents the internal model of the cognitive agent according to which actions are generated.
The stability matrix Γ(t) encodes the agent's belief about the probability of states being stable or likely to change in the next trial.In other words, the stability matrix reflects the cognitive agent's internal representation of the dynamic probabilistic model of the task environment.It is computed on each trial based on the response-feedback pair, x t , and a matching signal, m t , which are observed.
The matching signal m t is a vector informing the cognitive agent which features are currently relevant (meaningful), such that m (i) t = 1 when a positive feedback is associated with a response implying feature s t = i, and m (i) t = 0 otherwise.Note, that the matching signal is not a free parameter of the model, but is completely determined by the task contingencies.The matching signal vector allows the agent to compute the state activation level ω (i) t ∈ [0, 1] for the hidden state s t = i, which provides an internal measure of the (accumulated) evidence for each hidden state at trial t.Thus, the activation levels of the hidden states are represented by a vector ω t .The stability matrix is a square and asymmetric matrix related to hidden state activation levels such that: where the entries Γ ii (t) in the main diagonal represent the elements of the activation vector ω t , and the non-diagonal elements are computed so as to ensure that rows sum to 1.The state activation vector is computed in each trial as follows: This equation reflects the idea that state activations are simultaneously affected by the observed feedback, f t , and the matching signal vector, m t .However, the matching signal vector conveys different information based on the current feedback.Matching a target card with a stimulus card makes a feature (or a subset of features) informative for a specific state.The vector m t contributes to increase (resp.decrease) the activation level of a state if the feature is informative for that state when a positive (resp.negative) feedback is received.
The parameter λ ∈ [0, 1] modulates the efficiency to disengage attention to a given state-activation configuration when a negative feedback is processed.We therefore term this parameter flexibility.We also assume that information from the matching signal vector can degrade by slowing down the rate of evidence accumulation for the hidden states.This means that the matching signal vector can be re-scaled based on the current state activation level.The parameter δ ∈ [0, 1] is introduced to achieve this re-scaling.When δ = 0, there is no re-scaling and updating of the state activation levels relies on the entire information conveyed by m t .On the other extreme, when δ = 1, several trials have to be accomplished before converging to a given configuration of the state activation levels.Equivalently, higher values of δ affect the entropy of the distribution over hidden states by decreasing the probability of sampling of the correct feature.We therefore refer to δ as information loss.
The free parameters λ and δ are central to our computational model, since they regulate the rate at which the internal model converges to the true task environmental model.Eq. ( 5) can be expressed in compact notation as follows: Note that the information loss parameter δ affects the amount of information that a cognitive agent acquires from environmental contingencies, irrespective of the type of feedback received.Global information loss thus affects the rate at which the divergence between the agent's internal model and the true model is minimized.Figure 2 illustrates these ideas.
The probabilistic representation of adaptive behaviour provided by our Bayesian agent model allows us to quantify (latent) cognitive dynamics by means of meaningful information-theoretic measures.Information theory has, indeed, proven to be an effective and natural mathematical language to account for functional integration of structured cognitive processes and to relate them to brain activity [33,26,14,55,23].In particular, we are interested in three key measures, namely, Bayesian surprise, B t , Shannon surprise, I t , and entropy, H t .The subscript t indicates that we can compute each quantity on a trial-by-trial basis.Each quantity is thought to reflect a specific interpretation in terms of separate neurocognitive processes.Bayesian surprise B t quantifies the magnitude of the update from prior belief to posterior belief.Shannon surprise I t quantifies the improbability of an observation given an agent's prior expectation.Finally, entropy H t measures the degree of epistemic uncertainty regarding the true environmental states.Such measures are thought to account for the ability of the agent to manage uncertainty as emerging as a function of competing behavioral affordances [30].We expect an efficient adaptive functioning system to attenuate uncertainty over environmental states (current features), by reducing the entropy of its internal probabilistic model.
Bayesian surprise can be computed as the Kullback-Leibler (KL) divergence between prior and posterior beliefs about the environmental states.In our model representation, actions are sampled from predictive distributions which integrate information from the posterior belief about the hidden states and belief about their dynamics.The Bayesian surprise is then thought to account for the divergence between the predictive model for the current trial, and the updated predictive model for the upcoming trial.It is computed as follows: The Shannon surprise of a current observation given a previous one is computed as follows: For each scenario, trial-by-trial information-theoretic measures are shown.
Finally, the entropy is computed over the predictive distribution in order to account for the uncertainty in the internal model of the agent in trial t as follows: Once the flexibility (λ) and information loss (δ) parameters are recovered from data, the information-theoretic quantities can be easily computed and visualized for each trial of the WCST (see Figure 2).This allows to rephrase standard neurocognitive constructs in terms of measurable information-theoretic quantities.Moreover, the dynamics of these quantities, as well as their interactions, can be used for formulating and testing hypotheses about the neurcognitive underpinnings of adaptive behavior in a principled way, as discussed later in the paper.

Simulations
In this section we evaluate the expressiveness of the model by assessing its ability to reproduce meaningful behavioral patterns as a function of its two free parameters.We study how the generative model behaves when performing the WCST in a 2-factorial simulated Monte Carlo design where flexibility (λ) and information loss (δ) are systematically varied.
In this simulation, the Heaton version of the task [29] is administered to the Bayesian cognitive agent.In this particular version, the sorting rule (true environmental state) changes after a fixed number of consecutive correct responses.In particular, when the agent correctly matches the target card in 10 consecutive trials, the sorting rule is automatically changed.The task ends after completing a maximum of 128 trials.

Generative Model
The cognitive agent's responses are generated at each time step (trial) by processing the experimental feedback.Its performance depends on the parameters governing the computation of the relevant quantities.The generative algorithm is outlined in Algorithm 1.

Simulation 1: Clinical Assessment of the Bayesian Agent
Ideally, the qualitative performance of the Bayesian cognitive agent will resemble human performance.To this aim, we adopt a metric which is usually employed in clinical assessment of test results in neurological and psychiatric patients [11,61,7,35].Thus, agent performance is codified according to a neuropsychological criterion [29,20] which allows to classify responses into several response types.These response types provide the scoring measures for the test.
Here, we are interested in: 1) non-perseverative errors (E); 2) perseverative errors (PE); 3) number of trials to complete the first category (TFC); and 4) number of failures to maintain set (FMS).Perseverative errors occur when the agent applies a sorting rule which was valid before the rule has been changed.Usually, detecting a perseveration error is far from trivial, since several response configurations could be observed when individuals are required to shift a sorting rule after completing a category (see [20] for details).On the other hand, non-perseverative errors refer to all errors which do not fit the above description, or in other words, do not occur as a function of changing the sorting rule, such as casual errors.
The number of trials to complete the first category tells us how many trials the agent needs in order to achieve the first sorting principle, and can be seen as an index of conceptual ability [3,51].Finally, a failure to maintain a set occurs when the agent fails to match cards according to the sorting rule after it can be determined that the agent has acquired the rule.A given sorting rule is assumed to be acquired when the individual correctly sorts at least five cards in a row [29,18].Thus, a failure to maintain a set arises whenever a participant suddenly changes the sorting strategy in the absence of negative feedback.Failures to maintain a set are mostly attributed to distractibility.We compute this measure by counting the occurrences of first errors after the acquisition of a rule.
We run the generative model by varying flexibility across four levels, λ ∈ {0.3, 0.5, 0.7, 0.9}, and information loss across three levels, δ ∈ {0.4,0.7, 0.9}.We generate data from 150 synthetic cognitive agents per parameter combination and compute standard scoring measures for each of the agents simulated responses.Results from the simulation runs are depicted in Figure 3.
The simulated performance of our Bayesian cognitive agent demonstrates that different parameter combinations capture different meaningful behavioral patterns.In other words, flexibility and information loss seem to interact in a theoretically meaningful way.
First, overall errors increase when flexibility decreases, which is reflected by the inverse relation between the number of casual, as well as perseverative, errors and the values of parameter λ.Moreover, this pattern is consistent across all the levels of parameter δ.More precisely, information loss seems to contribute to the characterization of the casual and the perseverative components of the error in a different way.Perseverative errors are likely to occur after a sorting rule change and reflect the inability of the agent to use feedback to disengage attention from the currently attended feature.They therefore result from local cognitive dynamics conditioned on a particular stage of the task (e.g., after completing a series of correct responses).
Second, information loss does not interact with flexibility when perseverative errors are considered.This is due to the fact that high (resp.low) information loss affects general performance by yielding a dysfunctional response strategy which increases (resp.decreases) the probability of making an error at any stage of the task.The lack of such interaction provides evidence that our computational model can disentangle between error patterns due to perseveration and those due to general distractibility, according to neuropsychological scoring criteria.
However, in our framework, flexibility is allowed to yield more general and non-local cognitive dynamics as well.Indeed, λ plays a role whenever belief updating is demanded as a function of negative feedback.An error classified as non-perseverative (e.g., casual error) by the scoring criteria might still be processed as a feedback-related evidence for belief updating.Consistently, the interaction between λ and δ in accounting for causal errors shows that performance worsens when both flexibility and information loss become less optimal, and that such pattern becomes more pronounced for lower values of δ.
On the other hand, a specific effect of information loss can be observed for the scoring measures related to slow information processing and distractibility.The number of trials to achieve the first category reflects the efficiency of the agent in arriving at the first true environmental model.Flexibility does not contribute meaningfully to the accumulation of errors before completing the first category for some levels of information loss.This is reflected by the fact that the mean number of trials increases as a function of δ, and do not change across levels of λ for low and mid values of δ.A similar pattern applies for failures to maintain a set.Both scoring measures index a deceleration of the process of evidence accumulation for a specific environmental configuration, although the latter is a more exhaustive measures of dysfunctional adaptation.
Therefore, an interaction between parameters can be observed when information loss is high.A slow internal model convergence process increases the amount of errors due to improper rule sampling from the internal environmental model.However, internal model convergence also plays a role when a new category has to be accomplished after completing an older one.On the one hand, compromised flexibility increases the amount of errors due to inefficient feedback processing.This leads to longer trial windows needed to achieve the first category.On the other hand, when information loss is high, belief updating upon negative feedback is compromised due to high internal model uncertainty.At this point, the probability to err due to distractibility increases, as accounted by the failures to maintain a set measures.Finally, the joint effect of δ and λ for high levels of information loss suggests that the roles played by the two cognitive parameters in accounting for adaptive functioning can be entangled when neuropsychological scoring criteria are considered.

Simulation 2: Information-theoretic Analysis of the Bayesian Agent
In the following, we explore a different simulation scenario in which information-theoretic measures are derived to assess performance of the Bayesian cognitive agent.In particular, we explore the functional relationship between cognitive parameters and the dynamics of the recovered information-theoretic measures by simulating observed responses by varying flexibility across three levels, λ ∈ {0.1, 0.5, 0.9}, and information loss across three levels, δ ∈ {0.1, 0.5, 0.9}.
For this simulation scenario, we make no prior assumptions about sub-types of error classification.Instead, we investigate the dynamic interplay between Bayesian surprise, B t , Shannon surprise, I t , and entropy, H t over the entire course of 128 trials in the WCST.Again, simulated performance of the Bayesian cognitive agent shows that different parameter combinations yield different patterns of cognitive dynamics.Observed spikes and their related magnitudes signal informative task events (e.g., unexpected negative feedback), as accounted by Shannon surprise, or belief updating, as accounted by Bayesian surprise.Finally, entropy encodes the epistemic uncertainty about the environmental model on a trial-by-trial basis.
In general, low information loss ensures optimal behavior by speeding up internal model convergence by decreasing the number of trials needed to minimize uncertainty about the environmental states.Low uncertainty reflects two main aspects of adaptive behavior.On the one hand, the probability that a response occurs due to sampling of improper rules decreases, allowing the agent to prevent random responses due to distractibility.On the other hand, model convergence entails a peaked Shannon surprise when a negative feedback occurs, due to the divergence between predicted and actual observations.Flexibility plays a role in integrating feedback information in order to enable belief updating.The first row depicted in Figure 4 shows cognitive dynamics related to low information loss, across the levels of flexibility.As can be noticed, there is a positive relation between the magnitude of the Bayesian surprise and the level of flexibility, although unexpectedness yields approximately the same amount of signaling, as accounted by peaked Shannon surprise.From this perspective, surprise and belief updating can be considered functionally separable, where the first depends on the particular internal model probability configuration related to δ, whilst the second depends on flexibility λ.
However, more interesting patterns can be observed when information loss increases.In particular, model convergence slows down and several trials are needed to minimize predictive model entropy.Casual errors might occur within trial windows characterized by high uncertainty, and interactions between entropy and Shannon surprise can be observes in such cases.In particular, Shannon surprise magnitude increases (resp.decreases) when model's entropy decreases (resp.increases), that is, during the task phases in which the internal model has converged (resp.not converged).As a consequence, negative feedback could be classified as informative or uninformative, based on the uncertainty in the current internal model.This is reflected by the negative relation between entropy and Shannon surprise, as can be noticed by inspecting the graphs depicted in the third row of Figure 4. Therefore, the magnitude of belief updating depends on the interplay between entropy and Shannon surprise, and can differ based on the values of the two measures in a particular task phase.
In sum, both simulation scenarios suggest that the simulated behavior of our generative model is in accord with theoretical expectations.Moreover, the flexibility and information loss parameters can account for a wide range of observed response patterns and inferred dynamics of information processing.

Model Identification
In this section, we discuss the computational framework for recovering the parameters of our model from observed behavioral data.Parameter recovery is essential to inferring the cognitive dynamics underlying observed behavior in real-world applications of the model.This section is slightly more technical and can be skipped without significantly affecting the flow of the text.
Making our cognitive model suitable for application in real-world contexts entails estimating parameters from available data and accounting for uncertainty about parameter estimates.Indeed, uncertainty quantification turns out to be a fundamental and challenging goal when first-level quantities, that is, cognitive parameter estimates, are used to recover (second-level) information-theoretic measures of cognitive dynamics.The main difficulties arise when model complexity makes estimation and uncertainty quantification intractable at both analytical and numerical levels.For instance, in our case, probability distributions for the hidden model are generated at each trial, and the mapping between hidden states and responses changes depending on the structure of the task environment.
Identifying such a dynamic mapping is relatively easy from a generative perspective, but it becomes challenging, and almost impossible, when reverse engineering is required.Generally, this problem arises when no likelihood function relating model parameters to the data is available, or when the likelihood function is too complex to be evaluated [52].To overcome these limitations, we apply the recently developed BayesFlow method [46].BayesFlow is a powerful computational tool that allows to estimate parameters and quantify uncertainty in a unified probabilistic framework when inverting the generative model is intractable.The method is based on recent advances in deep probabilistic modeling and makes no assumptions about the shape of the true parameter posteriors.Thus, our ultimate goal becomes to approximate and analyze the joint posterior distribution over the model parameters.The posterior is given via an application of Bayes' rule: p(θ|x 0:T , m 0:T ) = p(x 0:T , m 0:T |θ)p(θ) p(x 0:T , m 0:T |θ)p(θ)dθ (10) where we set θ = (λ, δ) and stack all observations and matching signals into the vectors x 0:T = (x 0 , x 1 , ..., x T ) and m 0:T = (m 0 , m 1 , ..., m T ), respectively.The BayesFlow method uses simulations from the generative model to learn and clibrate a probabilistic mapping between data and parameters.First, it utilizes the fact that the data likelihood at each trial t can be reparameterized as: with g being the generative Bayesian cognitive model (Algorithm 1) and ξ independent noise representing the nondeterministic relationship between data-generating parameters and generated data.Second, BayesFlow utilizes the fact that data can easily be simulated by repeatedly running g with different θ and thereby iteratively minimizes the divergence between the true posterior and an approximate posterior via an invertible neural network.This approach allows to obtain samples from the approximate joint posterior distribution of the cognitive parameters of interest, which can be further processed in order to extract meaningful statistics (e.g., posterior mean, maximum a posteriori).
At this point, we must ensure that our computational model can be reliably fit to data.To this purpose, the main requirement is that the parameters can be recovered accurately and uncertainty in estimates is well-calibrated.
To address such a requirement, we train the invertible network for 50 epochs which amount to 50000 backpropagation updates.We then validate performance on a separate validation set of 1000 simulated data sets with known different ground truth parameter values.Training the networks took less than a day on a single machine with an NVIDIA R GTX1060 graphics card.In contrast, obtaining full parameter posteriors from the entire validation set took approximately 1.78 seconds.In what follows, we describe and report all performance validation metrics.
To assess the accuracy of point estimates, we compute the root mean squared error (RMSE) and the coefficient of determination (R 2 ) between estimated and true parameter values.To assess the quality of the approximate posteriors, we compute a calibration error [46] of the empirical coverage of each marginal posterior Finally, we implement simulation-based calibration (SBC, [57]) for visually detecting systematic biases in the approximate posteriors.
Point Estimates.Point estimates obtained by posterior means as well as corresponding RMSE and R 2 metrics are depicted in Figure 5a.Note, that point estimates do not have any special status in Bayesian inference, as they could be misleading depending on the shape of the posteriors.However, they are simple to interpret and useful for easeof-comparison.We observe that pointwise recovery of λ is better than that of δ.This is mainly due to suboptimal pointwise recovery in the lower (0, 0.1) range of δ.This pattern is evident in Figure 5a and is due to the fact that δ values in this range produce almost indistinguishable data patterns.Bootstrap estimates yielded an average RMSE of 0.155 (SD = 0.004) and an average R 2 of 0.708 (SD = 0.015) for the δ parameter.An average RMSE of 0.094 (SD = 0.002) and an average R 2 of 0.895 (SD = 0.007) were obtained for the λ parameter.These results suggest good global pointwise recovery but also warrant the inspection of full posteriors, especially in the low ranges of δ.
Full Posteriors.Average bootstrap calibration error was 0.011 (SD = 0.005) for the marginal posterior of δ and 0.014 (SD = 0.007) for the marginal posterior of λ.Calibration error is perhaps the most important metric here, as it measures potential under-or overconfidence across all confidence intervals of the approximate posterior (i.e., an αconfidence interval should contain the true posterior with a probability of α, for all α ∈ (0, 1)).Thus, low calibration error indicates a faithful uncertainty representation of the approximate posteriors.Additionally, SBC-histograms are depicted in Figure 5b.As shown by [57], deviations from the uniformity of the rank statistic (also know as a PIT histogram) indicate systematic biases in the posterior estimates.A visual inspection of the histograms reveals that the posterior means slightly overestimate the true values of δ.This corroborates the pattern seen in Figure 5a for the lower range of δ.
Finally, Figure 5c depicts the full marginal posteriors on two validation sets.Even on these two data sets, we observe strikingly different posterior shapes.The marginal posterior of δ obtained from the first data set is slightly left-skewed and has its density concentrated over the (0.8, 1.0) range.On the other hand, the marginal posterior of δ from the second data set is noticeably right-skewed and peaked across the lower range of the parameter.The marginal posteriors of λ appear more symmetric and warrant the use of the posterior mean as a useful summary of the distribution.
These two examples underline the importance of investigating full posterior distributions as a means to encode all relevant information about the parameters.Moreover, they demonstrate the advantage of imposing no distributional assumptions on the resulting posteriors, as their form and sharpness can vary widely depending on the concrete data set.

Application
In this section we fit the Bayesian cognitive model to real clinical data.The aim of this application is to evaluate the ability of our computational framework to account for dysfunctional cognitive dynamics of information processing in psychiatric patients.To this aim, we estimate parameters at individual level from a group of participants from an already published dataset [7].
Here, we focus on the estimation of the two relevant parameters λ and δ from a participant's observed response and feedback data.Our goal is to utilize the full information contained in the data and, further, quantify the uncertainty in parameter estimates.

The Data
The dataset used in this application consists of responses collected by administering the Heaton version of the WCST to healthy and substance dependent individuals (SDIs).Participants in the study were adults (> 18 years old) and gave their informed consent for inclusion which was approved by the appropriate human subject committee at the University of Iowa.SDIs were diagnosed as substance dependent based on the Structured Clinical Interview for DSM-IV criteria [19].
For this application, we focus on SDI participants who achieved all 128 trials in the task.This is the only selection criterion employed, and is motivated by the aim to utilize a maximum amount of data for model identification.However, this decision is not necessitated by the estimation method, since several trial numbers can be used for parameter recovery.Thus, the resulting dataset consists consists of 10 SDIs.

Results
We fit the Bayesian cognitive agent to data from each participant and obtain individual posterior distributions (see Figure 6) over the parameters.The advantage of modeling cognitive dynamics of individuals from a clinical population is that model predictions can be examined in light of available evidence about individual performances.SDIs are known to demonstrate inefficient conceptualization of the task and dysfunctional error-prone response strategies.This has been attributed to defective error monitoring and behavior modulation systems, which depend on cingulate and frontal brain regions functionality [34,60].Therefore, we expect our model to consistently capture such characteristics.
The recovered joint posteriors reveal a rather homogeneous pattern across SDI participants.Flexibility appears seriously impaired, as reflected by the low values of λ.The ability to efficiently achieve a suitable representation of the (task) environment also appears compromised due to abnormal information loss, as reflected by the high values of δ.However, slight individual differences in the parameters can be observed.
Parameter estimates suggest that error patterns produced by these individuals might be induced by a non-trivial interaction between cognitive sub-components.Lower values of λ imply that errors are likely to be produced by generating responses from an internal environmental model which is no longer valid.In other words, the agent is unable to rely on local feedback-related information in order to update beliefs about hidden states.On the other hand, higher values of δ reflect a general inefficiency of belief updating processes due to slow convergence to the optimal probabilistic environmental model.
From this perspective, Bayesian surprise B t and Shannon surprise I t might play different roles in regulating behavior based on different internal model probability configurations.These configurations are governed by the interplay between cognitive parameters.
For instance, it is often the case that psychiatric patients produce a noticeable amount of errors distributed sparsely across windows of trials.However, errors might be processed differently based on the status of the internal environmental states representation, as reflected by the entropy of the predictive model, H t .Thus, information-theoretic measures allow to describe cognitive dynamics on a trial-by-trial basis and, further, to disentangle the effect that different feedback-related information processing dynamics exert on adaptive behavior.
To further clarify these concepts, we investigate the reconstructed time series of information-theoretic quantities of an exemplary individual response pattern (Patient 7; Figure 7b).
Figure 7 depicts the unfolding of cognitive dynamics across a subset of trials in the task.Information-theoretic measures are recovered by computing the posterior mean of parameters.
Processing unexpected observations is accounted by the quantification of surprise at observing a response-feedback pair which is inconsistent with the current internal model of the task environment.Negative feedback is maximally informative when errors occur after the internal model has converged to the true task model (grey area), or the entropy approaches zero (grey line).The Shannon surprise (orange line) is maximal when errors occur within trial windows in which the agent's uncertainty about environmental states is minimal (orange areas).
However, internal model updates following an informative feedback are not optimally performed, which is reflected by very small Bayesian surprise (blue line).This is due to impaired flexibility, and reflects the fact that after internal model convergence, informative feedback is not processed adequately and the internal model becomes impervious to change.
Conversely, errors occurring when the agent is uncertain about the true environmental state carry no useful information for belief updating, since the system fails to conceive such errors as unexpected and informative.The information loss parameter plays a crucial role in characterizing this cognitive behavior.The slow convergence to the true environmental model, accompanied by the slow reduction of entropy in the predictive model, leads to a large number of trials required to achieve a good representation of the current task environment (white areas).Errors occurring within trial windows with large predictive model entropy (green area) do not affect subsequent behavior, and feedback is maximally uninformative.The role that predictive (internal) model uncertainty plays in characterizing the way the agent processes feedback allows to disentangle sub-types of errors based on the information they convey for subsequent belief updating.From this perspective, error classification is entirely dependent on the status of the internal environmental model across task phases.Identifying such a dynamic latent process is therefore fundamental, since the error codification criterion evolves with respect to the internal information processing dynamics.Otherwise, the problem of inferring which errors are due to perseverance in maintaining an older (converged) internal model and which due to uncertainty about the true environmental state becomes intractable, or even impossible.

Discussion
Investigating information processing related to changing environmental contingencies is fundamental to understanding adaptive behavior.For this purpose, cognitive scientists usually rely on controlled settings in which individuals are asked to accomplish (possibly) highly demanding tasks whose demands are assumed to resemble those of natural environments.Even in the most trivial cases, such as the WCST, optimal performance requires integrated and distributed neurocognitive processes.Moreover, these processes are unlikely to be isolated by simple scoring or aggregate performance measures.
In the current work, we developed and validated a new computational Bayesian model which maps distinct cognitive processes into separable information-theoretic constructs underlying observed adaptive behavior.We argue that these constructs could help describe and investigate the neurocognitive processes underlying adaptive behavior in a principled way.
In contrast to similar modeling approaches involving information-theoretic constructs [45,42,50], we adopt a powerful computational method for model identification.The method allows us to recover and quantify uncertainties in parameter estimates which is important for assessing the reliability of information-theoretic constructs in accounting for cognitive properties.In our case, uncertainty or identifiability of cognitive parameters is captured via a full joint posterior, and then a representative statistics of parameter posteriors (e.g., maximum a posteriori, posterior mean) can be used to derive the unfolding of information-theoretic quantities on a trial-by-trial basis.
Several computational models have been proposed to analyze performances in the WCST (and similar RL tasks), ranging from behavioral [10,53] to neural network models [17,2,37,41].These models aim to provide psychologically interpretable parameters or biologically inspired network structures, respectively, accounting for specific qualitative patterns of observed data.The main advantage of our Bayesian cognitive agent representation is that it provides both a cognitive and a measurement model which coexist within a substantiated theoretical framework.
Therefore, although our computational model is not a neural model, it might provide a suitable description of cognitive dynamics at a representational and computational level [39].This description can then be related to neural functioning underlying adaptive behavioral.Indeed, there is some evidence to suggest that neural processes related to belief maintenance/updating and unexpectedness are crucial for performance in the WCST.In particular, brain circuits associated with cognitive control and belief formation, such as the parietal cortex and prefrontal regions, seem to share a functional basis with neural substrates involved in adaptive tasks [42].Prefrontal regions appear to mediate the relation between feedback and belief updating [38] and efficient functioning in such brain structures seems to be heavily dependent on dopaminergic neuromodulation [44].Moreover, the dopaminergic system plays a role in the processing of salient and unexpected environmental stimuli, in learning based on error-related information, and in evaluating candidate actions [42,16,28].Accordingly, dopaminergic system functioning has been put in relation with performance in the WCST [31,48] and shown to be critical for the main executive components involved in the task, that is, cognitive flexibility and set-shifting [9,54].Further, neural activity in the anterior cingulate cortex (ACC) is increased when a negative feedback occurs in the context of the WCST [38].This finding corroborates the view that the ACC is part of an error-detection network which allocates attentional resources to prevent future errors.The ACC might play a crucial role in adaptive functioning by encoding error-related or, more generally, feedback-related information.Thus, it could facilitate the updating of internal environmental models [47].
Such neurobiological evidence suggests that brain networks involved in the WCST might endow adaptive behavior by accounting for maintaining/updating of an internal model of the environment and efficient processing of unexpected information.Is it noteworthy, that these processing aspects are incorporated into our computational framework.At this point, the empirical and theoretical potentials of the proposed computational framework for investigating adaptive functioning can be outlined.
Model-Based Neuroscience.Recent studies have pointed out the advantage of simultaneously modeling and analyzing neural and behavioral data within a joint modeling framework.In this way, the latter can be used to provide information for the former, as well as the other way around [58,59,21].This involves the development of joint models which encode assumptions about the probabilistic relationships between neural and cognitive parameters.
Within our framework, the reconstruction of information-theoretic discrete time series yields a quantitative account of the agent's internal processing of environmental information.Event-related cognitive measures of belief updating, epistemic uncertainty and surprise can be put in relation with neural measurements by explicitly providing a formal account of the statistical dependencies between neural and cognitive (information-theoretic) quantities.In this way, latent cognitive dynamics can be directly related to neural event-related measures (e.g., fMRI, EEG).Applications in which information-theoretic measures are treated as dependent variables in standard statistical analysis are also possible.
Neurological Assessment.Although neuroscientists have considered performance in the WCST as a proxy for measuring high-level cognitive processes, the usual approach to the analysis of human adaptive behavior consists in summarizing response patterns by simple heuristic scoring measures (e.g.occurrences of correct responses and sub-types of errors produced) and classification rules [20].However, the theoretical utility of such a summary approach remains questionable.Indeed, adaptive behavior appears to depend on a complex and intricate interplay between multiple network structures [4,40,38,5,12].This posits a great challenge for disentangling high-level cognitive constructs at a model level and further investigating their relationship with neurobiological substrates.It appears that standard scoring measures might not be able to fulfil these tasks.Moreover, there is a pronounced lack of anatomical specificity in previous research concerning the neural and functional substrates of the WCST [43].
Thus, there is a need for more sophisticated modeling approaches.For instance, disentangling errors due to perseverative processing of previously relevant environmental models from those due to uncertainty about task environmental states, is important and nontrivial.Sparse and distributed error patterns might depend on several internal model probability configurations.Such internal models are latent, and can only be uncovered through cognitive modeling.Therefore, information-based criteria to response (error) classification can enrich clinical evaluation beyond heuristically motivated criteria.
Generalizability.Another important advantage of the proposed computational framework is that it is not solely confined to the WCST.In fact, one can argue that the seventy-year old WCST does not provide the only or even the most suitable setting for extracting information about cognitive dynamics from general populations or maladaptive behavior in clinical populations.One can envision tasks which embody probabilistic (uncertain) or even chaotic environments (for instance with partially observable or unreliable feedback or partially observable states) and demand integrating information from different modalities [45,42].These settings might prove more suitable for investigating changes in uncertainty-related processing or cross-modal integration than deterministic and fully observable WCST-like settings.
Note that, as it currently stands, our framework is directly extendable to these richer settings.
Despite these advantages, our proposed computational framework has some limitations.A first limitation might concern the fact that the new Bayesian cognitive model accounts for the main dynamics in adaptive tasks by relying on only two parameters.Although such a parsimonious proposal suffices to disentangle latent data-generating processes, a more exhaustive formal description of cognitive sub-components might be envisioned.However, model identification can become challenging is such a scenario, especially when sparse one-dimensional response data is used as a basis for parameter recovery.
Second, as it currently stands, model identification is optimal only when the entire sequence of 128 trials in the WCST is used.However, in the Heaton version, the task can end with only after several sorting rule changes.Using incomplete data appears suboptimal for parameter recovery and results in large uncertainty estimates and multimodal posteriors.Future research should focus on designing and employing more data-rich RL tasks which can provide a better starting point for recovering complex latent cognitive dynamics.
In conclusion, the proposed model can be considered as the basis for a (bio)psychometric tool for measuring the dynamics of cognitive processes under changing environmental demands.Furthermore, it can be seen as a step towards a theory-based framework for investigating the relation between such cognitive measures and their neural underpinnings.Further investigations are needed to refine the proposed computational model and systematically explore the advantages of the Bayesian brain theoretical framework for empirical research on high-level cognition.

Figure 2 :
Figure 2: Suppose the correct sorting rule is the feature shape.The figure shows the rate of convergence of the predictive distributions to the true task environmental model.The predictive distributions at trial t + 1 depends on the sorting action a t (first row) and the received feedback f t (second row).Two examples of updating a predictive distribution are shown: one in which information loss is high (δ = 0.7, third row), and one in which information loss is low (δ = 0.3, fifth row).High information loss slows down the convergence of the internal model to the true environmental model.The gray bar plots represent the predictive probability distribution over the rules from which an action is sampled at each trial.Dotted bars represent the updated predictive distribution after the feedback observation.For each scenario, trial-by-trial information-theoretic measures are shown.

Figure 3 :
Figure 3: Clinical scoring measures as functions of flexibility and information loss -simulated scenarios.Cells show the density of scoring measures for the levels of λ across different levels of δ.In particular, they show the distribution of non-perseverative errors (E), perseverative errors (PE), number of trials to complete the first category (TFC), number of failures to maintain set (FMS) obtained from 150 synthetic agent's response simulations for each cell of the factorial design.

Figure 4 :
Figure 4: Information-theoretic measures varying as a function of flexibility λ and information loss δ across 128 trials of the WCST.Optimal belief updating and uncertainty reduction are achieved with low information loss and high flexibility (first row, third column).

Figure 4
Figure 4 depicts results from the nine simulation scenarios.Although an exhaustive discussion on cognitive dynamics should couple information-theoretic measures with patterns of correct and error responses, we focus solely on the information-theoretic time series for illustrative purposes.We refer to the Application section for a more detailed description of the relation between observed responses and estimated information-theoretic measures in the context of data from a real experiment.

Figure 5 :
Figure 5: Parameter recovery results on validation data; (a) Posterior means vs. true parameter values; (b) Histograms of the rank statistic used for simulation-based calibration; (c) Example full posteriors for two validation data sets; (d) Example information-theoretic dynamics recovered from the parameter posteriors.

Figure 6 :
Figure 6: Joint posteriors of the flexibility (λ) and information loss (δ) parameters obtained from the sample of 10 patients.We observe low flexibility and high information loss across all patient.Darker colors represent regions of low posterior density; lighter colors represent regions of high posterior density.
(a) Joint posterior of flexibility and information loss parameters.(b) Time series of information-theoretic measures

Figure 7 :
Figure 7: Recovered cognitive dynamics of patient 7. (a) Joint posterior of the flexibility and information loss parameters.The marginal posteriors indicate very low flexibility and very high information loss; (b) Time series of information-theoretic measures depicting belief updating and agent's internal model uncertainty during the unfolding of the task.Labels C and E indicate correct and error responses.