In this post, I will briefly discuss some of my current thoughts on decision behavior modeling based on my research experience over the past year.
From an abstract point of view, decision making is to take an action $a$ according to a certain strategy $\pi$ given decision information $s$ (hereinafter referred to as the state)
$$a \sim \pi(s).$$
而行为决策建模,则希望根据观察数据$\mathcal{D} = \{(s,a)\}_n$,获取决策机制$\pi$的相关信息。
While decision behavior modeling attempts to infer the strategy $\pi$ from observations $\mathcal{D} = \{(s,a)\}_n$.
Because the dataset $\mathcal{D}$ has finite datapoints, there can be an infinite number of $\pi$ that fits well into the dataset. Therefore, researchers often need to make additional assumptions about the form of $\pi$. And one of the most fundamental assumptions is that $\pi$ follows the utility-maximization principle.
效用最大化原理 The Utility-maximization Principle
The utility-maximization principle (UM) states that the behavior of a decision maker (DM) should be consistent with the behavior that maximizes her utility. Through this principle, decision behavior modeling problems can be transformed into utility function estimation problems. Compared to the direct use of descriptive methods, such as statistical or machine-learning models, to fit the strategy $\pi$, this normative approach takes into account our prior understanding of behavior (i.e., rationality) and is therefore more suitable for providing explanations and counterfactual judgments. At the same time, because of the regularization of the UM principle, the resulted behavior model has strong generalization and transfer ability (i.e., consistency in different scenarios).
However, the UM principle implies the following assumption: the DM has an “excellent” optimization capability to provide the optimal solution accurately regardless of the complexity of the underlying decision problem. Because this assumption is not intuitively solid, many doubts have been cast on the UM principle in the past few decades: a large number of scholars, by conducting labotory experiments or analyzing real-world economic data, have pointed out that people usually make sub-optimal decisions; in addition, many scholars have proposed behavioral models within the bounded rationality framework and illustrate with real-world data that these models have better explanation power than the model from the UM principle.
遗憾的是,这些新的理论并不能够彻底取代效用最大化原理:它们从某些特定角度改进了原始效用最大化原理的缺陷,但是仍处于“决策优化”的框架内。事实上,一方面,在最优性以外我们并没有很多合乎直觉的从决策问题到策略机制的映射关系以供选择;另一方面,对任何决策行为,我们总能通过补充缺失信息的方式令其达到最优性(当然,这可能会损失某些不同行为之间的一致性;关于这方面的讨论属于(公理化)决策论的范畴,有兴趣的读者可以从Kreps的Notes on the Theory of Choice[1]读起)。因此,在我看来,与其通过“创造”有限理性理论来修正效用最大化原理,倒不如通过修正决策问题的设定来拓展效用最大化原理的应用面。更具体地,我们应该关注:决策者是在何种场景下、基于何种信息、对何种效用目标进行判断,从而作出相应行为?或者说,对于决策行为
Unfortunately, these new theories cannot completely overthrow the UM principle: they improve it from some specific perspectives, but are still within the “decision optimization” framework. In fact, on the one hand, besides optimality, we do not have much choice on the mapping from decision-making problems to strategies. On the other hand, for any decision-making behavior, we can always make it optimal in some decision-making problems by supplementing some missing information (of course, this may result in a loss of consistency among behaviors; discussions on this aspect fall into the field of axiomatic decision theory, and interested readers can refer to Kreps’ Notes on the Theory of Choice[1] for more details). Therefore, in my opinion, instead of aiming at “creating” new bounded rationality models to “correct” the UM principle, it is better to expand the applicability of the UM principle by reformulating the underlying decision problems. More specifically, we should consider the following question: under what circumstances and based on what information and utility function did the DM take the observed action? In other words, for action
$$a \sim \pi(s),$$
does the researcher consider the correct decision information $s$?
有限理性与不完全信息 Bounded Rationality and Incomplete Information
As mentioned in previous discussions, many works on “correcting” the UM principle is about the bounded rationality of the DM. Although the term “bounded rationality” is often used to support the argument that “the DM lacks optimization capability”, in my opinion, it may actually reflect the rational behavior of the DM in solving decision problems. Just as practitioners in optimization often need to balance between the solution accuracy and the computational overhead, for complex decision-making problems, the DM will pay a high cost of time (or loss of future utility) if she looks for utility-maximizing behavior. In this case, using heuristics to optimize behavior usually leads to better overall utilities (of course, the set of feasible heuristics to be considered and the corresponding costs of time varies from problem to problem).
从另一个角度上看,求解决策问题的复杂度实际上反映了决策者对于最优行为所拥有的信息量:问题越难求解,意味着决策者对最优行为的信息越少、搜寻代价越高。这种思路可以进一步拓展到一般的不完全信息场景:事实上,在很多情况下,决策问题的相关信息较多,决策者在一开始可能只拥有一小部分信息。此时,在作出最终决策之前,决策者通常需要决定是否以及如何获取更多信息。例如,我们可以考虑一个经典的汽车购置决策问题。对于一个消费者来说,他不太可能了解市面上的所有在售车型;但是他可以通过上网搜索、参加车展、试驾等一系列方式获取更多信息。然而,不同的信息获取方式各有优劣:比如说,进行试驾能够获得对一款车型非常深入的了解,但是花费的时间成本较大;而阅读汽车杂志、观看广告等行为只需要少量时间即可获得多种车型的基本信息。因此,这个汽车购置决策问题还包含一个汽车信息获取的决策子问题。在过去十年中,已经有大量的理论和实证工作对这种信息处理机制进行了深入的讨论。有兴趣的读者可以参考Sims[2],Gabaix[3],以及Masatlioglu and Nakajima[4]。
From another perspective, the complexity of the solution to the decision problem actually reflects the amount of information that the DM has for optimal behavior: the more difficult the problem is to solve, the less information the DM has for the optimal behavior and the higher cost of information collecting. This view can be further extended to general scenarios with incomplete information: in fact, in many cases, the decision-making problem has much relevant information, but the DM may only have a small fraction of it at the beginning. In this case, the DM often needs to decide whether and how to obtain more information before making the final decision. For example, let us consider the problem of buying a car. For a customer, she is unlikely to be familiar with all available models on the market; but she can obtain more information through searching methods such as online search, auto shows, and test drive. However, these methods have their own advantages and disadvantages: for example, through test drive, the DM can obtain a very deep understanding of a model, but it costs a lot of time; while by reading car magazines and watching ads, the DM can obtain basic (and only basic!) information on a wide range of models in a short period of time. Therefore, this car purchase problem also includes a car information acquisition subproblem. In the past decade, many theoretical and empirical studies have discussed this information acquisition mechanism in depth. Interested readers can refer to Sims [2], Gabaix [3], and Masatlioglu and Nakajima [4].
In the above discussion, I have only considered the “immediate” information of the decision problem. However, decision-making problems are usually not isolated/independent in the time dimension, and the behavior of the DM is likely to depend on the DM’s summaries of past information and expectations for the future. In the next two sections, I will discuss these two aspects in more detail.
决策序列与动态学习过程 Decision Sequences and Learning Dynamics
在日常生活中,许多问题需要进行多期决策。例如,我们可以考虑一个经典的多期消费决策问题:给定当期生产产出与前期投资的关系,决策者需要决定如何对每期的产出分配消费和投资比例以最大化(从消费中获得的)总效用。相比单阶段决策行为,多阶段决策行为相对更难分析,因为在给定效用函数的情况下,对最优决策行为的求解仍然很复杂(注:在这种场景下要找到时间维度上一致的有限理性/次优决策解释也是很困难的!)。不过,在过去的数十年中,计量经济学家和计算机科学家仍然对此类问题(两派学者分别称之为动态离散选择和逆向强化学习问题)进行了非常深入的研究,提出了包括嵌套不动点算法(Rust[5])在内的一系列效用估计和行为建模方案。关于动态离散选择方面的工作,有兴趣的读者可以参考Aguirregabiria and Mira[6]和Heckman and Navarro[7]。对于逆向强化学习方面的进展,有兴趣的读者可以参考我最近一篇短文。
In our daily life, there are many problems that require multi-period decisions. For example, in a multi-period consumption problem, the DM needs to allocate her credit (from production) to consumption and investment in each period to maximize her total utility from consumption, given the relationship between the production output in a period and investments from previous periods. Multi-period decision-making behavior is relatively more difficult to analyze than single-period decision-making behavior, because even when the (single-period) utility function is explicitly given the optimal strategy can still be very complicated (note: in this case, finding a ** time consistent ** bounded rationality/suboptimality explanation can also be very difficult!). Nevertheless, in the past few decades, econometricians and computer scientists have conducted in-depth research on such issues (the two groups called these problems “dynamic discrete choice” and “inverse reinforcement learning”, respectively) and suggested a series of utility estimation and behavioral modeling methods, including the nested fixed-point algorithm (Rust [5]). For work on the dynamic discrete choice problem, interested readers can refer to Aguirregabiria and Mira [6] and Heckman and Navarro [7]. For recent progress in inverse reinforcement learning, interested readers can refer to my recent post.
However, these works had all assumed that “the DM has complete information about the underlying multi-period decision problem”. As discussed in the previous section, the DM is likely to lack information about the decision problem and the optimal strategy. This means that we can consider extending the existing work from two perspectives: the DM’s uncertainty of the state transition mechanism/environmental feedback, and the DM’s imperfect (long-term) planning capabilities. As I will illustrate below, both of these can be described by the learning dynamics.
On the one hand, to solve for the optimal strategy has a large computational cost, so the DM usually cannot find the optimal action in a reasonable time. Therefore, according to the above discussion, we can view the suboptimal behavior as a result of the tradeoff between computational costs and accuracies. However, for the sequential decision process, the decision problems in different periods have a similar structure, so when she tries to solve a decision problem in the future the DM can reuse her previous calculation results rather than start from scratch. Also notice that algorithms for the optimal long-term strategy usually involve iterative processes (for example, we can solve the classical infinite-time dynamic programming problems by value function iterations or policy iterations), we can think that in solving the current decision problem the DM mainly relies on her information about the optimal solution obtained from previous iterations. This means that suboptimal behavior in sequential decisions can be seen as a by-product of the iterative process.
On the other hand, for the uncertainty of the state transition, we should notice that in reinforcement learning problems we generally do not assume the DM has complete information about the state transition function, but only assume that the state transition remains unchanged during the learning process. In the learning process, the DM updates her perception of the state transition mechanism and her strategies through continuous trials and errors. This means that given the expectation of an invariant state transition function in the future, the DM’s knowledge of the state transition in each period can all be described by her experience in previous periods. (Note: Regarding the “invariant state transition” expectation, if we think of the process rather than the result, we can also use some invariant assumptions and the DM’s previous information on this expectation to describe her current knowledge of this expectation. This recursive process can go on until we obtain the final invariant assumption/expectation. I think that this can be described by the language of set/measure theory, or modeled by a multi-layer recurrent neural network (RNN).)
Combining these two points, I think that the DM’s behavior in a sequential decision process is more likely to come from a dynamic learning (optimization) process than an optimized result. From this perspective, the behavior of DM depends not only on her state in the actual environment but also on her history. In the next section, I will discuss the DM’s processing mechanisms of past information in more details.
历史信息处理与策略迭代 Processing of Past Information and Policy Iterations
In the above discussion, I mentioned that when making a decision the DM may consider historical information (hereinafter $s^h$). A simple way to introduce $s^h$ into the formulation of the decision-making problem is to include it as an input to the strategy
$$a \sim \pi(s,s^h).$$
However, the above formulation implies an assumption that “strategy $\pi$ is independent of historical information $s^h$”, even though this assumption is not valid in many real-world scenarios, including the aforementioned dynamic learning process: as historical information accumulates, the strategy itself should also be updated. To resolve this problem, we can consider an iterative mechanism $F$ which including the strategy $\pi$ in the iterative process
$$ a_t,(\pi,s^h)_t \sim F(s_t,(\pi,s^h)_{t-1}), $$
其中$t\in N^+$代表时间点。(注意到上式与循环神经网络具有相似的结构:$s_t$可以看做时间$t$时系统新增的输入,$a_t$为输出,而$(\pi,s^h)$则是不断更新的隐状态变量。)
where $t\in N^+$ represents the time period. (Note that the above formulation has a similar structure to the recurrent neural network: $s_t$ can be seen as the input to the system at time $t$, $a_t$ the output, and $(\pi,s^h)$ the state variables.)
Now, a new question is to determine the form of $F$: from a normative point of view, $F$ should represent an optimization mechanism, but it is nontrivial to find a suitable form; and if we consider models with high capacities (such as LSTM) for $F$, the estimation can be short of explanation power. Next, I will briefly review some of the existing online learning and reinforcement learning algorithms and try to distill some basic elements for modeling.
我们首先关注策略$\pi_t$的形式(假设决策行为仍然服从$a_t \sim \pi_t(s_t)$)。从现有的强化学习方向看,如果学习过程是基于策略的,那么策略$\pi_t$本身可以被参数化;如果学习过程是基于值函数(效用)$Q_t$的,那么策略可以表示为一个最大化运算与参数化值函数的复合$\pi_t = \arg\max_{a\in A}Q_t(\cdot,a)$。在两种情况下,策略都可以表示成参数化形式$\pi_t = \pi(\cdot,\theta_t)$。
We first look at the form of the strategy $\pi_t$ (assuming that the decision behavior still follows $a_t \sim \pi_t(s_t)$). According to the existing reinforcement learning algorithms, if the learning process is policy-based, the strategy $\pi_t$ is parameterized; if the learning process is value-based, the strategy can be expressed as the compound of the maximum operator and a parameterized value function $Q_t$: $\pi_t = \arg\max_{a\in A}Q_t(\cdot,a)$. In both cases, the strategy can be represented by a parameterized form $\pi_t = \pi(\cdot,\theta_t)$.
Next we consider the elements in the hidden state $s^h$ and the corresponding iterative mechanism:
For the classical (online) stochastic gradient descent algorithm, the state $s_t$ can be expressed as a combination of features and the label $(x_t, y_t)$, and the action $a_t$ is the model parameter $\theta_t$ (strategy $\pi_t$ is an trivial equality function). $\theta_t$ can be expressed as the weighted sum of the parameter $\theta_{t-1}$ in the previous stage and the current gradient $g_t=g(\theta_{t-1}, s_t|f)$
$$\theta_t = \theta_{t-1} + \alpha g(\theta_{t-1},s_t|f),$$
where $f$ is the target model. Note that in this case we don’t need to consider $s^h$.
For model-free reinforcement learning algorithms, a commonly used method to stabilize the training process is experience replay: The gradient $g_t$ is calculated from the historical sample rather than the current one. In this case, the historical state $s^h$ is the experience pool; and in formulating the iteration process $F$ we also need to consider the sampling method $f_{sample}$ (e.g., uniform sampling, prioritized sampling).
For model-based reinforcement learning algorithms, the historical state $s^h$ is the model $M_t$; the update of the strategy parameter $\theta_t$ follows the gradients computed by simulating the model $M_t$.
综合以上几点,我们可以暂且认为隐状态$s^h$主要包含由过去决策行为组成的经验池和通过过去决策行为训练得到的模型。这些信息实际上刻画了决策者对潜在演化机制/环境反馈的了解,同时又基本决定了策略的迭代方式。由于$s^h$也包括了决策者对演化机制的不确定性,我们还可以使用这套框架来描述一些考虑探索-利用平衡的方法(例如bootstrapped DQN[8]),因为这类方法本质上是对不确定性度量的系统性利用。
Based on the above discussion, we can temporarily assume that the hidden state $s^h$ mainly contains an experience pool of past decision behaviors and the models trained from past decision behaviors. These elements both describe the DM’s understanding of the underlying state transition mechanisms/environmental feedback and largely determine the update process of the strategy. Since $s^h$ also considers the DM’s uncertainty of the state transition, we can use this framework to describe methods that consider the exploration-exploitation tradeoff (such as bootstrapped DQN[8]), because the essence of such methods is the systematic exploitation of knowledge of uncertainty.
通过上面的讨论,我们只是大致能够确定迭代机制$F$中所需要考虑的元素,而这些元素之间具体应该如何互相影响目前还不清楚。因此,后续还需要进行大量工作。下面是一些我认为可以尝试的方向:从规范视角出发,尝试建立决策者的最小化遗憾值(regret minimization)序列决策问题(对单任务场景)或元学习问题(对多任务场景)以求解最优信息处理机制;从描述视角出发,探索和利用生理上的记忆机制与已有迭代学习(优化)算法的异同。
Through the above discussion, we are able to roughly determine the elements that need to be considered in $F$, but is not clear about how these elements should interact with each other. Therefore, more works are needed in the future. Following are some interesting future directions: from the normative perspective, we can attempt to formulate optimization problems for the information processing mechanisms, in the form of regret minimization (for scenarios with a single task) or meta learning (for scenarios with multiple tasks); from the description perspective, we can explore and exploit the similarities and differences between the memory mechanism in our brains and the existing iterative learning (optimization) algorithms.
总结 Summary
In this post, I try to analyze decision-making behavior from a more general perspective. I think that in formulating the DM’s decision problem, we can further consider the DM’s summaries of past information and expectations for the future. But because the latter is essentially determined by the former, the key to a good formulation is to find an appropriate representation of the historical information processing mechanism. However, the discussion here only presents a perspective for analysis and does not provide any practical solution. (Note: In fact, I think it may be very difficult to propose a unified modeling framework for this problem.)
As for applications, I think we should follow the principle of Occam’s razor and start with the basic utility estimation methods; according to my experience, in many cases, it is not necessary to use complex modeling elements. But I believe that as researchers’ ability to acquire and process information increases and the decision problems faced by the DM become more complex, more work in this area should emerge in the future.
