Partially observable Markov decision process

A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state.

Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations (or belief states) to the actions.

The POMDP framework is general enough to model a variety of real-world sequential decision processes.

Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general.

The general framework of Markov decision processes with imperfect information was described by Karl Johan Åström in 1965[1] in the case of a discrete state space, and it was further studied in the operations research community where the acronym POMDP was coined.

It was later adapted for problems in artificial intelligence and automated planning by Leslie P. Kaelbling and Michael L.

[2] An exact solution to a POMDP yields the optimal action for each possible belief over the world states.

The optimal action maximizes the expected reward (or minimizes the cost) of the agent over a possibly infinite horizon.

The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward:

the agent only cares about which action will yield the largest expected immediate reward; when

the agent cares about maximizing the expected sum of future rewards.

A consequence of this property is that the optimal behavior may often include (information gathering) actions that are taken purely because they improve the agent's estimate of the current state, thereby allowing it to make better decisions in the future.

An MDP does not include the observation set, because the agent always knows with certainty the environment's current state.

Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon.

, yields the highest expected reward value for each belief state, compactly represented by the optimal value function

Value iteration applies dynamic programming update to gradually improve on the value until convergence to an

[5][6] In practice, POMDPs are often computationally intractable to solve exactly.

To address these issues, computer scientists have developed various approximate POMDP solutions [7].

These solutions typically attempt to approximate the problem or solution with a limited number of parameters, plan only over a small part of the belief space online, or summarize the action-observation history compactly.

In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered which are not in the set of grid points.

More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states.

[9][10] For example, adaptive grids and point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space.

[13] Online planning algorithms approach large POMDPs by constructing a new policy for the current belief each time a new observation is received.

[15] Similar to MDPs, it is possible to construct online algorithms that find arbitrarily near-optimal policies and have no direct computational complexity dependence on the size of the state and observation spaces.

[16] Another line of approximate solution techniques for solving POMDPs relies on using (a subset of) the history of previous observations, actions and rewards up to the current time step as a pseudo-state.

Usual techniques for solving MDPs based on these pseudo-states can then be used (e.g. Q-learning).

Reachability is an example of a Büchi condition (for instance, reaching a good state in which all robots are home).

coBüchi objectives correspond to traces that do not satisfy a given Büchi condition (for instance, not reaching a bad state in which some robot died).

The objective can be satisfied: We also consider the finite memory case in which the agent is a finite-state machine, and the general case in which the agent has an infinite memory.

Notable applications include the use of a POMDP in management of patients with ischemic heart disease,[19] assistive technology for persons with dementia,[9][10] the conservation of the critically endangered and difficult to detect Sumatran tigers[20] and aircraft collision avoidance.