当前位置:文档之家› 马尔可夫决策过程(MDP)

马尔可夫决策过程(MDP)


MRP
H ( S ) = u( S ) +
S 0 2S

MDP
H (S, A) = u(S, A) +
A2 A
U (S ) = max H (S, A)

S 0 2S
X
P (S, A, S )U (S )
0
0
(S ) = arg max H (A, S )
Bellman equation
Bellman equation
S = {0, 1, 2, 3} A = {Left, Right} Reward: -1 for every step moved Discount factor: 0.5
MDP
State 0 State 1 State 2 State 3 Action:Left 0 1 2 3 Action:Right
MDP
Markov decision process (MDP)
MDP = Markov process + actions + reward functions + + 1
3
Possible future state
1
Current state
2
Action A1 Reward 20 0.1
Possible future state
Reward 9



15
MRP
0.1 0.9
2
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
H (S = 3)
= =
H(S=2), H(S=1), …
u(S = 3) + 0.2H (S = 4) + 0.8H (S = 5) ⇥ ⇤ 6 + 0. 2 · 2 + 0 . 8 · 9
1 6 1 P(A = Left) = 6 4 0 0
2
0 0 1 0
0 0 0 1
0 0 7 7 0 5 0
3
1 6 0 P(A = Right) = 6 4 0 0
2
0 0 0 0
0 1 0 0
0 0 7 7 1 5 1
3
Value: H=0.0
0.0
0.0
0.0
Period 1
Action:
MDP




: : : , 41
/ ,
Copyright: Forbes
• •
Powercaster Tx and Rx
Charging station
: RF energy Tx/Rx
• • •
Friis’ formula
Beamforming
P
42

Electricity chargers : At different fixed locations, e.g., power outlets, base stations End users of energy : Those who need energy, but are not covered by chargers Mobile energy gateway transferring (wirelessly)
/ action state transition prob. reward function discount factor
MDP

MDP

MDP ”
• •
“ “ ] ] action/decision ”
[ [
MDP

MDP
< S, A, P, U,

>
S P
-
-


A
reward-
• • • • • • • • • •
Value: H=0.0
←/→ ←/→

-1.0
-1.0
-1.0
Period 2
Action:
Value: H=0.0

-1.0
←/→
-1.5

-1.5
Period 3
Action: …



MDP

MDP


• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
Markov decision process (MDP)
/ Email: yangzhang@
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
2
Backward induction
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
Reward 9
H (S = 4) = u(S = 4) = 2 H (S = 5) = u(S = 5) = 9
14
MRP
0.1 0.9
Action A2 Reward 5
1
Current state
Hn+1 (S, A) = u(S, A) +
A2 A
Un+1 (S ) = max Hn+1 (S, A)
3
Possible future state
S 0 2S
0.9 0.1
2
Action A1 Reward 20
Possible future state
⇤ ⇤ n+1 (S )
= arg max Hn+1 (S, A)
A A2 A
S 0 2S
X
P (S, A, S 0 )Un (S 0 )
Compute and store Un+1 (S ) = max Hn+1 (S, A) Return
( S ) , U ( S ) , 8S 2 S
Bellman equation
MDP = Markov process + reward/utility functions + /
0.1 0.9
2
1.0 Reward u(S = 3)
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward u(S = 4) 1.0
Reward 0
Reward …
10
MRP

MRP
0.9
Action A2 Reward 5
1
Current state 22
2
3
Possible future state
2
Possible future state
Markov decision process (MDP)
MDP = Markov process + actions + reward functions + +
2
3
0.9
Action A2 Reward 5
1
Current state
0.1
2
1
Possible future state
Action A1 Reward 20
Possible future state
23
MDP

MDP
CMDP
POMDP
< S, A, P, U,
• • • • •
>
S A P U
P (Xt+1 |Xt , Xt
1 , Xt 2 , · · · )
2
1
3
P (4|3)
4
0 5
5
• • •
Random walk

P (Xt+1 |Xt , Xt
• •
1 , Xt 2 , · · · )
= P (Xt+1 |Xt , Xt
1)
St = ( X t , Xt St 2 {(s, s), (s, r), (r, s), (r, r)}
16

Reward 9

MRP
0.1 0.9
2
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
Reward 9
(
) H (St ) = E u(St ) + H (St+1 ) X H ( S ) = u( S ) + P (S, S 0 )H (S 0 )
Reward 2 1.0
Reward 0
Reward 9
H (S ) , 0, 8S 2 S
H (S )
H ( S ) = u( S ) +
S 0 2S
19
X
P (S, S 0 )H (S 0 )
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
相关主题