In policy iteration, which of the following is/are true of the Policy Evaluation (PE) and Policy Improvement (PI) steps?

The values of states that are returned by PE may fluctuate between high and low values as the algorithm runs.

PE returns the fixed point of Lπn

PI can randomly select any greedy policy for a given value function vn.

Policy iteration always converges for a finite MDP.

1 point

Consider Monte-Carlo approach for policy evaluation. Suppose the states are S1,S2,S3,S4,S5,S6 and terminal state. You sample one trajectory as follows –

S1→S5→S4→S6→ terminal state.

Which among the following states can be updated from this sample?

S1

S2

S6

S4

Ans – C

1 point

Which of the following statements are true with regards to Monte Carlo value approximation methods?

To evaluate a policy using these methods, a subset of trajectories in which all states are encountered at least once are enough to update all state-values.

Monte-Carlo value function approximation methods need knowledge of the full model.

Monte-Carlo methods update state-value estimates only at the end of an episode.

All of the above.

Ans – D

1 point

In every visit Monte Carlo methods, multiple samples for one state are obtained from a single trajectory. Which of the following is true?

There is an increase in bias of the estimates.

There is an increase in variance of the estimates.

It does not affect the bias or variance of estimates.

Both bias and variance of the estimates increase.

Ans – D

1 point

Which of the following statements are FALSE about solving MDPs using dynamic programming?

If the state space is large or computation power is limited, it is preferred to update only some states through random sampling or selecting states seen in trajectories.

Knowledge of transition probabilities is not necessary for solving MDPs using dynamic programming.

Methods that update only a subset of states at a time guarantee performance equal to or better than classic DP.

None of the above.

Ans – B

1 point

Select the correct statements about Generalized Policy Iteration (GPI).

GPI lets policy evaluation and policy improvement interact with each other regardless of the details of the two processes.

Before convergence, the policy evaluation step will usually cause the policy to no longer be greedy with respect to the updated value function.

GPI converges only when a policy has been found which is greedy with respect to its own value function.

The policy and value function found by GPI at convergence with both be optimal.

Ans – C

1 point

What is meant by ”off-policy” Monte Carlo value function evaluation?

The policy being evaluated is the same as the policy used to generate samples.

The policy being evaluated is different from the policy used to generate samples.

The policy being learnt is different from the policy used to generate samples.

The policy being learnt is different from the policy used to generate samples.

Ans – A

1 point

For both value and policy iteration algorithms we will get a sequence of vectors after some iterations, say v_1, v_2….v_n for value iteration and v’1,v’2…v’n for policy iteration. Which of the following statements are true.

For all vi∈v1,v2….vn there exists a policy for which vi is a fixed point.

For all v’i∈v’1,v’2….v’n there exists a policy for which v’i is a fixed point.

For all vi∈v1,v2….vn there may not exist a policy for which v_i is a fixed point.

For all v’i∈v’1,v’2….v’n there may not exist a policy for which v’i is a fixed point.

Ans – B

1 point

Given that L is a contraction in Banach space, which of the following is true?

L must be a linear transformation.

L has a unique fixed point.

∃s,|Lv(s)−Lu(s)|≤γ||v−u||

∀s,|Lv(s)−Lu(s)|≤γ||v−u||

Ans – C

1 point

Which of the following are true?

The bellman optimality equation defines a contraction in Banach space.

The bellman optimality equation can be re-written as a linear transformation on the value function vector v, where each element of v corresponds to the value of a state of the MDP.

The final value estimates obtained at the stopping condition of value iteration will be optimal values, v∗

The final policy obtained by greedily selecting actions according to the returned value function v at the stopping condition of value iteration will be an optimal policy