A comparability of Temporal-Distinction(0) and Fixed-α Monte Carlo strategies on the Random Stroll Job | by Tingsong Ou

Picture generated by Midjourney with a paid subscription, which complies normal industrial phrases [1].

The Monte Carlo (MC) and the Temporal-Distinction (TD) strategies are each basic technics within the area of reinforcement studying; they resolve the prediction drawback based mostly on the experiences from interacting with the atmosphere fairly than the atmosphere’s mannequin. Nonetheless, the TD technique is a mixture of MC strategies and Dynamic Programming (DP), making it differs from the MC technique within the points of the replace rule, the bootstrapping, and the bias/variance. TD strategies are additionally confirmed to have higher efficiency and sooner convergence in comparison with the MC usually.

On this submit, we’ll examine TD and MC, or extra particularly, the TD(0) and constant-α MC strategies, on a easy grid atmosphere and a extra complete Random Stroll [2] atmosphere. Hoping this submit will help readers keen on Reinforcement Studying higher perceive how every technique updates the state-value operate and the way their efficiency differs in the identical testing atmosphere.

We’ll implement algorithms and comparisons in Python, and libraries used on this submit are as follows:

python==3.9.16
numpy==1.24.3
matplotlib==3.7.1

The introduction of TD(0) and constant-α MC

The constant-α MC technique is a daily MC technique with a continuing step measurement parameter α, and this fixed parameter helps to make the worth estimate extra smart to the current expertise. In observe, the selection of the α worth is determined by a trade-off between stability and flexibility. The next is the MC technique’s equation for updating the state-value operate at time t:

The TD(0) is a particular case of TD(λ) that solely seems one step forward and is the only type of TD studying. This technique updates the state-value operate with TD error, the distinction between the estimated worth of the state and the reward plus the estimated worth of the subsequent state. A continuing step measurement parameter α works the identical as within the MC technique above. The next is the TD(0)’s equation for updating the state-value operate at time t:

Typically talking, the distinction between MC and TD strategies occurs on three points:

Replace rule: MC strategies replace values solely after the episode ends; this could possibly be problematic if the episode may be very lengthy, which slows down this system, or within the persevering with activity that doesn’t have episodes in any respect. Quite the opposite, TD technique updates worth estimates at every time step; that is on-line studying and might be significantly helpful in persevering with duties.
Bootstrapping: The time period “bootstrapping” in reinforcement studying refers to updating worth estimates based mostly on different worth estimates. TD(0) technique bases its replace on the worth of the next state, so it’s a bootstrapping technique; quite the opposite, MC doesn’t use bootstrapping because it updates worth immediately from the returns (G).
Bias/Variance: MC strategies are unbiased as a result of they estimate the worth by weighing the precise returns noticed with out making estimates in the course of the episode; nevertheless, MC strategies have excessive variance, particularly when the variety of samples is low. Quite the opposite, TD strategies have biases as a result of they use bootstrapping, and the bias can range based mostly on the precise implementation; TD strategies have low variance as a result of it makes use of the speedy reward plus the estimate of the subsequent state, which smooths out the fluctuation that raises from the randomness in rewards and actions.

Evaluating TD(0) and constant-α MC on easy gridworld setup

To make their distinction extra simple, we will arrange a easy Gridworld take a look at atmosphere with two fastened trajectories, run each algorithms on the setup till converged, and examine how they replace the values in a different way.

First we will setup the testing atmosphere with the next codes: