Write this code in Python
In Part 2, you will compare the performance of Monte Carlo control and Q-learning by running both algorithms on a small environment and then analyzing their progress toward finding the optimal policy and optimal state value function. For the sake of comparison, we will use value iteration to find the optimal policy.

A. Create a 3×3 instance of the FrozenPlatform environment with sp_range =[0.2, 0.6], a start position of 1 (which is the default), no holes, and with random_state =1. Create an instance of the DPAgent class with gamma =1 and random_state =1 and use it to run value iteration to find the optimal policy for the environment. Display the environment, showing the optimal policy and with the cells shaded to indicate their value under the optimal policy. We will use the optimal policy for the purpose of comparison in the remaining steps.

B. Create an instance of MCAgent class for the environment create in Step A, with gamma-1 and random_state=1. Do NOT set a policy for the agent, instead allow it to randomly generate one. Run Monte Carlo control with 20,000 episodes, an epsilon of 0.1, and an alpha of 0.001. Then calculate the mean absolute difference between the optimal state-value function found by value iteration and the current Monte Carlo estimate. Then print the following message with the blank filled in with the appropriate value, rounded to 2 decimal places
The mean absolute difference in V is ______.

C. Display the environment, showing the policy found by Monte Carlo control and with the cells shaded to indicate their value according to the Monte Carlo estimate. Mentally note whether or not the optimal policy was found by Monte Carlo control.

D. Call the plot_v_history () method of your MCAgent to visualize the evolution of the state-value estimates during the Monte Carlo control process. Set the target parameter of the method to be equal to the true statevalue function of the policy (as found by value iteration).

E. Create an instance of TDAgent class for the environment create in Step A, with gamma =1 and random_state =1. Do NOT set a policy for the agent, instead allow it to randomly generate one. Run Q-learning with 20,000 episodes, an epsilon of 0.1, and an alpha of 0.001. Calculate the mean absolute difference between the optimal state-value function found by value iteration and the current Q-learning estimate. Then print the following message with the blank filled in with the appropriate value, rounded to 2 decimal places.

F. Display the environment, showing the policy found by Qlearning and with the cells shaded to indicate their value according to the Q-learning estimate. Mentally note whether or not the optimal policy was found by Q learning.

G. Call the plot_v_history () method of your TDAgent to visualize the evolution of the state-value estimates during the Q-learning process. Set the target parameter of the method to be equal to the true state-value function of the policy (as found by value iteration).

H. Use the markdown cell provided to share your observations about the performance of Monte Carlo control vs. Q-learning for this environment. Did either method find the optimal policy? Which one had a larger average error (as measured by mean absolute difference) after 20,000 episodes? Does one seem noisier (i.e., having more variance) than the other?

Question

Answered

Answer :