讲解 COMP9414 Artificial Intelligence Assignment 2: Reinforcement Learning Term 2, 2025讲解 Python程序

COMP9414 Artificial Intelligence

Assignment 2: Reinforcement Learning

Term 2, 2025

Due: Week 9, Friday, 1 August 2025, 5:00 PM AEST

Worth: 25 marks (25% of final grade)

Submission: Electronic submission via Moodle

1 Problem Overview

In this assignment, you will implement and compare two reinforcement learning al- gorithms: Q-learning and SARSA, within a static grid world environment.

The grid world consists of an 11 × 11 grid in which the agent must navigate from a random starting position to a designated goal while avoiding fixed obstacles arranged in two distinct patterns.

You will first develop the Q-learning and SARSA algorithms, implementing action selec- tion policies that allow the agent to choose actions using an epsilon-greedy approach to balance exploration and exploitation. To ensure fair comparison between the algorithms, you must use identical hyperparameters across all experiments.

After training the baseline agents, you will simulate interactive reinforcement learn- ing (IntRL) by introducing a teacher-student framework. In this setup, a pre-trained agent (the teacher) provides advice to a new agent (the student) during its training. Each algorithm will teach its own type: Q-learning teachers will guide Q-learning students, and SARSA teachers will guide SARSA students.

The teacher’s advice will be configurable in terms of its availability (probability of of- fering advice) and accuracy (probability that the advice is correct).

You will evaluate the impact of teacher feedback on the student’s learning performance by running experiments with varying levels of availability and accuracy. By maintaining consistent hyperparameters across all four tasks, you can meaningfully compare how each algorithm performs with and without teacher guidance.

The goal is to understand how teacher interaction influences the learning efficiency of each algorithm and to determine which algorithm benefits more from teacher guidance under identical conditions.

2 Environment

The environment is provided in the env .py file. This section provides a brief overview of the grid world environment you will be working with.

For detailed information about the environment including setup instructions, key func- tions, agent movement and actions, movement constraints, and the reward structure, please refer to the Environment User Guide provided separately as a PDF file. En- sure you familiarise yourself with the environment before proceeding with the assignment.

2.1 Environment Specifications

The environment has the following specifications:

– Grid Size: 11 × 11

– Obstacles: 10 cells arranged in two patterns:

◦ L-shaped pattern: (2,2), (2,3), (2,4), (3,2), (4,2)

◦ Cross pattern: (5,4), (5,5), (5,6), (4,5), (6,5)

– Goal Position: (10, 10)

– Rewards:

◦ Reaching the goal: +25

◦ Hitting an obstacle: -10

◦ Each step: -1

Figure 1: The 11 × 11 grid world environment with visual elements. The agent must navigate from its starting position to the goal whilst avoiding the L-shaped and cross- shaped obstacle patterns.

Environment Elements:

Agent

Grey robot that navigates the grid world

Goal

Red flag at position (10,10) Rewards +25 points

Obstacles

Construction barriers Penalty of -10 points

Status Bar Information:

- Episode number

- Teacher availability (%)

- Teacher accuracy (%)

3 Hyperparameter Guidelines

For this assignment, you should use the following baseline parameters:

– Learning rate (α): 0.1 to 0.5

– Discount factor (γ): 0.9 to 0.99

– Epsilon:

– For exploration strategies with decay: Initial 0.8 to 1.0, Final 0.01 to 0.1

– Decay strategy: Can use linear decay, exponential decay, or other decay strategies

– For fixed epsilon (no decay): 0.1 to 0.3

– Number of episodes: 300 to 1000

– Maximum steps per episode: 50 to 100

You may experiment within these ranges, but must:

– Use identical parameters across all four tasks (Tasks 1–4) to ensure fair comparison between Q-learning and SARSA, both with and without teacher guidance

– Document your final chosen values and provide brief justification

– For Tasks 3 and 4 (teacher experiments), you may use fewer episodes to reduce computational time while maintaining meaningful results.

4 Task 1: Implement Q-learning

In this task, you will implement the Q-learning algorithm and train an agent in the provided environment.

Implementation Requirements

Your Q-learning implementation should:

– Train the agent for the specified number of episodes using the hyperparameters from Section 3

– Use epsilon-greedy action selection for exploration

– Update Q-values according to the Q-learning update rule

Metrics to Track

During training, track these metrics for each episode:

– Total Rewards per Episode: The cumulative reward accumulated during the episode

– Steps per Episode: The number of steps taken to complete the episode

– Successful Episodes: Whether the agent reached the goal

Required Outputs

After training is complete, you must produce the following:

– Generate a plot that displays the episode rewards over time. The plot should include:

◦ Raw episode rewards (with transparency to show variance)

◦ A moving average line (e.g., 50-episode window) for smoothing

◦ A horizontal line at y=0 to indicate the transition between positive and neg- ative rewards

◦ Appropriate labels, title, and legend

Figure 2 shows a sample of what your Q-learning performance plot should look like

(generated with simulated data).

– Calculate and report the Success Rate, Average Reward per Episode, and Average Learning Speed, using the following formulas:

– Success Rate:

where N is the total number of episodes.

– Average Reward per Episode:

where Ri is the total reward in the i-th episode.

– Average Learning Speed:

where Si is the number of steps taken in the i-th episode.

– Keep track of the following outputs:

◦ The three calculated metrics: average reward, success rate, and average learn- ing speed

◦ The trained Q-table, as it will be used as the teacher in Task 3

Figure 2: Sample Q-learning performance plot showing episode rewards and 50-episode moving average. This is generated with simulated data for demonstration purposes only.

5 Task 2: Implement SARSA

Important: You must use the same hyperparameters (learning rate, discount factor, epsilon, episodes, and maximum steps) that you chose in Task 1 to ensure fair comparison between Q-learning and SARSA.

In this task, you will implement the SARSA algorithm and train an agent in the provided environment.

Implementation Requirements

Your SARSA implementation should:

– Train the agent for the specified number of episodes using the same hyperparameters as Task 1

– Update Q-values according to the SARSA update rule

Metrics to Track

During training, track these metrics for each episode:

– Total Rewards per Episode: The cumulative reward accumulated during the episode

– Steps per Episode: The number of steps taken to complete the episode

– Successful Episodes: Whether the agent reached the goal

Required Outputs

After training is complete, you must produce the following:

– Generate a plot that displays the episode rewards over time. The plot should include:

◦ Raw episode rewards (with transparency to show variance)

◦ A moving average line (e.g., 50-episode window) for smoothing

◦ A horizontal line at y=0 to indicate the transition between positive and neg- ative rewards

◦ Appropriate labels, title, and legend

Figure 3 shows a sample of what your SARSA performance plot should look like

(generated with simulated data).

– Calculate and report the Success Rate, Average Reward per Episode, and Average Learning Speed using the same formulas as in Task 1 (Equations 1, 2, and 3).

– Keep track of the following outputs:

◦ The three calculated metrics: average reward, success rate, and average learn- ing speed

◦ The trained Q-table, as it will be used as the teacher in Task 4

Figure 3: Sample SARSA performance plot showing episode rewards and 50-episode mov- ing average. This is generated with simulated data for demonstration purposes only.

6 Baseline Comparison

After completing Tasks 1 and 2, you should compare the baseline performance of Q- learning and SARSA. This comparison will help you understand the fundamental differ- ences between the two algorithms before introducing teacher guidance.

Creating the Comparison

Generate comparison visualisations that include:

– Learning Progress Comparison

◦ Episode rewards for both Q-learning and SARSA (with transparency)

◦ 50-episode moving averages for both algorithms

◦ The y=0 reference line

◦ Average reward values for each algorithm

– Success Rate Comparison

◦ Rolling success rates (50-episode window) for both algorithms

◦ Overall success rates for each algorithm

◦ Success rate ranging from 0 to 100%

Figure 4 shows a sample baseline comparison plot (generated with simulated data).

Figure 4: Sample baseline comparison showing Q-learning vs SARSA performance. Left: Episode rewards with moving averages. Right: Success rate over time. This is generated with simulated data for demonstration purposes only.

7 Teacher Feedback Mechanism

A teacher feedback system is a valuable addition to the training process of agents using Q-learning or SARSA. In this system, a pre-trained agent (the teacher) assists a new agent by offering advice during training. The advice provided by the teacher is based on two key probabilities:

– The availability factor determines whether advice is offered by the teacher at any step.

– The accuracy factor dictates whether the advice given is correct or incorrect.

7.1 How It Works

At each step, the system first determines whether the teacher provides advice (based on availability). If advice is given, it then determines whether the advice is correct (based on accuracy). These two checks ensure that advice is provided probabilistically and may not always be accurate.

The agent responds to the teacher’s advice as follows:

• If the generated advice is correct (given the accuracy parameter), the agent follows the teacher’s recommended action (the action with highest Q-value in the teacher’s Q-table).

• If the generated advice is incorrect, the agent takes a random action, excluding the trainer’s best action.

• If no advice is given, the agent continues its independent learning using its explor- ation strategy (epsilon-greedy).

Figure 5 illustrates the complete decision process for the teacher feedback mechanism.

Figure 5: Flowchart showing the teacher feedback mechanism. The student agent’s action selection depends on two probability checks: availability (whether the teacher provides advice) and accuracy (whether the advice is correct).

8 Task 3: Teacher Advice Using Q-learning Agent

Important: You must use the same hyperparameters (learning rate, discount factor, epsilon, episodes, and maximum steps) that you chose in Task 1 to ensure fair comparison across all tasks.

In this task, you will implement the teacher-student framework where a pre-trained Q- learning agent (from Task 1) acts as a teacher to guide a new Q-learning student agent.

Implementation Requirements

Your implementation should:

– Load the trained Q-table from Task 1 to use as the teacher

– Train a new Q-learning student agent with teacher guidance

– Implement the teacher feedback mechanism as described in Section 7

– Test all combinations of teacher availability and accuracy parameters

Parameter Combinations

You must evaluate the following parameter combinations using nested loops:

– Availability: [0.1, 0.3, 0.5, 0.7, 1.0]

– Accuracy: [0.1, 0.3, 0.5, 0.7, 1.0]

This results in 25 different teacher configurations to test.

Metrics to Track

For each teacher configuration, track these metrics during training:

– Total Rewards per Episode: The cumulative reward accumulated during each episode

– Steps per Episode: The number of steps taken to complete each episode

– Successful Episodes: Whether the agent reached the goal

Required Outputs

After training with all parameter combinations, you must:

– Calculate performance metrics for each configuration:

◦ Success Rate using Equation 1

◦ Average Reward per Episode using Equation 2

◦ Average Learning Speed using Equation 3

– Store all results in a structured format with the following data:

◦ Availability

◦ Accuracy

◦ Avg Reward

◦ Success Rate (%)

◦ Avg Learning Speed

– Generate a heatmap visualisation showing average rewards for all teacher configur- ations:

◦ X-axis: Availability values

◦ Y-axis: Accuracy values

◦ Colour intensity: Average reward achieved

◦ Include appropriate colour bar and labels

Figure 6 shows a sample of what your teacher performance heatmap should look like

(generated with simulated data).

Figure 6: Sample heatmap showing Q-learning performance with different teacher con- figurations. Note that accuracy increases from bottom to top. This is generated with simulated data for demonstration purposes only.

9 Task 4: Teacher Advice Using SARSA Agent

Important: You must use the same hyperparameters (learning rate, discount factor, epsilon, episodes, and maximum steps) that you chose in Task 1 to ensure fair comparison across all tasks.

In this task, you will implement the teacher-student framework where a pre-trained SARSA agent (from Task 2) acts as a teacher to guide a new SARSA student agent.

Implementation Requirements

Your implementation should:

– Load the trained Q-table from Task 2 to use as the teacher

– Train a new SARSA student agent with teacher guidance

– Implement the teacher feedback mechanism as described in Section 7

– Test all combinations of teacher availability and accuracy parameters

– Use the same implementation structure as Task 3 for consistency

Parameter Combinations

You must evaluate the following parameter combinations using nested loops:

– Availability: [0.1, 0.3, 0.5, 0.7, 1.0]

– Accuracy: [0.1, 0.3, 0.5, 0.7, 1.0]

This results in 25 different teacher configurations to test.

Metrics to Track

For each teacher configuration, track these metrics during training:

– Total Rewards per Episode: The cumulative reward accumulated during each episode

– Steps per Episode: The number of steps taken to complete each episode

– Successful Episodes: Whether the agent reached the goal

Required Outputs

After training with all parameter combinations, you must:

– Calculate performance metrics for each configuration:

◦ Success Rate using Equation 1

◦ Average Reward per Episode using Equation 2

◦ Average Learning Speed using Equation 3

– Store all results in a structured format with the following data:

◦ Availability

◦ Accuracy

◦ Avg Reward

◦ Success Rate (%)

◦ Avg Learning Speed

– Generate a heatmap visualisation showing average rewards for all teacher configur- ations:

◦ X-axis: Availability values

◦ Y-axis: Accuracy values

◦ Colour intensity: Average reward achieved

◦ Include appropriate colour bar and labels

This will allow direct comparison with the Q-learning teacher results from Task 3 to determine which algorithm provides better teaching capabilities.

10 Testing and Discussing Your Code

After completing all four tasks, you should analyse and compare your results to understand the impact of teacher guidance on reinforcement learning performance.

10.1 Required Analysis

You must perform the following analysis to demonstrate your understanding:

1. Teacher Impact on Learning Curves

For selected teacher availability levels (e.g., 0.1, 0.5, 1.0), create plots showing how differ- ent teacher accuracies affect learning progress compared to the baseline. Each plot should show:

◦ Episode rewards (50-episode moving average) on the y-axis

◦ Episodes on the x-axis

◦ Multiple lines for different accuracy levels

◦ Baseline performance for reference

Figure 7 shows a sample comparison for Q-learning with 50% teacher availability.

Figure 7: Sample comparison of teacher accuracy impact on Q-learning performance with 50% availability. Generated with simulated data.

2. Teacher Effectiveness Summary

Create a comprehensive analysis comparing how teacher guidance affects both algorithms:

◦ Generate learning curves showing Q-learning and SARSA performance with selected combinations of teacher availability and accuracy values

◦ Include baseline performance (no teacher) as a reference line

◦ Show multiple combinations of teacher availability (e.g., 0.1, 0.5, 1.0) and accuracy (e.g., 0.3, 0.7, 1.0) to demonstrate the range of teacher impact

◦ Use moving averages to smooth the learning curves for clarity Figure 8 shows a sample teacher effectiveness summary analysis.

Figure 8: Sample teacher effectiveness summary showing the impact of different teacher configurations on both Q-learning and SARSA algorithms. This analysis helps identify optimal teacher parameters and compare algorithm responsiveness to guidance.

联系我们