辅导COMP3702讲解编程、Python，c++程序辅导

COMP3702 Artificial Intelligence (Semester 2, 2021)
Assignment 3: DragonGame Reinforcement Learning
Key information:
• Due: 4pm, Monday 1 November
• This assignment will assess your skills in developing algorithms for solving Reinforcement Learning
Problems.
• Assignment 3 contributes 20% to your final grade.
• This assignment consists of two parts: (1) programming and (2) a report.
• This is an individual assignment.
• Both code and report are to be submitted via Gradescope (https://www.gradescope.com/).
• Your program (Part 1, 60/100) will be graded using the Gradescope code autograder, using testcases
similar to those in the support code provided at https://gitlab.com/3702-2021/a3-support.
• Your report (Part 2, 40/100) should fit the template provided, be in .pdf format and named according
to the format a3-COMP3702-[SID].pdf, where SID is your student ID. Reports will be graded by the
teaching team.
The DragonGame AI Environment
“Untitled Dragon Game”1 or simply DragonGame, is a 2.5D Platformer game in which the player must
collect all of the gems in each level and reach the exit portal, making use of a jump-and-glide movement
mechanic, and avoiding landing on lava tiles. DragonGame is inspired by the “Spyro the Dragon” game
series from the original PlayStation. In Assignment 3, actions may again have non-deterministic outcomes,
but in addition, the transition probabilities and reward values are unknown.
To solve a level, your AI agent must explore the environment and determine a policy (mapping from states
to actions) which collects all gems and reaches the exit while incurring the minimum expected cost, which is
equivalent to maximising the expected reward.
DragonGame as a Reinforcement Learning problem
In this assignment, you will write the components of a program to play DragonGame, with the objective of
finding a high-quality solution to the problem using various reinforcement learning algorithms. This assignment
will test your skills in defining reinforcement learning algorithms for a practical problem and understanding of
key algorithm features and parameters.
What is provided to you
We will provide supporting code in Python only, in the form of:
1. A class representing a DragonGame game map and a number of helper functions
2. A parser method to take an input file (testcase) and convert it into a DragonGame map
3. A policy visualiser
1The full game title was inspired by Untitled Goose Game, an Indie game developed by some Australians in 2019
1
COMP3702 Assignment 3: DragonGame Reinforcement Learning
4. A simulator script to evaluate the performance of your solution
5. Testcases to test and evaluate your solution
6. A solution file template
The support code can be found at: https://gitlab.com/3702-2021/a3-support. See the README.md for
more details. Autograding of code will be done through Gradescope, so that you can test your submission and
continue to improve it based on this feedback — you are strongly encouraged to make use of this feedback.
Your assignment task
Your task is to develop two reinforcement learning algorithms for computing paths (series of actions) for the
agent (i.e. the Dragon), and to write a report on your algorithms’ performance. You will be graded on both
your submitted program (Part 1, 60%) and the report (Part 2, 40%). These percentages will be scaled
to the 20% course weighting for this assessment item.
The provided support code provides a generative DragonGame environment, and your task is to submit
code implementing both of the following Reinforcement Learning algorithms:
1. Q-learning
2. SARSA
There isn’t an explicit requirement to use a particular learning type for each testcase, but the testcases are
designed to make using a specific type advantageous for that testcase. To achieve separation between Qlearning
and SARSA results, the total reward received during training is tracked in addition to the reward
received during evaluation, with separate reward targets specified for each in the testcases.
Once you have implemented and tested the algorithms above, you are to complete the questions listed in the
section “Part 2 - The Report” and submit the report to Gradescope.
More detail of what is required for the programming and report parts are given below.
Part 1 — The programming task
Your program will be graded using the Gradescope autograder, using testcases similar to those in the support
code provided at https://gitlab.com/3702-2021/a3-support.
Interaction with the testcases and autograder
We now provide you with some details explaining how your code will interact with the testcases and the
autograder (with special thanks to Nick Collins for his efforts making this work seamlessly, yet again!).
First, note that the Assignment 3 version of the class GameEnv (in game_env.py) differs from previous
assignments in that the transition and reward functions are now randomised and unknown to the agent.
The action outcome probabilities (for glide, supercharge, superjump actions and the ladder fall probability)
and costs/penalties (action_cost, collision_penalty, game_over_penalty) are randomised within some
fixed range based on the seed of the filename, and are all stored in private variables. Your agent does not
know these values, and therefore must interact with the environment to determine the optimal policy.
Implement your solution using the supplied solution.py template file. You are required to fill in the following
method stubs:
• __init__()
• run_training()
• select_action()
Page 2
COMP3702 Assignment 3: DragonGame Reinforcement Learning
You may add to the init method if required, and can add additional helper methods and classes (either in
solution.py or in files you create) if you wish. To ensure your code is handled correctly by the autograder,
you should avoid using any try-except blocks in your implementation of the above methods (as this can
interfere with our time-out handling). Also, unlike in the previous assignments, the autograder now
does not allow you to upload your own copy of game_env.py.
Refer to the documentation in solution.py for more details.
Grading rubric for the programming component (total marks: 60/100)
For marking, we will use five different testcases of ascending level of difficulty to evaluate your solution.
There will be a total of 60 code marks, consisting of:
• 20 Threshold Marks
– Program runs without errors (+5 marks)
– Program approximately solves at least 1 testcase within 2x time limit (+7.5 marks)
– Program approximately solves at least 2 testcases within 2x time limit (+7.5 marks)
• 40 Testcase Marks
– 5 testcases worth 8 marks each
– A maximum of 8 marks for each testcase, with deductions for taking more than the time limit or
solution having higher than the target costs (training and evaluation reward targets) proportional
to the amount exceeded
– The code used to compute your score is in simulator.py
– Program will be terminated after 2× time limit has elapsed
Part 2 — The report
The report tests your understanding of Reinforcement Learning and the methods you have used in your code,
and contributes 40/100 of your assignment mark.
Question 1. Q-learning is closely related to the Value Iteration algorithm for Markov decision processes.
a) (5 marks) Describe two key similarities between Q-learning and Value Iteration.
b) (5 marks) Give one key difference between Q-learning and Value Iteration.
For Questions 2, 3 and 4, consider testcase a3-t5.txt, and compare Q-learning and SARSA.
Question 2.
a) (5 marks) With reference to Q-learning and SARSA, explain the difference between off-policy and
on-policy reinforcement learning algorithms.
b) (5 marks) How does the difference between off-policy and on-policy algorithms affect the way in which
Q-learning and SARSA solves testcase a3-t5.txt? Give an example of an expected difference between
the way the algorithms learn a policy.
For questions 3 and 4, you are asked to plot the solution quality at each episode, as given by the 50-step
moving average reward received by your learning agent. At time step t, the 50-step moving average reward is
the average reward earned by your learning agent in the episodes [t − 50, t], including episode restarts. If the
Q-values imply a poor quality policy, this value will be low. If the Q-values correspond to a high-value policy,
the 50-step moving average reward will be higher. We are using a moving average here because the reward is
received only occasionally and there are sources of randomness in the transitions and the exploration strategy.
Page 3
COMP3702 Assignment 3: DragonGame Reinforcement Learning
Question 3.
a) (5 marks) Plot the quality of the policy learned by Q-learning in testcase a3-t5.txt against episode
number for three different fixed values of the learning_rate (which is called α in the lecture notes and
in many texts and online tutorials), as given by the 50-step moving average reward (i.e. for this question,
do not adjust α over time, rather keep it the same value throughout the learning process). Your plot
should display the solution quality up to an episode count where the performance stabilises, with a
minimum of 2000 episodes (note the policy quality may still be noisy, but the algorithm’s performance
will stop increasing and its average quality will level out).
b) (5 marks) With reference to your plot, comment on the effect of varying the learning_rate.
Question 4.
a) (5 marks) Plot the quality of the learned policy against episode number under Q-learning and SARSA
in testcase a3-t5.txt, as given by the 50-step moving average reward. Your plot should display the
solution quality up to an episode count where the performance of both algorithms stabilise, with a
minimum of 2000 episodes.
b) (5 marks) With reference to your plot, compare the learning trajectory of the two algorithms, and their
final solution quality. Discuss the way the solution quality of Q-learning and SARSA change as they
learn to solve the testcase, both as they learn and once they have stabilised.
Page 4
COMP3702 Assignment 3: DragonGame Reinforcement Learning
Academic Misconduct
The University defines Academic Misconduct as involving “a range of unethical behaviours that are designed
to give a student an unfair and unearned advantage over their peers.” UQ takes Academic Misconduct very
seriously and any suspected cases will be investigated through the University’s standard policy (https://
ppl.app.uq.edu.au/content/3.60.04-student-integrity-and-misconduct). If you are found guilty,
you may be expelled from the University with no award.
It is the responsibility of the student to ensure that you understand what constitutes Academic Misconduct
and to ensure that you do not break the rules. If you are unclear about what is required, please ask.
It is also the responsibility of the student to take reasonable precautions to guard against unauthorised access
by others to his/her work, however stored in whatever format, both before and after assessment.
In the coding part of COMP3702 assignments, you are allowed to draw on publicly-accessible resources and
provided tutorial solutions, but you must make reference or attribution to its source, by doing the following:
• All blocks of code that you take from public sources must be referenced in adjacent comments in your
code.
• Please also include a list of references indicating code you have drawn on in your solution.py docstring.
However, you must not show your code to, or share your code with, any other student under any
circumstances. You must not post your code to public discussion forums (including Ed Discussion)
or save your code in publicly accessible repositories (check your security settings). You must not
look at or copy code from any other student.
All submitted files (code and report) will be subject to electronic plagiarism detection and misconduct proceedings
will be instituted against students where plagiarism or collusion is suspected. The electronic plagiarism
detection can detect similarities in code structure even if comments, variable names, formatting etc. are
modified. If you collude to develop your code or answer your report questions, you will be caught.
For more information, please consult the following University web pages:
• Information regarding Academic Integrity and Misconduct:
– https://my.uq.edu.au/information-and-services/manage-my-program/student-integrity-andconduct/academic-integrity-and-student-conduct
– http://ppl.app.uq.edu.au/content/3.60.04-student-integrity-and-misconduct
• Information on Student Services:
– https://www.uq.edu.au/student-services/
Late submission
Students should not leave assignment preparation until the last minute and must plan their workloads to meet
advertised or notified deadlines. It is your responsibility to manage your time effectively.
Late submission of the assignment will not be accepted. Unless advised, assessment items received after the
due date will receive a zero mark unless you have been approved to submit the assessment item after the due
date.
In the event of exceptional circumstances, you may submit a request for an extension. You can find guidelines
on acceptable reasons for an extension here https://my.uq.edu.au/information-and-services/manage-myprogram/exams-and-assessment/applying-extension
All requests for extension must be submitted on the UQ
Application for Extension of Progressive Assessment form at least 48 hours prior to the submission deadline.
Page 5