Q learning

时间: 2023-07-18 admin IT培训

Q learning

Q learning

今天继续写RL的exercise2,发现Q learning一直不收敛。本来就是个很简单的算法,改了好久都不知道fault在哪里,一开始以为是超参数调的不好,结果调了好久的参数都不行。后来发现自己犯了个错误:

target = reward + int(done) * self.gamma * max_action_qnew_estimate = old_estimate + self.alpha * (target - old_estimate)

注意这里,不应该是int(done),int(not done),很容易忽略的一个逻辑错误(其实还是自己对算法的理解不够深)

改过来之后,就收敛了,下面是训练日志:

EVALUATION: EP 1000 - MEAN RETURN -83.042 (424/500 failed episodes)
EVALUATION: EP 2000 - MEAN RETURN -13.322 (101/500 failed episodes)
EVALUATION: EP 3000 - MEAN RETURN 5.032 (13/500 failed episodes)
EVALUATION: EP 4000 - MEAN RETURN 7.898 (0/500 failed episodes)
EVALUATION: EP 5000 - MEAN RETURN 8.166 (0/500 failed episodes)
EVALUATION: EP 6000 - MEAN RETURN 7.848 (0/500 failed episodes)
EVALUATION: EP 7000 - MEAN RETURN 7.8 (0/500 failed episodes)
EVALUATION: EP 8000 - MEAN RETURN 7.88 (0/500 failed episodes)
EVALUATION: EP 9000 - MEAN RETURN 7.826 (0/500 failed episodes)
EVALUATION: EP 10000 - MEAN RETURN 7.888 (0/500 failed episodes)