Q learning
- Q learning 推荐度:
- 相关推荐
Q learning
今天继续写RL的exercise2,发现Q learning一直不收敛。本来就是个很简单的算法,改了好久都不知道fault在哪里,一开始以为是超参数调的不好,结果调了好久的参数都不行。后来发现自己犯了个错误:
target = reward + int(done) * self.gamma * max_action_qnew_estimate = old_estimate + self.alpha * (target - old_estimate)
注意这里,不应该是int(done),int(not done),很容易忽略的一个逻辑错误(其实还是自己对算法的理解不够深)
改过来之后,就收敛了,下面是训练日志:
EVALUATION: EP 1000 - MEAN RETURN -83.042 (424/500 failed episodes)
EVALUATION: EP 2000 - MEAN RETURN -13.322 (101/500 failed episodes)
EVALUATION: EP 3000 - MEAN RETURN 5.032 (13/500 failed episodes)
EVALUATION: EP 4000 - MEAN RETURN 7.898 (0/500 failed episodes)
EVALUATION: EP 5000 - MEAN RETURN 8.166 (0/500 failed episodes)
EVALUATION: EP 6000 - MEAN RETURN 7.848 (0/500 failed episodes)
EVALUATION: EP 7000 - MEAN RETURN 7.8 (0/500 failed episodes)
EVALUATION: EP 8000 - MEAN RETURN 7.88 (0/500 failed episodes)
EVALUATION: EP 9000 - MEAN RETURN 7.826 (0/500 failed episodes)
EVALUATION: EP 10000 - MEAN RETURN 7.888 (0/500 failed episodes)
- English digest
- linux驱动
- 做php的灯就灭,121128 还原 我是做PHP的,女嘉宾把灯全灭了 真相
- c语言中 #include < > 和include “ “的区别
- 重启mysql
- MSYS
- 十年职场
- 声音的数字化表示
- 贝叶斯网络综合应用
- principal java
- 漫谈系列—大数定律
- 电话银行
- mmap如何使用?
- r语言如何计算均方误差
- 兔子吃狼 引发的人力资源故事
- 装机、做系统必备:秒懂MBR和GPT分区表
- Linux命令curl详解(一)
- python爬虫爬取网页信息
- SQL语句注入的全过程
- 【CTDB】什么是CTDB(Cluster Trivial Database)