[PARL强化学习]Sarsa和Q—learning的实现

  • 时间:
  • 浏览:
  • 来源:互联网

[PARL强化学习]Sarsa和Q—learning的实现

Sarsa和Q—learning都是利用表格法再根据MDP四元组<S,A,P,R>:S: state状态,a: action动作,r:reward,奖励p: probability状态转移概率实现强化学习的方法。

这两种方法都是根据环境来进行学习,因此我们需要利用P函数和R函数描述环境、

而Q表格用于记录每一个状态(state)上进行的每一个动作(action)计算出最大的未来奖励(reward)的期望。

训练完成的Q表格将用于指导智能体的行动。

一、Sarsa简介

Sarsa全称是state-action-reward-state'-action',目的是学习特定的state下,特定action的价值Q,最终建立和优化一个Q表格,以state为行,action为列,根据与环境交互得到的reward来更新Q表格.

”SARSA“ 五个字母是当前 S (状态), A(行动), R(奖励) 与 下一步S’(状态) A’(行动) 的组合,即我们不仅需要知道当前的S, A, R 还需要知道下一步的 S’ 和 A‘。

在Sarsa算法中,智能体的目标

R(S1) + γ*Q(S1,A)

至于A是多少,完全取决于智能体实际上选择的哪一个Action。智能体有90%的概率会选择Q值最大的Action(A2),还有10%的概率会随机选择一个Action。

因此,Sarsa的算法是这样的,也即是Q表格的更新公式

img

Sarsa在训练中为了更好的探索环境,采用ε-greedy方式来训练,有一定概率随机选择动作输出。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5482JiES-1616143534658)(http://yanxuan.nosdn.127.net/3ed475f315dfa73299222f848dcc1de7.png)]

二、Sarsa的实现

导入库

import gym
import numpy as np
import time

Sarsa方法实现

class SarsaAgent(object):
    def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
        self.act_n = act_n      # 动作维度,有几个动作可选
        self.lr = learning_rate # 学习率
        self.gamma = gamma      # reward的衰减率
        self.epsilon = e_greed  # 按一定概率随机选动作
        self.Q = np.zeros((obs_n, act_n))

    # 根据输入观察值,采样输出的动作值,带探索
    def sample(self, obs):
       
        if np.random.uniform(0, 1) < (1.0 - self.epsilon): #根据table的Q值选动作
            action = self.predict(obs)
        else:
            action = np.random.choice(self.act_n) #有一定概率随机探索选取一个动作
    
        return action

    # 根据输入观察值,预测输出的动作值
    def predict(self, obs):
       
        Q_list = self.Q[obs, :]
        maxQ = np.max(Q_list)
        action_list = np.where(Q_list == maxQ)[0]  # maxQ可能对应多个action
        action = np.random.choice(action_list)

        return action

    # 学习方法,也就是更新Q-table的方法
    def learn(self, obs, action, reward, next_obs, next_action, done):
        """ on-policy
            obs: 交互前的obs, s_t
            action: 本次交互选择的action, a_t
            reward: 本次动作获得的奖励r
            next_obs: 本次交互后的obs, s_t+1
            next_action: 根据当前Q表格, 针对next_obs会选择的动作, a_t+1
            done: episode是否结束
        """
        predict_Q = self.Q[obs, action]
        if done:
            target_Q = reward # 没有下一个状态了
        else:
            target_Q = reward + self.gamma * self.Q[next_obs, next_action] # Sarsa
        self.Q[obs, action] += self.lr * (target_Q - predict_Q) # 修正q

    # 保存Q表格数据到文件
    def save(self):
        npy_file = './q_table.npy'
        np.save(npy_file, self.Q)
        print(npy_file + ' saved.')
    
    # 从文件中读取数据到Q表格中
    def restore(self, npy_file='./q_table.npy'):
        self.Q = np.load(npy_file)
        print(npy_file + ' loaded.')

训练部分

def run_episode(env, agent, render=False):
    total_steps = 0 # 记录每个episode走了多少step
    total_reward = 0

    obs = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)
    action = agent.sample(obs) # 根据算法选择一个动作

    while True:
        next_obs, reward, done, _ = env.step(action) # 与环境进行一个交互
        next_action = agent.sample(next_obs) # 根据算法选择一个动作
        # 训练 Sarsa 算法
        agent.learn(obs, action, reward, next_obs, next_action, done)

        action = next_action
        obs = next_obs  # 存储上一个观察值
        total_reward += reward
        total_steps += 1 # 计算step数
        if render:
            env.render() #渲染新的一帧图形
        if done:
            break
    return total_reward, total_steps

测试程序

def test_episode(env, agent):
    total_reward = 0
    obs = env.reset()
    while True:
        action = agent.predict(obs) # greedy
        next_obs, reward, done, _ = env.step(action)
        total_reward += reward
        obs = next_obs
        time.sleep(0.5)
        env.render()
        if done:
            break
    return total_reward

主程序

# 使用gym创建迷宫环境,设置is_slippery为False降低环境难度
env = gym.make("FrozenLake-v0", is_slippery=False)  # 0 left, 1 down, 2 right, 3 up

# 创建一个agent实例,输入超参数
agent = SarsaAgent(
        obs_n=env.observation_space.n,
        act_n=env.action_space.n,
        learning_rate=0.1,
        gamma=0.9,
        e_greed=0.1)


# 训练500个episode,打印每个episode的分数
for episode in range(500):
    ep_reward, ep_steps = run_episode(env, agent, False)
    print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))

# 全部训练结束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))

运行结果

三、Q-learning简介

  • Q-learning也是采用Q表格的方式存储Q值(状态动作价值),决策部分与Sarsa是一样的,采用ε-greedy方式增加探索。
  • Q-learning跟Sarsa不一样的地方是更新Q表格的方式。
    • Sarsaon-policy的更新方式,先做出动作再更新。
    • Q-learningoff-policy的更新方式,更新learn()时无需获取下一步实际做出的动作next_action,并假设下一步动作是取最大Q值的动作。
  • Q-learning的更新公式为:

img

因此在学习过程中也只有更新公式有略微差别,其他方式都是一样的

四、Q-learning的实现

导入库

import gym
import numpy as np
import time

Q——learning方法实现

class QLearningAgent(object):
    def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
        self.act_n = act_n      # 动作维度,有几个动作可选
        self.lr = learning_rate # 学习率
        self.gamma = gamma      # reward的衰减率
        self.epsilon = e_greed  # 按一定概率随机选动作
        self.Q = np.zeros((obs_n, act_n))

    # 根据输入观察值,采样输出的动作值,带探索
    def sample(self, obs):

        if np.random.uniform(0, 1) < (1.0 - self.epsilon): #根据table的Q值选动作
            action = self.predict(obs)
        else:
            action = np.random.choice(self.act_n) #有一定概率随机探索选取一个动作
        return action

    # 根据输入观察值,预测输出的动作值
    def predict(self, obs):
       
        Q_list = self.Q[obs, :]
        maxQ = np.max(Q_list)
        action_list = np.where(Q_list == maxQ)[0]  # maxQ可能对应多个action
        action = np.random.choice(action_list)
        return action

    # 学习方法,也就是更新Q-table的方法
    def learn(self, obs, action, reward, next_obs, done):
        """ off-policy
            obs: 交互前的obs, s_t
            action: 本次交互选择的action, a_t
            reward: 本次动作获得的奖励r
            next_obs: 本次交互后的obs, s_t+1
            done: episode是否结束
        """

        predict_Q = self.Q[obs, action]
        if done:
            target_Q = reward # 没有下一个状态了
        else:
            target_Q = reward + self.gamma * np.max(self.Q[next_obs, :]) # Q-learning
        self.Q[obs, action] += self.lr * (target_Q - predict_Q) # 修正q

    # 保存Q表格数据到文件
    def save(self):
        npy_file = './q_table.npy'
        np.save(npy_file, self.Q)
        print(npy_file + ' saved.')
    
    # 从文件中读取数据到Q表格中
    def restore(self, npy_file='./q_table.npy'):
        self.Q = np.load(npy_file)
        print(npy_file + ' loaded.')

训练部分

def run_episode(env, agent, render=False):
    total_steps = 0 # 记录每个episode走了多少step
    total_reward = 0

    obs = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)

    while True:
        action = agent.sample(obs) # 根据算法选择一个动作
        next_obs, reward, done, _ = env.step(action) # 与环境进行一个交互
        # 训练 Q-learning算法
        agent.learn(obs, action, reward, next_obs, done)

        obs = next_obs  # 存储上一个观察值
        total_reward += reward
        total_steps += 1 # 计算step数
        if render:
            env.render() #渲染新的一帧图形
        if done:
            break
    return total_reward, total_steps

测试程序

def test_episode(env, agent):
    total_reward = 0
    obs = env.reset()
    while True:
        action = agent.predict(obs) # greedy
        next_obs, reward, done, _ = env.step(action)
        total_reward += reward
        obs = next_obs
        # time.sleep(0.5)
        # env.render()
        if done:
            break
    return total_reward

主程序

# 使用gym创建悬崖环境
env = gym.make("CliffWalking-v0")  # 0 up, 1 right, 2 down, 3 left

# 创建一个agent实例,输入超参数
agent = QLearningAgent(
    obs_n=env.observation_space.n,
    act_n=env.action_space.n,
    learning_rate=0.1,
    gamma=0.9,
    e_greed=0.1)


# 训练500个episode,打印每个episode的分数
for episode in range(500):
    ep_reward, ep_steps = run_episode(env, agent, False)
    print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))

# 全部训练结束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))

运行结果

Episode 0: steps = 14 , reward = 0.0
Episode 1: steps = 8 , reward = 0.0
Episode 2: steps = 27 , reward = 0.0
Episode 3: steps = 7 , reward = 0.0
Episode 4: steps = 9 , reward = 0.0
Episode 5: steps = 11 , reward = 0.0
Episode 6: steps = 7 , reward = 0.0
Episode 7: steps = 6 , reward = 0.0
Episode 8: steps = 7 , reward = 0.0
Episode 9: steps = 6 , reward = 0.0
Episode 10: steps = 11 , reward = 0.0
Episode 11: steps = 2 , reward = 0.0
Episode 12: steps = 16 , reward = 0.0
Episode 13: steps = 7 , reward = 1.0
Episode 14: steps = 13 , reward = 0.0
Episode 15: steps = 6 , reward = 0.0
Episode 16: steps = 12 , reward = 0.0
Episode 17: steps = 4 , reward = 0.0
Episode 18: steps = 21 , reward = 0.0
Episode 19: steps = 15 , reward = 0.0
Episode 20: steps = 2 , reward = 0.0
Episode 21: steps = 16 , reward = 0.0
Episode 22: steps = 4 , reward = 0.0
Episode 23: steps = 10 , reward = 0.0
Episode 24: steps = 11 , reward = 1.0
Episode 25: steps = 10 , reward = 0.0
Episode 26: steps = 6 , reward = 1.0
Episode 27: steps = 17 , reward = 0.0
Episode 28: steps = 5 , reward = 0.0
Episode 29: steps = 6 , reward = 0.0
Episode 30: steps = 31 , reward = 0.0
Episode 31: steps = 8 , reward = 0.0
Episode 32: steps = 9 , reward = 0.0
Episode 33: steps = 4 , reward = 0.0
Episode 34: steps = 16 , reward = 1.0
Episode 35: steps = 6 , reward = 0.0
Episode 36: steps = 11 , reward = 0.0
Episode 37: steps = 8 , reward = 0.0
Episode 38: steps = 12 , reward = 0.0
Episode 39: steps = 6 , reward = 1.0
Episode 40: steps = 6 , reward = 0.0
Episode 41: steps = 10 , reward = 0.0
Episode 42: steps = 6 , reward = 0.0
Episode 43: steps = 3 , reward = 0.0
Episode 44: steps = 9 , reward = 0.0
Episode 45: steps = 11 , reward = 1.0
Episode 46: steps = 7 , reward = 1.0
Episode 47: steps = 8 , reward = 1.0
Episode 48: steps = 8 , reward = 1.0
Episode 49: steps = 7 , reward = 1.0
Episode 50: steps = 6 , reward = 1.0
Episode 51: steps = 6 , reward = 1.0
Episode 52: steps = 6 , reward = 1.0
Episode 53: steps = 6 , reward = 1.0
Episode 54: steps = 6 , reward = 1.0
Episode 55: steps = 6 , reward = 1.0
Episode 56: steps = 4 , reward = 0.0
Episode 57: steps = 5 , reward = 0.0
Episode 58: steps = 6 , reward = 1.0
Episode 59: steps = 5 , reward = 0.0
Episode 60: steps = 6 , reward = 1.0
Episode 61: steps = 6 , reward = 1.0
Episode 62: steps = 6 , reward = 1.0
Episode 63: steps = 6 , reward = 1.0
Episode 64: steps = 9 , reward = 1.0
Episode 65: steps = 6 , reward = 1.0
Episode 66: steps = 7 , reward = 1.0
Episode 67: steps = 6 , reward = 1.0
Episode 68: steps = 10 , reward = 1.0
Episode 69: steps = 7 , reward = 1.0
Episode 70: steps = 8 , reward = 1.0
Episode 71: steps = 5 , reward = 0.0
Episode 72: steps = 6 , reward = 1.0
Episode 73: steps = 6 , reward = 1.0
Episode 74: steps = 6 , reward = 1.0
Episode 75: steps = 6 , reward = 1.0
Episode 76: steps = 2 , reward = 0.0
Episode 77: steps = 6 , reward = 1.0
Episode 78: steps = 6 , reward = 1.0
Episode 79: steps = 6 , reward = 1.0
Episode 80: steps = 6 , reward = 1.0
Episode 81: steps = 6 , reward = 1.0
Episode 82: steps = 8 , reward = 1.0
Episode 83: steps = 8 , reward = 1.0
Episode 84: steps = 6 , reward = 1.0
Episode 85: steps = 6 , reward = 1.0
Episode 86: steps = 6 , reward = 1.0
Episode 87: steps = 5 , reward = 0.0
Episode 88: steps = 7 , reward = 1.0
Episode 89: steps = 6 , reward = 1.0
Episode 90: steps = 6 , reward = 1.0
Episode 91: steps = 7 , reward = 1.0
Episode 92: steps = 6 , reward = 1.0
Episode 93: steps = 6 , reward = 1.0
Episode 94: steps = 6 , reward = 1.0
Episode 95: steps = 6 , reward = 1.0
Episode 96: steps = 6 , reward = 1.0
Episode 97: steps = 7 , reward = 1.0
Episode 98: steps = 3 , reward = 0.0
Episode 99: steps = 6 , reward = 1.0
Episode 100: steps = 6 , reward = 1.0
Episode 101: steps = 6 , reward = 1.0
Episode 102: steps = 6 , reward = 1.0
Episode 103: steps = 7 , reward = 1.0
Episode 104: steps = 6 , reward = 1.0
Episode 105: steps = 8 , reward = 1.0
Episode 106: steps = 6 , reward = 1.0
Episode 107: steps = 6 , reward = 1.0
Episode 108: steps = 6 , reward = 1.0
Episode 109: steps = 6 , reward = 1.0
Episode 110: steps = 6 , reward = 1.0
Episode 111: steps = 6 , reward = 1.0
Episode 112: steps = 8 , reward = 0.0
Episode 113: steps = 8 , reward = 1.0
Episode 114: steps = 4 , reward = 0.0
Episode 115: steps = 6 , reward = 1.0
Episode 116: steps = 5 , reward = 0.0
Episode 117: steps = 6 , reward = 1.0
Episode 118: steps = 6 , reward = 1.0
Episode 119: steps = 6 , reward = 1.0
Episode 120: steps = 7 , reward = 1.0
Episode 121: steps = 6 , reward = 1.0
Episode 122: steps = 4 , reward = 0.0
Episode 123: steps = 3 , reward = 0.0
Episode 124: steps = 6 , reward = 1.0
Episode 125: steps = 6 , reward = 1.0
Episode 126: steps = 6 , reward = 1.0
Episode 127: steps = 6 , reward = 1.0
Episode 128: steps = 7 , reward = 1.0
Episode 129: steps = 8 , reward = 1.0
Episode 130: steps = 6 , reward = 1.0
Episode 131: steps = 6 , reward = 1.0
Episode 132: steps = 6 , reward = 1.0
Episode 133: steps = 6 , reward = 1.0
Episode 134: steps = 6 , reward = 1.0
Episode 135: steps = 3 , reward = 0.0
Episode 136: steps = 6 , reward = 1.0
Episode 137: steps = 6 , reward = 0.0
Episode 138: steps = 13 , reward = 1.0
Episode 139: steps = 6 , reward = 1.0
Episode 140: steps = 12 , reward = 1.0
Episode 141: steps = 6 , reward = 1.0
Episode 142: steps = 6 , reward = 1.0
Episode 143: steps = 6 , reward = 1.0
Episode 144: steps = 6 , reward = 1.0
Episode 145: steps = 6 , reward = 1.0
Episode 146: steps = 5 , reward = 0.0
Episode 147: steps = 9 , reward = 1.0
Episode 148: steps = 6 , reward = 1.0
Episode 149: steps = 8 , reward = 1.0
Episode 150: steps = 6 , reward = 1.0
Episode 151: steps = 10 , reward = 1.0
Episode 152: steps = 6 , reward = 1.0
Episode 153: steps = 6 , reward = 1.0
Episode 154: steps = 6 , reward = 1.0
Episode 155: steps = 4 , reward = 0.0
Episode 156: steps = 6 , reward = 1.0
Episode 157: steps = 5 , reward = 0.0
Episode 158: steps = 6 , reward = 1.0
Episode 159: steps = 6 , reward = 1.0
Episode 160: steps = 8 , reward = 1.0
Episode 161: steps = 6 , reward = 1.0
Episode 162: steps = 6 , reward = 1.0
Episode 163: steps = 6 , reward = 1.0
Episode 164: steps = 6 , reward = 1.0
Episode 165: steps = 4 , reward = 0.0
Episode 166: steps = 5 , reward = 0.0
Episode 167: steps = 6 , reward = 1.0
Episode 168: steps = 3 , reward = 0.0
Episode 169: steps = 6 , reward = 1.0
Episode 170: steps = 3 , reward = 0.0
Episode 171: steps = 6 , reward = 1.0
Episode 172: steps = 5 , reward = 0.0
Episode 173: steps = 6 , reward = 1.0
Episode 174: steps = 7 , reward = 1.0
Episode 175: steps = 6 , reward = 1.0
Episode 176: steps = 6 , reward = 1.0
Episode 177: steps = 8 , reward = 1.0
Episode 178: steps = 6 , reward = 1.0
Episode 179: steps = 7 , reward = 1.0
Episode 180: steps = 6 , reward = 1.0
Episode 181: steps = 7 , reward = 1.0
Episode 182: steps = 6 , reward = 1.0
Episode 183: steps = 6 , reward = 1.0
Episode 184: steps = 6 , reward = 1.0
Episode 185: steps = 6 , reward = 1.0
Episode 186: steps = 8 , reward = 1.0
Episode 187: steps = 7 , reward = 1.0
Episode 188: steps = 6 , reward = 1.0
Episode 189: steps = 7 , reward = 1.0
Episode 190: steps = 6 , reward = 1.0
Episode 191: steps = 8 , reward = 1.0
Episode 192: steps = 6 , reward = 1.0
Episode 193: steps = 6 , reward = 1.0
Episode 194: steps = 6 , reward = 1.0
Episode 195: steps = 6 , reward = 1.0
Episode 196: steps = 6 , reward = 1.0
Episode 197: steps = 6 , reward = 1.0
Episode 198: steps = 6 , reward = 1.0
Episode 199: steps = 6 , reward = 1.0
Episode 200: steps = 6 , reward = 1.0
Episode 201: steps = 6 , reward = 1.0
Episode 202: steps = 8 , reward = 1.0
Episode 203: steps = 8 , reward = 1.0
Episode 204: steps = 6 , reward = 1.0
Episode 205: steps = 7 , reward = 1.0
Episode 206: steps = 6 , reward = 1.0
Episode 207: steps = 4 , reward = 0.0
Episode 208: steps = 6 , reward = 1.0
Episode 209: steps = 2 , reward = 0.0
Episode 210: steps = 6 , reward = 1.0
Episode 211: steps = 6 , reward = 1.0
Episode 212: steps = 8 , reward = 1.0
Episode 213: steps = 6 , reward = 1.0
Episode 214: steps = 6 , reward = 1.0
Episode 215: steps = 7 , reward = 1.0
Episode 216: steps = 7 , reward = 1.0
Episode 217: steps = 6 , reward = 1.0
Episode 218: steps = 2 , reward = 0.0
Episode 219: steps = 3 , reward = 0.0
Episode 220: steps = 6 , reward = 1.0
Episode 221: steps = 6 , reward = 1.0
Episode 222: steps = 6 , reward = 1.0
Episode 223: steps = 3 , reward = 0.0
Episode 224: steps = 7 , reward = 0.0
Episode 225: steps = 4 , reward = 0.0
Episode 226: steps = 5 , reward = 0.0
Episode 227: steps = 6 , reward = 1.0
Episode 228: steps = 6 , reward = 1.0
Episode 229: steps = 6 , reward = 1.0
Episode 230: steps = 7 , reward = 1.0
Episode 231: steps = 8 , reward = 1.0
Episode 232: steps = 9 , reward = 1.0
Episode 233: steps = 10 , reward = 1.0
Episode 234: steps = 9 , reward = 1.0
Episode 235: steps = 7 , reward = 1.0
Episode 236: steps = 8 , reward = 1.0
Episode 237: steps = 8 , reward = 1.0
Episode 238: steps = 8 , reward = 1.0
Episode 239: steps = 6 , reward = 1.0
Episode 240: steps = 6 , reward = 1.0
Episode 241: steps = 9 , reward = 1.0
Episode 242: steps = 6 , reward = 1.0
Episode 243: steps = 6 , reward = 1.0
Episode 244: steps = 6 , reward = 1.0
Episode 245: steps = 7 , reward = 1.0
Episode 246: steps = 8 , reward = 1.0
Episode 247: steps = 7 , reward = 1.0
Episode 248: steps = 12 , reward = 1.0
Episode 249: steps = 6 , reward = 1.0
Episode 250: steps = 6 , reward = 1.0
Episode 251: steps = 6 , reward = 1.0
Episode 252: steps = 6 , reward = 1.0
Episode 253: steps = 6 , reward = 1.0
Episode 254: steps = 6 , reward = 1.0
Episode 255: steps = 7 , reward = 1.0
Episode 256: steps = 8 , reward = 1.0
Episode 257: steps = 12 , reward = 1.0
Episode 258: steps = 6 , reward = 1.0
Episode 259: steps = 8 , reward = 1.0
Episode 260: steps = 6 , reward = 1.0
Episode 261: steps = 6 , reward = 1.0
Episode 262: steps = 6 , reward = 1.0
Episode 263: steps = 6 , reward = 1.0
Episode 264: steps = 4 , reward = 0.0
Episode 265: steps = 4 , reward = 0.0
Episode 266: steps = 4 , reward = 0.0
Episode 267: steps = 7 , reward = 1.0
Episode 268: steps = 6 , reward = 1.0
Episode 269: steps = 6 , reward = 1.0
Episode 270: steps = 6 , reward = 1.0
Episode 271: steps = 7 , reward = 1.0
Episode 272: steps = 6 , reward = 1.0
Episode 273: steps = 6 , reward = 1.0
Episode 274: steps = 7 , reward = 1.0
Episode 275: steps = 6 , reward = 1.0
Episode 276: steps = 6 , reward = 1.0
Episode 277: steps = 6 , reward = 1.0
Episode 278: steps = 2 , reward = 0.0
Episode 279: steps = 6 , reward = 1.0
Episode 280: steps = 6 , reward = 1.0
Episode 281: steps = 6 , reward = 1.0
Episode 282: steps = 6 , reward = 1.0
Episode 283: steps = 4 , reward = 0.0
Episode 284: steps = 8 , reward = 1.0
Episode 285: steps = 6 , reward = 1.0
Episode 286: steps = 6 , reward = 1.0
Episode 287: steps = 7 , reward = 1.0
Episode 288: steps = 6 , reward = 1.0
Episode 289: steps = 6 , reward = 1.0
Episode 290: steps = 6 , reward = 1.0
Episode 291: steps = 8 , reward = 1.0
Episode 292: steps = 6 , reward = 1.0
Episode 293: steps = 4 , reward = 0.0
Episode 294: steps = 6 , reward = 1.0
Episode 295: steps = 6 , reward = 1.0
Episode 296: steps = 6 , reward = 1.0
Episode 297: steps = 6 , reward = 1.0
Episode 298: steps = 8 , reward = 1.0
Episode 299: steps = 6 , reward = 1.0
Episode 300: steps = 6 , reward = 1.0
Episode 301: steps = 6 , reward = 1.0
Episode 302: steps = 6 , reward = 1.0
Episode 303: steps = 6 , reward = 1.0
Episode 304: steps = 6 , reward = 1.0
Episode 305: steps = 6 , reward = 1.0
Episode 306: steps = 6 , reward = 1.0
Episode 307: steps = 6 , reward = 1.0
Episode 308: steps = 9 , reward = 1.0
Episode 309: steps = 6 , reward = 1.0
Episode 310: steps = 6 , reward = 0.0
Episode 311: steps = 6 , reward = 1.0
Episode 312: steps = 5 , reward = 0.0
Episode 313: steps = 6 , reward = 1.0
Episode 314: steps = 6 , reward = 1.0
Episode 315: steps = 6 , reward = 1.0
Episode 316: steps = 7 , reward = 1.0
Episode 317: steps = 6 , reward = 1.0
Episode 318: steps = 6 , reward = 1.0
Episode 319: steps = 6 , reward = 1.0
Episode 320: steps = 6 , reward = 1.0
Episode 321: steps = 6 , reward = 1.0
Episode 322: steps = 10 , reward = 1.0
Episode 323: steps = 6 , reward = 1.0
Episode 324: steps = 8 , reward = 1.0
Episode 325: steps = 3 , reward = 0.0
Episode 326: steps = 6 , reward = 1.0
Episode 327: steps = 6 , reward = 1.0
Episode 328: steps = 6 , reward = 1.0
Episode 329: steps = 6 , reward = 1.0
Episode 330: steps = 6 , reward = 1.0
Episode 331: steps = 6 , reward = 1.0
Episode 332: steps = 8 , reward = 1.0
Episode 333: steps = 7 , reward = 1.0
Episode 334: steps = 7 , reward = 1.0
Episode 335: steps = 6 , reward = 1.0
Episode 336: steps = 6 , reward = 1.0
Episode 337: steps = 6 , reward = 1.0
Episode 338: steps = 2 , reward = 0.0
Episode 339: steps = 6 , reward = 1.0
Episode 340: steps = 6 , reward = 1.0
Episode 341: steps = 6 , reward = 1.0
Episode 342: steps = 6 , reward = 1.0
Episode 343: steps = 8 , reward = 1.0
Episode 344: steps = 6 , reward = 1.0
Episode 345: steps = 6 , reward = 1.0
Episode 346: steps = 6 , reward = 1.0
Episode 347: steps = 6 , reward = 1.0
Episode 348: steps = 6 , reward = 1.0
Episode 349: steps = 7 , reward = 1.0
Episode 350: steps = 6 , reward = 1.0
Episode 351: steps = 6 , reward = 1.0
Episode 352: steps = 6 , reward = 1.0
Episode 353: steps = 6 , reward = 1.0
Episode 354: steps = 6 , reward = 1.0
Episode 355: steps = 8 , reward = 1.0
Episode 356: steps = 7 , reward = 1.0
Episode 357: steps = 6 , reward = 1.0
Episode 358: steps = 6 , reward = 0.0
Episode 359: steps = 6 , reward = 1.0
Episode 360: steps = 6 , reward = 1.0
Episode 361: steps = 6 , reward = 1.0
Episode 362: steps = 6 , reward = 1.0
Episode 363: steps = 6 , reward = 1.0
Episode 364: steps = 6 , reward = 1.0
Episode 365: steps = 6 , reward = 1.0
Episode 366: steps = 7 , reward = 1.0
Episode 367: steps = 6 , reward = 1.0
Episode 368: steps = 6 , reward = 1.0
Episode 369: steps = 6 , reward = 1.0
Episode 370: steps = 7 , reward = 1.0
Episode 371: steps = 8 , reward = 1.0
Episode 372: steps = 6 , reward = 1.0
Episode 373: steps = 7 , reward = 1.0
Episode 374: steps = 6 , reward = 1.0
Episode 375: steps = 8 , reward = 1.0
Episode 376: steps = 6 , reward = 1.0
Episode 377: steps = 6 , reward = 1.0
Episode 378: steps = 6 , reward = 1.0
Episode 379: steps = 6 , reward = 1.0
Episode 380: steps = 6 , reward = 1.0
Episode 381: steps = 6 , reward = 1.0
Episode 382: steps = 6 , reward = 1.0
Episode 383: steps = 6 , reward = 1.0
Episode 384: steps = 6 , reward = 1.0
Episode 385: steps = 6 , reward = 1.0
Episode 386: steps = 6 , reward = 1.0
Episode 387: steps = 8 , reward = 1.0
Episode 388: steps = 6 , reward = 1.0
Episode 389: steps = 6 , reward = 1.0
Episode 390: steps = 6 , reward = 1.0
Episode 391: steps = 9 , reward = 1.0
Episode 392: steps = 8 , reward = 1.0
Episode 393: steps = 6 , reward = 1.0
Episode 394: steps = 4 , reward = 0.0
Episode 395: steps = 6 , reward = 1.0
Episode 396: steps = 7 , reward = 1.0
Episode 397: steps = 6 , reward = 1.0
Episode 398: steps = 6 , reward = 1.0
Episode 399: steps = 9 , reward = 1.0
Episode 400: steps = 6 , reward = 1.0
Episode 401: steps = 6 , reward = 1.0
Episode 402: steps = 3 , reward = 0.0
Episode 403: steps = 6 , reward = 1.0
Episode 404: steps = 9 , reward = 1.0
Episode 405: steps = 7 , reward = 1.0
Episode 406: steps = 6 , reward = 1.0
Episode 407: steps = 6 , reward = 1.0
Episode 408: steps = 6 , reward = 1.0
Episode 409: steps = 9 , reward = 1.0
Episode 410: steps = 6 , reward = 1.0
Episode 411: steps = 6 , reward = 1.0
Episode 412: steps = 6 , reward = 1.0
Episode 413: steps = 6 , reward = 1.0
Episode 414: steps = 6 , reward = 1.0
Episode 415: steps = 6 , reward = 1.0
Episode 416: steps = 6 , reward = 1.0
Episode 417: steps = 8 , reward = 1.0
Episode 418: steps = 4 , reward = 0.0
Episode 419: steps = 8 , reward = 1.0
Episode 420: steps = 6 , reward = 1.0
Episode 421: steps = 6 , reward = 1.0
Episode 422: steps = 8 , reward = 1.0
Episode 423: steps = 6 , reward = 1.0
Episode 424: steps = 6 , reward = 1.0
Episode 425: steps = 8 , reward = 1.0
Episode 426: steps = 4 , reward = 0.0
Episode 427: steps = 6 , reward = 1.0
Episode 428: steps = 6 , reward = 1.0
Episode 429: steps = 6 , reward = 1.0
Episode 430: steps = 10 , reward = 1.0
Episode 431: steps = 6 , reward = 1.0
Episode 432: steps = 6 , reward = 1.0
Episode 433: steps = 7 , reward = 1.0
Episode 434: steps = 6 , reward = 1.0
Episode 435: steps = 5 , reward = 0.0
Episode 436: steps = 6 , reward = 1.0
Episode 437: steps = 6 , reward = 1.0
Episode 438: steps = 6 , reward = 1.0
Episode 439: steps = 6 , reward = 1.0
Episode 440: steps = 6 , reward = 1.0
Episode 441: steps = 6 , reward = 1.0
Episode 442: steps = 8 , reward = 1.0
Episode 443: steps = 6 , reward = 1.0
Episode 444: steps = 8 , reward = 1.0
Episode 445: steps = 6 , reward = 1.0
Episode 446: steps = 6 , reward = 1.0
Episode 447: steps = 6 , reward = 1.0
Episode 448: steps = 6 , reward = 1.0
Episode 449: steps = 6 , reward = 1.0
Episode 450: steps = 2 , reward = 0.0
Episode 451: steps = 6 , reward = 1.0
Episode 452: steps = 8 , reward = 1.0
Episode 453: steps = 6 , reward = 1.0
Episode 454: steps = 6 , reward = 1.0
Episode 455: steps = 3 , reward = 0.0
Episode 456: steps = 4 , reward = 0.0
Episode 457: steps = 6 , reward = 1.0
Episode 458: steps = 6 , reward = 1.0
Episode 459: steps = 6 , reward = 1.0
Episode 460: steps = 5 , reward = 0.0
Episode 461: steps = 4 , reward = 0.0
Episode 462: steps = 8 , reward = 1.0
Episode 463: steps = 8 , reward = 1.0
Episode 464: steps = 3 , reward = 0.0
Episode 465: steps = 6 , reward = 1.0
Episode 466: steps = 7 , reward = 1.0
Episode 467: steps = 6 , reward = 1.0
Episode 468: steps = 6 , reward = 1.0
Episode 469: steps = 6 , reward = 1.0
Episode 470: steps = 6 , reward = 1.0
Episode 471: steps = 6 , reward = 1.0
Episode 472: steps = 6 , reward = 0.0
Episode 473: steps = 6 , reward = 1.0
Episode 474: steps = 6 , reward = 1.0
Episode 475: steps = 6 , reward = 1.0
Episode 476: steps = 6 , reward = 1.0
Episode 477: steps = 6 , reward = 1.0
Episode 478: steps = 6 , reward = 1.0
Episode 479: steps = 6 , reward = 1.0
Episode 480: steps = 6 , reward = 1.0
Episode 481: steps = 10 , reward = 1.0
Episode 482: steps = 6 , reward = 1.0
Episode 483: steps = 6 , reward = 1.0
Episode 484: steps = 8 , reward = 1.0
Episode 485: steps = 5 , reward = 0.0
Episode 486: steps = 7 , reward = 1.0
Episode 487: steps = 6 , reward = 1.0
Episode 488: steps = 6 , reward = 1.0
Episode 489: steps = 9 , reward = 1.0
Episode 490: steps = 6 , reward = 1.0
Episode 491: steps = 6 , reward = 1.0
Episode 492: steps = 7 , reward = 1.0
Episode 493: steps = 8 , reward = 1.0
Episode 494: steps = 6 , reward = 1.0
Episode 495: steps = 6 , reward = 1.0
Episode 496: steps = 2 , reward = 0.0
Episode 497: steps = 6 , reward = 1.0
Episode 498: steps = 8 , reward = 1.0
Episode 499: steps = 6 , reward = 1.0
test reward = 1.0

五、总结

Sarsa选取的是一种保守的策略,他在更新Q值的时候已经为未来规划好了动作,对错误和死亡比较敏感。而Q-learning每次在更新的时候选取的是最大化Q的方向,而当下一个状态时,再重新选择动作,Q-learning是一种鲁莽、大胆、贪婪的算法,对于死亡和错误并不在乎。

简单来说Sarsa更加保守,Q-learning更加激进。

!pip install gym
Looking in indexes: https://pypi.mirrors.ustc.edu.cn/simple/
Requirement already satisfied: gym in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (0.12.1)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gym) (1.3.0)
Requirement already satisfied: pyglet>=1.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gym) (1.4.5)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gym) (1.15.0)
Requirement already satisfied: requests>=2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gym) (2.22.0)
Requirement already satisfied: numpy>=1.10.4 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gym) (1.16.4)
Requirement already satisfied: future in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pyglet>=1.2.0->gym) (0.18.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.0->gym) (2019.9.11)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.0->gym) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.0->gym) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.0->gym) (1.25.6)
import gym
import numpy as np
import time

class SarsaAgent(object):
    def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
        self.act_n = act_n      # 动作维度,有几个动作可选
        self.lr = learning_rate # 学习率
        self.gamma = gamma      # reward的衰减率
        self.epsilon = e_greed  # 按一定概率随机选动作
        self.Q = np.zeros((obs_n, act_n))

    # 根据输入观察值,采样输出的动作值,带探索
    def sample(self, obs):
       
        if np.random.uniform(0, 1) < (1.0 - self.epsilon): #根据table的Q值选动作
            action = self.predict(obs)
        else:
            action = np.random.choice(self.act_n) #有一定概率随机探索选取一个动作
    
        return action

    # 根据输入观察值,预测输出的动作值
    def predict(self, obs):
       
        Q_list = self.Q[obs, :]
        maxQ = np.max(Q_list)
        action_list = np.where(Q_list == maxQ)[0]  # maxQ可能对应多个action
        action = np.random.choice(action_list)

        return action

    # 学习方法,也就是更新Q-table的方法
    def learn(self, obs, action, reward, next_obs, next_action, done):
        """ on-policy
            obs: 交互前的obs, s_t
            action: 本次交互选择的action, a_t
            reward: 本次动作获得的奖励r
            next_obs: 本次交互后的obs, s_t+1
            next_action: 根据当前Q表格, 针对next_obs会选择的动作, a_t+1
            done: episode是否结束
        """
        predict_Q = self.Q[obs, action]
        if done:
            target_Q = reward # 没有下一个状态了
        else:
            target_Q = reward + self.gamma * self.Q[next_obs, next_action] # Sarsa
        self.Q[obs, action] += self.lr * (target_Q - predict_Q) # 修正q

    # 保存Q表格数据到文件
    def save(self):
        npy_file = './q_table.npy'
        np.save(npy_file, self.Q)
        print(npy_file + ' saved.')
    
    # 从文件中读取数据到Q表格中
    def restore(self, npy_file='./q_table.npy'):
        self.Q = np.load(npy_file)
        print(npy_file + ' loaded.')


def run_episode(env, agent, render=False):
    total_steps = 0 # 记录每个episode走了多少step
    total_reward = 0

    obs = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)
    action = agent.sample(obs) # 根据算法选择一个动作

    while True:
        next_obs, reward, done, _ = env.step(action) # 与环境进行一个交互
        next_action = agent.sample(next_obs) # 根据算法选择一个动作
        # 训练 Sarsa 算法
        agent.learn(obs, action, reward, next_obs, next_action, done)

        action = next_action
        obs = next_obs  # 存储上一个观察值
        total_reward += reward
        total_steps += 1 # 计算step数
        if render:
            env.render() #渲染新的一帧图形
        if done:
            break
    return total_reward, total_steps


def test_episode(env, agent):
    total_reward = 0
    obs = env.reset()
    while True:
        action = agent.predict(obs) # greedy
        next_obs, reward, done, _ = env.step(action)
        total_reward += reward
        obs = next_obs
        time.sleep(0.5)
        env.render()
        if done:
            break
    return total_reward


# 使用gym创建迷宫环境,设置is_slippery为False降低环境难度
env = gym.make("FrozenLake-v0", is_slippery=False)  # 0 left, 1 down, 2 right, 3 up

# 创建一个agent实例,输入超参数
agent = SarsaAgent(
        obs_n=env.observation_space.n,
        act_n=env.action_space.n,
        learning_rate=0.1,
        gamma=0.9,
        e_greed=0.1)


# 训练500个episode,打印每个episode的分数
for episode in range(500):
    ep_reward, ep_steps = run_episode(env, agent, False)
    print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))

# 全部训练结束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))
Episode 0: steps = 11 , reward = 0.0
Episode 1: steps = 10 , reward = 0.0
Episode 2: steps = 5 , reward = 0.0
Episode 3: steps = 13 , reward = 0.0
Episode 4: steps = 6 , reward = 0.0
Episode 5: steps = 7 , reward = 0.0
Episode 6: steps = 5 , reward = 0.0
Episode 7: steps = 4 , reward = 0.0
Episode 8: steps = 2 , reward = 0.0
Episode 9: steps = 20 , reward = 0.0
Episode 10: steps = 8 , reward = 0.0
Episode 11: steps = 15 , reward = 0.0
Episode 12: steps = 2 , reward = 0.0
Episode 13: steps = 8 , reward = 0.0
Episode 14: steps = 10 , reward = 0.0
Episode 15: steps = 10 , reward = 0.0
Episode 16: steps = 2 , reward = 0.0
Episode 17: steps = 4 , reward = 0.0
Episode 18: steps = 2 , reward = 0.0
Episode 19: steps = 5 , reward = 0.0
Episode 20: steps = 2 , reward = 0.0
Episode 21: steps = 5 , reward = 0.0
Episode 22: steps = 11 , reward = 0.0
Episode 23: steps = 9 , reward = 0.0
Episode 24: steps = 7 , reward = 0.0
Episode 25: steps = 8 , reward = 0.0
Episode 26: steps = 13 , reward = 0.0
Episode 27: steps = 5 , reward = 0.0
Episode 28: steps = 5 , reward = 0.0
Episode 29: steps = 2 , reward = 0.0
Episode 30: steps = 6 , reward = 0.0
Episode 31: steps = 4 , reward = 0.0
Episode 32: steps = 11 , reward = 0.0
Episode 33: steps = 6 , reward = 0.0
Episode 34: steps = 3 , reward = 0.0
Episode 35: steps = 3 , reward = 0.0
Episode 36: steps = 11 , reward = 0.0
Episode 37: steps = 31 , reward = 0.0
Episode 38: steps = 9 , reward = 0.0
Episode 39: steps = 3 , reward = 0.0
Episode 40: steps = 17 , reward = 0.0
Episode 41: steps = 6 , reward = 0.0
Episode 42: steps = 5 , reward = 0.0
Episode 43: steps = 4 , reward = 0.0
Episode 44: steps = 3 , reward = 0.0
Episode 45: steps = 7 , reward = 0.0
Episode 46: steps = 4 , reward = 0.0
Episode 47: steps = 5 , reward = 0.0
Episode 48: steps = 3 , reward = 0.0
Episode 49: steps = 7 , reward = 0.0
Episode 50: steps = 7 , reward = 0.0
Episode 51: steps = 4 , reward = 0.0
Episode 52: steps = 4 , reward = 0.0
Episode 53: steps = 6 , reward = 0.0
Episode 54: steps = 3 , reward = 0.0
Episode 55: steps = 2 , reward = 0.0
Episode 56: steps = 9 , reward = 0.0
Episode 57: steps = 3 , reward = 0.0
Episode 58: steps = 6 , reward = 0.0
Episode 59: steps = 24 , reward = 0.0
Episode 60: steps = 12 , reward = 0.0
Episode 61: steps = 8 , reward = 0.0
Episode 62: steps = 10 , reward = 0.0
Episode 63: steps = 15 , reward = 0.0
Episode 64: steps = 10 , reward = 0.0
Episode 65: steps = 5 , reward = 0.0
Episode 66: steps = 12 , reward = 0.0
Episode 67: steps = 8 , reward = 0.0
Episode 68: steps = 5 , reward = 0.0
Episode 69: steps = 7 , reward = 0.0
Episode 70: steps = 2 , reward = 0.0
Episode 71: steps = 11 , reward = 0.0
Episode 72: steps = 8 , reward = 0.0
Episode 73: steps = 3 , reward = 0.0
Episode 74: steps = 6 , reward = 0.0
Episode 75: steps = 16 , reward = 0.0
Episode 76: steps = 4 , reward = 0.0
Episode 77: steps = 2 , reward = 0.0
Episode 78: steps = 9 , reward = 0.0
Episode 79: steps = 7 , reward = 0.0
Episode 80: steps = 4 , reward = 0.0
Episode 81: steps = 6 , reward = 0.0
Episode 82: steps = 21 , reward = 0.0
Episode 83: steps = 4 , reward = 0.0
Episode 84: steps = 2 , reward = 0.0
Episode 85: steps = 15 , reward = 0.0
Episode 86: steps = 13 , reward = 0.0
Episode 87: steps = 3 , reward = 0.0
Episode 88: steps = 39 , reward = 0.0
Episode 89: steps = 14 , reward = 0.0
Episode 90: steps = 4 , reward = 0.0
Episode 91: steps = 6 , reward = 0.0
Episode 92: steps = 2 , reward = 0.0
Episode 93: steps = 2 , reward = 0.0
Episode 94: steps = 2 , reward = 0.0
Episode 95: steps = 15 , reward = 0.0
Episode 96: steps = 2 , reward = 0.0
Episode 97: steps = 12 , reward = 1.0
Episode 98: steps = 2 , reward = 0.0
Episode 99: steps = 4 , reward = 0.0
Episode 100: steps = 9 , reward = 0.0
Episode 101: steps = 12 , reward = 0.0
Episode 102: steps = 26 , reward = 0.0
Episode 103: steps = 4 , reward = 0.0
Episode 104: steps = 23 , reward = 0.0
Episode 105: steps = 5 , reward = 0.0
Episode 106: steps = 2 , reward = 0.0
Episode 107: steps = 5 , reward = 0.0
Episode 108: steps = 2 , reward = 0.0
Episode 109: steps = 3 , reward = 0.0
Episode 110: steps = 7 , reward = 0.0
Episode 111: steps = 15 , reward = 0.0
Episode 112: steps = 13 , reward = 0.0
Episode 113: steps = 14 , reward = 0.0
Episode 114: steps = 16 , reward = 0.0
Episode 115: steps = 3 , reward = 0.0
Episode 116: steps = 4 , reward = 0.0
Episode 117: steps = 5 , reward = 0.0
Episode 118: steps = 4 , reward = 0.0
Episode 119: steps = 11 , reward = 0.0
Episode 120: steps = 8 , reward = 0.0
Episode 121: steps = 8 , reward = 0.0
Episode 122: steps = 2 , reward = 0.0
Episode 123: steps = 3 , reward = 0.0
Episode 124: steps = 4 , reward = 0.0
Episode 125: steps = 2 , reward = 0.0
Episode 126: steps = 9 , reward = 0.0
Episode 127: steps = 10 , reward = 0.0
Episode 128: steps = 8 , reward = 0.0
Episode 129: steps = 3 , reward = 0.0
Episode 130: steps = 19 , reward = 0.0
Episode 131: steps = 7 , reward = 0.0
Episode 132: steps = 4 , reward = 0.0
Episode 133: steps = 19 , reward = 0.0
Episode 134: steps = 10 , reward = 0.0
Episode 135: steps = 11 , reward = 0.0
Episode 136: steps = 8 , reward = 0.0
Episode 137: steps = 6 , reward = 0.0
Episode 138: steps = 5 , reward = 0.0
Episode 139: steps = 11 , reward = 0.0
Episode 140: steps = 13 , reward = 0.0
Episode 141: steps = 10 , reward = 0.0
Episode 142: steps = 3 , reward = 0.0
Episode 143: steps = 5 , reward = 0.0
Episode 144: steps = 3 , reward = 0.0
Episode 145: steps = 4 , reward = 0.0
Episode 146: steps = 7 , reward = 0.0
Episode 147: steps = 21 , reward = 0.0
Episode 148: steps = 19 , reward = 0.0
Episode 149: steps = 11 , reward = 0.0
Episode 150: steps = 9 , reward = 0.0
Episode 151: steps = 7 , reward = 0.0
Episode 152: steps = 5 , reward = 0.0
Episode 153: steps = 7 , reward = 0.0
Episode 154: steps = 2 , reward = 0.0
Episode 155: steps = 2 , reward = 0.0
Episode 156: steps = 7 , reward = 0.0
Episode 157: steps = 10 , reward = 0.0
Episode 158: steps = 3 , reward = 0.0
Episode 159: steps = 3 , reward = 0.0
Episode 160: steps = 5 , reward = 0.0
Episode 161: steps = 11 , reward = 1.0
Episode 162: steps = 8 , reward = 1.0
Episode 163: steps = 10 , reward = 0.0
Episode 164: steps = 17 , reward = 1.0
Episode 165: steps = 4 , reward = 0.0
Episode 166: steps = 4 , reward = 0.0
Episode 167: steps = 7 , reward = 1.0
Episode 168: steps = 10 , reward = 1.0
Episode 169: steps = 8 , reward = 1.0
Episode 170: steps = 6 , reward = 1.0
Episode 171: steps = 7 , reward = 1.0
Episode 172: steps = 6 , reward = 1.0
Episode 173: steps = 4 , reward = 0.0
Episode 174: steps = 6 , reward = 1.0
Episode 175: steps = 6 , reward = 1.0
Episode 176: steps = 6 , reward = 1.0
Episode 177: steps = 8 , reward = 0.0
Episode 178: steps = 6 , reward = 1.0
Episode 179: steps = 6 , reward = 1.0
Episode 180: steps = 6 , reward = 1.0
Episode 181: steps = 8 , reward = 0.0
Episode 182: steps = 8 , reward = 1.0
Episode 183: steps = 6 , reward = 1.0
Episode 184: steps = 6 , reward = 1.0
Episode 185: steps = 6 , reward = 1.0
Episode 186: steps = 6 , reward = 1.0
Episode 187: steps = 6 , reward = 1.0
Episode 188: steps = 9 , reward = 0.0
Episode 189: steps = 8 , reward = 1.0
Episode 190: steps = 6 , reward = 1.0
Episode 191: steps = 6 , reward = 1.0
Episode 192: steps = 6 , reward = 1.0
Episode 193: steps = 8 , reward = 1.0
Episode 194: steps = 6 , reward = 1.0
Episode 195: steps = 6 , reward = 1.0
Episode 196: steps = 6 , reward = 1.0
Episode 197: steps = 8 , reward = 1.0
Episode 198: steps = 6 , reward = 0.0
Episode 199: steps = 6 , reward = 1.0
Episode 200: steps = 5 , reward = 0.0
Episode 201: steps = 5 , reward = 0.0
Episode 202: steps = 6 , reward = 1.0
Episode 203: steps = 8 , reward = 1.0
Episode 204: steps = 8 , reward = 1.0
Episode 205: steps = 8 , reward = 1.0
Episode 206: steps = 2 , reward = 0.0
Episode 207: steps = 6 , reward = 1.0
Episode 208: steps = 6 , reward = 1.0
Episode 209: steps = 5 , reward = 0.0
Episode 210: steps = 9 , reward = 1.0
Episode 211: steps = 7 , reward = 0.0
Episode 212: steps = 6 , reward = 1.0
Episode 213: steps = 6 , reward = 1.0
Episode 214: steps = 9 , reward = 1.0
Episode 215: steps = 6 , reward = 1.0
Episode 216: steps = 8 , reward = 1.0
Episode 217: steps = 6 , reward = 1.0
Episode 218: steps = 8 , reward = 1.0
Episode 219: steps = 6 , reward = 1.0
Episode 220: steps = 4 , reward = 0.0
Episode 221: steps = 6 , reward = 1.0
Episode 222: steps = 6 , reward = 1.0
Episode 223: steps = 2 , reward = 0.0
Episode 224: steps = 6 , reward = 1.0
Episode 225: steps = 7 , reward = 1.0
Episode 226: steps = 6 , reward = 1.0
Episode 227: steps = 6 , reward = 1.0
Episode 228: steps = 6 , reward = 1.0
Episode 229: steps = 6 , reward = 1.0
Episode 230: steps = 6 , reward = 1.0
Episode 231: steps = 6 , reward = 1.0
Episode 232: steps = 6 , reward = 1.0
Episode 233: steps = 6 , reward = 1.0
Episode 234: steps = 6 , reward = 1.0
Episode 235: steps = 6 , reward = 1.0
Episode 236: steps = 7 , reward = 1.0
Episode 237: steps = 7 , reward = 1.0
Episode 238: steps = 6 , reward = 1.0
Episode 239: steps = 4 , reward = 0.0
Episode 240: steps = 4 , reward = 0.0
Episode 241: steps = 4 , reward = 0.0
Episode 242: steps = 2 , reward = 0.0
Episode 243: steps = 7 , reward = 1.0
Episode 244: steps = 6 , reward = 1.0
Episode 245: steps = 6 , reward = 1.0
Episode 246: steps = 4 , reward = 0.0
Episode 247: steps = 6 , reward = 1.0
Episode 248: steps = 6 , reward = 1.0
Episode 249: steps = 6 , reward = 1.0
Episode 250: steps = 4 , reward = 0.0
Episode 251: steps = 6 , reward = 1.0
Episode 252: steps = 6 , reward = 1.0
Episode 253: steps = 6 , reward = 1.0
Episode 254: steps = 8 , reward = 1.0
Episode 255: steps = 6 , reward = 1.0
Episode 256: steps = 9 , reward = 1.0
Episode 257: steps = 6 , reward = 1.0
Episode 258: steps = 8 , reward = 1.0
Episode 259: steps = 7 , reward = 1.0
Episode 260: steps = 6 , reward = 1.0
Episode 261: steps = 6 , reward = 1.0
Episode 262: steps = 6 , reward = 1.0
Episode 263: steps = 8 , reward = 1.0
Episode 264: steps = 6 , reward = 1.0
Episode 265: steps = 8 , reward = 1.0
Episode 266: steps = 6 , reward = 1.0
Episode 267: steps = 6 , reward = 1.0
Episode 268: steps = 4 , reward = 0.0
Episode 269: steps = 8 , reward = 0.0
Episode 270: steps = 6 , reward = 1.0
Episode 271: steps = 6 , reward = 1.0
Episode 272: steps = 8 , reward = 1.0
Episode 273: steps = 6 , reward = 1.0
Episode 274: steps = 7 , reward = 1.0
Episode 275: steps = 6 , reward = 1.0
Episode 276: steps = 4 , reward = 0.0
Episode 277: steps = 6 , reward = 1.0
Episode 278: steps = 6 , reward = 1.0
Episode 279: steps = 6 , reward = 1.0
Episode 280: steps = 4 , reward = 0.0
Episode 281: steps = 6 , reward = 1.0
Episode 282: steps = 6 , reward = 1.0
Episode 283: steps = 4 , reward = 0.0
Episode 284: steps = 6 , reward = 1.0
Episode 285: steps = 7 , reward = 1.0
Episode 286: steps = 9 , reward = 0.0
Episode 287: steps = 6 , reward = 1.0
Episode 288: steps = 6 , reward = 1.0
Episode 289: steps = 8 , reward = 1.0
Episode 290: steps = 7 , reward = 1.0
Episode 291: steps = 7 , reward = 1.0
Episode 292: steps = 6 , reward = 1.0
Episode 293: steps = 6 , reward = 1.0
Episode 294: steps = 8 , reward = 1.0
Episode 295: steps = 6 , reward = 1.0
Episode 296: steps = 6 , reward = 1.0
Episode 297: steps = 6 , reward = 1.0
Episode 298: steps = 6 , reward = 1.0
Episode 299: steps = 10 , reward = 0.0
Episode 300: steps = 8 , reward = 1.0
Episode 301: steps = 4 , reward = 0.0
Episode 302: steps = 8 , reward = 1.0
Episode 303: steps = 7 , reward = 1.0
Episode 304: steps = 6 , reward = 1.0
Episode 305: steps = 6 , reward = 1.0
Episode 306: steps = 6 , reward = 1.0
Episode 307: steps = 4 , reward = 0.0
Episode 308: steps = 6 , reward = 1.0
Episode 309: steps = 6 , reward = 1.0
Episode 310: steps = 5 , reward = 0.0
Episode 311: steps = 6 , reward = 1.0
Episode 312: steps = 8 , reward = 0.0
Episode 313: steps = 6 , reward = 1.0
Episode 314: steps = 6 , reward = 1.0
Episode 315: steps = 6 , reward = 1.0
Episode 316: steps = 6 , reward = 1.0
Episode 317: steps = 6 , reward = 1.0
Episode 318: steps = 7 , reward = 1.0
Episode 319: steps = 6 , reward = 1.0
Episode 320: steps = 6 , reward = 1.0
Episode 321: steps = 6 , reward = 1.0
Episode 322: steps = 6 , reward = 1.0
Episode 323: steps = 6 , reward = 1.0
Episode 324: steps = 10 , reward = 1.0
Episode 325: steps = 6 , reward = 1.0
Episode 326: steps = 6 , reward = 1.0
Episode 327: steps = 6 , reward = 1.0
Episode 328: steps = 6 , reward = 1.0
Episode 329: steps = 6 , reward = 1.0
Episode 330: steps = 6 , reward = 1.0
Episode 331: steps = 8 , reward = 1.0
Episode 332: steps = 6 , reward = 1.0
Episode 333: steps = 5 , reward = 0.0
Episode 334: steps = 5 , reward = 0.0
Episode 335: steps = 8 , reward = 1.0
Episode 336: steps = 6 , reward = 1.0
Episode 337: steps = 6 , reward = 1.0
Episode 338: steps = 6 , reward = 1.0
Episode 339: steps = 6 , reward = 1.0
Episode 340: steps = 6 , reward = 1.0
Episode 341: steps = 6 , reward = 1.0
Episode 342: steps = 6 , reward = 1.0
Episode 343: steps = 6 , reward = 1.0
Episode 344: steps = 6 , reward = 1.0
Episode 345: steps = 6 , reward = 1.0
Episode 346: steps = 6 , reward = 1.0
Episode 347: steps = 6 , reward = 1.0
Episode 348: steps = 7 , reward = 0.0
Episode 349: steps = 6 , reward = 1.0
Episode 350: steps = 6 , reward = 1.0
Episode 351: steps = 8 , reward = 1.0
Episode 352: steps = 7 , reward = 1.0
Episode 353: steps = 8 , reward = 1.0
Episode 354: steps = 6 , reward = 1.0
Episode 355: steps = 6 , reward = 1.0
Episode 356: steps = 8 , reward = 1.0
Episode 357: steps = 8 , reward = 1.0
Episode 358: steps = 6 , reward = 1.0
Episode 359: steps = 6 , reward = 1.0
Episode 360: steps = 6 , reward = 1.0
Episode 361: steps = 9 , reward = 1.0
Episode 362: steps = 6 , reward = 1.0
Episode 363: steps = 6 , reward = 1.0
Episode 364: steps = 7 , reward = 1.0
Episode 365: steps = 6 , reward = 1.0
Episode 366: steps = 8 , reward = 1.0
Episode 367: steps = 6 , reward = 1.0
Episode 368: steps = 2 , reward = 0.0
Episode 369: steps = 6 , reward = 1.0
Episode 370: steps = 6 , reward = 1.0
Episode 371: steps = 6 , reward = 0.0
Episode 372: steps = 6 , reward = 1.0
Episode 373: steps = 8 , reward = 1.0
Episode 374: steps = 6 , reward = 1.0
Episode 375: steps = 6 , reward = 1.0
Episode 376: steps = 6 , reward = 1.0
Episode 377: steps = 6 , reward = 1.0
Episode 378: steps = 6 , reward = 1.0
Episode 379: steps = 6 , reward = 1.0
Episode 380: steps = 6 , reward = 1.0
Episode 381: steps = 6 , reward = 1.0
Episode 382: steps = 6 , reward = 1.0
Episode 383: steps = 6 , reward = 1.0
Episode 384: steps = 6 , reward = 1.0
Episode 385: steps = 6 , reward = 1.0
Episode 386: steps = 6 , reward = 1.0
Episode 387: steps = 6 , reward = 1.0
Episode 388: steps = 6 , reward = 1.0
Episode 389: steps = 6 , reward = 1.0
Episode 390: steps = 8 , reward = 1.0
Episode 391: steps = 6 , reward = 1.0
Episode 392: steps = 2 , reward = 0.0
Episode 393: steps = 6 , reward = 1.0
Episode 394: steps = 6 , reward = 1.0
Episode 395: steps = 6 , reward = 1.0
Episode 396: steps = 7 , reward = 1.0
Episode 397: steps = 6 , reward = 1.0
Episode 398: steps = 8 , reward = 1.0
Episode 399: steps = 8 , reward = 1.0
Episode 400: steps = 6 , reward = 1.0
Episode 401: steps = 6 , reward = 1.0
Episode 402: steps = 8 , reward = 1.0
Episode 403: steps = 9 , reward = 1.0
Episode 404: steps = 8 , reward = 1.0
Episode 405: steps = 6 , reward = 1.0
Episode 406: steps = 6 , reward = 1.0
Episode 407: steps = 8 , reward = 1.0
Episode 408: steps = 6 , reward = 1.0
Episode 409: steps = 6 , reward = 1.0
Episode 410: steps = 4 , reward = 0.0
Episode 411: steps = 6 , reward = 1.0
Episode 412: steps = 6 , reward = 1.0
Episode 413: steps = 2 , reward = 0.0
Episode 414: steps = 6 , reward = 1.0
Episode 415: steps = 6 , reward = 1.0
Episode 416: steps = 6 , reward = 1.0
Episode 417: steps = 10 , reward = 1.0
Episode 418: steps = 6 , reward = 1.0
Episode 419: steps = 6 , reward = 1.0
Episode 420: steps = 6 , reward = 1.0
Episode 421: steps = 5 , reward = 0.0
Episode 422: steps = 6 , reward = 1.0
Episode 423: steps = 8 , reward = 1.0
Episode 424: steps = 6 , reward = 1.0
Episode 425: steps = 6 , reward = 1.0
Episode 426: steps = 6 , reward = 1.0
Episode 427: steps = 6 , reward = 1.0
Episode 428: steps = 6 , reward = 1.0
Episode 429: steps = 10 , reward = 1.0
Episode 430: steps = 4 , reward = 0.0
Episode 431: steps = 2 , reward = 0.0
Episode 432: steps = 6 , reward = 1.0
Episode 433: steps = 8 , reward = 1.0
Episode 434: steps = 6 , reward = 1.0
Episode 435: steps = 4 , reward = 0.0
Episode 436: steps = 6 , reward = 1.0
Episode 437: steps = 6 , reward = 1.0
Episode 438: steps = 4 , reward = 0.0
Episode 439: steps = 6 , reward = 1.0
Episode 440: steps = 6 , reward = 1.0
Episode 441: steps = 6 , reward = 1.0
Episode 442: steps = 6 , reward = 1.0
Episode 443: steps = 6 , reward = 1.0
Episode 444: steps = 6 , reward = 1.0
Episode 445: steps = 6 , reward = 1.0
Episode 446: steps = 8 , reward = 1.0
Episode 447: steps = 5 , reward = 0.0
Episode 448: steps = 6 , reward = 1.0
Episode 449: steps = 6 , reward = 1.0
Episode 450: steps = 8 , reward = 1.0
Episode 451: steps = 5 , reward = 0.0
Episode 452: steps = 6 , reward = 1.0
Episode 453: steps = 6 , reward = 1.0
Episode 454: steps = 10 , reward = 1.0
Episode 455: steps = 6 , reward = 1.0
Episode 456: steps = 6 , reward = 0.0
Episode 457: steps = 6 , reward = 1.0
Episode 458: steps = 6 , reward = 1.0
Episode 459: steps = 6 , reward = 1.0
Episode 460: steps = 8 , reward = 1.0
Episode 461: steps = 8 , reward = 1.0
Episode 462: steps = 6 , reward = 1.0
Episode 463: steps = 8 , reward = 1.0
Episode 464: steps = 6 , reward = 1.0
Episode 465: steps = 2 , reward = 0.0
Episode 466: steps = 6 , reward = 1.0
Episode 467: steps = 6 , reward = 1.0
Episode 468: steps = 6 , reward = 1.0
Episode 469: steps = 8 , reward = 1.0
Episode 470: steps = 4 , reward = 0.0
Episode 471: steps = 6 , reward = 1.0
Episode 472: steps = 6 , reward = 1.0
Episode 473: steps = 8 , reward = 1.0
Episode 474: steps = 6 , reward = 1.0
Episode 475: steps = 4 , reward = 0.0
Episode 476: steps = 6 , reward = 1.0
Episode 477: steps = 6 , reward = 1.0
Episode 478: steps = 4 , reward = 0.0
Episode 479: steps = 4 , reward = 0.0
Episode 480: steps = 6 , reward = 1.0
Episode 481: steps = 6 , reward = 1.0
Episode 482: steps = 8 , reward = 1.0
Episode 483: steps = 8 , reward = 1.0
Episode 484: steps = 7 , reward = 1.0
Episode 485: steps = 8 , reward = 1.0
Episode 486: steps = 6 , reward = 1.0
Episode 487: steps = 6 , reward = 1.0
Episode 488: steps = 6 , reward = 1.0
Episode 489: steps = 6 , reward = 1.0
Episode 490: steps = 6 , reward = 1.0
Episode 491: steps = 6 , reward = 1.0
Episode 492: steps = 10 , reward = 1.0
Episode 493: steps = 6 , reward = 1.0
Episode 494: steps = 7 , reward = 1.0
Episode 495: steps = 6 , reward = 1.0
Episode 496: steps = 6 , reward = 1.0
Episode 497: steps = 6 , reward = 1.0
Episode 498: steps = 5 , reward = 0.0
Episode 499: steps = 4 , reward = 0.0
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Down)
SFFF
FH[41mF[0mH
FFFH
HFFG
  (Down)
SFFF
FHFH
FF[41mF[0mH
HFFG
  (Down)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
test reward = 1.0
import gym
import numpy as np
import time


class QLearningAgent(object):
    def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
        self.act_n = act_n      # 动作维度,有几个动作可选
        self.lr = learning_rate # 学习率
        self.gamma = gamma      # reward的衰减率
        self.epsilon = e_greed  # 按一定概率随机选动作
        self.Q = np.zeros((obs_n, act_n))

    # 根据输入观察值,采样输出的动作值,带探索
    def sample(self, obs):

        if np.random.uniform(0, 1) < (1.0 - self.epsilon): #根据table的Q值选动作
            action = self.predict(obs)
        else:
            action = np.random.choice(self.act_n) #有一定概率随机探索选取一个动作
        return action

    # 根据输入观察值,预测输出的动作值
    def predict(self, obs):
       
        Q_list = self.Q[obs, :]
        maxQ = np.max(Q_list)
        action_list = np.where(Q_list == maxQ)[0]  # maxQ可能对应多个action
        action = np.random.choice(action_list)
        return action

    # 学习方法,也就是更新Q-table的方法
    def learn(self, obs, action, reward, next_obs, done):
        """ off-policy
            obs: 交互前的obs, s_t
            action: 本次交互选择的action, a_t
            reward: 本次动作获得的奖励r
            next_obs: 本次交互后的obs, s_t+1
            done: episode是否结束
        """

        predict_Q = self.Q[obs, action]
        if done:
            target_Q = reward # 没有下一个状态了
        else:
            target_Q = reward + self.gamma * np.max(self.Q[next_obs, :]) # Q-learning
        self.Q[obs, action] += self.lr * (target_Q - predict_Q) # 修正q

    # 保存Q表格数据到文件
    def save(self):
        npy_file = './q_table.npy'
        np.save(npy_file, self.Q)
        print(npy_file + ' saved.')
    
    # 从文件中读取数据到Q表格中
    def restore(self, npy_file='./q_table.npy'):
        self.Q = np.load(npy_file)
        print(npy_file + ' loaded.')


def run_episode(env, agent, render=False):
    total_steps = 0 # 记录每个episode走了多少step
    total_reward = 0

    obs = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)

    while True:
        action = agent.sample(obs) # 根据算法选择一个动作
        next_obs, reward, done, _ = env.step(action) # 与环境进行一个交互
        # 训练 Q-learning算法
        agent.learn(obs, action, reward, next_obs, done)

        obs = next_obs  # 存储上一个观察值
        total_reward += reward
        total_steps += 1 # 计算step数
        if render:
            env.render() #渲染新的一帧图形
        if done:
            break
    return total_reward, total_steps


def test_episode(env, agent):
    total_reward = 0
    obs = env.reset()
    while True:
        action = agent.predict(obs) # greedy
        next_obs, reward, done, _ = env.step(action)
        total_reward += reward
        obs = next_obs
        # time.sleep(0.5)
        # env.render()
        if done:
            break
    return total_reward



# 使用gym创建迷宫环境,设置is_slippery为False降低环境难度
env = gym.make("FrozenLake-v0", is_slippery=False)  # 0 left, 1 down, 2 right, 3 up

# 创建一个agent实例,输入超参数
agent = QLearningAgent(
        obs_n=env.observation_space.n,
        act_n=env.action_space.n,
        learning_rate=0.1,
        gamma=0.9,
        e_greed=0.1)


# 训练500个episode,打印每个episode的分数
for episode in range(500):
    ep_reward, ep_steps = run_episode(env, agent, False)
    print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))

# 全部训练结束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))
Episode 0: steps = 2 , reward = 0.0
Episode 1: steps = 20 , reward = 0.0
Episode 2: steps = 8 , reward = 0.0
Episode 3: steps = 7 , reward = 0.0
Episode 4: steps = 6 , reward = 0.0
Episode 5: steps = 2 , reward = 0.0
Episode 6: steps = 15 , reward = 0.0
Episode 7: steps = 2 , reward = 0.0
Episode 8: steps = 2 , reward = 0.0
Episode 9: steps = 12 , reward = 0.0
Episode 10: steps = 14 , reward = 0.0
Episode 11: steps = 3 , reward = 0.0
Episode 12: steps = 5 , reward = 0.0
Episode 13: steps = 9 , reward = 0.0
Episode 14: steps = 6 , reward = 0.0
Episode 15: steps = 6 , reward = 0.0
Episode 16: steps = 15 , reward = 0.0
Episode 17: steps = 11 , reward = 0.0
Episode 18: steps = 10 , reward = 0.0
Episode 19: steps = 12 , reward = 0.0
Episode 20: steps = 7 , reward = 0.0
Episode 21: steps = 6 , reward = 0.0
Episode 22: steps = 5 , reward = 0.0
Episode 23: steps = 17 , reward = 0.0
Episode 24: steps = 2 , reward = 0.0
Episode 25: steps = 4 , reward = 0.0
Episode 26: steps = 23 , reward = 0.0
Episode 27: steps = 11 , reward = 0.0
Episode 28: steps = 7 , reward = 0.0
Episode 29: steps = 5 , reward = 0.0
Episode 30: steps = 4 , reward = 0.0
Episode 31: steps = 9 , reward = 0.0
Episode 32: steps = 5 , reward = 0.0
Episode 33: steps = 2 , reward = 0.0
Episode 34: steps = 15 , reward = 0.0
Episode 35: steps = 9 , reward = 0.0
Episode 36: steps = 14 , reward = 0.0
Episode 37: steps = 5 , reward = 0.0
Episode 38: steps = 8 , reward = 0.0
Episode 39: steps = 22 , reward = 0.0
Episode 40: steps = 4 , reward = 0.0
Episode 41: steps = 11 , reward = 0.0
Episode 42: steps = 16 , reward = 0.0
Episode 43: steps = 6 , reward = 0.0
Episode 44: steps = 5 , reward = 0.0
Episode 45: steps = 10 , reward = 0.0
Episode 46: steps = 11 , reward = 0.0
Episode 47: steps = 4 , reward = 0.0
Episode 48: steps = 6 , reward = 0.0
Episode 49: steps = 4 , reward = 0.0
Episode 50: steps = 3 , reward = 0.0
Episode 51: steps = 3 , reward = 0.0
Episode 52: steps = 17 , reward = 0.0
Episode 53: steps = 2 , reward = 0.0
Episode 54: steps = 3 , reward = 0.0
Episode 55: steps = 15 , reward = 0.0
Episode 56: steps = 2 , reward = 0.0
Episode 57: steps = 6 , reward = 0.0
Episode 58: steps = 4 , reward = 0.0
Episode 59: steps = 10 , reward = 0.0
Episode 60: steps = 3 , reward = 0.0
Episode 61: steps = 6 , reward = 0.0
Episode 62: steps = 9 , reward = 0.0
Episode 63: steps = 6 , reward = 0.0
Episode 64: steps = 15 , reward = 0.0
Episode 65: steps = 7 , reward = 1.0
Episode 66: steps = 8 , reward = 0.0
Episode 67: steps = 9 , reward = 0.0
Episode 68: steps = 2 , reward = 0.0
Episode 69: steps = 19 , reward = 1.0
Episode 70: steps = 18 , reward = 0.0
Episode 71: steps = 2 , reward = 0.0
Episode 72: steps = 10 , reward = 0.0
Episode 73: steps = 10 , reward = 0.0
Episode 74: steps = 10 , reward = 0.0
Episode 75: steps = 5 , reward = 0.0
Episode 76: steps = 10 , reward = 0.0
Episode 77: steps = 17 , reward = 0.0
Episode 78: steps = 4 , reward = 0.0
Episode 79: steps = 5 , reward = 0.0
Episode 80: steps = 3 , reward = 0.0
Episode 81: steps = 9 , reward = 0.0
Episode 82: steps = 12 , reward = 0.0
Episode 83: steps = 2 , reward = 0.0
Episode 84: steps = 10 , reward = 0.0
Episode 85: steps = 5 , reward = 0.0
Episode 86: steps = 5 , reward = 0.0
Episode 87: steps = 5 , reward = 0.0
Episode 88: steps = 6 , reward = 0.0
Episode 89: steps = 7 , reward = 0.0
Episode 90: steps = 7 , reward = 0.0
Episode 91: steps = 3 , reward = 0.0
Episode 92: steps = 8 , reward = 0.0
Episode 93: steps = 5 , reward = 0.0
Episode 94: steps = 10 , reward = 0.0
Episode 95: steps = 8 , reward = 0.0
Episode 96: steps = 2 , reward = 0.0
Episode 97: steps = 4 , reward = 0.0
Episode 98: steps = 9 , reward = 0.0
Episode 99: steps = 5 , reward = 0.0
Episode 100: steps = 18 , reward = 0.0
Episode 101: steps = 11 , reward = 0.0
Episode 102: steps = 3 , reward = 0.0
Episode 103: steps = 8 , reward = 0.0
Episode 104: steps = 6 , reward = 0.0
Episode 105: steps = 21 , reward = 1.0
Episode 106: steps = 8 , reward = 1.0
Episode 107: steps = 2 , reward = 0.0
Episode 108: steps = 3 , reward = 0.0
Episode 109: steps = 3 , reward = 0.0
Episode 110: steps = 4 , reward = 0.0
Episode 111: steps = 8 , reward = 0.0
Episode 112: steps = 2 , reward = 0.0
Episode 113: steps = 8 , reward = 0.0
Episode 114: steps = 9 , reward = 0.0
Episode 115: steps = 6 , reward = 0.0
Episode 116: steps = 7 , reward = 1.0
Episode 117: steps = 6 , reward = 0.0
Episode 118: steps = 6 , reward = 0.0
Episode 119: steps = 12 , reward = 1.0
Episode 120: steps = 8 , reward = 1.0
Episode 121: steps = 9 , reward = 1.0
Episode 122: steps = 12 , reward = 1.0
Episode 123: steps = 6 , reward = 1.0
Episode 124: steps = 3 , reward = 0.0
Episode 125: steps = 6 , reward = 1.0
Episode 126: steps = 6 , reward = 1.0
Episode 127: steps = 6 , reward = 1.0
Episode 128: steps = 8 , reward = 1.0
Episode 129: steps = 6 , reward = 1.0
Episode 130: steps = 6 , reward = 1.0
Episode 131: steps = 7 , reward = 1.0
Episode 132: steps = 8 , reward = 1.0
Episode 133: steps = 6 , reward = 1.0
Episode 134: steps = 7 , reward = 1.0
Episode 135: steps = 6 , reward = 1.0
Episode 136: steps = 6 , reward = 1.0
Episode 137: steps = 7 , reward = 1.0
Episode 138: steps = 6 , reward = 1.0
Episode 139: steps = 8 , reward = 1.0
Episode 140: steps = 6 , reward = 1.0
Episode 141: steps = 6 , reward = 1.0
Episode 142: steps = 6 , reward = 1.0
Episode 143: steps = 6 , reward = 1.0
Episode 144: steps = 2 , reward = 0.0
Episode 145: steps = 6 , reward = 1.0
Episode 146: steps = 6 , reward = 1.0
Episode 147: steps = 6 , reward = 1.0
Episode 148: steps = 8 , reward = 1.0
Episode 149: steps = 10 , reward = 1.0
Episode 150: steps = 6 , reward = 1.0
Episode 151: steps = 6 , reward = 1.0
Episode 152: steps = 3 , reward = 0.0
Episode 153: steps = 2 , reward = 0.0
Episode 154: steps = 8 , reward = 1.0
Episode 155: steps = 6 , reward = 1.0
Episode 156: steps = 6 , reward = 1.0
Episode 157: steps = 6 , reward = 1.0
Episode 158: steps = 6 , reward = 1.0
Episode 159: steps = 6 , reward = 1.0
Episode 160: steps = 6 , reward = 1.0
Episode 161: steps = 7 , reward = 1.0
Episode 162: steps = 8 , reward = 1.0
Episode 163: steps = 5 , reward = 0.0
Episode 164: steps = 6 , reward = 1.0
Episode 165: steps = 6 , reward = 1.0
Episode 166: steps = 6 , reward = 1.0
Episode 167: steps = 6 , reward = 1.0
Episode 168: steps = 6 , reward = 1.0
Episode 169: steps = 6 , reward = 1.0
Episode 170: steps = 3 , reward = 0.0
Episode 171: steps = 6 , reward = 1.0
Episode 172: steps = 6 , reward = 1.0
Episode 173: steps = 6 , reward = 1.0
Episode 174: steps = 6 , reward = 1.0
Episode 175: steps = 8 , reward = 1.0
Episode 176: steps = 9 , reward = 1.0
Episode 177: steps = 6 , reward = 1.0
Episode 178: steps = 4 , reward = 0.0
Episode 179: steps = 6 , reward = 1.0
Episode 180: steps = 6 , reward = 1.0
Episode 181: steps = 8 , reward = 1.0
Episode 182: steps = 6 , reward = 1.0
Episode 183: steps = 6 , reward = 1.0
Episode 184: steps = 6 , reward = 1.0
Episode 185: steps = 6 , reward = 1.0
Episode 186: steps = 6 , reward = 1.0
Episode 187: steps = 8 , reward = 1.0
Episode 188: steps = 7 , reward = 1.0
Episode 189: steps = 6 , reward = 1.0
Episode 190: steps = 8 , reward = 1.0
Episode 191: steps = 6 , reward = 1.0
Episode 192: steps = 4 , reward = 0.0
Episode 193: steps = 6 , reward = 1.0
Episode 194: steps = 6 , reward = 1.0
Episode 195: steps = 9 , reward = 1.0
Episode 196: steps = 6 , reward = 1.0
Episode 197: steps = 6 , reward = 1.0
Episode 198: steps = 7 , reward = 1.0
Episode 199: steps = 6 , reward = 1.0
Episode 200: steps = 7 , reward = 1.0
Episode 201: steps = 6 , reward = 1.0
Episode 202: steps = 6 , reward = 1.0
Episode 203: steps = 7 , reward = 1.0
Episode 204: steps = 6 , reward = 1.0
Episode 205: steps = 8 , reward = 1.0
Episode 206: steps = 3 , reward = 0.0
Episode 207: steps = 8 , reward = 1.0
Episode 208: steps = 7 , reward = 1.0
Episode 209: steps = 6 , reward = 1.0
Episode 210: steps = 6 , reward = 1.0
Episode 211: steps = 6 , reward = 1.0
Episode 212: steps = 6 , reward = 1.0
Episode 213: steps = 6 , reward = 1.0
Episode 214: steps = 6 , reward = 1.0
Episode 215: steps = 7 , reward = 1.0
Episode 216: steps = 4 , reward = 0.0
Episode 217: steps = 6 , reward = 1.0
Episode 218: steps = 6 , reward = 1.0
Episode 219: steps = 6 , reward = 1.0
Episode 220: steps = 6 , reward = 1.0
Episode 221: steps = 6 , reward = 1.0
Episode 222: steps = 12 , reward = 1.0
Episode 223: steps = 8 , reward = 1.0
Episode 224: steps = 6 , reward = 1.0
Episode 225: steps = 8 , reward = 1.0
Episode 226: steps = 6 , reward = 1.0
Episode 227: steps = 6 , reward = 1.0
Episode 228: steps = 6 , reward = 1.0
Episode 229: steps = 6 , reward = 1.0
Episode 230: steps = 2 , reward = 0.0
Episode 231: steps = 6 , reward = 1.0
Episode 232: steps = 8 , reward = 1.0
Episode 233: steps = 6 , reward = 1.0
Episode 234: steps = 6 , reward = 1.0
Episode 235: steps = 6 , reward = 1.0
Episode 236: steps = 6 , reward = 1.0
Episode 237: steps = 6 , reward = 1.0
Episode 238: steps = 6 , reward = 1.0
Episode 239: steps = 6 , reward = 1.0
Episode 240: steps = 7 , reward = 1.0
Episode 241: steps = 6 , reward = 1.0
Episode 242: steps = 2 , reward = 0.0
Episode 243: steps = 6 , reward = 1.0
Episode 244: steps = 6 , reward = 1.0
Episode 245: steps = 7 , reward = 1.0
Episode 246: steps = 7 , reward = 1.0
Episode 247: steps = 8 , reward = 0.0
Episode 248: steps = 6 , reward = 1.0
Episode 249: steps = 5 , reward = 0.0
Episode 250: steps = 7 , reward = 1.0
Episode 251: steps = 6 , reward = 1.0
Episode 252: steps = 8 , reward = 1.0
Episode 253: steps = 6 , reward = 1.0
Episode 254: steps = 4 , reward = 0.0
Episode 255: steps = 4 , reward = 0.0
Episode 256: steps = 7 , reward = 1.0
Episode 257: steps = 6 , reward = 1.0
Episode 258: steps = 8 , reward = 1.0
Episode 259: steps = 6 , reward = 1.0
Episode 260: steps = 6 , reward = 1.0
Episode 261: steps = 6 , reward = 1.0
Episode 262: steps = 8 , reward = 1.0
Episode 263: steps = 7 , reward = 1.0
Episode 264: steps = 6 , reward = 1.0
Episode 265: steps = 6 , reward = 1.0
Episode 266: steps = 6 , reward = 1.0
Episode 267: steps = 6 , reward = 1.0
Episode 268: steps = 8 , reward = 1.0
Episode 269: steps = 6 , reward = 0.0
Episode 270: steps = 6 , reward = 1.0
Episode 271: steps = 7 , reward = 1.0
Episode 272: steps = 4 , reward = 0.0
Episode 273: steps = 6 , reward = 1.0
Episode 274: steps = 2 , reward = 0.0
Episode 275: steps = 8 , reward = 1.0
Episode 276: steps = 6 , reward = 1.0
Episode 277: steps = 6 , reward = 1.0
Episode 278: steps = 5 , reward = 0.0
Episode 279: steps = 6 , reward = 1.0
Episode 280: steps = 6 , reward = 1.0
Episode 281: steps = 6 , reward = 1.0
Episode 282: steps = 7 , reward = 1.0
Episode 283: steps = 6 , reward = 1.0
Episode 284: steps = 6 , reward = 1.0
Episode 285: steps = 6 , reward = 1.0
Episode 286: steps = 6 , reward = 1.0
Episode 287: steps = 7 , reward = 1.0
Episode 288: steps = 6 , reward = 1.0
Episode 289: steps = 6 , reward = 1.0
Episode 290: steps = 6 , reward = 1.0
Episode 291: steps = 6 , reward = 1.0
Episode 292: steps = 6 , reward = 1.0
Episode 293: steps = 6 , reward = 1.0
Episode 294: steps = 6 , reward = 1.0
Episode 295: steps = 7 , reward = 1.0
Episode 296: steps = 6 , reward = 1.0
Episode 297: steps = 8 , reward = 1.0
Episode 298: steps = 6 , reward = 1.0
Episode 299: steps = 6 , reward = 1.0
Episode 300: steps = 2 , reward = 0.0
Episode 301: steps = 6 , reward = 1.0
Episode 302: steps = 6 , reward = 1.0
Episode 303: steps = 6 , reward = 1.0
Episode 304: steps = 3 , reward = 0.0
Episode 305: steps = 7 , reward = 1.0
Episode 306: steps = 6 , reward = 1.0
Episode 307: steps = 6 , reward = 1.0
Episode 308: steps = 10 , reward = 1.0
Episode 309: steps = 7 , reward = 1.0
Episode 310: steps = 6 , reward = 1.0
Episode 311: steps = 10 , reward = 1.0
Episode 312: steps = 6 , reward = 1.0
Episode 313: steps = 6 , reward = 1.0
Episode 314: steps = 6 , reward = 1.0
Episode 315: steps = 8 , reward = 1.0
Episode 316: steps = 6 , reward = 1.0
Episode 317: steps = 6 , reward = 1.0
Episode 318: steps = 6 , reward = 1.0
Episode 319: steps = 7 , reward = 1.0
Episode 320: steps = 6 , reward = 1.0
Episode 321: steps = 6 , reward = 1.0
Episode 322: steps = 6 , reward = 1.0
Episode 323: steps = 8 , reward = 1.0
Episode 324: steps = 7 , reward = 1.0
Episode 325: steps = 7 , reward = 1.0
Episode 326: steps = 6 , reward = 1.0
Episode 327: steps = 6 , reward = 1.0
Episode 328: steps = 6 , reward = 1.0
Episode 329: steps = 6 , reward = 1.0
Episode 330: steps = 8 , reward = 1.0
Episode 331: steps = 6 , reward = 1.0
Episode 332: steps = 8 , reward = 1.0
Episode 333: steps = 6 , reward = 1.0
Episode 334: steps = 8 , reward = 1.0
Episode 335: steps = 6 , reward = 1.0
Episode 336: steps = 6 , reward = 1.0
Episode 337: steps = 6 , reward = 1.0
Episode 338: steps = 11 , reward = 1.0
Episode 339: steps = 2 , reward = 0.0
Episode 340: steps = 6 , reward = 1.0
Episode 341: steps = 10 , reward = 0.0
Episode 342: steps = 6 , reward = 1.0
Episode 343: steps = 6 , reward = 1.0
Episode 344: steps = 6 , reward = 1.0
Episode 345: steps = 8 , reward = 1.0
Episode 346: steps = 9 , reward = 1.0
Episode 347: steps = 6 , reward = 1.0
Episode 348: steps = 6 , reward = 1.0
Episode 349: steps = 6 , reward = 1.0
Episode 350: steps = 6 , reward = 1.0
Episode 351: steps = 6 , reward = 1.0
Episode 352: steps = 6 , reward = 1.0
Episode 353: steps = 8 , reward = 1.0
Episode 354: steps = 6 , reward = 1.0
Episode 355: steps = 8 , reward = 1.0
Episode 356: steps = 6 , reward = 1.0
Episode 357: steps = 6 , reward = 1.0
Episode 358: steps = 6 , reward = 1.0
Episode 359: steps = 6 , reward = 1.0
Episode 360: steps = 6 , reward = 1.0
Episode 361: steps = 6 , reward = 1.0
Episode 362: steps = 6 , reward = 1.0
Episode 363: steps = 6 , reward = 1.0
Episode 364: steps = 6 , reward = 1.0
Episode 365: steps = 6 , reward = 1.0
Episode 366: steps = 6 , reward = 1.0
Episode 367: steps = 6 , reward = 1.0
Episode 368: steps = 6 , reward = 1.0
Episode 369: steps = 6 , reward = 1.0
Episode 370: steps = 6 , reward = 1.0
Episode 371: steps = 6 , reward = 1.0
Episode 372: steps = 6 , reward = 1.0
Episode 373: steps = 8 , reward = 1.0
Episode 374: steps = 6 , reward = 1.0
Episode 375: steps = 7 , reward = 1.0
Episode 376: steps = 10 , reward = 1.0
Episode 377: steps = 6 , reward = 1.0
Episode 378: steps = 6 , reward = 1.0
Episode 379: steps = 6 , reward = 1.0
Episode 380: steps = 6 , reward = 1.0
Episode 381: steps = 8 , reward = 1.0
Episode 382: steps = 8 , reward = 1.0
Episode 383: steps = 8 , reward = 1.0
Episode 384: steps = 6 , reward = 1.0
Episode 385: steps = 7 , reward = 0.0
Episode 386: steps = 4 , reward = 0.0
Episode 387: steps = 5 , reward = 0.0
Episode 388: steps = 5 , reward = 0.0
Episode 389: steps = 10 , reward = 1.0
Episode 390: steps = 6 , reward = 1.0
Episode 391: steps = 6 , reward = 1.0
Episode 392: steps = 6 , reward = 1.0
Episode 393: steps = 6 , reward = 1.0
Episode 394: steps = 7 , reward = 1.0
Episode 395: steps = 6 , reward = 1.0
Episode 396: steps = 6 , reward = 1.0
Episode 397: steps = 6 , reward = 1.0
Episode 398: steps = 6 , reward = 1.0
Episode 399: steps = 6 , reward = 1.0
Episode 400: steps = 6 , reward = 1.0
Episode 401: steps = 8 , reward = 1.0
Episode 402: steps = 6 , reward = 1.0
Episode 403: steps = 8 , reward = 1.0
Episode 404: steps = 2 , reward = 0.0
Episode 405: steps = 6 , reward = 1.0
Episode 406: steps = 6 , reward = 1.0
Episode 407: steps = 6 , reward = 1.0
Episode 408: steps = 6 , reward = 1.0
Episode 409: steps = 6 , reward = 1.0
Episode 410: steps = 6 , reward = 1.0
Episode 411: steps = 5 , reward = 0.0
Episode 412: steps = 3 , reward = 0.0
Episode 413: steps = 8 , reward = 1.0
Episode 414: steps = 6 , reward = 0.0
Episode 415: steps = 6 , reward = 1.0
Episode 416: steps = 6 , reward = 1.0
Episode 417: steps = 7 , reward = 1.0
Episode 418: steps = 6 , reward = 1.0
Episode 419: steps = 6 , reward = 1.0
Episode 420: steps = 6 , reward = 1.0
Episode 421: steps = 6 , reward = 1.0
Episode 422: steps = 8 , reward = 1.0
Episode 423: steps = 6 , reward = 1.0
Episode 424: steps = 7 , reward = 1.0
Episode 425: steps = 6 , reward = 1.0
Episode 426: steps = 6 , reward = 1.0
Episode 427: steps = 6 , reward = 1.0
Episode 428: steps = 10 , reward = 1.0
Episode 429: steps = 6 , reward = 1.0
Episode 430: steps = 8 , reward = 1.0
Episode 431: steps = 6 , reward = 1.0
Episode 432: steps = 6 , reward = 1.0
Episode 433: steps = 6 , reward = 1.0
Episode 434: steps = 6 , reward = 1.0
Episode 435: steps = 4 , reward = 0.0
Episode 436: steps = 6 , reward = 1.0
Episode 437: steps = 6 , reward = 1.0
Episode 438: steps = 6 , reward = 1.0
Episode 439: steps = 7 , reward = 1.0
Episode 440: steps = 5 , reward = 0.0
Episode 441: steps = 5 , reward = 0.0
Episode 442: steps = 6 , reward = 1.0
Episode 443: steps = 6 , reward = 1.0
Episode 444: steps = 6 , reward = 1.0
Episode 445: steps = 8 , reward = 1.0
Episode 446: steps = 8 , reward = 1.0
Episode 447: steps = 6 , reward = 1.0
Episode 448: steps = 6 , reward = 1.0
Episode 449: steps = 3 , reward = 0.0
Episode 450: steps = 6 , reward = 1.0
Episode 451: steps = 8 , reward = 1.0
Episode 452: steps = 10 , reward = 1.0
Episode 453: steps = 8 , reward = 1.0
Episode 454: steps = 6 , reward = 1.0
Episode 455: steps = 6 , reward = 1.0
Episode 456: steps = 6 , reward = 1.0
Episode 457: steps = 6 , reward = 1.0
Episode 458: steps = 6 , reward = 1.0
Episode 459: steps = 8 , reward = 1.0
Episode 460: steps = 6 , reward = 1.0
Episode 461: steps = 6 , reward = 1.0
Episode 462: steps = 6 , reward = 1.0
Episode 463: steps = 6 , reward = 1.0
Episode 464: steps = 6 , reward = 1.0
Episode 465: steps = 9 , reward = 1.0
Episode 466: steps = 9 , reward = 1.0
Episode 467: steps = 6 , reward = 1.0
Episode 468: steps = 6 , reward = 1.0
Episode 469: steps = 6 , reward = 1.0
Episode 470: steps = 6 , reward = 1.0
Episode 471: steps = 6 , reward = 1.0
Episode 472: steps = 9 , reward = 1.0
Episode 473: steps = 7 , reward = 1.0
Episode 474: steps = 6 , reward = 1.0
Episode 475: steps = 7 , reward = 1.0
Episode 476: steps = 7 , reward = 1.0
Episode 477: steps = 6 , reward = 1.0
Episode 478: steps = 6 , reward = 1.0
Episode 479: steps = 6 , reward = 1.0
Episode 480: steps = 6 , reward = 1.0
Episode 481: steps = 6 , reward = 1.0
Episode 482: steps = 5 , reward = 0.0
Episode 483: steps = 6 , reward = 1.0
Episode 484: steps = 11 , reward = 1.0
Episode 485: steps = 6 , reward = 1.0
Episode 486: steps = 2 , reward = 0.0
Episode 487: steps = 6 , reward = 1.0
Episode 488: steps = 6 , reward = 1.0
Episode 489: steps = 6 , reward = 1.0
Episode 490: steps = 6 , reward = 1.0
Episode 491: steps = 6 , reward = 1.0
Episode 492: steps = 6 , reward = 1.0
Episode 493: steps = 7 , reward = 1.0
Episode 494: steps = 6 , reward = 1.0
Episode 495: steps = 6 , reward = 1.0
Episode 496: steps = 6 , reward = 1.0
Episode 497: steps = 8 , reward = 1.0
Episode 498: steps = 6 , reward = 1.0
Episode 499: steps = 6 , reward = 1.0
test reward = 1.0

运行代码请点击:https://aistudio.baidu.com/aistudio/projectdetail/625951?shared=1

欢迎三连!

本文链接http://www.dzjqx.cn/news/show-617542.html