1

我正在尝试从 Python 环境开始模拟 RL 环境,然后通过使用tf_py_environment. 我注意到我的环境步骤时间规范是:

TimeStep(
{'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'observation': BoundedTensorSpec(shape=(6,), dtype=tf.int32, name=None, minimum=array(0), maximum=array(1)),
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')})

但是当我执行 step 方法时,我会得到以下格式的结果:

TimeStep(
{'discount': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 'observation': <tf.Tensor: shape=(1, 1, 6), dtype=int32, numpy=array([[[1, 0, 1, 0, 1, 0]]])>,
 'reward': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-1.], dtype=float32)>,
 'step_type': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1])>})

除了添加了两个维度的观察之外,所有内容都添加了一个维度。

这是我的环境的代码:

class CustomEnv(py_environment.PyEnvironment):
    
    def __init__(self):
        self._action_spec = array_spec.BoundedArraySpec(
            shape=(), dtype=np.int32, minimum=0, maximum=3)
        self._observation_spec = array_spec.BoundedArraySpec(
        shape=(6,), dtype=np.int32, minimum=0, maximum=1)
        self._state = [0,0,0,0,0,0]
        self._counter = 0
        self._episode_ended = False
        self.dictionary = {0: [(0,0), (0,1)], 
                            1: [(0,2)],
                            2: [(1,0), (1,1)],
                            3: [(1,2), (2,0), (2,1), (2,2)]}
    
    def action_spec(self):
        return self._action_spec
    
    def observation_spec(self):
        return self._observation_spec
    
    def _reset(self):
        self._state = [0,0,0,0,0,0]
        self._counter = 0
        self._episode_ended = False
        return ts.restart(np.array([self._state], dtype=np.int32))
    
    def preferences(self):
        return np.random.randint(3, size=2)
    
    def pickedGift(self, yes):
        reward = -1.0
        if yes:
            reward = 0.0
        return reward
    
    def _step(self, action):
        if self._episode_ended:
            self._reset()
        
        if self._counter<250:
            self._counter += 1
            
            color, letter = self.preferences()
            condition = (color, letter) in self.dictionary[int(action)]
            reward = self.pickedGift(condition)
            self._state[color] = 1
            self._state[3+letter] = 1
            
            if self._counter==250:
                self._episode_ended=True
                return ts.termination(np.array([self._state], 
                                               dtype=np.int32),
                                      reward,
                                      1)
            else:
                return ts.transition(np.array([self._state], 
                                              dtype=np.int32), 
                                     reward, 
                                     discount=1.0)

我创建了这样的 TF 环境:

py_env = CustomEnv()
tf_env = tf_py_environment.TFPyEnvironment(py_env)

我想问你为什么要添加其他维度以及最终如何摆脱它们,特别是对于观察,因为我们有两个额外的维度。

先感谢您。

4

0 回答 0