我正在尝试从 Python 环境开始模拟 RL 环境,然后通过使用tf_py_environment
. 我注意到我的环境步骤时间规范是:
TimeStep(
{'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
'observation': BoundedTensorSpec(shape=(6,), dtype=tf.int32, name=None, minimum=array(0), maximum=array(1)),
'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')})
但是当我执行 step 方法时,我会得到以下格式的结果:
TimeStep(
{'discount': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
'observation': <tf.Tensor: shape=(1, 1, 6), dtype=int32, numpy=array([[[1, 0, 1, 0, 1, 0]]])>,
'reward': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-1.], dtype=float32)>,
'step_type': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1])>})
除了添加了两个维度的观察之外,所有内容都添加了一个维度。
这是我的环境的代码:
class CustomEnv(py_environment.PyEnvironment):
def __init__(self):
self._action_spec = array_spec.BoundedArraySpec(
shape=(), dtype=np.int32, minimum=0, maximum=3)
self._observation_spec = array_spec.BoundedArraySpec(
shape=(6,), dtype=np.int32, minimum=0, maximum=1)
self._state = [0,0,0,0,0,0]
self._counter = 0
self._episode_ended = False
self.dictionary = {0: [(0,0), (0,1)],
1: [(0,2)],
2: [(1,0), (1,1)],
3: [(1,2), (2,0), (2,1), (2,2)]}
def action_spec(self):
return self._action_spec
def observation_spec(self):
return self._observation_spec
def _reset(self):
self._state = [0,0,0,0,0,0]
self._counter = 0
self._episode_ended = False
return ts.restart(np.array([self._state], dtype=np.int32))
def preferences(self):
return np.random.randint(3, size=2)
def pickedGift(self, yes):
reward = -1.0
if yes:
reward = 0.0
return reward
def _step(self, action):
if self._episode_ended:
self._reset()
if self._counter<250:
self._counter += 1
color, letter = self.preferences()
condition = (color, letter) in self.dictionary[int(action)]
reward = self.pickedGift(condition)
self._state[color] = 1
self._state[3+letter] = 1
if self._counter==250:
self._episode_ended=True
return ts.termination(np.array([self._state],
dtype=np.int32),
reward,
1)
else:
return ts.transition(np.array([self._state],
dtype=np.int32),
reward,
discount=1.0)
我创建了这样的 TF 环境:
py_env = CustomEnv()
tf_env = tf_py_environment.TFPyEnvironment(py_env)
我想问你为什么要添加其他维度以及最终如何摆脱它们,特别是对于观察,因为我们有两个额外的维度。
先感谢您。