0基础创建自定义gym环境-以股票市场为例-白红宇的个人博客

发布日期：2021-05-07 14:33:00 浏览次数：19 分类：原创文章

本文共 7606 字，大约阅读时间需要 25 分钟。

本文翻译自Adam King于4.10发表的《》，英文好的建议看原文。此翻译版本只是自我学习。

翻译完，我自己都觉得语句不通顺，各位看客见谅哈，英文水平慢慢修炼中

OpenAI的gym是一个非常棒的包(package)，可以用来创建自定义强化学习智体。自带许多内置环境如：、以及。

这些环境对于学习是极有价值的，但最终你还是希望建立一个智体用来解决自定义问题。为此，你需要针对特定问题领域创建一个自定义环境。稍后，我们会创建一个自定义股票市场环境，用于模拟股票交易。本文所有代码见.

首先明确什么是环境(environment).一个环境包含了所有运行一个智体所需的所有功能，并允许其学习。每个环境必须执行下述gym接口：

import gymfrom gym import spacesclass CustomEnv(gym.Env):  """Custom Environment that follows gym interface"""  metadata = {   'render.modes': ['human']}  def __init__(self, arg1, arg2, ...):    super(CustomEnv, self).__init__()    # Define action and observation space    # They must be gym.spaces objects    # Example when using discrete actions:    self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)    # Example for using image as input:    self.observation_space = spaces.Box(low=0, high=255, shape=                    (HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)  def step(self, action):    # Execute one time step within the environment    ...  def reset(self):    # Reset the state of the environment to an initial state    ...  def render(self, mode='human', close=False):    # Render the environment to the screen    ...

在构造函数中，我们先定义action_space(包含智体在此环境中可能采取的所有actions)的type和shape，类似的定义observation_space(包含此环境下智体观察到的所有数据).

定期调用reset方法重置环境为初始状态。之后是环境中的许多steps，模型会提供一个action并执行，返回下一个观察。reward也是在此计算，稍后详述。

最终，定期调用render方法输出环境的表现(rendition)。这既可以简单如输出声明，也可复杂如用openGL展示3D环境。在此例中，我们坚持输出声明。

股票交易环境

为了展示这一切是如何运行的，我们创建一个股票交易环境，而后训练智体成为一个正收益交易员。gogogo

在这里插入图片描述

首先要考虑一个人类交易员如何感知环境。他们在做一笔交易之前要关注哪些观测量？

交易员很可能会看股价走势图，或许叠加几个技术指标。结合自己的经验，通过可视信息给i出股价可能走向的知情决策。

将上述过程转化为智体如何感知环境。

observation_space 包含智体在做或不做一笔交易之前所需考虑的所有输入变量。在本例中，我们希望智体“看到”股票过去五日数据点(开盘价、最高价、最低价、收盘价、日成交量)，以及其他数据点集(账户信息、当前股票位置、当前收益)。

直觉上，对于每一个时间步，我们希望智体能够考虑现价之前的价格走势，以及投资组合状态，以便为下一步操作做出知情决策。

一旦交易员感知到环境，他们就要采取行动。在智体情况下，action_space包含三种可能：买入、卖出或无操作。

但这样还不够，我们还需要知道每次买卖股票的数量。用gym的Box空间,创建一个具有离散数量行动类型的action space(买、卖、持有),以及连续性的买卖数量(账户余额/仓位大小的0-100%)

你会发现虽然对于持有操作并不需要数量，但还是会提供。我们的智体最初并不知道这些，一段时间后会学会在此action中不需要数量。

实现环境之前最后要考虑的就是reward。我们希望奖励可持续的利润。每一步中，我们将把reward设置为账户余额乘以到目前为止的时间步数的一部分。

这样做的目的是在早期阶段推迟对智体的奖励，让其能够在优化单个策略之前尽可能深的充分探索。还可以奖励维持长期高余额的智体，而不是那些不可持续的快速获利策略。

部署

现在我们已经定义了观测空间、行动空间、奖励，是时候部署我们的环境了。首先，我们需要在环境构造器中定义action_space 和 observation_space。环境希望传递进一个pandas数据类型，其中包含学习用的股票数据。这里给出一个示例。

class StockTradingEnvironment(gym.Env):  """A stock trading environment for OpenAI gym"""  metadata = {     'render.modes': ['human']}  def __init__(self, df):    super(StockTradingEnv, self).__init__()    self.df = df    self.reward_range = (0, MAX_ACCOUNT_BALANCE)     # Actions of the format Buy x%, Sell x%, Hold, etc.    self.action_space = spaces.Box(      low=np.array([0, 0]), high=np.array([3, 1]), dtype=np.float16)    # Prices contains the OHCL values for the last five prices    self.observation_space = spaces.Box(      low=0, high=1, shape=(6, 6), dtype=np.float16)

然后，构建reset方法，用于随时创建新环境或重置旧有环境。就是在这里，我们设置每个智体的初始余额并将其开始仓位初始化为空列表。

def reset(self):  # Reset the state of the environment to an initial state  self.balance = INITIAL_ACCOUNT_BALANCE  self.net_worth = INITIAL_ACCOUNT_BALANCE  self.max_net_worth = INITIAL_ACCOUNT_BALANCE  self.shares_held = 0  self.cost_basis = 0  self.total_shares_sold = 0  self.total_sales_value = 0   # Set the current step to a random point within the data frame  self.current_step = random.randint(0, len(self.df.loc[:, 'Open'].values) - 6)  return self._next_observation()

我们将当前步设置为数据帧中的一个随机点，与从同一数据点启动相比，这会给智体提供更独特的经验。_next_observation 方法编译最后5个时间步的股票数据，附加智体的账户信息，并归一化。

def _next_observation(self):  # Get the data points for the last 5 days and scale to between 0-1  frame = np.array([    self.df.loc[self.current_step: self.current_step +                5, 'Open'].values / MAX_SHARE_PRICE,    self.df.loc[self.current_step: self.current_step +                5, 'High'].values / MAX_SHARE_PRICE,    self.df.loc[self.current_step: self.current_step +                5, 'Low'].values / MAX_SHARE_PRICE,    self.df.loc[self.current_step: self.current_step +                5, 'Close'].values / MAX_SHARE_PRICE,    self.df.loc[self.current_step: self.current_step +                5, 'Volume'].values / MAX_NUM_SHARES,   ])  # Append additional data and scale each value to between 0-1  obs = np.append(frame, [[    self.balance / MAX_ACCOUNT_BALANCE,    self.max_net_worth / MAX_ACCOUNT_BALANCE,    self.shares_held / MAX_NUM_SHARES,    self.cost_basis / MAX_SHARE_PRICE,    self.total_shares_sold / MAX_NUM_SHARES,    self.total_sales_value / (MAX_NUM_SHARES * MAX_SHARE_PRICE),  ]], axis=0)  return obs

然后，我们的环境需要能实现一步(step)。每一步执行制定行动(由模型选取)，计算奖励，返回下一个观测。

def step(self, action):  # Execute one time step within the environment  self._take_action(action)  self.current_step += 1  if self.current_step > len(self.df.loc[:, 'Open'].values) - 6:    self.current_step = 0  delay_modifier = (self.current_step / MAX_STEPS)    reward = self.balance * delay_modifier  done = self.net_worth <= 0  obs = self._next_observation()  return obs, reward, done, {     }

现在，_take_action 方法需要执行模型提供的action：买、卖或持有股票。

def _take_action(self, action):  # Set the current price to a random price within the time step  current_price = random.uniform(    self.df.loc[self.current_step, "Open"],    self.df.loc[self.current_step, "Close"])  action_type = action[0]  amount = action[1]  if action_type < 1:    # Buy amount % of balance in shares    total_possible = self.balance / current_price    shares_bought = total_possible * amount    prev_cost = self.cost_basis * self.shares_held    additional_cost = shares_bought * current_price    self.balance -= additional_cost    self.cost_basis = (prev_cost + additional_cost) /                             (self.shares_held + shares_bought)    self.shares_held += shares_bought  elif actionType < 2:    # Sell amount % of shares held    shares_sold = self.shares_held * amount .     self.balance += shares_sold * current_price    self.shares_held -= shares_sold    self.total_shares_sold += shares_sold    self.total_sales_value += shares_sold * current_price  self.netWorth = self.balance + self.shares_held * current_price  if self.net_worth > self.max_net_worth:    self.max_net_worth = net_worth  if self.shares_held == 0:    self.cost_basis = 0

现在只剩一件事，将环境呈现在屏幕上。简单起见，我们只呈现目前收益及部分有趣的度量。

def render(self, mode='human', close=False):  # Render the environment to the screen  profit = self.net_worth - INITIAL_ACCOUNT_BALANCE  print(f'Step: {self.current_step}')  print(f'Balance: {self.balance}')  print(f'Shares held: {     self.shares_held}          (Total sold: {     self.total_shares_sold})')  print(f'Avg cost for held shares: {     self.cost_basis}          (Total sales value: {     self.total_sales_value})')  print(f'Net worth: {     self.net_worth}          (Max net worth: {     self.max_net_worth})')  print(f'Profit: {profit}')

我们的环境完整了，现在以带有数据帧的StockTradingEnv为例说明，并从选取模型测试。

import gymimport jsonimport datetime as dtfrom stable_baselines.common.policies import MlpPolicyfrom stable_baselines.common.vec_env import DummyVecEnvfrom stable_baselines import PPO2from env.StockTradingEnv import StockTradingEnvimport pandas as pddf = pd.read_csv('./data/AAPL.csv')df = df.sort_values('Date')# The algorithms require a vectorized environment to runenv = DummyVecEnv([lambda: StockTradingEnv(df)])model = PPO2(MlpPolicy, env, verbose=1)model.learn(total_timesteps=20000)obs = env.reset()for i in range(2000):  action, _states = model.predict(obs)  obs, rewards, done, info = env.step(action)  env.render()

当然，上述只是为了娱乐用一些半复合actions、observations、reward spaces来测试我们创建的有趣的自定义gym环境，如果想要用深度学习从股票市场获利还有很多工作要做。

继续收看下周文章：《》

其中有许多专业概念，我想我翻译的是不对的，如有专业人士看到，希望指点，谢谢。

上一篇：用Matplotlib和Gym优雅地呈现股票交易智体

下一篇：OpenAI Gym简介及初级实例

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章