This is a RL agent backtesting replay. Press ▶ Play to initialize a fresh PPO trading agent and train it inside a market sandbox. The agent collects market observations, takes actions, receives rewards, updates its policy, and repeats. Each episode becomes part of its learning record, while the reward curve shows whether the agent is actually improving.
This is the first step in the Roostoo pipeline: train the agent, evaluate its behavior, then graduate it into live-market competitions.
Every box updates with live numbers from the algorithm. One episode = a full pass through the loop: 200 rollout steps (boxes 1–3 fire 200 times) → 1 update phase running K=5 epochs of clipped-surrogate gradient descent (box 4). The center counter shows your position in both — outer episode count, inner rollout step.