Cart-Pole Example
Ready to see reinforcement learning in action? In this tutorial, we’ll take on the classic balancing challenge where you’ll watch your AI learn to keep a pole upright on a moving cart.
The cart-pole challenge blends simplicity with visual feedback, making it perfect for reinforcement learning. You push a cart left or right, and physics determines whether the attached pole stays balanced or topples over. Every time step, your agent makes a decision, and you get the satisfaction of watching your algorithm gradually master the task.
Setting Up Your Project
We’ll use SciSharp/Gym.NET to provide our simulated physics environment.
You can follow along or grab the complete project if you prefer.
Let’s install the necessary packages:
dotnet add package RLMatrixdotnet add package RLMatrix.Toolkitdotnet add package Gym.NETdotnet add package Gym.NET.Environmentsdotnet add package Gym.NET.Rendering.WinForm
Building the Environment
Here’s our cart-pole environment implementation:
using System;using System.Threading.Tasks;using Gym.Environments.Envs.Classic;using Gym.Rendering.WinForm;using RLMatrix.Toolkit;using NumSharp;
namespace MyEnv{ [RLMatrixEnvironment] public partial class CartPoleEnvironment { private CartPoleEnv myEnv; private float[] myState; private int stepCounter; private const int MaxSteps = 100000; private bool isDone;
public CartPoleEnvironment() { InitialiseAsync(); }
private void InitialiseAsync() { myEnv = new CartPoleEnv(WinFormEnvViewer.Factory); ResetEnvironment(); }
[RLMatrixObservation] public float GetCartPosition() => myState[0];
[RLMatrixObservation] public float GetCartVelocity() => myState[1];
[RLMatrixObservation] public float GetPoleAngle() => myState[2];
[RLMatrixObservation] public float GetPoleAngularVelocity() => myState[3];
[RLMatrixActionDiscrete(2)] public void ApplyForce(int action) { if (isDone) ResetEnvironment();
var (observation, reward, done, _) = myEnv.Step(action); myEnv.Render(); myState = ToFloatArray(observation); isDone = done; stepCounter++;
if (stepCounter > MaxSteps) isDone = true; }
private float[] ToFloatArray(NDArray npArray) { double[] doubleArray = npArray.ToArray<double>(); return Array.ConvertAll(doubleArray, item => (float)item); }
[RLMatrixReward] public float CalculateReward() { return isDone ? 0 : 1; }
[RLMatrixDone] public bool IsEpisodeFinished() { return isDone; }
[RLMatrixReset] public void ResetEnvironment() { myEnv.Reset(); myState = new float[4] { 0, 0, 0, 0 }; isDone = false; stepCounter = 0; } }}
Setting Up Training
Now for the training code that will teach our agent to balance:
using RLMatrix.Agents.Common;using RLMatrix;using MyEnv;
Console.WriteLine("Starting cart-pole training...\n");
// Configure learning parametersvar learningSetup = new PPOAgentOptions( batchSize: 8, ppoEpochs: 8, memorySize: 1000, gamma: 0.99f, width: 128, entropyCoefficient: 0.01f, lr: 1E-02f);
// Create environment and attach to agentvar environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 1200, maxStepsHard: 1200);var env = new List<IEnvironmentAsync<float[]>> { environment, //new CartPoleEnvironment().RLInit() //uncomment to train with multiple environments};
// Initialize agentvar agent = new LocalDiscreteRolloutAgent<float[]>(learningSetup, env);
// Train until convergencefor (int i = 0; i < 100000; i++){ await agent.Step();}
Console.WriteLine("\nTraining complete!");Console.ReadLine();
The simple reward of +1 per time step is deceptively powerful. Deep reinforcement learning algorithms naturally optimize for the long game, figuring out that subtle, preemptive adjustments lead to longer balancing times and higher cumulative rewards.
PPO in RLMatrix: What’s Different
While DQN (from our earlier tutorials) can be more sample-efficient for simple tasks, PPO generally delivers more stable training without requiring extensive hyperparameter tuning. This makes it particularly well-suited for challenging control problems.
The Memory-Saving Trick You Need to Know
Look at this line in our training code:
var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 1200, maxStepsHard: 1200);
This innocuous parameter configuration holds the key to training with very long episodes without overwhelming your GPU’s memory. Let me explain:
What happens when we modify these values?
var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 200, maxStepsHard: 1200);
Now the magic happens:
- We only accumulate rewards and calculate gradients for the first 200 steps
- The simulation continues running naturally up to 1200 steps or until failure
- Your GPU memory usage drops significantly
When you run this configuration, check your reward graphs – you’ll notice no reward exceeds 200 (our soft limit), even though the cart-pole physics continues beyond that point. Open your task manager and watch the memory savings in real-time.
This technique becomes indispensable for complex environments where episodes can run indefinitely. Instead of crashing with out-of-memory errors, you control precisely how much computational effort to invest while maintaining natural environment dynamics.
Watching Learning in Action
When you run this training, a window will pop up showing the cart-pole environment. At first, the pole will topple quickly – your agent has no idea what it’s doing. But within minutes, you’ll witness a remarkable transformation:
- Initially, the agent makes random movements with no strategy
- Then it starts reacting when the pole is already falling (too late!)
- It gradually learns to make corrective moves earlier and earlier
- Finally, it makes subtle, preemptive adjustments, keeping the pole perfectly balanced
This visible progression is what makes cart-pole so satisfying as a learning example. You’re not just seeing numbers improve in a graph – you’re watching your AI develop a skill before your eyes.
Test Your Understanding
Understanding Cart-Pole Reinforcement Learning
Next Steps
In this tutorial, you’ve:
- Set up a real-time physics simulation for reinforcement learning
- Implemented a complete agent to master a classic control problem
- Learned how to efficiently manage memory with the soft/hard termination trick
- Understood how RLMatrix’s PPO implementation differs from standard ones
Next, we’ll implement the same environment without using the toolkit, giving you insights into what’s happening behind those neat attributes we used.