Skip to content

Cart-Pole Example

Ready to see reinforcement learning in action? In this tutorial, we’ll take on the classic balancing challenge where you’ll watch your AI learn to keep a pole upright on a moving cart.

The cart-pole challenge blends simplicity with visual feedback, making it perfect for reinforcement learning. You push a cart left or right, and physics determines whether the attached pole stays balanced or topples over. Every time step, your agent makes a decision, and you get the satisfaction of watching your algorithm gradually master the task.

Setting Up Your Project

We’ll use SciSharp/Gym.NET to provide our simulated physics environment.

You can follow along or grab the complete project if you prefer.

Let’s install the necessary packages:

Installing the necessary packages
dotnet add package RLMatrix
dotnet add package RLMatrix.Toolkit
dotnet add package Gym.NET
dotnet add package Gym.NET.Environments
dotnet add package Gym.NET.Rendering.WinForm

Building the Environment

Here’s our cart-pole environment implementation:

CartPoleEnvironment.cs
using System;
using System.Threading.Tasks;
using Gym.Environments.Envs.Classic;
using Gym.Rendering.WinForm;
using RLMatrix.Toolkit;
using NumSharp;
namespace MyEnv
{
[RLMatrixEnvironment]
public partial class CartPoleEnvironment
{
private CartPoleEnv myEnv;
private float[] myState;
private int stepCounter;
private const int MaxSteps = 100000;
private bool isDone;
public CartPoleEnvironment()
{
InitialiseAsync();
}
private void InitialiseAsync()
{
myEnv = new CartPoleEnv(WinFormEnvViewer.Factory);
ResetEnvironment();
}
[RLMatrixObservation]
public float GetCartPosition() => myState[0];
[RLMatrixObservation]
public float GetCartVelocity() => myState[1];
[RLMatrixObservation]
public float GetPoleAngle() => myState[2];
[RLMatrixObservation]
public float GetPoleAngularVelocity() => myState[3];
[RLMatrixActionDiscrete(2)]
public void ApplyForce(int action)
{
if (isDone)
ResetEnvironment();
var (observation, reward, done, _) = myEnv.Step(action);
myEnv.Render();
myState = ToFloatArray(observation);
isDone = done;
stepCounter++;
if (stepCounter > MaxSteps)
isDone = true;
}
private float[] ToFloatArray(NDArray npArray)
{
double[] doubleArray = npArray.ToArray<double>();
return Array.ConvertAll(doubleArray, item => (float)item);
}
[RLMatrixReward]
public float CalculateReward()
{
return isDone ? 0 : 1;
}
[RLMatrixDone]
public bool IsEpisodeFinished()
{
return isDone;
}
[RLMatrixReset]
public void ResetEnvironment()
{
myEnv.Reset();
myState = new float[4] { 0, 0, 0, 0 };
isDone = false;
stepCounter = 0;
}
}
}

Setting Up Training

Now for the training code that will teach our agent to balance:

Program.cs
using RLMatrix.Agents.Common;
using RLMatrix;
using MyEnv;
Console.WriteLine("Starting cart-pole training...\n");
// Configure learning parameters
var learningSetup = new PPOAgentOptions(
batchSize: 8,
ppoEpochs: 8,
memorySize: 1000,
gamma: 0.99f,
width: 128,
entropyCoefficient: 0.01f,
lr: 1E-02f
);
// Create environment and attach to agent
var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 1200, maxStepsHard: 1200);
var env = new List<IEnvironmentAsync<float[]>> {
environment,
//new CartPoleEnvironment().RLInit() //uncomment to train with multiple environments
};
// Initialize agent
var agent = new LocalDiscreteRolloutAgent<float[]>(learningSetup, env);
// Train until convergence
for (int i = 0; i < 100000; i++)
{
await agent.Step();
}
Console.WriteLine("\nTraining complete!");
Console.ReadLine();

The simple reward of +1 per time step is deceptively powerful. Deep reinforcement learning algorithms naturally optimize for the long game, figuring out that subtle, preemptive adjustments lead to longer balancing times and higher cumulative rewards.

PPO in RLMatrix: What’s Different

While DQN (from our earlier tutorials) can be more sample-efficient for simple tasks, PPO generally delivers more stable training without requiring extensive hyperparameter tuning. This makes it particularly well-suited for challenging control problems.

The Memory-Saving Trick You Need to Know

Look at this line in our training code:

var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 1200, maxStepsHard: 1200);

This innocuous parameter configuration holds the key to training with very long episodes without overwhelming your GPU’s memory. Let me explain:

What happens when we modify these values?

var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 200, maxStepsHard: 1200);

Now the magic happens:

  1. We only accumulate rewards and calculate gradients for the first 200 steps
  2. The simulation continues running naturally up to 1200 steps or until failure
  3. Your GPU memory usage drops significantly

When you run this configuration, check your reward graphs – you’ll notice no reward exceeds 200 (our soft limit), even though the cart-pole physics continues beyond that point. Open your task manager and watch the memory savings in real-time.

This technique becomes indispensable for complex environments where episodes can run indefinitely. Instead of crashing with out-of-memory errors, you control precisely how much computational effort to invest while maintaining natural environment dynamics.

Watching Learning in Action

When you run this training, a window will pop up showing the cart-pole environment. At first, the pole will topple quickly – your agent has no idea what it’s doing. But within minutes, you’ll witness a remarkable transformation:

  1. Initially, the agent makes random movements with no strategy
  2. Then it starts reacting when the pole is already falling (too late!)
  3. It gradually learns to make corrective moves earlier and earlier
  4. Finally, it makes subtle, preemptive adjustments, keeping the pole perfectly balanced

This visible progression is what makes cart-pole so satisfying as a learning example. You’re not just seeing numbers improve in a graph – you’re watching your AI develop a skill before your eyes.

Test Your Understanding

Understanding Cart-Pole Reinforcement Learning

Next Steps

In this tutorial, you’ve:

  • Set up a real-time physics simulation for reinforcement learning
  • Implemented a complete agent to master a classic control problem
  • Learned how to efficiently manage memory with the soft/hard termination trick
  • Understood how RLMatrix’s PPO implementation differs from standard ones

Next, we’ll implement the same environment without using the toolkit, giving you insights into what’s happening behind those neat attributes we used.