Cart-Pole Example

Ready to see reinforcement learning in action? In this tutorial, we’ll take on the classic balancing challenge where you’ll watch your AI learn to keep a pole upright on a moving cart.

The cart-pole challenge blends simplicity with visual feedback, making it perfect for reinforcement learning. You push a cart left or right, and physics determines whether the attached pole stays balanced or topples over. Every time step, your agent makes a decision, and you get the satisfaction of watching your algorithm gradually master the task.

Setting Up Your Project

We’ll use SciSharp/Gym.NET to provide our simulated physics environment.

You can follow along or grab the complete project if you prefer.

Let’s install the necessary packages:

dotnet add package RLMatrix
dotnet add package RLMatrix.Toolkit
dotnet add package Gym.NET
dotnet add package Gym.NET.Environments
dotnet add package Gym.NET.Rendering.WinForm

Building the Environment

Here’s our cart-pole environment implementation:

using System;
using System.Threading.Tasks;
using Gym.Environments.Envs.Classic;
using Gym.Rendering.WinForm;
using RLMatrix.Toolkit;
using NumSharp;

namespace MyEnv
{
    [RLMatrixEnvironment]
    public partial class CartPoleEnvironment
    {
        private CartPoleEnv myEnv;
        private float[] myState;
        private int stepCounter;
        private const int MaxSteps = 100000;
        private bool isDone;

        public CartPoleEnvironment()
        {
            InitialiseAsync();
        }

        private void InitialiseAsync()
        {
            myEnv = new CartPoleEnv(WinFormEnvViewer.Factory);
            ResetEnvironment();
        }

        [RLMatrixObservation]
        public float GetCartPosition() => myState[0];

        [RLMatrixObservation]
        public float GetCartVelocity() => myState[1];

        [RLMatrixObservation]
        public float GetPoleAngle() => myState[2];

        [RLMatrixObservation]
        public float GetPoleAngularVelocity() => myState[3];

        [RLMatrixActionDiscrete(2)]
        public void ApplyForce(int action)
        {
            if (isDone)
                ResetEnvironment();

            var (observation, reward, done, _) = myEnv.Step(action);
            myEnv.Render();
            myState = ToFloatArray(observation);
            isDone = done;
            stepCounter++;

            if (stepCounter > MaxSteps)
                isDone = true;
        }

        private float[] ToFloatArray(NDArray npArray)
        {
            double[] doubleArray = npArray.ToArray<double>();
            return Array.ConvertAll(doubleArray, item => (float)item);
        }

        [RLMatrixReward]
        public float CalculateReward()
        {
            return isDone ? 0 : 1;
        }

        [RLMatrixDone]
        public bool IsEpisodeFinished()
        {
            return isDone;
        }

        [RLMatrixReset]
        public void ResetEnvironment()
        {
            myEnv.Reset();
            myState = new float[4] { 0, 0, 0, 0 };
            isDone = false;
            stepCounter = 0;
        }
    }
}

Setting Up Training

Now for the training code that will teach our agent to balance:

using RLMatrix.Agents.Common;
using RLMatrix;
using MyEnv;

Console.WriteLine("Starting cart-pole training...\n");

// Configure learning parameters
var learningSetup = new PPOAgentOptions(
    batchSize: 8,
    ppoEpochs: 8,
    memorySize: 1000,
    gamma: 0.99f,
    width: 128,
    entropyCoefficient: 0.01f,
    lr: 1E-02f
);

// Create environment and attach to agent
var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 1200, maxStepsHard: 1200);
var env = new List<IEnvironmentAsync<float[]>> {
    environment,
    //new CartPoleEnvironment().RLInit() //uncomment to train with multiple environments
};

// Initialize agent
var agent = new LocalDiscreteRolloutAgent<float[]>(learningSetup, env);

// Train until convergence
for (int i = 0; i < 100000; i++)
{
    await agent.Step();
}

Console.WriteLine("\nTraining complete!");
Console.ReadLine();

The simple reward of +1 per time step is deceptively powerful. Deep reinforcement learning algorithms naturally optimize for the long game, figuring out that subtle, preemptive adjustments lead to longer balancing times and higher cumulative rewards.

PPO in RLMatrix: What’s Different

RLMatrix’s PPO implementation is optimized for distributed training, which creates some differences from what you might see in research papers or other frameworks:

Worth knowing if you’re comparing implementations

Batch Size Interpretation: In RLMatrix, batchSize refers to the number of complete episodes to collect before updating the model – not the number of individual steps as in many other implementations.
On-Policy Consistency: PPO only learns from experiences collected under the current policy. Collecting multiple complete episodes before updating helps create stable gradient estimates and captures more environment dynamics without introducing off-policy errors that would occur from updating the policy mid-episode.
Multiple Training Passes: The ppoEpochs parameter controls how many passes we make through the collected experience. Since we’ll discard the data afterward, we want to extract maximum value from it with multiple passes.

While DQN (from our earlier tutorials) can be more sample-efficient for simple tasks, PPO generally delivers more stable training without requiring extensive hyperparameter tuning. This makes it particularly well-suited for challenging control problems.

The Memory-Saving Trick You Need to Know

Look at this line in our training code:

var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 1200, maxStepsHard: 1200);

This innocuous parameter configuration holds the key to training with very long episodes without overwhelming your GPU’s memory. Let me explain:

What happens when we modify these values?

var environment = new CartPoleEnvironment().RLInit(maxStepsSoft: 200, maxStepsHard: 1200);

Now the magic happens:

We only accumulate rewards and calculate gradients for the first 200 steps
The simulation continues running naturally up to 1200 steps or until failure
Your GPU memory usage drops significantly

When you run this configuration, check your reward graphs – you’ll notice no reward exceeds 200 (our soft limit), even though the cart-pole physics continues beyond that point. Open your task manager and watch the memory savings in real-time.

This technique becomes indispensable for complex environments where episodes can run indefinitely. Instead of crashing with out-of-memory errors, you control precisely how much computational effort to invest while maintaining natural environment dynamics.

Watching Learning in Action

When you run this training, a window will pop up showing the cart-pole environment. At first, the pole will topple quickly – your agent has no idea what it’s doing. But within minutes, you’ll witness a remarkable transformation:

Initially, the agent makes random movements with no strategy
Then it starts reacting when the pole is already falling (too late!)
It gradually learns to make corrective moves earlier and earlier
Finally, it makes subtle, preemptive adjustments, keeping the pole perfectly balanced

This visible progression is what makes cart-pole so satisfying as a learning example. You’re not just seeing numbers improve in a graph – you’re watching your AI develop a skill before your eyes.

Test Your Understanding

Understanding Cart-Pole Reinforcement Learning

Question 1: Why is Cart-Pole considered an ideal reinforcement learning example?

It requires minimal computational resources compared to other RL problems

While Cart-Pole is less resource-intensive than some complex environments, the tutorial emphasizes different reasons for its value as a learning example. The computational efficiency isn't its primary advantage.

It provides visual feedback where you can watch your agent's skill progression in real-time

Exactly right! The tutorial highlights this visual aspect as what makes Cart-Pole so satisfying: 'You're not just seeing numbers improve in a graph – you're watching your AI develop a skill before your eyes.' This immediate, intuitive feedback loop makes the learning process tangible.

It's the only reinforcement learning problem with a guaranteed optimal solution

Cart-Pole doesn't have a uniquely guaranteed optimal solution compared to other RL problems. Many RL tasks have optimal or near-optimal solutions. The value of Cart-Pole lies elsewhere, particularly in its intuitive visual feedback.

Need a hint?

Think about what makes Cart-Pole particularly satisfying as a learning example according to the tutorial.

Question 2: What reward strategy does the Cart-Pole environment use to encourage the agent to balance the pole?

A large positive reward only when the pole remains perfectly vertical

The environment doesn't specifically reward perfect verticality. Looking for absolute perfection would create a sparse reward problem, making learning much more difficult.

A +1 reward for every time step the pole stays up, 0 when it falls

Correct! The code shows `CalculateReward()` returns 1 when the episode continues and 0 when it's done. This simple approach creates a powerful incentive: the longer the pole stays balanced, the more total reward the agent receives, naturally encouraging it to master balancing.

A graduated reward based on how close the pole is to vertical (higher reward for more vertical)

While this approach could work, it's not what our implementation uses. Our environment uses a simpler binary reward: +1 for each surviving time step, regardless of the exact angle, and 0 when the episode ends.

Need a hint?

Check the `CalculateReward()` method in the environment code to see exactly what reward is given and when.

Question 3: What is the purpose of setting different values for maxStepsSoft and maxStepsHard?

To artificially increase learning speed by ending episodes prematurely

This isn't about artificially speeding up learning. In fact, episodes can still run to their natural conclusion up to maxStepsHard. The distinction serves a different purpose related to computational efficiency.

To reduce GPU memory usage by limiting reward calculations while allowing natural environment progression

That's right! As the tutorial explains, this technique lets you 'control precisely how much computational effort to invest while maintaining natural environment dynamics.' You accumulate rewards and gradients only until maxStepsSoft, but the simulation continues naturally up to maxStepsHard, significantly reducing memory usage for long episodes.

To create a curriculum where the agent first learns short episodes before tackling longer ones

While curriculum learning is a valid RL technique, that's not what the soft/hard step limits are designed for. These parameters don't progressively increase episode length - they manage computational resources while maintaining natural environment behavior.

Need a hint?

Consider what happens to GPU memory when episodes get very long, and how this parameter configuration helps address that issue.

Question 4: How does RLMatrix's interpretation of PPO's batchSize parameter differ from standard implementations?

It refers to the number of complete episodes to collect before updating the model, not individual steps

Exactly right! The tutorial explicitly points out this difference: 'In RLMatrix, batchSize refers to the number of complete episodes to collect before updating the model – not the number of individual steps as in many other implementations.' This is an important distinction when configuring your training.

It determines the size of the neural network's hidden layers

The batch size doesn't determine neural network architecture. In RLMatrix, the 'width' parameter controls the size of hidden layers. Batch size instead relates to how much experience is collected before learning updates.

It controls how many training steps to perform before evaluating the agent

This isn't what batch size means in RLMatrix's PPO implementation. Batch size specifically relates to data collection for learning, not the evaluation schedule.

Need a hint?

The tutorial has a specific section explaining RLMatrix's PPO implementation differences - check what it says about batch size interpretation.

Question 5: What transformation would you expect to see in the agent's behavior as training progresses?

The agent will develop increasingly complex movement patterns that appear random but maintain balance

Successful agents don't typically develop random-looking movements. The progression tends toward subtle, deliberate control rather than complex or chaotic patterns.

The agent will progress from random movements to reactive corrections to preemptive adjustments

Exactly as described in the tutorial! The agent follows this progression: random movements → reactive corrections (when the pole is already falling) → earlier interventions → subtle preemptive adjustments. This shows how it learns to anticipate problems rather than just react to them.

The agent will learn to keep the cart perfectly centered on screen at all times

Centering the cart isn't necessarily the optimal strategy. The goal is keeping the pole balanced, which might involve moving the cart strategically. Perfect centering isn't mentioned as part of the expected behavior progression.

Need a hint?

The tutorial outlines a specific progression of behavior that you'll observe as the agent learns. Look for the numbered list describing this transformation.

Next Steps

In this tutorial, you’ve:

Set up a real-time physics simulation for reinforcement learning
Implemented a complete agent to master a classic control problem
Learned how to efficiently manage memory with the soft/hard termination trick
Understood how RLMatrix’s PPO implementation differs from standard ones

Next, we’ll implement the same environment without using the toolkit, giving you insights into what’s happening behind those neat attributes we used.

Cart-Pole Without Toolkit See what's happening under the hood by implementing cart-pole without the toolkit abstraction.