Getting Started with RLMatrix

Introduction

When we write traditional programs, we tell the computer exactly what to do in every situation. For example, if we wanted to write a program that matches numbers, we might write:

if (input == pattern)
{
    return "Correct!";
}
else
{
    return "Try again!";
}

But what if we want our program to learn on its own? What if the rules are too complex to write out, or we don’t even know the rules ourselves? This is where reinforcement learning comes in.

Setting Up Your Project

You can follow along or clone this GitHub repository. First, let’s get everything installed:

dotnet add package RLMatrix
dotnet add package RLMatrix.Toolkit

Your First Learning Environment

Let’s create something simple but meaningful - an environment where our AI will learn to match patterns. While this seems basic (and would be trivial to program directly), it introduces all the key concepts we need.

Here’s our complete environment:

using RLMatrix.Toolkit;

namespace PatternMatchingExample;

[RLMatrixEnvironment]
public partial class PatternMatchingEnvironment
{
    private int pattern = 0;
    private int aiChoice = 0;
    private bool roundFinished = false;

    // Simple counters for last 50 steps
    private int correct = 0;
    private int total = 0;

    // Simple accuracy calculation
    public float RecentAccuracy => total > 0 ? (float)correct / total * 100 : 0;

    [RLMatrixObservation]
    public float SeePattern() => pattern;

    [RLMatrixActionDiscrete(2)]
    public void MakeChoice(int choice)
    {
        aiChoice = choice;
        roundFinished = true;

        // Update counters
        total++;
        if (aiChoice == pattern) correct++;
    }

    [RLMatrixReward]
    public float GiveReward() => aiChoice == pattern ? 1.0f : -1.0f;

    [RLMatrixDone]
    public bool IsRoundOver() => roundFinished;

    [RLMatrixReset]
    public void StartNewRound()
    {
        pattern = Random.Shared.Next(2);
        aiChoice = 0;
        roundFinished = false;
    }

    public void ResetStats()
    {
        correct = 0;
        total = 0;
    }
}

Code Breakdown

Let’s look at each part:

The Variables:

private int pattern = 0;      // The number to match
private int aiChoice = 0;     // The AI's guess
private bool roundFinished = false;  // Round status

These keep track of what’s happening in our environment.

The Special Attributes:

[RLMatrixEnvironment]: Tells RLMatrix “this is a learning environment”
[RLMatrixObservation]: “This is what the AI can see”
[RLMatrixActionDiscrete]: “These are the choices the AI can make”
[RLMatrixReward]: “This is how we score the AI’s performance”
[RLMatrixReset]: “This is how we start fresh”

The toolkit uses these attributes to automatically generate the necessary code.

Training Your AI

Now comes the interesting part - teaching our AI to match patterns. We’ll use an algorithm called DQN (Deep Q-Network). Don’t worry too much about the name - it’s just one way of teaching AI to make decisions.

Here’s how we set up the training:

using RLMatrix.Agents.Common;
using RLMatrix;
using PatternMatchingExample;

Console.WriteLine("Starting pattern matching training...\n");

// Set up how our AI will learn
var learningSetup = new DQNAgentOptions(
    batchSize: 32,      // Learn from 32 experiences at once
    memorySize: 1000,   // Remember last 1000 attempts
    gamma: 0.99f,       // Care a lot about future rewards
    epsStart: 1f,       // Start by trying everything
    epsEnd: 0.05f,      // Eventually stick to what works
    epsDecay: 150f      // How fast to transition
);

// Create our environment
var environment = new PatternMatchingEnvironment().RLInit();
var env = new List<IEnvironmentAsync<float[]>> {
    environment,
    //new PatternMatchingEnvironment().RLInit() //you can add more than one to train in parallel
};

// Create our learning agent
var agent = new LocalDiscreteRolloutAgent<float[]>(learningSetup, env);

// Let it learn!
for (int i = 0; i < 1000; i++)
{
    await agent.Step();

    if ((i + 1) % 50 == 0)
    {
        Console.WriteLine($"Step {i + 1}/1000 - Last 50 steps accuracy: {environment.RecentAccuracy:F1}%");
        environment.ResetStats();

        Console.WriteLine("\nPress Enter to continue...");
        Console.ReadLine();
    }
}

Console.WriteLine("\nTraining complete!");
Console.ReadLine();

When you run this code, you’ll see the training progress displayed every 50 steps:

Starting pattern matching training...

Step 50/1000 - Last 50 steps accuracy: 48.0%
Press Enter to continue...

Step 100/1000 - Last 50 steps accuracy: 68.0%
Press Enter to continue...

Step 150/1000 - Last 50 steps accuracy: 86.0%
Press Enter to continue...

Step 200/1000 - Last 50 steps accuracy: 82.0%
Press Enter to continue...

Beyond Simple Matching

While our example is straightforward, the same principles apply to much more complex problems:

Test Your Understanding

Understanding Reinforcement Learning Basics

Question 1: Why would we choose reinforcement learning over traditional programming for a task?

When we need the program to work with extreme precision

Actually, traditional programming often excels at precision when we know exactly what we want. Reinforcement learning shines in scenarios where the rules are complex or unknown, not necessarily when maximum precision is the goal.

When the rules are too complex to manually program or we don't fully know them ourselves

Exactly right! Reinforcement learning is particularly valuable when the rules are too complex to specify (like balancing a robot) or when we don't fully understand the optimal approach ourselves. The AI can discover solutions through experience rather than being explicitly programmed.

When we need the program to run faster than traditional code

Reinforcement learning isn't about speed of execution - in fact, traditional programming usually runs faster. RL is about having programs learn from experience rather than being explicitly coded for every situation.

Need a hint?

Think about the limitations of traditional if/else programming versus letting a system discover patterns through trial and error.

Question 2: In our example, why was it important to set epsStart to 1.0 and epsEnd to a lower value like 0.05?

This ensures the agent always picks the highest reward action

That's not quite the purpose. If the agent always picked what it thought was best (exploitation only), it would never discover potentially better strategies it hasn't tried yet.

These settings control the agent's learning rate over time

While these parameters do change over time, they don't directly control the learning rate (that would be the 'lr' parameter). They control something else fundamental to reinforcement learning.

This creates a balance between exploration (trying new things) and exploitation (using what works) that shifts over time

That's right! This is the classic exploration-exploitation balance. By starting with epsStart: 1f, the agent initially tries everything (pure exploration). As training progresses, it gradually shifts toward epsEnd: 0.05f, where it mostly uses what it's learned works best (mostly exploitation) while still occasionally exploring.

Need a hint?

Consider what happens at the beginning of training versus later on - how does the agent's behavior change, and why is that important?

Question 3: What would likely happen if we changed our reward function to only give +1 for correct matches but no penalty for incorrect matches?

Learning would be faster because the agent would only receive positive feedback

Without penalties, the agent would actually learn more slowly or possibly not at all. With only positive rewards, random guessing still yields rewards 50% of the time, giving little incentive to improve beyond random chance.

Learning would be slower or fail because the agent wouldn't receive clear feedback about incorrect actions

Exactly! This highlights the importance of well-designed reward functions. Without penalties for incorrect matches, the agent gets no feedback distinguishing wrong from right when it makes a mistake. It might conclude that random guessing is good enough since it still receives rewards half the time.

The agent would learn the same pattern but would need more memory to store the experiences

Memory requirements aren't directly related to the reward structure. The key issue here is the quality of learning signals the agent receives, not how much memory it uses.

Need a hint?

Think about what motivates learning - is it just receiving rewards, or is it also avoiding penalties?

Question 4: What role does gamma (set to 0.99f in our example) play in the learning process?

It determines how many patterns the agent can memorize at once

Pattern memorization capacity is primarily related to the neural network architecture, not the gamma parameter. Gamma serves a different purpose in how the agent evaluates rewards.

It controls how much the agent values immediate rewards versus potential future rewards

Correct! Gamma is the discount factor that determines how the agent values future rewards compared to immediate ones. With our high setting of 0.99f, the agent cares almost as much about future rewards as immediate ones, encouraging it to learn strategies that lead to good outcomes in the long run.

It sets how quickly the agent forgets unsuccessful attempts

The agent's memory of past experiences is controlled by the memorySize parameter, not gamma. Gamma influences how the agent evaluates the value of actions across time.

Need a hint?

In more complex environments, actions don't always lead to immediate rewards. How would an agent decide between a small reward now versus potentially larger rewards later?

Question 5: Based on what you've learned, which of these tasks would be MOST suitable for a reinforcement learning approach?

Sorting a list of numbers in ascending order

Sorting is a well-understood problem with optimal algorithms already known. Traditional programming would be more appropriate here since we know exactly what the correct output should be for any input.

Balancing a simulated robot that has complex joint dynamics

Perfect choice! Robot balancing involves complex physics that's difficult to model precisely, with many potential strategies for maintaining balance. This exemplifies when RL shines - when the rules are complex and the optimal policy isn't obvious even to humans.

Converting temperature between Celsius and Fahrenheit

This is a straightforward mathematical formula (F = C × 9/5 + 32) that's easily implemented with traditional programming. There's a single correct answer for each input, making reinforcement learning unnecessarily complex for this task.

Need a hint?

Consider which task has rules that are difficult to specify explicitly but could be learned through trial and error.

Next Steps

Ready to go further? Your next steps could be:

We have two main algorithms available:

DQN: What we just used, good for simple choices, benefits from large replay memory.
PPO: More advanced, handles continuous actions (like controlling speed or direction)