Getting Started with RLMatrix
Introduction
When we write traditional programs, we tell the computer exactly what to do in every situation. For example, if we wanted to write a program that matches numbers, we might write:
if (input == pattern){ return "Correct!";}else{ return "Try again!";}
But what if we want our program to learn on its own? What if the rules are too complex to write out, or we don’t even know the rules ourselves? This is where reinforcement learning comes in.
Setting Up Your Project
You can follow along or clone this GitHub repository. First, let’s get everything installed:
dotnet add package RLMatrixdotnet add package RLMatrix.Toolkit
Your First Learning Environment
Let’s create something simple but meaningful - an environment where our AI will learn to match patterns. While this seems basic (and would be trivial to program directly), it introduces all the key concepts we need.
Here’s our complete environment:
using RLMatrix.Toolkit;
namespace PatternMatchingExample;
[RLMatrixEnvironment]public partial class PatternMatchingEnvironment{ private int pattern = 0; private int aiChoice = 0; private bool roundFinished = false;
// Simple counters for last 50 steps private int correct = 0; private int total = 0;
// Simple accuracy calculation public float RecentAccuracy => total > 0 ? (float)correct / total * 100 : 0;
[RLMatrixObservation] public float SeePattern() => pattern;
[RLMatrixActionDiscrete(2)] public void MakeChoice(int choice) { aiChoice = choice; roundFinished = true;
// Update counters total++; if (aiChoice == pattern) correct++; }
[RLMatrixReward] public float GiveReward() => aiChoice == pattern ? 1.0f : -1.0f;
[RLMatrixDone] public bool IsRoundOver() => roundFinished;
[RLMatrixReset] public void StartNewRound() { pattern = Random.Shared.Next(2); aiChoice = 0; roundFinished = false; }
public void ResetStats() { correct = 0; total = 0; }}
Training Your AI
Now comes the interesting part - teaching our AI to match patterns. We’ll use an algorithm called DQN (Deep Q-Network). Don’t worry too much about the name - it’s just one way of teaching AI to make decisions.
Here’s how we set up the training:
using RLMatrix.Agents.Common;using RLMatrix;using PatternMatchingExample;
Console.WriteLine("Starting pattern matching training...\n");
// Set up how our AI will learnvar learningSetup = new DQNAgentOptions( batchSize: 32, // Learn from 32 experiences at once memorySize: 1000, // Remember last 1000 attempts gamma: 0.99f, // Care a lot about future rewards epsStart: 1f, // Start by trying everything epsEnd: 0.05f, // Eventually stick to what works epsDecay: 150f // How fast to transition);
// Create our environmentvar environment = new PatternMatchingEnvironment().RLInit();var env = new List<IEnvironmentAsync<float[]>> { environment, //new PatternMatchingEnvironment().RLInit() //you can add more than one to train in parallel};
// Create our learning agentvar agent = new LocalDiscreteRolloutAgent<float[]>(learningSetup, env);
// Let it learn!for (int i = 0; i < 1000; i++){ await agent.Step();
if ((i + 1) % 50 == 0) { Console.WriteLine($"Step {i + 1}/1000 - Last 50 steps accuracy: {environment.RecentAccuracy:F1}%"); environment.ResetStats();
Console.WriteLine("\nPress Enter to continue..."); Console.ReadLine(); }}
Console.WriteLine("\nTraining complete!");Console.ReadLine();
When you run this code, you’ll see the training progress displayed every 50 steps:
Starting pattern matching training...
Step 50/1000 - Last 50 steps accuracy: 48.0%Press Enter to continue...
Step 100/1000 - Last 50 steps accuracy: 68.0%Press Enter to continue...
Step 150/1000 - Last 50 steps accuracy: 86.0%Press Enter to continue...
Step 200/1000 - Last 50 steps accuracy: 82.0%Press Enter to continue...
Beyond Simple Matching
While our example is straightforward, the same principles apply to much more complex problems:
Test Your Understanding
Understanding Reinforcement Learning Basics
Next Steps
Ready to go further? Your next steps could be:
We have two main algorithms available:
- DQN: What we just used, good for simple choices, benefits from large replay memory.
- PPO: More advanced, handles continuous actions (like controlling speed or direction)