Intro to Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning that teaches an agent how to choose an action in its environment to maximize rewards over time. Unlike supervised learning, RL doesn't rely on labeled datasets; instead, the agent learns through interactions with its environment, using feedback from rewards to improve its decision-making over time.

Key Concepts

Agent: The program you train with the aim of performing a specified task.
Environment: The real or virtual world where the agent performs actions.
Action: A move made by the agent that causes a change in the environment.
Rewards: The evaluation of an action, which can be positive or negative.
State: A representation of the environment at a specific moment in time.
Policy: A strategy that defines the agent’s behavior by mapping states to actions.
Value Function: Estimates the long-term reward for being in a particular state or taking a particular action.
Q-Value (Action-Value): Estimates the total expected reward for taking a given action from a given state.

Supervise & Unsupervised

Supervised Learning: Uses labeled datasets where each input has a corresponding output to train algorithms to predict outcomes and recognize patterns.
Unsupervised Learning: Applies machine learning on unlabeled datasets that have no predefined labels or outputs, aiming to uncover hidden patterns in the data.

Key Differences

Static vs. Dynamic
Supervised and unsupervised learning focus on finding patterns in static training data.
RL is dynamic, focusing on developing policies to guide the agent's actions at each step.
No Explicit Right Answer
In supervised learning, the "right answer" is provided by training data.
In RL, the right answer isn’t explicit — the agent learns through trial and error, relying on rewards to gauge progress or failure.
Exploration Required
Supervised and unsupervised learning derive answers directly from training data.
In RL, the agent must explore the environment to discover new strategies for earning rewards.

OpenAI Gym

A toolkit for developing and comparing reinforcement learning algorithms.
Provides a game-like environment where agents can take actions and learn from the outcomes.
After the agent takes an action, the environment updates its state, and the agent uses these changes to decide its next move.

Markov Process

Markov Property: A process where the future state depends only on the present state and not on the sequence of events that preceded it.

The state of X at time t+1 depends only on the state of X at time t, making it independent of past states.
When the Markov Property is applied to a random process, it becomes a Markov Chain — a model that describes a sequence of possible events where the probability of each event depends only on the state attained in the previous event.

Important Notes

Exploration vs. Exploitation: The agent must balance between exploring new actions to find better rewards and exploiting known actions to maximize immediate rewards.
Delayed Rewards: Actions can have long-term consequences, making it essential to consider future rewards, not just immediate feedback.
Credit Assignment Problem: Determining which actions contributed to the rewards received can be challenging, especially when rewards are delayed.
Training Process: RL typically involves running episodes where the agent repeatedly interacts with the environment, gathering experience and refining its policy.