Intro to Reinforcement Learning
Reinforcement Learning (RL) is a subfield of machine learning that teaches an agent how to choose an action in its environment to maximize rewards over time. Unlike supervised learning, RL doesn't rely on labeled datasets; instead, the agent learns through interactions with its environment, using feedback from rewards to improve its decision-making over time.
Key Concepts
- Agent: The program you train with the aim of performing a specified task.
- Environment: The real or virtual world where the agent performs actions.
- Action: A move made by the agent that causes a change in the environment.
- Rewards: The evaluation of an action, which can be positive or negative.
- State: A representation of the environment at a specific moment in time.
- Policy: A strategy that defines the agent’s behavior by mapping states to actions.
- Value Function: Estimates the long-term reward for being in a particular state or taking a particular action.
- Q-Value (Action-Value): Estimates the total expected reward for taking a given action from a given state.
Supervise & Unsupervised
- Supervised Learning: Uses labeled datasets where each input has a corresponding output to train algorithms to predict outcomes and recognize patterns.
- Unsupervised Learning: Applies machine learning on unlabeled datasets that have no predefined labels or outputs, aiming to uncover hidden patterns in the data.
Key Differences
- Static vs. Dynamic
- Supervised and unsupervised learning focus on finding patterns in static training data.
-
RL is dynamic, focusing on developing policies to guide the agent's actions at each step.
-
No Explicit Right Answer
- In supervised learning, the "right answer" is provided by training data.
-
In RL, the right answer isn’t explicit — the agent learns through trial and error, relying on rewards to gauge progress or failure.
-
Exploration Required
- Supervised and unsupervised learning derive answers directly from training data.
- In RL, the agent must explore the environment to discover new strategies for earning rewards.
OpenAI Gym
- A toolkit for developing and comparing reinforcement learning algorithms.
- Provides a game-like environment where agents can take actions and learn from the outcomes.
- After the agent takes an action, the environment updates its state, and the agent uses these changes to decide its next move.
Markov Process
Markov Property: A process where the future state depends only on the present state and not on the sequence of events that preceded it.
- The state of X at time t+1 depends only on the state of X at time t, making it independent of past states.
- When the Markov Property is applied to a random process, it becomes a Markov Chain — a model that describes a sequence of possible events where the probability of each event depends only on the state attained in the previous event.
Important Notes
- Exploration vs. Exploitation: The agent must balance between exploring new actions to find better rewards and exploiting known actions to maximize immediate rewards.
- Delayed Rewards: Actions can have long-term consequences, making it essential to consider future rewards, not just immediate feedback.
- Credit Assignment Problem: Determining which actions contributed to the rewards received can be challenging, especially when rewards are delayed.
- Training Process: RL typically involves running episodes where the agent repeatedly interacts with the environment, gathering experience and refining its policy.