AI 101: Neural net Training using Micrograd for Deep Learning

Alternative Titles:

Micrograd: A Backpropagation Primer for Neural Network Enthusiasts

The Heart of Deep Learning: 94 Lines of Code and the Power of Autograd

From Scalar to Superintelligence: The Journey of a Neural Network, Starting with Micrograd

Micrograd: Peeling Back the Layers of Neural Network Training

Simplicity Meets Sophistication: Understanding Autograd with Micrograd

Autograd:

At its heart, an autograd engine is a system that automatically calculates derivatives (gradients) of functions. In the context of neural networks, these functions represent the complex calculations that transform input data into predictions. The gradients tell us how much a small change in each parameter of the network (like the weights and biases) would affect the output (the loss). This is crucial for training, as we want to adjust these parameters to minimize the loss.

Core Idea: Autograd is a technique used in machine learning to automatically compute derivatives (gradients) of functions. It is essential for optimizing the parameters of neural networks through techniques like gradient descent.
How it Works: Autograd builds a computational graph as the forward pass of a neural network is executed. This graph tracks all the operations performed on the input data and network parameters. During the backward pass, Autograd traverses this graph in reverse, applying the chain rule of calculus to compute gradients for each node, eventually reaching the input parameters.
Benefits:
- Automatic Differentiation: Eliminates the need for manually deriving complex gradient expressions, saving time and reducing errors.
- Dynamic Graphs: Supports dynamic computational graphs, enabling flexible model architectures.
- Efficiency: Computes gradients efficiently in a single backward pass.

Micrograd:

Micrograd is a valuable tool for understanding the inner workings of neural network training. Its simplicity allows you to see how automatic differentiation and backpropagation are implemented without being overwhelmed by the complexity of production-level libraries. By experimenting with Micrograd, you can gain a deeper intuition for how gradients flow through a computational graph and how parameters are adjusted to optimize a model.

A Miniature Autograd Engine: Micrograd is a minimal implementation of the Autograd concept. It demonstrates the fundamental principles of automatic differentiation and backpropagation in a concise and understandable way.
Educational Tool: Its primary purpose is educational, helping learners grasp the core mechanics of neural network training.
Scalar-Valued: Micrograd focuses on scalar values (single numbers) and their gradients, simplifying the implementation compared to full-fledged autograd engines that handle tensors (multi-dimensional arrays).
Limitations: While it can build and train simple neural networks, it lacks the optimizations and features needed for large-scale, real-world applications.

Key Concepts in Micrograd

Value Class: The fundamental building block representing a scalar value and its gradient.
Computational Graph: Created dynamically by combining Value objects through mathematical operations.
Forward Pass: Calculates the output of the network by propagating values through the graph.
Backward Pass: Calculates gradients for each Value by applying the chain rule in reverse through the graph.

How Micrograd Works

Scalar-Valued: Each Value object represents a single numerical value in the computational graph.
Computational Graph: A neural network's calculations can be represented as a graph.
- The leaf nodes of this graph are the input data and the network's initial parameters.
- Internal nodes represent mathematical operations (like addition or multiplication) applied to the values.
- The final node is the loss, a single value that measures how far off the network's predictions are from the correct answers.
Forward Pass: The computation starts at the leaf nodes and flows forward through the graph, with each node calculating its value based on its inputs and the associated operation.
Backward Pass (Backpropagation): Starting from the loss node, the engine works backward through the graph.
- At each node, it uses the chain rule of calculus to determine how much that node's output contributed to the final loss.
- This contribution is stored as the node's grad (gradient) value.
- By the time the engine reaches the leaf nodes (the parameters), each parameter has a gradient indicating how it should be changed to reduce the loss.

The Essence of Training

The backward function in the code implements this backward pass. By recursively calling _backward on each node's children, it efficiently calculates the gradients for the entire graph. These gradients are then used by an optimization algorithm (not shown in this code) to update the parameters. This iterative process of forward and backward passes, followed by parameter updates, is the core of neural network training.

While Micrograd beautifully illustrates the core, building and training state-of-the-art neural networks involves many additional factors:

Architecture Design: Choosing the right type of network (e.g., MLP, Transformer) for the task at hand is crucial.
Complex Architectures: Modern networks like Transformers have intricate structures with millions of parameters.
Loss Functions: Selecting an appropriate loss function (e.g., cross-entropy, mean squared error) guides the optimization process.
Optimization Algorithms: Advanced optimizers like AdamW can converge faster and achieve better results than simple SGD.
Specialized Operations: Operations like convolutions and attention mechanisms are essential for specific tasks.
Hardware Acceleration: GPUs and TPUs dramatically speed up computations, making large-scale training feasible.
Data Handling: Efficient data loading, preprocessing, and augmentation pipelines are crucial for effective training.
Hyperparameter Tuning: Finding the right learning rate, batch size, etc., significantly impacts model performance.
Regularization Techniques: Dropout, weight decay, and others help prevent overfitting and improve generalization.

EXPLANATION of the CODE:

This code defines a class called Value that forms the basis for building and training simple neural networks. It implements forward and backward propagation using automatic differentiation. Let's break down the code step by step:

1. Class Value:

Purpose: The Value class represents a scalar value (e.g., a number) within a neural network's computation graph. It also stores the gradient of this value, which is crucial for training the network.
Initialization (__init__):
- data: The actual numerical value.
- _children: A set of other Value instances that are used as inputs to compute the current value.
- _op: A string representing the operation that produced this value (e.g., "+", "*").
- grad: Initially set to None, this will hold the calculated gradient during backpropagation.
- _backward: A function that will be defined later, responsible for computing the gradient of this value.

2. Forward Operations:

__add__(self, other): Defines addition between two Value instances or a Value and a regular number. It creates a new Value to store the result and defines a _backward function for this new value that will be used in backpropagation.
__mul__(self, other): Similar to __add__, but for multiplication.
__pow__(self, other): Exponentiation of a Value to an integer or float power.
Other Operators: The code includes other methods like __neg__, __radd__, __sub__, __rsub__, __rmul__, __truediv__, and __rtruediv__ to define operations like negation, subtraction, and division for the Value class.

3. Backward Propagation:

backward(self): This method is responsible for computing the gradient of the current Value using the chain rule. It does the following:
1. Topological Sorting: It sorts the computational graph using build_topo to ensure that gradients are computed in the correct order.
2. Gradient Initialization: It sets the gradient of the output node (the final loss in the network) to 1.
3. Chain Rule: It iterates through the nodes in reverse topological order, applying the chain rule at each node to compute its gradient based on the gradients of its children.

4. ReLU Activation:

relu(self): Implements the Rectified Linear Unit (ReLU) activation function, which is common in neural networks. It returns the value itself if it's positive, otherwise, it returns 0. The backward method is also defined to handle gradient computation for ReLU.

5. Representation:

__repr__(self): Provides a string representation of the Value for debugging and visualization.

Key Points:

This code is a simplified illustration of the core principles behind training neural networks.
It showcases the essence of automatic differentiation, where gradients are computed automatically by tracing the computation graph.
The Value class and its operations enable the construction of complex computational graphs representing neural networks.
The backward method efficiently calculates gradients, which are essential for updating the network's parameters during training.

CODE:

class Value:

def __init__(self, data, _children=(), _op='', label=''):

self.data = data

self.grad = 0.0

self._backward = lambda: None

self._prev = set(_children)

self._op = _op

self.label = label

def __repr__(self):

return f"Value(data={self.data}, label='{self.label}')"

def __add__(self, other):

other = other if isinstance(other, Value) else Value(other)

out = Value(self.data + other.data, (self, other), '+')

def _backward():

self.grad += 1.0 * out.grad

other.grad += 1.0 * out.grad

out._backward = _backward

return out

def __mul__(self, other):

other = other if isinstance(other, Value) else Value(other)

out = Value(self.data * other.data, (self, other), '*')

def _backward():

self.grad += other.data * out.grad

other.grad += self.data * out.grad

out._backward = _backward

return out

def __pow__(self, other):

assert isinstance(other, (int, float)), "only supporting int/float powers for now"

out = Value(self.data**other, (self,), f'**{other}')

def _backward():

self.grad += other * (self.data**(other - 1)) * out.grad

out._backward = _backward

return out

def __neg__(self): # -self

return self * -1

def __radd__(self, other): # other + self

return self + other

def __sub__(self, other): # self - other

return self + (-other)

def __rsub__(self, other): # other - self

return other + (-self)

def __rmul__(self, other): # other * self

return self * other

def __truediv__(self, other): # self / other

return self * other**-1

def __rtruediv__(self, other): # other / self

return other * self**-1

def relu(self):

out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU')

def _backward():

self.grad += (out.data > 0) * out.grad

out._backward = _backward

return out

def build_topo(v):

topo = []

visited = set()

def build_topo_helper(v):

if v not in visited:

visited.add(v)

for child in v._prev:

build_topo_helper(child)

topo.append(v)

build_topo_helper(v)

return topo

def backward(l):

topo = build_topo(l)

l.grad = 1.0

for node in reversed(topo):

node._backward()

References:

https://github.com/karpathy/micrograd?tab=readme-ov-file

github.com/Elbert-Ainstein/micrograd

github.com/madhav165/micrograd-demo

github.com/catalyst-team/dl-course

github.com/AP6YC/SAR

https://x.com/karpathy/status/1803963383018066272

AI 101: Neural net Training using Micrograd for Deep Learning

Next

Newer Post

Previous

Older Post