Amjad Izhar Blog

Category: PyTorch

PyTorch for Deep Learning & Machine Learning – Study Notes
PyTorch for Deep Learning FAQ

1. What are tensors and how are they represented in PyTorch?

Tensors are the fundamental data structures in PyTorch, used to represent numerical data. They can be thought of as multi-dimensional arrays. In PyTorch, tensors are created using the torch.tensor() function and can be classified as:
- Scalar: A single number (zero dimensions)
- Vector: A one-dimensional array (one dimension)
- Matrix: A two-dimensional array (two dimensions)
- Tensor: A general term for arrays with three or more dimensions
You can identify the number of dimensions by counting the pairs of closing square brackets used to define the tensor.

2. How do you determine the shape and dimensions of a tensor?
- Dimensions: Determined by counting the pairs of closing square brackets (e.g., [[]] represents two dimensions). Accessed using tensor.ndim.
- Shape: Represents the number of elements in each dimension. Accessed using tensor.shape or tensor.size().
For example, a tensor defined as [[1, 2], [3, 4]] has two dimensions and a shape of (2, 2), indicating two rows and two columns.

3. What are tensor data types and how do you change them?

Tensors have data types that specify the kind of numerical values they hold (e.g., float32, int64). The default data type in PyTorch is float32. You can change the data type of a tensor using the .type() method:

float_32_tensor = torch.tensor([1.0, 2.0, 3.0])

float_16_tensor = float_32_tensor.type(torch.float16)

4. What does “requires_grad” mean in PyTorch?

requires_grad is a parameter used when creating tensors. Setting it to True indicates that you want to track gradients for this tensor during training. This is essential for PyTorch to calculate derivatives and update model weights during backpropagation.

5. What is matrix multiplication in PyTorch and what are the rules?

Matrix multiplication, a key operation in deep learning, is performed using the @ operator or torch.matmul() function. Two important rules apply:
- Inner dimensions must match: The number of columns in the first matrix must equal the number of rows in the second matrix.
- Resulting matrix shape: The resulting matrix will have the number of rows from the first matrix and the number of columns from the second matrix.
6. What are common tensor operations for aggregation?

PyTorch provides several functions to aggregate tensor values, such as:
- torch.min(): Finds the minimum value.
- torch.max(): Finds the maximum value.
- torch.mean(): Calculates the average.
- torch.sum(): Calculates the sum.
These functions can be applied to the entire tensor or along specific dimensions.

7. What are the differences between reshape, view, and stack?
- reshape: Changes the shape of a tensor while maintaining the same data. The new shape must be compatible with the original number of elements.
- view: Creates a new view of the same underlying data as the original tensor, with a different shape. Changes to the view affect the original tensor.
- stack: Concatenates tensors along a new dimension, creating a higher-dimensional tensor.
8. What are the steps involved in a typical PyTorch training loop?
1. Forward Pass: Input data is passed through the model to get predictions.
2. Calculate Loss: The difference between predictions and actual labels is calculated using a loss function.
3. Zero Gradients: Gradients from previous iterations are reset to zero.
4. Backpropagation: Gradients are calculated for all parameters with requires_grad=True.
5. Optimize Step: The optimizer updates model weights based on calculated gradients.
Deep Learning and Machine Learning with PyTorch

Short-Answer Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What are the key differences between a scalar, a vector, a matrix, and a tensor in PyTorch?
2. How can you determine the number of dimensions of a tensor in PyTorch?
3. Explain the concept of “shape” in relation to PyTorch tensors.
4. Describe how to create a PyTorch tensor filled with ones and specify its data type.
5. What is the purpose of the torch.zeros_like() function?
6. How do you convert a PyTorch tensor from one data type to another?
7. Explain the importance of ensuring tensors are on the same device and have compatible data types for operations.
8. What are tensor attributes, and provide two examples?
9. What is tensor broadcasting, and what are the two key rules for its operation?
10. Define tensor aggregation and provide two examples of aggregation functions in PyTorch.
Short-Answer Quiz Answer Key
1. In PyTorch, a scalar is a single number, a vector is an array of numbers with direction, a matrix is a 2-dimensional array of numbers, and a tensor is a multi-dimensional array that encompasses scalars, vectors, and matrices. All of these are represented as torch.Tensor objects in PyTorch.
2. The number of dimensions of a tensor can be determined using the tensor.ndim attribute, which returns the number of dimensions or axes present in the tensor.
3. The shape of a tensor refers to the number of elements along each dimension of the tensor. It is represented as a tuple, where each element in the tuple corresponds to the size of each dimension.
4. To create a PyTorch tensor filled with ones, use torch.ones(size) where size is a tuple specifying the desired dimensions. To specify the data type, use the dtype parameter, for example, torch.ones(size, dtype=torch.float64).
5. The torch.zeros_like() function creates a new tensor filled with zeros, having the same shape and data type as the input tensor. It is useful for quickly creating a tensor with the same structure but with zero values.
6. To convert a PyTorch tensor from one data type to another, use the .type() method, specifying the desired data type as an argument. For example, to convert a tensor to float16: tensor = tensor.type(torch.float16).
7. PyTorch operations require tensors to be on the same device (CPU or GPU) and have compatible data types for successful computation. Performing operations on tensors with mismatched devices or incompatible data types will result in errors.
8. Tensor attributes provide information about the tensor’s properties. Two examples are:
- dtype: Specifies the data type of the tensor elements.
- shape: Represents the dimensionality of the tensor as a tuple.
1. Tensor broadcasting allows operations between tensors with different shapes, automatically expanding the smaller tensor to match the larger one under certain conditions. The two key rules for broadcasting are:
- Inner dimensions must match.
- The resulting matrix has the shape of the broadcasted tensors.
1. Tensor aggregation involves reducing the elements of a tensor to a single value using specific functions. Two examples are:
- torch.min(): Finds the minimum value in a tensor.
- torch.mean(): Calculates the average value of the elements in a tensor.
Essay Questions
1. Discuss the concept of dimensionality in PyTorch tensors. Explain how to create tensors with different dimensions and demonstrate how to access specific elements within a tensor. Provide examples and illustrate the relationship between dimensions, shape, and indexing.
2. Explain the importance of data types in PyTorch. Describe different data types available for tensors and discuss the implications of choosing specific data types for tensor operations. Provide examples of data type conversion and highlight potential issues arising from data type mismatches.
3. Compare and contrast the torch.reshape(), torch.view(), and torch.permute() functions. Explain their functionalities, use cases, and any potential limitations or considerations. Provide code examples to illustrate their usage.
4. Discuss the purpose and functionality of the PyTorch nn.Module class. Explain how to create custom neural network modules by subclassing nn.Module. Provide a code example demonstrating the creation of a simple neural network module with at least two layers.
5. Describe the typical workflow for training a neural network model in PyTorch. Explain the steps involved, including data loading, model creation, loss function definition, optimizer selection, training loop implementation, and model evaluation. Provide a code example outlining the essential components of the training process.
Glossary of Key Terms

Tensor: A multi-dimensional array, the fundamental data structure in PyTorch.

Dimensionality: The number of axes or dimensions present in a tensor.

Shape: A tuple representing the size of each dimension in a tensor.

Data Type: The type of values stored in a tensor (e.g., float32, int64).

Tensor Broadcasting: Automatically expanding the dimensions of tensors during operations to enable compatibility.

Tensor Aggregation: Reducing the elements of a tensor to a single value using functions like min, max, or mean.

nn.Module: The base class for building neural network modules in PyTorch.

Forward Pass: The process of passing input data through a neural network to obtain predictions.

Loss Function: A function that measures the difference between predicted and actual values during training.

Optimizer: An algorithm that adjusts the model’s parameters to minimize the loss function.

Training Loop: Iteratively performing forward passes, loss calculation, and parameter updates to train a model.

Device: The hardware used for computation (CPU or GPU).

Data Loader: An iterable that efficiently loads batches of data for training or evaluation.

Exploring Deep Learning with PyTorch

Fundamentals of Tensors

1. Understanding Tensors
- Introduction to tensors, the fundamental data structure in PyTorch.
- Differentiating between scalars, vectors, matrices, and tensors.
- Exploring tensor attributes: dimensions, shape, and indexing.
2. Manipulating Tensors
- Creating tensors with varying data types, devices, and gradient tracking.
- Performing arithmetic operations on tensors and managing potential data type errors.
- Reshaping tensors, understanding the concept of views, and employing stacking operations like torch.stack, torch.vstack, and torch.hstack.
- Utilizing torch.squeeze to remove single dimensions and torch.unsqueeze to add them.
- Practicing advanced indexing techniques on multi-dimensional tensors.
3. Tensor Aggregation and Comparison
- Exploring tensor aggregation with functions like torch.min, torch.max, and torch.mean.
- Utilizing torch.argmin and torch.argmax to find the indices of minimum and maximum values.
- Understanding element-wise tensor comparison and its role in machine learning tasks.
Building Neural Networks

4. Introduction to torch.nn
- Introducing the torch.nn module, the cornerstone of neural network construction in PyTorch.
- Exploring the concept of neural network layers and their role in transforming data.
- Utilizing matplotlib for data visualization and understanding PyTorch version compatibility.
5. Linear Regression with PyTorch
- Implementing a simple linear regression model using PyTorch.
- Generating synthetic data, splitting it into training and testing sets.
- Defining a linear model with parameters, understanding gradient tracking with requires_grad.
- Setting up a training loop, iterating through epochs, performing forward and backward passes, and optimizing model parameters.
6. Non-Linear Regression with PyTorch
- Transitioning from linear to non-linear regression.
- Introducing non-linear activation functions like ReLU and Sigmoid.
- Visualizing the impact of activation functions on data transformations.
- Implementing custom ReLU and Sigmoid functions and comparing them with PyTorch’s built-in versions.
Working with Datasets and Data Loaders

7. Multi-Class Classification with PyTorch
- Exploring multi-class classification using the make_blobs dataset from scikit-learn.
- Setting hyperparameters for data creation, splitting data into training and testing sets.
- Visualizing multi-class data with matplotlib and understanding the relationship between features and labels.
- Converting NumPy arrays to PyTorch tensors, managing data type consistency between NumPy and PyTorch.
8. Building a Multi-Class Classification Model
- Constructing a multi-class classification model using PyTorch.
- Defining a model class, utilizing linear layers and activation functions.
- Implementing the forward pass, calculating logits and probabilities.
- Setting up a training loop, calculating loss, performing backpropagation, and optimizing model parameters.
9. Model Evaluation and Prediction
- Evaluating the trained multi-class classification model.
- Making predictions using the model and converting probabilities to class labels.
- Visualizing model predictions and comparing them to true labels.
10. Introduction to Data Loaders
- Understanding the importance of data loaders in PyTorch for efficient data handling.
- Implementing data loaders using torch.utils.data.DataLoader for both training and testing data.
- Exploring data loader attributes and understanding their role in data batching and shuffling.
11. Building a Convolutional Neural Network (CNN)
- Introduction to CNNs, a specialized architecture for image and sequence data.
- Implementing a CNN using PyTorch’s nn.Conv2d layer, understanding concepts like kernels, strides, and padding.
- Flattening convolutional outputs using nn.Flatten and connecting them to fully connected layers.
- Defining a CNN model class, implementing the forward pass, and understanding the flow of data through the network.
12. Training and Evaluating a CNN
- Setting up a training loop for the CNN model, utilizing device-agnostic code for CPU and GPU compatibility.
- Implementing helper functions for training and evaluation, calculating loss, accuracy, and training time.
- Visualizing training progress, tracking loss and accuracy over epochs.
13. Transfer Learning with Pre-trained Models
- Exploring the concept of transfer learning, leveraging pre-trained models for faster training and improved performance.
- Introducing torchvision, a library for computer vision tasks, and understanding its dataset and model functionalities.
- Implementing data transformations using torchvision.transforms for data augmentation and pre-processing.
14. Custom Datasets and Data Augmentation
- Creating custom datasets using torch.utils.data.Dataset for managing image data.
- Implementing data transformations for resizing, converting to tensors, and normalizing images.
- Visualizing data transformations and understanding their impact on image data.
- Implementing data augmentation techniques to increase data variability and improve model robustness.
15. Advanced CNN Architectures and Optimization
- Exploring advanced CNN architectures, understanding concepts like convolutional blocks, residual connections, and pooling layers.
- Implementing a more complex CNN model using convolutional blocks and exploring its performance.
- Optimizing the training process, introducing learning rate scheduling and momentum-based optimizers.
Please provide me with the full text to analyze, as I need the complete context to create a detailed timeline and a cast of characters. The provided text snippets focus on PyTorch concepts and code examples related to tensors, neural networks, and data loading.

For a comprehensive analysis, I need the entire document to understand the flow of information, identify specific events, and extract relevant character details.

Once you provide the complete text, I can generate:
- Timeline: A chronological list of significant events discussed in the text, including conceptual explanations, code demonstrations, and challenges presented.
- Cast of Characters: A list of key individuals mentioned, along with their roles and contributions based on the provided information.
Please share the complete “748-PyTorch for Deep Learning & Machine Learning – Full Course.pdf” document for a more accurate and detailed analysis.

Briefing Doc: Deep Dive into PyTorch for Deep Learning

This briefing document summarizes key themes and concepts extracted from excerpts of the “748-PyTorch for Deep Learning & Machine Learning – Full Course.pdf” focusing on PyTorch fundamentals, tensor manipulation, model building, and training.

Core Themes:
1. Tensors: The Heart of PyTorch:
- Understanding Tensors:
- Tensors are multi-dimensional arrays representing numerical data in PyTorch.
- Understanding dimensions, shapes, and data types of tensors is crucial.
- Scalar, Vector, Matrix, and Tensor are different names for tensors with varying dimensions.
- “Dimension is like the number of square brackets… the shape of the vector is two. So we have two by one elements. So that means a total of two elements.”
- Manipulating Tensors:
- Reshaping, viewing, stacking, squeezing, and unsqueezing tensors are essential for preparing data.
- Indexing and slicing allow access to specific elements within a tensor.
- “Reshape has to be compatible with the original dimensions… view of a tensor shares the same memory as the original input.”
- Tensor Operations:
- PyTorch provides various operations for manipulating tensors, including arithmetic, aggregation, and matrix multiplication.
- Understanding broadcasting rules is vital for performing element-wise operations on tensors of different shapes.
- “The min of this tensor would be 27. So you’re turning it from nine elements to one element, hence aggregation.”
1. Building Neural Networks with PyTorch:
- torch.nn Module:
- This module provides building blocks for constructing neural networks, including layers, activation functions, and loss functions.
- nn.Module is the base class for defining custom models.
- “nn is the building block layer for neural networks. And within nn, so nn stands for neural network, is module.”
- Model Construction:
- Defining a model involves creating layers and arranging them in a specific order.
- nn.Sequential allows stacking layers in a sequential manner.
- Custom models can be built by subclassing nn.Module and defining the forward method.
- “Can you see what’s going on here? So as you might have guessed, sequential, it implements most of this code for us”
- Parameters and Gradients:
- Model parameters are tensors that store the model’s learned weights and biases.
- Gradients are used during training to update these parameters.
- requires_grad=True enables gradient tracking for a tensor.
- “Requires grad optional. If the parameter requires gradient. Hmm. What does requires gradient mean? Well, let’s come back to that in a second.”
1. Training Neural Networks:
- Training Loop:
- The training loop iterates over the dataset multiple times (epochs) to optimize the model’s parameters.
- Each iteration involves a forward pass (making predictions), calculating the loss, performing backpropagation, and updating parameters.
- “Epochs, an epoch is one loop through the data…So epochs, we’re going to start with one. So one time through all of the data.”
- Optimizers:
- Optimizers, like Stochastic Gradient Descent (SGD), are used to update model parameters based on the calculated gradients.
- “Optimise a zero grad, loss backwards, optimise a step, step, step.”
- Loss Functions:
- Loss functions measure the difference between the model’s predictions and the actual targets.
- The choice of loss function depends on the specific task (e.g., mean squared error for regression, cross-entropy for classification).
1. Data Handling and Visualization:
- Data Loading:
- PyTorch provides DataLoader for efficiently iterating over datasets in batches.
- “DataLoader, this creates a python iterable over a data set.”
- Data Transformations:
- The torchvision.transforms module offers various transformations for preprocessing images, such as converting to tensors, resizing, and normalization.
- Visualization:
- matplotlib is a commonly used library for visualizing data and model outputs.
- Visualizing data and model predictions is crucial for understanding the learning process and debugging potential issues.
1. Device Agnostic Code:
- PyTorch allows running code on different devices (CPU or GPU).
- Writing device agnostic code ensures flexibility and portability.
- “Device agnostic code for the model and for the data.”
Important Facts:
- PyTorch’s default tensor data type is torch.float32.
- CUDA (Compute Unified Device Architecture) enables utilizing GPUs for accelerated computations.
- torch.no_grad() disables gradient tracking, often used during inference or evaluation.
- torch.argmax finds the index of the maximum value in a tensor.
Next Steps:
- Explore different model architectures (CNNs, RNNs, etc.).
- Implement various optimizers and loss functions.
- Work with more complex datasets and tasks.
- Experiment with hyperparameter tuning.
- Dive deeper into PyTorch’s documentation and tutorials.
Traditional Programming vs. Machine Learning

Traditional programming involves providing the computer with data and explicit rules to generate output. Machine learning, on the other hand, involves providing the computer with data and desired outputs, allowing the computer to learn the rules for itself. [1, 2]

Here’s a breakdown of the differences, illustrated with the example of creating a program for cooking a Sicilian grandmother’s roast chicken dish:

Traditional Programming
- Input: Vegetables, chicken
- Rules: Cut vegetables, season chicken, preheat oven, cook chicken for 30 minutes, add vegetables (rules are explicitly programmed)
- Output: Roast chicken dish
Machine Learning
- Input: A collection of various ingredients (vegetables, chicken)
- Desired output: A collection of images showcasing the finished roast chicken dish
- Process: The machine learning algorithm analyzes the inputs and desired outputs to figure out the rules (patterns) for transforming the ingredients into the desired dish. [2-4]
Key Differences in Approach
- In traditional programming, humans define the rules, while in machine learning, the computer learns the rules. [2, 3]
- Traditional programming requires a complete understanding of the problem and the steps to solve it. In contrast, machine learning can be used for complex problems where explicitly defining all the rules would be challenging. [5]
Why Use Machine Learning?

Machine learning excels in scenarios where:
- Complex problems involve long lists of rules: For instance, teaching a computer to drive a car involves countless rules that would be difficult to program manually. [5]
- Environments are constantly changing: A self-driving car needs to adapt to unpredictable situations on the road. Machine learning enables algorithms to learn and adjust based on new data. [6]
- Large datasets offer valuable insights: Machine learning algorithms can uncover hidden patterns and relationships within massive datasets. [7]
The sources also emphasize that machine learning isn’t always the solution. Google’s number one rule of machine learning advises that if a problem can be solved with a simple rule-based system, that approach should be preferred. [8-10]

Here is how PyTorch leverages tensors and neural networks for deep learning:
- Tensors: Deep learning relies on numerical data representation. In PyTorch, this is done using tensors. Tensors are multi-dimensional arrays of numbers that can represent various data types, including images, audio, and text [1-3].
- Neural Networks: Neural networks are a fundamental aspect of deep learning, consisting of interconnected layers that perform mathematical operations on tensors [2, 4-6]. PyTorch provides the building blocks for creating these networks through the torch.nn module [7, 8].
- GPU Acceleration: PyTorch leverages GPUs (Graphics Processing Units) to accelerate the computation of deep learning models [9]. GPUs excel at number crunching, originally designed for video games but now crucial for deep learning tasks due to their parallel processing capabilities [9, 10]. PyTorch uses CUDA, a parallel computing platform, to interface with NVIDIA GPUs, allowing for faster computations [10, 11].
- Key Modules:torch.nn: Contains layers, loss functions, and other components needed for constructing computational graphs (neural networks) [8, 12].
- torch.nn.Parameter: Defines learnable parameters for the model, often set by PyTorch layers [12].
- torch.nn.Module: The base class for all neural network modules; models should subclass this and override the forward method [12].
- torch.optim: Contains optimizers that help adjust model parameters during training through gradient descent [13].
- torch.utils.data.Dataset: The base class for creating custom datasets [14].
- torch.utils.data.DataLoader: Creates a Python iterable over a dataset, allowing for batched data loading [14-16].
1. Workflow:Data Preparation: Involves loading, preprocessing, and transforming data into tensors [17, 18].
2. Building a Model: Constructing a neural network by combining different layers from torch.nn [7, 19, 20].
3. Loss Function: Choosing a suitable loss function to measure the difference between model predictions and the actual targets [21-24].
4. Optimizer: Selecting an optimizer (e.g., SGD, Adam) to adjust the model’s parameters based on the calculated gradients [21, 22, 24-26].
5. Training Loop: Implementing a training loop that iteratively feeds data through the model, calculates the loss, backpropagates the gradients, and updates the model’s parameters [22, 24, 27, 28].
6. Evaluation: Evaluating the trained model on unseen data to assess its performance [24, 28].
Overall, PyTorch uses tensors as the fundamental data structure and provides the necessary tools (modules, classes, and functions) to construct neural networks, optimize their parameters using gradient descent, and efficiently run deep learning models, often with GPU acceleration.

Training, Evaluating, and Saving a Deep Learning Model Using PyTorch

To train a deep learning model with PyTorch, you first need to prepare your data and turn it into tensors [1]. Tensors are the fundamental building blocks of deep learning and can represent almost any kind of data, such as images, videos, audio, or even DNA [2, 3]. Once your data is ready, you need to build or pick a pre-trained model to suit your problem [1, 4].
- PyTorch offers a variety of pre-built deep learning models through resources like Torch Hub and Torch Vision.Models [5]. These models can be used as is or adjusted for a specific problem through transfer learning [5].
- If you are building your model from scratch, PyTorch provides a flexible and powerful framework for building neural networks using various layers and modules [6].
- The torch.nn module contains all the building blocks for computational graphs, another term for neural networks [7, 8].
- PyTorch also offers layers for specific tasks, such as convolutional layers for image data, linear layers for simple calculations, and many more [9].
- The torch.nn.Module serves as the base class for all neural network modules [8, 10]. When building a model from scratch, you should subclass nn.Module and override the forward method to define the computations that your model will perform [8, 11].
After choosing or building a model, you need to select a loss function and an optimizer [1, 4].
- The loss function measures how wrong your model’s predictions are compared to the ideal outputs [12].
- The optimizer takes into account the loss of a model and adjusts the model’s parameters, such as weights and biases, to improve the loss function [13].
- The specific loss function and optimizer you use will depend on the problem you are trying to solve [14].
With your data, model, loss function, and optimizer in place, you can now build a training loop [1, 13].
- The training loop iterates through your training data, making predictions, calculating the loss, and updating the model’s parameters to minimize the loss [15].
- PyTorch implements the mathematical algorithms of back propagation and gradient descent behind the scenes, making the training process relatively straightforward [16, 17].
- The loss.backward() function calculates the gradients of the loss function with respect to each parameter in the model [18]. The optimizer.step() function then uses those gradients to update the model’s parameters in the direction that minimizes the loss [18].
- You can monitor the training process by printing out the loss and other metrics [19].
In addition to a training loop, you also need a testing loop to evaluate your model’s performance on data it has not seen during training [13, 20]. The testing loop is similar to the training loop but does not update the model’s parameters. Instead, it calculates the loss and other metrics to evaluate how well the model generalizes to new data [21, 22].

To save your trained model, PyTorch provides several methods, including torch.save, torch.load, and torch.nn.Module.load_state_dict [23-25].
- The recommended way to save and load a PyTorch model is by saving and loading its state dictionary [26].
- The state dictionary is a Python dictionary object that maps each layer in the model to its parameter tensor [27].
- You can save the state dictionary using torch.save and load it back in using torch.load and the model’s load_state_dict method [28, 29].
By following this general workflow, you can train, evaluate, and save deep learning models using PyTorch for a wide range of real-world applications.

A Comprehensive Discussion of the PyTorch Workflow

The PyTorch workflow outlines the steps involved in building, training, and deploying deep learning models using the PyTorch framework. The sources offer a detailed walkthrough of this workflow, emphasizing its application in various domains, including computer vision and custom datasets.

1. Data Preparation and Loading

The foundation of any machine learning project lies in data. Getting your data ready is the crucial first step in the PyTorch workflow [1-3]. This step involves:
- Data Acquisition: Gathering the data relevant to your problem. This could involve downloading existing datasets or collecting your own.
- Data Preprocessing: Cleaning and transforming the raw data into a format suitable for training a machine learning model. This often includes handling missing values, normalizing numerical features, and converting categorical variables into numerical representations.
- Data Transformation into Tensors: Converting the preprocessed data into PyTorch tensors. Tensors are multi-dimensional arrays that serve as the fundamental data structure in PyTorch [4-6]. This step uses torch.tensor to create tensors from various data types.
- Dataset and DataLoader Creation:Organizing the data into PyTorch datasets using torch.utils.data.Dataset. This involves defining how to access individual samples and their corresponding labels [7, 8].
- Creating data loaders using torch.utils.data.DataLoader [7, 9-11]. Data loaders provide a Python iterable over the dataset, allowing you to efficiently iterate through the data in batches during training. They handle shuffling, batching, and other data loading operations.
2. Building or Picking a Pre-trained Model

Once your data is ready, the next step is to build or pick a pre-trained model [1, 2]. This is a critical decision that will significantly impact your model’s performance.
- Pre-trained Models: PyTorch offers pre-built models through resources like Torch Hub and Torch Vision.Models [12].
- Benefits: Leveraging pre-trained models can save significant time and resources. These models have already learned useful features from large datasets, which can be adapted to your specific task through transfer learning [12, 13].
- Transfer Learning: Involves fine-tuning a pre-trained model on your dataset, adapting its learned features to your problem. This is especially useful when working with limited data [12, 14].
- Building from Scratch:When Necessary: You might need to build a model from scratch if your problem is unique or if no suitable pre-trained models exist.
- PyTorch Flexibility: PyTorch provides the tools to create diverse neural network architectures, including:
- Multi-layer Perceptrons (MLPs): Composed of interconnected layers of neurons, often using torch.nn.Linear layers [15].
- Convolutional Neural Networks (CNNs): Specifically designed for image data, utilizing convolutional layers (torch.nn.Conv2d) to extract spatial features [16-18].
- Recurrent Neural Networks (RNNs): Suitable for sequential data, leveraging recurrent layers to process information over time.
Key Considerations in Model Building:
- Subclassing torch.nn.Module: PyTorch models typically subclass nn.Module and override the forward method to define the computational flow [19-23].
- Understanding Layers: Familiarity with various PyTorch layers (available in torch.nn) is crucial for constructing effective models. Each layer performs specific mathematical operations that transform the data as it flows through the network [24-26].
- Model Inspection:print(model): Provides a basic overview of the model’s structure and parameters.
- model.parameters(): Allows you to access and inspect the model’s learnable parameters [27].
- Torch Info: This package offers a more programmatic way to obtain a detailed summary of your model, including the input and output shapes of each layer [28-30].
3. Setting Up a Loss Function and Optimizer

Training a deep learning model involves optimizing its parameters to minimize a loss function. Therefore, choosing the right loss function and optimizer is essential [31-33].
- Loss Function: Measures the difference between the model’s predictions and the actual target values. The choice of loss function depends on the type of problem you are solving [34, 35]:
- Regression: Mean Squared Error (MSE) or Mean Absolute Error (MAE) are common choices [36].
- Binary Classification: Binary Cross Entropy (BCE) is often used [35-39]. PyTorch offers variations like torch.nn.BCELoss and torch.nn.BCEWithLogitsLoss. The latter combines a sigmoid layer with the BCE loss, often simplifying the code [38, 39].
- Multi-Class Classification: Cross Entropy Loss is a standard choice [35-37].
- Optimizer: Responsible for updating the model’s parameters based on the calculated gradients to minimize the loss function [31-33, 40]. Popular optimizers in PyTorch include:
- Stochastic Gradient Descent (SGD): A foundational optimization algorithm [35, 36, 41, 42].
- Adam: An adaptive optimization algorithm often offering faster convergence [35, 36, 42].
PyTorch provides various loss functions in torch.nn and optimizers in torch.optim [7, 40, 43].

4. Building a Training Loop

The heart of the PyTorch workflow lies in the training loop [32, 44-46]. It’s where the model learns patterns in the data through repeated iterations of:
- Forward Pass: Passing the input data through the model to generate predictions [47, 48].
- Loss Calculation: Using the chosen loss function to measure the difference between the predictions and the actual target values [47, 48].
- Back Propagation: Calculating the gradients of the loss with respect to each parameter in the model using loss.backward() [41, 47-49]. PyTorch handles this complex mathematical operation automatically.
- Parameter Update: Updating the model’s parameters using the calculated gradients and the chosen optimizer (e.g., optimizer.step()) [41, 47, 49]. This step nudges the parameters in a direction that minimizes the loss.
Key Aspects of a Training Loop:
- Epochs: The number of times the training loop iterates through the entire training dataset [50].
- Batches: Dividing the training data into smaller batches to improve computational efficiency and model generalization [10, 11, 51].
- Monitoring Training Progress: Printing the loss and other metrics during training allows you to track how well the model is learning [50]. You can use techniques like progress bars (e.g., using the tqdm library) to visualize the training progress [52].
5. Evaluation and Testing Loop

After training, you need to evaluate your model’s performance on unseen data using a testing loop [46, 48, 53]. The testing loop is similar to the training loop, but it does not update the model’s parameters [48]. Its purpose is to assess how well the trained model generalizes to new data.

Steps in a Testing Loop:
- Setting Evaluation Mode: Switching the model to evaluation mode (model.eval()) deactivates certain layers like dropout, which are only needed during training [53, 54].
- Inference Mode: Using PyTorch’s inference mode (torch.inference_mode()) disables gradient tracking and other computations unnecessary for inference, making the evaluation process faster [53-56].
- Forward Pass: Making predictions on the test data by passing it through the model [57].
- Loss and Metric Calculation: Calculating the loss and other relevant metrics (e.g., accuracy, precision, recall) to assess the model’s performance on the test data [53].
6. Saving and Loading the Model

Once you have a trained model that performs well, you need to save it for later use or deployment [58]. PyTorch offers different ways to save and load models, including saving the entire model or saving its state dictionary [59].
- State Dictionary: The recommended way is to save the model’s state dictionary [59, 60], which is a Python dictionary containing the model’s parameters. This approach is more efficient and avoids saving unnecessary information.
Saving and Loading using State Dictionary:
- Saving: torch.save(model.state_dict(), ‘model_filename.pth’)
1. Loading:Create an instance of the model: loaded_model = MyModel()
2. Load the state dictionary: loaded_model.load_state_dict(torch.load(‘model_filename.pth’))
7. Improving the Model (Iterative Process)

Building a successful deep learning model often involves an iterative process of experimentation and improvement [61-63]. After evaluating your initial model, you might need to adjust various aspects to enhance its performance. This includes:
- Hyperparameter Tuning: Experimenting with different values for hyperparameters like learning rate, batch size, and model architecture [64].
- Data Augmentation: Applying transformations to the training data (e.g., random cropping, flipping, rotations) to increase data diversity and improve model generalization [65].
- Regularization Techniques: Using techniques like dropout or weight decay to prevent overfitting and improve model robustness.
- Experiment Tracking: Utilizing tools like TensorBoard or Weights & Biases to track your experiments, log metrics, and visualize results [66]. This can help you gain insights into the training process and make informed decisions about model improvements.
Additional Insights from the Sources:
- Functionalization: As your models and training loops become more complex, it’s beneficial to functionalize your code to improve readability and maintainability [67]. The sources demonstrate this by creating functions for training and evaluation steps [68, 69].
- Device Agnostic Code: PyTorch allows you to write code that can run on either a CPU or a GPU [70-73]. By using torch.device to determine the available device, you can make your code more flexible and efficient.
- Debugging and Troubleshooting: The sources emphasize common debugging tips, such as printing shapes and values to check for errors and using the PyTorch documentation as a reference [9, 74-77].
By following the PyTorch workflow and understanding the key steps involved, you can effectively build, train, evaluate, and deploy deep learning models for various applications. The sources provide valuable code examples and explanations to guide you through this process, enabling you to tackle real-world problems with PyTorch.

A Comprehensive Discussion of Neural Networks

Neural networks are a cornerstone of deep learning, a subfield of machine learning. They are computational models inspired by the structure and function of the human brain. The sources, while primarily focused on the PyTorch framework, offer valuable insights into the principles and applications of neural networks.

1. What are Neural Networks?

Neural networks are composed of interconnected nodes called neurons, organized in layers. These layers typically include:
- Input Layer: Receives the initial data, representing features or variables.
- Hidden Layers: Perform computations on the input data, transforming it through a series of mathematical operations. A network can have multiple hidden layers, increasing its capacity to learn complex patterns.
- Output Layer: Produces the final output, such as predictions or classifications.
The connections between neurons have associated weights that determine the strength of the signal transmitted between them. During training, the network adjusts these weights to learn the relationships between input and output data.

2. The Power of Linear and Nonlinear Functions

Neural networks leverage a combination of linear and nonlinear functions to approximate complex relationships in data.
- Linear functions represent straight lines. While useful, they are limited in their ability to model nonlinear patterns.
- Nonlinear functions introduce curves and bends, allowing the network to capture more intricate relationships in the data.
The sources illustrate this concept by demonstrating how a simple linear model struggles to separate circularly arranged data points. However, introducing nonlinear activation functions like ReLU (Rectified Linear Unit) allows the model to capture the nonlinearity and successfully classify the data.

3. Key Concepts and Terminology
- Activation Functions: Nonlinear functions applied to the output of neurons, introducing nonlinearity into the network and enabling it to learn complex patterns. Common activation functions include sigmoid, ReLU, and tanh.
- Layers: Building blocks of a neural network, each performing specific computations.
- Linear Layers (torch.nn.Linear): Perform linear transformations on the input data using weights and biases.
- Convolutional Layers (torch.nn.Conv2d): Specialized for image data, extracting features using convolutional kernels.
- Pooling Layers: Reduce the spatial dimensions of feature maps, often used in CNNs.
4. Architectures and Applications

The specific arrangement of layers and their types defines the network’s architecture. Different architectures are suited to various tasks. The sources explore:
- Multi-layer Perceptrons (MLPs): Basic neural networks with fully connected layers, often used for tabular data.
- Convolutional Neural Networks (CNNs): Excellent at image recognition tasks, utilizing convolutional layers to extract spatial features.
- Recurrent Neural Networks (RNNs): Designed for sequential data like text or time series, using recurrent connections to process information over time.
5. Training Neural Networks

Training a neural network involves adjusting its weights to minimize a loss function, which measures the difference between predicted and actual values. The sources outline the key steps of a training loop:
1. Forward Pass: Input data flows through the network, generating predictions.
2. Loss Calculation: The loss function quantifies the error between predictions and target values.
3. Backpropagation: The algorithm calculates gradients of the loss with respect to each weight, indicating the direction and magnitude of weight adjustments needed to reduce the loss.
4. Parameter Update: An optimizer (e.g., SGD or Adam) updates the weights based on the calculated gradients, moving them towards values that minimize the loss.
6. PyTorch and Neural Network Implementation

The sources demonstrate how PyTorch provides a flexible and powerful framework for building and training neural networks. Key features include:
- torch.nn Module: Contains pre-built layers, activation functions, and other components for constructing neural networks.
- Automatic Differentiation: PyTorch automatically calculates gradients during backpropagation, simplifying the training process.
- GPU Acceleration: PyTorch allows you to leverage GPUs for faster training, especially beneficial for computationally intensive deep learning models.
7. Beyond the Basics

While the sources provide a solid foundation, the world of neural networks is vast and constantly evolving. Further exploration might involve:
- Advanced Architectures: Researching more complex architectures like ResNet, Transformer networks, and Generative Adversarial Networks (GANs).
- Transfer Learning: Utilizing pre-trained models to accelerate training and improve performance on tasks with limited data.
- Deployment and Applications: Learning how to deploy trained models into real-world applications, from image recognition systems to natural language processing tools.
By understanding the fundamental principles, architectures, and training processes, you can unlock the potential of neural networks to solve a wide range of problems across various domains. The sources offer a practical starting point for your journey into the world of deep learning.

Training Machine Learning Models: A Deep Dive

Building upon the foundation of neural networks, the sources provide a detailed exploration of the model training process, focusing on the practical aspects using PyTorch. Here’s an expanded discussion on the key concepts and steps involved:

1. The Significance of the Training Loop

The training loop lies at the heart of fitting a model to data, iteratively refining its parameters to learn the underlying patterns. This iterative process involves several key steps, often likened to a song with a specific sequence:
1. Forward Pass: Input data, transformed into tensors, is passed through the model’s layers, generating predictions.
2. Loss Calculation: The loss function quantifies the discrepancy between the model’s predictions and the actual target values, providing a measure of how “wrong” the model is.
3. Optimizer Zero Grad: Before calculating gradients, the optimizer’s gradients are reset to zero to prevent accumulating gradients from previous iterations.
4. Loss Backwards: Backpropagation calculates the gradients of the loss with respect to each weight in the network, indicating how much each weight contributes to the error.
5. Optimizer Step: The optimizer, using algorithms like Stochastic Gradient Descent (SGD) or Adam, adjusts the model’s weights based on the calculated gradients. These adjustments aim to nudge the weights in a direction that minimizes the loss.
2. Choosing a Loss Function and Optimizer

The sources emphasize the crucial role of selecting an appropriate loss function and optimizer tailored to the specific machine learning task:
- Loss Function: Different tasks require different loss functions. For example, binary classification tasks often use binary cross-entropy loss, while multi-class classification tasks use cross-entropy loss. The loss function guides the model’s learning by quantifying its errors.
- Optimizer: Optimizers like SGD and Adam employ various algorithms to update the model’s weights during training. Selecting the right optimizer can significantly impact the model’s convergence speed and performance.
3. Training and Evaluation Modes

PyTorch provides distinct training and evaluation modes for models, each with specific settings to optimize performance:
- Training Mode (model.train): This mode enables gradient tracking and activates components like dropout and batch normalization layers, essential for the learning process.
- Evaluation Mode (model.eval): This mode disables gradient tracking and deactivates components not needed during evaluation or prediction. It ensures that the model’s behavior during testing reflects its true performance without the influence of training-specific mechanisms.
4. Monitoring Progress with Loss Curves

The sources introduce the concept of loss curves as visual tools to track the model’s performance during training. Loss curves plot the loss value over epochs (passes through the entire dataset). Observing these curves helps identify potential issues like underfitting or overfitting:
- Underfitting: Indicated by a high and relatively unchanging loss value for both training and validation data, suggesting the model is not effectively learning the patterns in the data.
- Overfitting: Characterized by a low training loss but a high validation loss, implying the model has memorized the training data but struggles to generalize to unseen data.
5. Improving Through Experimentation

Model training often involves an iterative process of experimentation to improve performance. The sources suggest several strategies for improving a model’s ability to learn and generalize:

Model-centric approaches:
- Adding more layers: Increasing the depth of the network can enhance its capacity to learn complex patterns.
- Adding more hidden units: Expanding the width of layers can provide more representational power.
- Changing the activation function: Experimenting with different activation functions like ReLU or sigmoid can influence the model’s nonlinearity and learning behavior.
Data-centric approaches:
- Training for longer: Increasing the number of epochs allows the model more iterations to adjust its weights and potentially reach a lower loss.
- Data Augmentation: Artificially expanding the training dataset by applying transformations like rotations, flips, and crops can help the model generalize better to unseen data.
6. Saving and Loading Models

PyTorch enables saving and loading trained models, crucial for deploying models or resuming training from a previous state. This process often involves saving the model’s state dictionary, containing the learned weights and biases:
- Saving a model (torch.save): Preserves the model’s state dictionary for later use.
- Loading a model (torch.load): Retrieves a saved model’s state dictionary to restore a previously trained model.
7. Going Beyond the Basics

The sources provide a comprehensive foundation for understanding and implementing model training using PyTorch. As you progress, further exploration might include:
- Advanced Optimizers: Investigating optimizers beyond SGD and Adam, such as RMSprop and Adagrad, each with different advantages and characteristics.
- Hyperparameter Tuning: Exploring techniques like grid search and random search to systematically find optimal hyperparameters for the model, loss function, and optimizer.
- Monitoring with TensorBoard: Utilizing TensorBoard, a visualization tool, to track various metrics like loss, accuracy, and gradients during training, providing insights into the learning process.
By grasping the core principles of the training loop, the importance of loss functions and optimizers, and techniques for improving model performance, you gain the tools to effectively train neural networks and other machine learning models using PyTorch. The sources offer a practical guide to navigate the intricacies of model training, setting the stage for tackling more complex deep learning challenges.

A Deep Dive into Computer Vision with PyTorch

Building on the foundation of neural networks and model training, the sources provide an extensive exploration of computer vision using the PyTorch framework. They guide you through the process of building, training, and evaluating computer vision models, offering valuable insights into the core concepts and practical techniques involved.

1. Understanding Computer Vision Problems

Computer vision, broadly defined, encompasses tasks that enable computers to “see” and interpret visual information, mimicking human visual perception. The sources illustrate the vast scope of computer vision problems, ranging from basic classification to more complex tasks like object detection and image segmentation.

Examples of Computer Vision Problems:
- Image Classification: Assigning a label to an image from a predefined set of categories. For instance, classifying an image as containing a cat, dog, or bird.
- Object Detection: Identifying and localizing specific objects within an image, often by drawing bounding boxes around them. Applications include self-driving cars recognizing pedestrians and traffic signs.
- Image Segmentation: Dividing an image into meaningful regions, labeling each pixel with its corresponding object or category. This technique is used in medical imaging to identify organs and tissues.
2. The Power of Convolutional Neural Networks (CNNs)

The sources highlight CNNs as powerful deep learning models well-suited for computer vision tasks. CNNs excel at extracting spatial features from images using convolutional layers, mimicking the human visual system’s hierarchical processing of visual information.

Key Components of CNNs:
- Convolutional Layers: Perform convolutions using learnable filters (kernels) that slide across the input image, extracting features like edges, textures, and patterns.
- Activation Functions: Introduce nonlinearity, allowing CNNs to model complex relationships between image features and output predictions.
- Pooling Layers: Downsample feature maps, reducing computational complexity and making the model more robust to variations in object position and scale.
- Fully Connected Layers: Combine features extracted by convolutional and pooling layers, generating final predictions for classification or other tasks.
The sources provide practical insights into building CNNs using PyTorch’s torch.nn module, guiding you through the process of defining layers, constructing the network architecture, and implementing the forward pass.

3. Working with Torchvision

PyTorch’s Torchvision library emerges as a crucial tool for computer vision projects, offering a rich ecosystem of pre-built datasets, models, and transformations.

Key Components of Torchvision:
- Datasets: Provides access to popular computer vision datasets like MNIST, FashionMNIST, CIFAR, and ImageNet. These datasets simplify the process of obtaining and loading data for model training and evaluation.
- Models: Offers pre-trained models for various computer vision tasks, allowing you to leverage the power of transfer learning by fine-tuning these models on your own datasets.
- Transforms: Enables data preprocessing and augmentation. You can use transforms to resize, crop, flip, normalize, and augment images, artificially expanding your dataset and improving model generalization.
4. The Computer Vision Workflow

The sources outline a typical workflow for computer vision projects using PyTorch, emphasizing practical steps and considerations:
1. Data Preparation: Obtaining or creating a suitable dataset, organizing it into appropriate folders (e.g., by class labels), and applying necessary preprocessing or transformations.
2. Dataset and DataLoader: Utilizing PyTorch’s Dataset and DataLoader classes to efficiently load and batch data for training and evaluation.
3. Model Construction: Defining the CNN architecture using PyTorch’s torch.nn module, specifying layers, activation functions, and other components based on the problem’s complexity and requirements.
4. Loss Function and Optimizer: Selecting a suitable loss function that aligns with the task (e.g., cross-entropy loss for classification) and choosing an optimizer like SGD or Adam to update the model’s weights during training.
5. Training Loop: Implementing the iterative training process, involving forward pass, loss calculation, backpropagation, and weight updates. Monitoring training progress using loss curves to identify potential issues like underfitting or overfitting.
6. Evaluation: Assessing the model’s performance on a held-out test dataset using metrics like accuracy, precision, recall, and F1-score, depending on the task.
7. Model Saving and Loading: Preserving trained models for later use or deployment using torch.save and loading them back using torch.load.
8. Prediction on Custom Data: Demonstrating how to load and preprocess custom images, pass them through the trained model, and obtain predictions.
5. Going Beyond the Basics

The sources provide a comprehensive foundation, but computer vision is a rapidly evolving field. Further exploration might lead you to:
- Advanced Architectures: Exploring more complex CNN architectures like ResNet, Inception, and EfficientNet, each designed to address challenges in image recognition.
- Object Detection and Segmentation: Investigating specialized models and techniques for object detection (e.g., YOLO, Faster R-CNN) and image segmentation (e.g., U-Net, Mask R-CNN).
- Transfer Learning in Depth: Experimenting with various pre-trained models and fine-tuning strategies to optimize performance on your specific computer vision tasks.
- Real-world Applications: Researching how computer vision is applied in diverse domains, such as medical imaging, autonomous driving, robotics, and image editing software.
By mastering the fundamentals of computer vision, understanding CNNs, and leveraging PyTorch’s powerful tools, you can build and deploy models that empower computers to “see” and understand the visual world. The sources offer a practical guide to navigate this exciting domain, equipping you with the skills to tackle a wide range of computer vision challenges.

Understanding Data Augmentation in Computer Vision

Data augmentation is a crucial technique in computer vision that artificially expands the diversity and size of a training dataset by applying various transformations to the existing images [1, 2]. This process enhances the model’s ability to generalize and learn more robust patterns, ultimately improving its performance on unseen data.

Why Data Augmentation is Important
1. Increased Dataset Diversity: Data augmentation introduces variations in the training data, exposing the model to different perspectives of the same image [2]. This prevents the model from overfitting, where it learns to memorize the specific details of the training set rather than the underlying patterns of the target classes.
2. Reduced Overfitting: By making the training data more challenging, data augmentation forces the model to learn more generalizable features that are less sensitive to minor variations in the input images [3, 4].
3. Improved Model Generalization: A model trained with augmented data is better equipped to handle unseen data, as it has learned to recognize objects and patterns under various transformations, making it more robust and reliable in real-world applications [1, 5].
Types of Data Augmentations

The sources highlight several commonly used data augmentation techniques, particularly within the context of PyTorch’s torchvision.transforms module [6-8].
- Resize: Changing the dimensions of the images [9]. This helps standardize the input size for the model and can also introduce variations in object scale.
- Random Horizontal Flip: Flipping the images horizontally with a certain probability [8]. This technique is particularly effective for objects that are symmetric or appear in both left-right orientations.
- Random Rotation: Rotating the images by a random angle [3]. This helps the model learn to recognize objects regardless of their orientation.
- Random Crop: Cropping random sections of the images [9, 10]. This forces the model to focus on different parts of the image and can also introduce variations in object position.
- Color Jitter: Adjusting the brightness, contrast, saturation, and hue of the images [11]. This helps the model learn to recognize objects under different lighting conditions.
Trivial Augment: A State-of-the-Art Approach

The sources mention Trivial Augment, a data augmentation strategy used by the PyTorch team to achieve state-of-the-art results on their computer vision models [12, 13]. Trivial Augment leverages randomness to select and apply a combination of augmentations from a predefined set with varying intensities, leading to a diverse and challenging training dataset [14].

Practical Implementation in PyTorch

PyTorch’s torchvision.transforms module provides a comprehensive set of functions for data augmentation [6-8]. You can create a transform pipeline by composing a sequence of transformations using transforms.Compose. For example, a basic transform pipeline might include resizing, random horizontal flipping, and conversion to a tensor:

from torchvision import transforms

train_transform = transforms.Compose([

transforms.Resize((64, 64)),

transforms.RandomHorizontalFlip(p=0.5),

transforms.ToTensor(),

])

To apply data augmentation during training, you would pass this transform pipeline to the Dataset or DataLoader when loading your images [7, 15].

Evaluating the Impact of Data Augmentation

The sources emphasize the importance of comparing model performance with and without data augmentation to assess its effectiveness [16, 17]. By monitoring training metrics like loss and accuracy, you can observe how data augmentation influences the model’s learning process and its ability to generalize to unseen data [18, 19].

The Crucial Role of Hyperparameters in Model Training

Hyperparameters are external configurations that are set by the machine learning engineer or data scientist before training a model. They are distinct from the parameters of a model, which are the internal values (weights and biases) that the model learns from the data during training. Hyperparameters play a critical role in shaping the model’s architecture, behavior, and ultimately, its performance.

Defining Hyperparameters

As the sources explain, hyperparameters are values that we, as the model builders, control and adjust. In contrast, parameters are values that the model learns and updates during training. The sources use the analogy of parking a car:
- Hyperparameters are akin to the external controls of the car, such as the steering wheel, accelerator, and brake, which the driver uses to guide the vehicle.
- Parameters are like the internal workings of the engine and transmission, which adjust automatically based on the driver’s input.
Impact of Hyperparameters on Model Training

Hyperparameters directly influence the learning process of a model. They determine factors such as:
- Model Complexity: Hyperparameters like the number of layers and hidden units dictate the model’s capacity to learn intricate patterns in the data. More layers and hidden units typically increase the model’s complexity and ability to capture nonlinear relationships. However, excessive complexity can lead to overfitting.
- Learning Rate: The learning rate governs how much the optimizer adjusts the model’s parameters during each training step. A high learning rate allows for rapid learning but can lead to instability or divergence. A low learning rate ensures stability but may require longer training times.
- Batch Size: The batch size determines how many training samples are processed together before updating the model’s weights. Smaller batches can lead to faster convergence but might introduce more noise in the gradients. Larger batches provide more stable gradients but can slow down training.
- Number of Epochs: The number of epochs determines how many times the entire training dataset is passed through the model. More epochs can improve learning, but excessive training can also lead to overfitting.
Example: Tuning Hyperparameters for a CNN

Consider the task of building a CNN for image classification, as described in the sources. Several hyperparameters are crucial to the model’s performance:
- Number of Convolutional Layers: This hyperparameter determines how many layers are used to extract features from the images. More layers allow for the capture of more complex features but increase computational complexity.
- Kernel Size: The kernel size (filter size) in convolutional layers dictates the receptive field of the filters, influencing the scale of features extracted. Smaller kernels capture fine-grained details, while larger kernels cover wider areas.
- Stride: The stride defines how the kernel moves across the image during convolution. A larger stride results in downsampling and a smaller feature map.
- Padding: Padding adds extra pixels around the image borders before convolution, preventing information loss at the edges and ensuring consistent feature map dimensions.
- Activation Function: Activation functions like ReLU introduce nonlinearity, enabling the model to learn complex relationships between features. The choice of activation function can significantly impact model performance.
- Optimizer: The optimizer (e.g., SGD, Adam) determines how the model’s parameters are updated based on the calculated gradients. Different optimizers have different convergence properties and might be more suitable for specific datasets or architectures.
By carefully tuning these hyperparameters, you can optimize the CNN’s performance on the image classification task. Experimentation and iteration are key to finding the best hyperparameter settings for a given dataset and model architecture.

The Hyperparameter Tuning Process

The sources highlight the iterative nature of finding the best hyperparameter configurations. There’s no single “best” set of hyperparameters that applies universally. The optimal settings depend on the specific dataset, model architecture, and task. The sources also emphasize:
- Experimentation: Try different combinations of hyperparameters to observe their impact on model performance.
- Monitoring Loss Curves: Use loss curves to gain insights into the model’s training behavior, identifying potential issues like underfitting or overfitting and adjusting hyperparameters accordingly.
- Validation Sets: Employ a validation dataset to evaluate the model’s performance on unseen data during training, helping to prevent overfitting and select the best-performing hyperparameters.
- Automated Techniques: Explore automated hyperparameter tuning methods like grid search, random search, or Bayesian optimization to efficiently search the hyperparameter space.
By understanding the role of hyperparameters and mastering techniques for tuning them, you can unlock the full potential of your models and achieve optimal performance on your computer vision tasks.

The Learning Process of Deep Learning Models

Deep learning models learn from data by adjusting their internal parameters to capture patterns and relationships within the data. The sources provide a comprehensive overview of this process, particularly within the context of supervised learning using neural networks.

1. Data Representation: Turning Data into Numbers

The first step in deep learning is to represent the data in a numerical format that the model can understand. As the sources emphasize, “machine learning is turning things into numbers” [1, 2]. This process involves encoding various forms of data, such as images, text, or audio, into tensors, which are multi-dimensional arrays of numbers.

2. Model Architecture: Building the Learning Framework

Once the data is numerically encoded, a model architecture is defined. Neural networks are a common type of deep learning model, consisting of interconnected layers of neurons. Each layer performs mathematical operations on the input data, transforming it into increasingly abstract representations.
- Input Layer: Receives the numerical representation of the data.
- Hidden Layers: Perform computations on the input, extracting features and learning representations.
- Output Layer: Produces the final output of the model, which is tailored to the specific task (e.g., classification, regression).
3. Parameter Initialization: Setting the Starting Point

The parameters of a neural network, typically weights and biases, are initially assigned random values. These parameters determine how the model processes the data and ultimately define its behavior.

4. Forward Pass: Calculating Predictions

During training, the data is fed forward through the network, layer by layer. Each layer performs its mathematical operations, using the current parameter values to transform the input data. The final output of the network represents the model’s prediction for the given input.

5. Loss Function: Measuring Prediction Errors

A loss function is used to quantify the difference between the model’s predictions and the true target values. The loss function measures how “wrong” the model’s predictions are, providing a signal for how to adjust the parameters to improve performance.

6. Backpropagation: Calculating Gradients

Backpropagation is the core algorithm that enables deep learning models to learn. It involves calculating the gradients of the loss function with respect to each parameter in the network. These gradients indicate the direction and magnitude of change needed for each parameter to reduce the loss.

7. Optimizer: Updating Parameters

An optimizer uses the calculated gradients to update the model’s parameters. The optimizer’s goal is to minimize the loss function by iteratively adjusting the parameters in the direction that reduces the error. Common optimizers include Stochastic Gradient Descent (SGD) and Adam.

8. Training Loop: Iterative Learning Process

The training loop encompasses the steps of forward pass, loss calculation, backpropagation, and parameter update. This process is repeated iteratively over the training data, allowing the model to progressively refine its parameters and improve its predictive accuracy.
- Epochs: Each pass through the entire training dataset is called an epoch.
- Batch Size: Data is typically processed in batches, where a batch is a subset of the training data.
9. Evaluation: Assessing Model Performance

After training, the model is evaluated on a separate dataset (validation or test set) to assess its ability to generalize to unseen data. Metrics like accuracy, precision, and recall are used to measure the model’s performance on the task.

10. Hyperparameter Tuning: Optimizing the Learning Process

Hyperparameters are external configurations that influence the model’s learning process. Examples include learning rate, batch size, and the number of layers. Tuning hyperparameters is crucial to achieving optimal model performance. This often involves experimentation and monitoring training metrics to find the best settings.

Key Concepts and Insights
- Iterative Learning: Deep learning models learn through an iterative process of making predictions, calculating errors, and adjusting parameters.
- Gradient Descent: Backpropagation and optimizers work together to implement gradient descent, guiding the parameter updates towards minimizing the loss function.
- Feature Learning: Hidden layers in neural networks automatically learn representations of the data, extracting meaningful features that contribute to the model’s predictive ability.
- Nonlinearity: Activation functions introduce nonlinearity, allowing models to capture complex relationships in the data that cannot be represented by simple linear models.
By understanding these fundamental concepts, you can gain a deeper appreciation for how deep learning models learn from data and achieve remarkable performance on a wide range of tasks.

Key Situations for Deep Learning Solutions

The sources provide a detailed explanation of when deep learning is a good solution and when simpler approaches might be more suitable. Here are three key situations where deep learning often excels:

1. Problems with Long Lists of Rules

Deep learning models are particularly effective when dealing with problems that involve a vast and intricate set of rules that would be difficult or impossible to program explicitly. The sources use the example of driving a car, which encompasses countless rules regarding navigation, safety, and traffic regulations.
- Traditional programming struggles with such complexity, requiring engineers to manually define and code every possible scenario. This approach quickly becomes unwieldy and prone to errors.
- Deep learning offers a more flexible and adaptable solution. Instead of explicitly programming rules, deep learning models learn from data, automatically extracting patterns and relationships that represent the underlying rules.
2. Continuously Changing Environments

Deep learning shines in situations where the environment or the data itself is constantly evolving. Unlike traditional rule-based systems, which require manual updates to adapt to changes, deep learning models can continuously learn and update their knowledge as new data becomes available.
- The sources highlight the adaptability of deep learning, stating that models can “keep learning if it needs to” and “adapt and learn to new scenarios.”
- This capability is crucial in applications such as self-driving cars, where road conditions, traffic patterns, and even driving regulations can change over time.
3. Discovering Insights Within Large Collections of Data

Deep learning excels at uncovering hidden patterns and insights within massive datasets. The ability to process vast amounts of data is a key advantage of deep learning, enabling it to identify subtle relationships and trends that might be missed by traditional methods.
- The sources emphasize the flourishing of deep learning in handling large datasets, citing examples like the Food 101 dataset, which contains images of 101 different kinds of foods.
- This capacity for large-scale data analysis is invaluable in fields such as medical image analysis, where deep learning can assist in detecting diseases, identifying anomalies, and predicting patient outcomes.
In these situations, deep learning offers a powerful and flexible approach, allowing models to learn from data, adapt to changes, and extract insights from vast datasets, providing solutions that were previously challenging or even impossible to achieve with traditional programming techniques.

The Most Common Errors in Deep Learning

The sources highlight shape errors as one of the most prevalent challenges encountered by deep learning developers. The sources emphasize that this issue stems from the fundamental reliance on matrix multiplication operations in neural networks.
- Neural networks are built upon interconnected layers, and matrix multiplication is the primary mechanism for data transformation between these layers. [1]
- Shape errors arise when the dimensions of the matrices involved in these multiplications are incompatible. [1, 2]
- The sources illustrate this concept by explaining that for matrix multiplication to succeed, the inner dimensions of the matrices must match. [2, 3]
Three Big Errors in PyTorch and Deep Learning

The sources further elaborate on this concept within the specific context of the PyTorch deep learning framework, identifying three primary categories of errors:
1. Tensors not having the Right Data Type: The sources point out that using the incorrect data type for tensors can lead to errors, especially during the training of large neural networks. [4]
2. Tensors not having the Right Shape: This echoes the earlier discussion of shape errors and their importance in matrix multiplication operations. [4]
3. Device Issues: This category of errors arises when tensors are located on different devices, typically the CPU and GPU. PyTorch requires tensors involved in an operation to reside on the same device. [5]
The Ubiquity of Shape Errors

The sources consistently underscore the significance of understanding tensor shapes and dimensions in deep learning.
- They emphasize that mismatches in input and output shapes between layers are a frequent source of errors. [6]
- The process of reshaping, stacking, squeezing, and unsqueezing tensors is presented as a crucial technique for addressing shape-related issues. [7, 8]
- The sources advise developers to become familiar with their data’s shape and consult documentation to understand the expected input shapes for various layers and operations. [9]
Troubleshooting Tips and Practical Advice

Beyond identifying shape errors as a common challenge, the sources offer practical tips and insights for troubleshooting such issues.
- Understanding matrix multiplication rules: Developers are encouraged to grasp the fundamental rules governing matrix multiplication to anticipate and prevent shape errors. [3]
- Visualizing matrix multiplication: The sources recommend using the website matrixmultiplication.xyz as a tool for visualizing matrix operations and understanding their dimensional requirements. [10]
- Programmatic shape checking: The sources advocate for incorporating programmatic checks of tensor shapes using functions like tensor.shape to identify and debug shape mismatches. [11, 12]
By understanding the importance of tensor shapes and diligently checking for dimensional compatibility, deep learning developers can mitigate the occurrence of shape errors and streamline their development workflow.

Two Common Deep Learning Errors

The sources describe three major errors faced by deep learning developers: tensors not having the correct data type, tensors not having the correct shape, and device issues. [1] Two particularly common errors are data type and shape mismatches. [1, 2]

Data Type Mismatches

The sources explain that using the wrong data type for a tensor, especially when training large neural networks, can lead to errors. [1] For example, the torch.mean() function requires a float32 tensor, but if a long tensor is used, an error occurs. [3] Data type mismatches can also occur with loss functions. For instance, if the torch.nn.BCELoss is used instead of the torch.nn.BCEWithLogitsLoss, the inputs to the loss function must have gone through a sigmoid activation function. [4-6]

Shape Mismatches

Shape errors are extremely common in deep learning. [1, 2, 7-13] The sources explain that shape errors arise when the dimensions of matrices are incompatible during matrix multiplication operations. [7-9] To perform matrix multiplication, the inner dimensions of the matrices must match. [7, 14] Shape errors can also occur if the input or output shapes of tensors are mismatched between layers in a neural network. [11, 15] For example, a convolutional layer might expect a four-dimensional tensor, but if a three-dimensional tensor is used, an error will occur. [13] The sources recommend checking the shape of tensors frequently to catch these errors. [11, 16]

Let’s go through the topics covered in the “PyTorch for Deep Learning & Machine Learning – Full Course” one by one.

1. Introduction: Deep Learning vs. Traditional Programming

The sources start by introducing deep learning as a subset of machine learning, which itself is a subset of artificial intelligence [1]. They explain the key difference between traditional programming and machine learning [2].
- In traditional programming, we give the computer specific rules and data, and it produces the output.
- In machine learning, we provide the computer with data and desired outputs, and it learns the rules to map the data to the outputs.
The sources argue that deep learning is particularly well-suited for complex problems where it’s difficult to hand-craft rules [3, 4]. Examples include self-driving cars and image recognition. However, they also caution against using machine learning when a simpler, rule-based system would suffice [4, 5].

2. PyTorch Fundamentals: Tensors and Operations

The sources then introduce PyTorch, a popular deep learning framework written in Python [6, 7]. The core data structure in PyTorch is the tensor, a multi-dimensional array that can be used to represent various types of data [8].
- The sources explain the different types of tensors: scalars, vectors, matrices, and higher-order tensors [9].
- They demonstrate how to create tensors using torch.tensor() and showcase various operations like reshaping, indexing, stacking, and permuting [9-11].
Understanding tensor shapes and dimensions is crucial for avoiding errors in deep learning, as highlighted in our previous conversation about shape mismatches [12].

3. The PyTorch Workflow: From Data to Model

The sources then outline a typical PyTorch workflow [13] for developing deep learning models:
1. Data Preparation and Loading: The sources emphasize the importance of preparing data for machine learning [14] and the process of transforming raw data into a numerical representation suitable for models. They introduce data loaders (torch.utils.data.DataLoader) [15] for efficiently loading data in batches [16].
2. Building a Machine Learning Model: The sources demonstrate how to build models in PyTorch by subclassing nn.Module [17]. This involves defining the model’s layers and the forward pass, which specifies how data flows through the model.
3. Fitting the Model to the Data (Training): The sources explain the concept of a training loop [18], where the model iteratively learns from the data. Key steps in the training loop include:
- Forward Pass: Passing data through the model to get predictions.
- Calculating the Loss: Measuring how wrong the model’s predictions are using a loss function [19].
- Backpropagation: Calculating gradients to determine how to adjust the model’s parameters.
- Optimizer Step: Updating the model’s parameters using an optimizer [20] to minimize the loss.
1. Evaluating the Model: The sources highlight the importance of evaluating the model’s performance on unseen data to assess its generalization ability. This typically involves calculating metrics such as accuracy, precision, and recall [21].
2. Saving and Reloading the Model: The sources discuss methods for saving and loading trained models using torch.save() and torch.load() [22, 23].
3. Improving the Model: The sources provide tips and strategies for enhancing the model’s performance, including techniques like hyperparameter tuning, data augmentation, and using different model architectures [24].
4. Classification with PyTorch: Binary and Multi-Class

The sources dive into classification problems, a common type of machine learning task where the goal is to categorize data into predefined classes [25]. They discuss:
- Binary Classification: Predicting one of two possible classes [26].
- Multi-Class Classification: Choosing from more than two classes [27].
The sources demonstrate how to build classification models in PyTorch and showcase various techniques:
- Choosing appropriate loss functions like binary cross entropy loss (nn.BCELoss) for binary classification and cross entropy loss (nn.CrossEntropyLoss) for multi-class classification [28].
- Using activation functions like sigmoid for binary classification and softmax for multi-class classification [29].
- Evaluating classification models using metrics like accuracy, precision, recall, and confusion matrices [30].
5. Computer Vision with PyTorch: Convolutional Neural Networks (CNNs)

The sources introduce computer vision, the field of enabling computers to “see” and interpret images [31]. They focus on convolutional neural networks (CNNs), a type of neural network architecture specifically designed for processing image data [32].
- Torchvision: The sources introduce torchvision, a PyTorch library containing datasets, model architectures, and image transformation tools [33].
- Data Augmentation: The sources showcase data augmentation techniques using torchvision.transforms to create variations of training images and improve model robustness [34].
- CNN Building Blocks: The sources explain and demonstrate key CNN components like convolutional layers (nn.Conv2d), pooling layers, and activation functions [35].
They guide you through building CNNs from scratch and visualizing the learned features.

6. Custom Datasets: Working with Your Own Data

The sources address the challenge of working with custom datasets not readily available in PyTorch’s built-in libraries [36]. They explain how to:
- Create custom datasets by subclassing torch.utils.data.Dataset [37] and implementing methods for loading and processing data.
- Integrate custom datasets with PyTorch’s data loaders for efficient training.
This section empowers you to apply PyTorch to real-world problems using your own data.

7. Further Resources and Exercises

The sources provide numerous extracurricular resources and exercises [38-40] to deepen your understanding and practice your skills. These resources include:
- Links to relevant PyTorch documentation and tutorials
- Blog posts and articles on specific deep learning concepts
- Code templates and solution notebooks for exercises
By actively engaging with these resources, you can continue your learning journey and solidify your PyTorch proficiency.

This comprehensive overview of the topics covered in the “PyTorch for Deep Learning & Machine Learning – Full Course” provides a structured understanding of the key concepts and techniques. Remember, the best way to learn is by practicing and experimenting with the code provided in the sources.

Here are summaries of each set of 10 pages from the source document:

Pages 1-10 Summary: Introduction to Deep Learning and PyTorch Fundamentals

These pages introduce the fundamental concepts of deep learning, positioning it as a powerful subset of machine learning. The sources draw a clear distinction between traditional programming, where explicit rules dictate output, and machine learning, where algorithms learn rules from data. The emphasis is on PyTorch as the chosen deep learning framework, highlighting its core data structure: the tensor.

The sources provide practical guidance on creating tensors using torch.tensor() and manipulating them with operations like reshaping and indexing. They underscore the crucial role of understanding tensor shapes and dimensions, connecting it to the common challenge of shape errors discussed in our earlier conversation.

This set of pages lays the groundwork for understanding both the conceptual framework of deep learning and the practical tools provided by PyTorch.

Pages 11-20 Summary: Exploring Tensors, Neural Networks, and PyTorch Documentation

These pages build upon the introduction of tensors, expanding on operations like stacking and permuting to manipulate tensor structures further. They transition into a conceptual overview of neural networks, emphasizing their ability to learn complex patterns from data. However, the sources don’t provide detailed definitions of deep learning or neural networks, encouraging you to explore these concepts independently through external resources like Wikipedia and educational channels.

The sources strongly advocate for actively engaging with PyTorch documentation. They highlight the website as a valuable resource for understanding PyTorch’s features, functions, and examples. They encourage you to spend time reading and exploring the documentation, even if you don’t fully grasp every detail initially.

Pages 21-30 Summary: The PyTorch Workflow: Data, Models, Loss, and Optimization

This section of the source delves into the core PyTorch workflow, starting with the importance of data preparation. It emphasizes the transformation of raw data into tensors, making it suitable for deep learning models. Data loaders are presented as essential tools for efficiently handling large datasets by loading data in batches.

The sources then guide you through the process of building a machine learning model in PyTorch, using the concept of subclassing nn.Module. The forward pass is introduced as a fundamental step that defines how data flows through the model’s layers. The sources explain how models are trained by fitting them to the data, highlighting the iterative process of the training loop:
1. Forward pass: Input data is fed through the model to generate predictions.
2. Loss calculation: A loss function quantifies the difference between the model’s predictions and the actual target values.
3. Backpropagation: The model’s parameters are adjusted by calculating gradients, indicating how each parameter contributes to the loss.
4. Optimization: An optimizer uses the calculated gradients to update the model’s parameters, aiming to minimize the loss.
Pages 31-40 Summary: Evaluating Models, Running Tensors, and Important Concepts

The sources focus on evaluating the model’s performance, emphasizing its significance in determining how well the model generalizes to unseen data. They mention common metrics like accuracy, precision, and recall as tools for evaluating model effectiveness.

The sources introduce the concept of running tensors on different devices (CPU and GPU) using .to(device), highlighting its importance for computational efficiency. They also discuss the use of random seeds (torch.manual_seed()) to ensure reproducibility in deep learning experiments, enabling consistent results across multiple runs.

The sources stress the importance of documentation reading as a key exercise for understanding PyTorch concepts and functionalities. They also advocate for practical coding exercises to reinforce learning and develop proficiency in applying PyTorch concepts.

Pages 41-50 Summary: Exercises, Classification Introduction, and Data Visualization

The sources dedicate these pages to practical application and reinforcement of previously learned concepts. They present exercises designed to challenge your understanding of PyTorch workflows, data manipulation, and model building. They recommend referring to the documentation, practicing independently, and checking provided solutions as a learning approach.

The focus shifts to classification problems, distinguishing between binary classification, where the task is to predict one of two classes, and multi-class classification, involving more than two classes.

The sources then begin exploring data visualization, emphasizing the importance of understanding your data before applying machine learning models. They introduce the make_circles dataset as an example and use scatter plots to visualize its structure, highlighting the need for visualization as a crucial step in the data exploration process.

Pages 51-60 Summary: Data Splitting, Building a Classification Model, and Training

The sources discuss the critical concept of splitting data into training and test sets. This separation ensures that the model is evaluated on unseen data to assess its generalization capabilities accurately. They utilize the train_test_split function to divide the data and showcase the process of building a simple binary classification model in PyTorch.

The sources emphasize the familiar training loop process, where the model iteratively learns from the training data:
1. Forward pass through the model
2. Calculation of the loss function
3. Backpropagation of gradients
4. Optimization of model parameters
They guide you through implementing these steps and visualizing the model’s training progress using loss curves, highlighting the importance of monitoring these curves for insights into the model’s learning behavior.

Pages 61-70 Summary: Multi-Class Classification, Data Visualization, and the Softmax Function

The sources delve into multi-class classification, expanding upon the previously covered binary classification. They illustrate the differences between the two and provide examples of scenarios where each is applicable.

The focus remains on data visualization, emphasizing the importance of understanding your data before applying machine learning algorithms. The sources introduce techniques for visualizing multi-class data, aiding in pattern recognition and insight generation.

The softmax function is introduced as a crucial component in multi-class classification models. The sources explain its role in converting the model’s raw outputs (logits) into probabilities, enabling interpretation and decision-making based on these probabilities.

Pages 71-80 Summary: Evaluation Metrics, Saving/Loading Models, and Computer Vision Introduction

This section explores various evaluation metrics for assessing the performance of classification models. They introduce metrics like accuracy, precision, recall, F1 score, confusion matrices, and classification reports. The sources explain the significance of each metric and how to interpret them in the context of evaluating model effectiveness.

The sources then discuss the practical aspects of saving and loading trained models, highlighting the importance of preserving model progress and enabling future use without retraining.

The focus shifts to computer vision, a field that enables computers to “see” and interpret images. They discuss the use of convolutional neural networks (CNNs) as specialized neural network architectures for image processing tasks.

Pages 81-90 Summary: Computer Vision Libraries, Data Exploration, and Mini-Batching

The sources introduce essential computer vision libraries in PyTorch, particularly highlighting torchvision. They explain the key components of torchvision, including datasets, model architectures, and image transformation tools.

They guide you through exploring a computer vision dataset, emphasizing the importance of understanding data characteristics before model building. Techniques for visualizing images and examining data structure are presented.

The concept of mini-batching is discussed as a crucial technique for efficiently training deep learning models on large datasets. The sources explain how mini-batching involves dividing the data into smaller batches, reducing memory requirements and improving training speed.

Pages 91-100 Summary: Building a CNN, Training Steps, and Evaluation

This section dives into the practical aspects of building a CNN for image classification. They guide you through defining the model’s architecture, including convolutional layers (nn.Conv2d), pooling layers, activation functions, and a final linear layer for classification.

The familiar training loop process is revisited, outlining the steps involved in training the CNN model:
1. Forward pass of data through the model
2. Calculation of the loss function
3. Backpropagation to compute gradients
4. Optimization to update model parameters
The sources emphasize the importance of monitoring the training process by visualizing loss curves and calculating evaluation metrics like accuracy and loss. They provide practical code examples for implementing these steps and evaluating the model’s performance on a test dataset.

Pages 101-110 Summary: Troubleshooting, Non-Linear Activation Functions, and Model Building

The sources provide practical advice for troubleshooting common errors in PyTorch code, encouraging the use of the data explorer’s motto: visualize, visualize, visualize. The importance of checking tensor shapes, understanding error messages, and referring to the PyTorch documentation is highlighted. They recommend searching for specific errors online, utilizing resources like Stack Overflow, and if all else fails, asking questions on the course’s GitHub discussions page.

The concept of non-linear activation functions is introduced as a crucial element in building effective neural networks. These functions, such as ReLU, introduce non-linearity into the model, enabling it to learn complex, non-linear patterns in the data. The sources emphasize the importance of combining linear and non-linear functions within a neural network to achieve powerful learning capabilities.

Building upon this concept, the sources guide you through the process of constructing a more complex classification model incorporating non-linear activation functions. They demonstrate the step-by-step implementation, highlighting the use of ReLU and its impact on the model’s ability to capture intricate relationships within the data.

Pages 111-120 Summary: Data Augmentation, Model Evaluation, and Performance Improvement

The sources introduce data augmentation as a powerful technique for artificially increasing the diversity and size of training data, leading to improved model performance. They demonstrate various data augmentation methods, including random cropping, flipping, and color adjustments, emphasizing the role of torchvision.transforms in implementing these techniques. The TrivialAugment technique is highlighted as a particularly effective and efficient data augmentation strategy.

The sources reinforce the importance of model evaluation and explore advanced techniques for assessing the performance of classification models. They introduce metrics beyond accuracy, including precision, recall, F1-score, and confusion matrices. The use of torchmetrics and other libraries for calculating these metrics is demonstrated.

The sources discuss strategies for improving model performance, focusing on optimizing training speed and efficiency. They introduce concepts like mixed precision training and highlight the potential benefits of using TPUs (Tensor Processing Units) for accelerated deep learning tasks.

Pages 121-130 Summary: CNN Hyperparameters, Custom Datasets, and Image Loading

The sources provide a deeper exploration of CNN hyperparameters, focusing on kernel size, stride, and padding. They utilize the CNN Explainer website as a valuable resource for visualizing and understanding the impact of these hyperparameters on the convolutional operations within a CNN. They guide you through calculating output shapes based on these hyperparameters, emphasizing the importance of understanding the transformations applied to the input data as it passes through the network’s layers.

The concept of custom datasets is introduced, moving beyond the use of pre-built datasets like FashionMNIST. The sources outline the process of creating a custom dataset using PyTorch’s Dataset class, enabling you to work with your own data sources. They highlight the importance of structuring your data appropriately for use with PyTorch’s data loading utilities.

They demonstrate techniques for loading images using PyTorch, leveraging libraries like PIL (Python Imaging Library) and showcasing the steps involved in reading image data, converting it into tensors, and preparing it for use in a deep learning model.

Pages 131-140 Summary: Building a Custom Dataset, Data Visualization, and Data Augmentation

The sources guide you step-by-step through the process of building a custom dataset in PyTorch, specifically focusing on creating a food image classification dataset called FoodVision Mini. They cover techniques for organizing image data, creating class labels, and implementing a custom dataset class that inherits from PyTorch’s Dataset class.

They emphasize the importance of data visualization throughout the process, demonstrating how to visually inspect images, verify labels, and gain insights into the dataset’s characteristics. They provide code examples for plotting random images from the custom dataset, enabling visual confirmation of data loading and preprocessing steps.

The sources revisit data augmentation in the context of custom datasets, highlighting its role in improving model generalization and robustness. They demonstrate the application of various data augmentation techniques using torchvision.transforms to artificially expand the training dataset and introduce variations in the images.

Pages 141-150 Summary: Training and Evaluation with a Custom Dataset, Transfer Learning, and Advanced Topics

The sources guide you through the process of training and evaluating a deep learning model using your custom dataset (FoodVision Mini). They cover the steps involved in setting up data loaders, defining a model architecture, implementing a training loop, and evaluating the model’s performance using appropriate metrics. They emphasize the importance of monitoring training progress through visualization techniques like loss curves and exploring the model’s predictions on test data.

The sources introduce transfer learning as a powerful technique for leveraging pre-trained models to improve performance on a new task, especially when working with limited data. They explain the concept of using a model trained on a large dataset (like ImageNet) as a starting point and fine-tuning it on your custom dataset to achieve better results.

The sources provide an overview of advanced topics in PyTorch deep learning, including:
- Model experiment tracking: Tools and techniques for managing and tracking multiple deep learning experiments, enabling efficient comparison and analysis of model variations.
- PyTorch paper replicating: Replicating research papers using PyTorch, a valuable approach for understanding cutting-edge deep learning techniques and applying them to your own projects.
- PyTorch workflow debugging: Strategies for debugging and troubleshooting issues that may arise during the development and training of deep learning models in PyTorch.
These advanced topics provide a glimpse into the broader landscape of deep learning research and development using PyTorch, encouraging further exploration and experimentation beyond the foundational concepts covered in the previous sections.

Pages 151-160 Summary: Custom Datasets, Data Exploration, and the FoodVision Mini Dataset

The sources emphasize the importance of custom datasets when working with data that doesn’t fit into pre-existing structures like FashionMNIST. They highlight the different domain libraries available in PyTorch for handling specific types of data, including:
- Torchvision: for image data
- Torchtext: for text data
- Torchaudio: for audio data
- Torchrec: for recommendation systems data
Each of these libraries has a datasets module that provides tools for loading and working with data from that domain. Additionally, the sources mention Torchdata, which is a more general-purpose data loading library that is still under development.

The sources guide you through the process of creating a custom image dataset called FoodVision Mini, based on the larger Food101 dataset. They provide detailed instructions for:
1. Obtaining the Food101 data: This involves downloading the dataset from its original source.
2. Structuring the data: The sources recommend organizing the data in a specific folder structure, where each subfolder represents a class label and contains images belonging to that class.
3. Exploring the data: The sources emphasize the importance of becoming familiar with the data through visualization and exploration. This can help you identify potential issues with the data and gain insights into its characteristics.
They introduce the concept of becoming one with the data, spending significant time understanding its structure, format, and nuances before diving into model building. This echoes the data explorer’s motto: visualize, visualize, visualize.

The sources provide practical advice for exploring the dataset, including walking through directories and visualizing images to confirm the organization and content of the data. They introduce a helper function called walk_through_dir that allows you to systematically traverse the dataset’s folder structure and gather information about the number of directories and images within each class.

Pages 161-170 Summary: Creating a Custom Dataset Class and Loading Images

The sources continue the process of building the FoodVision Mini custom dataset, guiding you through creating a custom dataset class using PyTorch’s Dataset class. They outline the essential components and functionalities of such a class:
1. Initialization (__init__): This method sets up the dataset’s attributes, including the target directory containing the data and any necessary transformations to be applied to the images.
2. Length (__len__): This method returns the total number of samples in the dataset, providing a way to iterate through the entire dataset.
3. Item retrieval (__getitem__): This method retrieves a specific sample (image and label) from the dataset based on its index, enabling access to individual data points during training.
The sources demonstrate how to load images using the PIL (Python Imaging Library) and convert them into tensors, a format suitable for PyTorch deep learning models. They provide a detailed implementation of the load_image function, which takes an image path as input and returns a PIL image object. This function is then utilized within the __getitem__ method to load and preprocess images on demand.

They highlight the steps involved in creating a class-to-index mapping, associating each class label with a numerical index, a requirement for training classification models in PyTorch. This mapping is generated by scanning the target directory and extracting the class names from the subfolder names.

Pages 171-180 Summary: Data Visualization, Data Augmentation Techniques, and Implementing Transformations

The sources reinforce the importance of data visualization as an integral part of building a custom dataset. They provide code examples for creating a function that displays random images from the dataset along with their corresponding labels. This visual inspection helps ensure that the images are loaded correctly, the labels are accurate, and the data is appropriately preprocessed.

They further explore data augmentation techniques, highlighting their significance in enhancing model performance and generalization. They demonstrate the implementation of various augmentation methods, including random horizontal flipping, random cropping, and color jittering, using torchvision.transforms. These augmentations introduce variations in the training images, artificially expanding the dataset and helping the model learn more robust features.

The sources introduce the TrivialAugment technique, a data augmentation strategy that leverages randomness to apply a series of transformations to images, promoting diversity in the training data. They provide code examples for implementing TrivialAugment using torchvision.transforms and showcase its impact on the visual appearance of the images. They suggest experimenting with different augmentation strategies and visualizing their effects to understand their impact on the dataset.

Pages 181-190 Summary: Building a TinyVGG Model and Evaluating its Performance

The sources guide you through building a TinyVGG model architecture, a simplified version of the VGG convolutional neural network architecture. They demonstrate the step-by-step implementation of the model’s layers, including convolutional layers, ReLU activation functions, and max-pooling layers, using torch.nn modules. They use the CNN Explainer website as a visual reference for the TinyVGG architecture and encourage exploration of this resource to gain a deeper understanding of the model’s structure and operations.

The sources introduce the torchinfo package, a helpful tool for summarizing the structure and parameters of a PyTorch model. They demonstrate its usage for the TinyVGG model, providing a clear representation of the input and output shapes of each layer, the number of parameters in each layer, and the overall model size. This information helps in verifying the model’s architecture and understanding its computational complexity.

They walk through the process of evaluating the TinyVGG model’s performance on the FoodVision Mini dataset, covering the steps involved in setting up data loaders, defining a training loop, and calculating metrics like loss and accuracy. They emphasize the importance of monitoring training progress through visualization techniques like loss curves, plotting the loss value over epochs to observe the model’s learning trajectory and identify potential issues like overfitting.

Pages 191-200 Summary: Implementing Training and Testing Steps, and Setting Up a Training Loop

The sources guide you through the implementation of separate functions for the training step and testing step of the model training process. These functions encapsulate the logic for processing a single batch of data during training and testing, respectively.

The train_step function, as described in the sources, performs the following actions:
1. Forward pass: Passes the input batch through the model to obtain predictions.
2. Loss calculation: Computes the loss between the predictions and the ground truth labels.
3. Backpropagation: Calculates the gradients of the loss with respect to the model’s parameters.
4. Optimizer step: Updates the model’s parameters based on the calculated gradients to minimize the loss.
The test_step function is similar to the training step, but it omits the backpropagation and optimizer step since the goal during testing is to evaluate the model’s performance on unseen data without updating its parameters.

The sources then demonstrate how to integrate these functions into a training loop. This loop iterates over the specified number of epochs, processing the training data in batches. For each epoch, the loop performs the following steps:
1. Training phase: Calls the train_step function for each batch of training data, updating the model’s parameters.
2. Testing phase: Calls the test_step function for each batch of testing data, evaluating the model’s performance on unseen data.
The sources emphasize the importance of monitoring training progress by tracking metrics like loss and accuracy during both the training and testing phases. This allows you to observe how well the model is learning and identify potential issues like overfitting.

Pages 201-210 Summary: Visualizing Model Predictions and Exploring the Concept of Transfer Learning

The sources emphasize the value of visualizing the model’s predictions to gain insights into its performance and identify potential areas for improvement. They guide you through the process of making predictions on a set of test images and displaying the images along with their predicted and actual labels. This visual assessment helps you understand how well the model is generalizing to unseen data and can reveal patterns in the model’s errors.

They introduce the concept of transfer learning, a powerful technique in deep learning where you leverage knowledge gained from training a model on a large dataset to improve the performance of a model on a different but related task. The sources suggest exploring the torchvision.models module, which provides a collection of pre-trained models for various computer vision tasks. They highlight that these pre-trained models can be used as a starting point for your own models, either by fine-tuning the entire model or using parts of it as feature extractors.

They provide an overview of how to load pre-trained models from the torchvision.models module and modify their architecture to suit your specific task. The sources encourage experimentation with different pre-trained models and fine-tuning strategies to achieve optimal performance on your custom dataset.

Pages 211-310 Summary: Fine-Tuning a Pre-trained ResNet Model, Multi-Class Classification, and Exploring Binary vs. Multi-Class Problems

The sources shift focus to fine-tuning a pre-trained ResNet model for the FoodVision Mini dataset. They highlight the advantages of using a pre-trained model, such as faster training and potentially better performance due to leveraging knowledge learned from a larger dataset. The sources guide you through:
1. Loading a pre-trained ResNet model: They show how to use the torchvision.models module to load a pre-trained ResNet model, such as ResNet18 or ResNet34.
2. Modifying the final fully connected layer: To adapt the model to the FoodVision Mini dataset, the sources demonstrate how to change the output size of the final fully connected layer to match the number of classes in the dataset (3 in this case).
3. Freezing the initial layers: The sources discuss the strategy of freezing the weights of the initial layers of the pre-trained model to preserve the learned features from the larger dataset. This helps prevent catastrophic forgetting, where the model loses its previously acquired knowledge during fine-tuning.
4. Training the modified model: They provide instructions for training the fine-tuned model on the FoodVision Mini dataset, emphasizing the importance of monitoring training progress and evaluating the model’s performance.
The sources transition to discussing multi-class classification, explaining the distinction between binary classification (predicting between two classes) and multi-class classification (predicting among more than two classes). They provide examples of both types of classification problems:
- Binary Classification: Identifying email as spam or not spam, classifying images as containing a cat or a dog.
- Multi-class Classification: Categorizing images of different types of food, assigning topics to news articles, predicting the sentiment of a text review.
They introduce the ImageNet dataset, a large-scale dataset for image classification with 1000 object classes, as an example of a multi-class classification problem. They highlight the use of the softmax activation function for multi-class classification, explaining its role in converting the model’s raw output (logits) into probability scores for each class.

The sources guide you through building a neural network for a multi-class classification problem using PyTorch. They illustrate:
1. Creating a multi-class dataset: They use the sklearn.datasets.make_blobs function to generate a synthetic dataset with multiple classes for demonstration purposes.
2. Visualizing the dataset: The sources emphasize the importance of visualizing the dataset to understand its structure and distribution of classes.
3. Building a neural network model: They walk through the steps of defining a neural network model with multiple layers and activation functions using torch.nn modules.
4. Choosing a loss function: For multi-class classification, they introduce the cross-entropy loss function and explain its suitability for this type of problem.
5. Setting up an optimizer: They discuss the use of optimizers, such as stochastic gradient descent (SGD), for updating the model’s parameters during training.
6. Training the model: The sources provide instructions for training the multi-class classification model, highlighting the importance of monitoring training progress and evaluating the model’s performance.
Pages 311-410 Summary: Building a Robust Training Loop, Working with Nonlinearities, and Performing Model Sanity Checks

The sources guide you through building a more robust training loop for the multi-class classification problem, incorporating best practices like using a validation set for monitoring overfitting. They provide a detailed code implementation of the training loop, highlighting the key steps:
1. Iterating over epochs: The loop iterates over a specified number of epochs, processing the training data in batches.
2. Forward pass: For each batch, the input data is passed through the model to obtain predictions.
3. Loss calculation: The loss between the predictions and the target labels is computed using the chosen loss function.
4. Backward pass: The gradients of the loss with respect to the model’s parameters are calculated through backpropagation.
5. Optimizer step: The optimizer updates the model’s parameters based on the calculated gradients.
6. Validation: After each epoch, the model’s performance is evaluated on a separate validation set to monitor overfitting.
The sources introduce the concept of nonlinearities in neural networks and explain the importance of activation functions in introducing non-linearity to the model. They discuss various activation functions, such as:
- ReLU (Rectified Linear Unit): A popular activation function that sets negative values to zero and leaves positive values unchanged.
- Sigmoid: An activation function that squashes the input values between 0 and 1, commonly used for binary classification problems.
- Softmax: An activation function used for multi-class classification, producing a probability distribution over the different classes.
They demonstrate how to incorporate these activation functions into the model architecture and explain their impact on the model’s ability to learn complex patterns in the data.

The sources stress the importance of performing model sanity checks to verify that the model is functioning correctly and learning as expected. They suggest techniques like:
1. Testing on a simpler problem: Before training on the full dataset, the sources recommend testing the model on a simpler problem with known solutions to ensure that the model’s architecture and implementation are sound.
2. Visualizing model predictions: Comparing the model’s predictions to the ground truth labels can help identify potential issues with the model’s learning process.
3. Checking the loss function: Monitoring the loss value during training can provide insights into how well the model is optimizing its parameters.
Pages 411-510 Summary: Exploring Multi-class Classification Metrics and Deep Diving into Convolutional Neural Networks

The sources explore a range of multi-class classification metrics beyond accuracy, emphasizing that different metrics provide different perspectives on the model’s performance. They introduce:
- Precision: A measure of the proportion of correctly predicted positive cases out of all positive predictions.
- Recall: A measure of the proportion of correctly predicted positive cases out of all actual positive cases.
- F1-score: A harmonic mean of precision and recall, providing a balanced measure of the model’s performance.
- Confusion matrix: A visualization tool that shows the counts of true positive, true negative, false positive, and false negative predictions, providing a detailed breakdown of the model’s performance across different classes.
They guide you through implementing these metrics using PyTorch and visualizing the confusion matrix to gain insights into the model’s strengths and weaknesses.

The sources transition to discussing convolutional neural networks (CNNs), a specialized type of neural network architecture well-suited for image classification tasks. They provide an in-depth explanation of the key components of a CNN, including:
1. Convolutional layers: Layers that apply convolution operations to the input image, extracting features at different spatial scales.
2. Activation functions: Functions like ReLU that introduce non-linearity to the model, enabling it to learn complex patterns.
3. Pooling layers: Layers that downsample the feature maps, reducing the computational complexity and increasing the model’s robustness to variations in the input.
4. Fully connected layers: Layers that connect all the features extracted by the convolutional and pooling layers, performing the final classification.
They provide a visual explanation of the convolution operation, using the CNN Explainer website as a reference to illustrate how filters are applied to the input image to extract features. They discuss important hyperparameters of convolutional layers, such as:
- Kernel size: The size of the filter used for the convolution operation.
- Stride: The step size used to move the filter across the input image.
- Padding: The technique of adding extra pixels around the borders of the input image to control the output size of the convolutional layer.
Pages 511-610 Summary: Building a CNN Model from Scratch and Understanding Convolutional Layers

The sources provide a step-by-step guide to building a CNN model from scratch using PyTorch for the FoodVision Mini dataset. They walk through the process of defining the model architecture, including specifying the convolutional layers, activation functions, pooling layers, and fully connected layers. They emphasize the importance of carefully designing the model architecture to suit the specific characteristics of the dataset and the task at hand. They recommend starting with a simpler architecture and gradually increasing the model’s complexity if needed.

They delve deeper into understanding convolutional layers, explaining how they work and their role in extracting features from images. They illustrate:
1. Filters: Convolutional layers use filters (also known as kernels) to scan the input image, detecting patterns like edges, corners, and textures.
2. Feature maps: The output of a convolutional layer is a set of feature maps, each representing the presence of a particular feature in the input image.
3. Hyperparameters: They revisit the importance of hyperparameters like kernel size, stride, and padding in controlling the output size and feature extraction capabilities of convolutional layers.
The sources guide you through experimenting with different hyperparameter settings for the convolutional layers, emphasizing the importance of understanding how these choices affect the model’s performance. They recommend using visualization techniques, such as displaying the feature maps generated by different convolutional layers, to gain insights into how the model is learning features from the data.

The sources emphasize the iterative nature of the model development process, where you experiment with different architectures, hyperparameters, and training strategies to optimize the model’s performance. They recommend keeping track of the different experiments and their results to identify the most effective approaches.

Pages 611-710 Summary: Understanding CNN Building Blocks, Implementing Max Pooling, and Building a TinyVGG Model

The sources guide you through a deeper understanding of the fundamental building blocks of a convolutional neural network (CNN) for image classification. They highlight the importance of:
- Convolutional Layers: These layers extract features from input images using learnable filters. They discuss the interplay of hyperparameters like kernel size, stride, and padding, emphasizing their role in shaping the output feature maps and controlling the network’s receptive field.
- Activation Functions: Introducing non-linearity into the network is crucial for learning complex patterns. They revisit popular activation functions like ReLU (Rectified Linear Unit), which helps prevent vanishing gradients and speeds up training.
- Pooling Layers: Pooling layers downsample feature maps, making the network more robust to variations in the input image while reducing computational complexity. They explain the concept of max pooling, where the maximum value within a pooling window is selected, preserving the most prominent features.
The sources provide a detailed code implementation for max pooling using PyTorch’s torch.nn.MaxPool2d module, demonstrating how to apply it to the output of convolutional layers. They showcase how to calculate the output dimensions of the pooling layer based on the input size, stride, and pooling kernel size.

Building on these foundational concepts, the sources guide you through the construction of a TinyVGG model, a simplified version of the popular VGG architecture known for its effectiveness in image classification tasks. They demonstrate how to define the network architecture using PyTorch, stacking convolutional layers, activation functions, and pooling layers to create a deep and hierarchical representation of the input image. They emphasize the importance of designing the network structure based on principles like increasing the number of filters in deeper layers to capture more complex features.

The sources highlight the role of flattening the output of the convolutional layers before feeding it into fully connected layers, transforming the multi-dimensional feature maps into a one-dimensional vector. This transformation prepares the extracted features for the final classification task. They emphasize the importance of aligning the output size of the flattening operation with the input size of the subsequent fully connected layer.

Pages 711-810 Summary: Training a TinyVGG Model, Addressing Overfitting, and Evaluating the Model

The sources guide you through training the TinyVGG model on the FoodVision Mini dataset, emphasizing the importance of structuring the training process for optimal performance. They showcase a training loop that incorporates:
- Data Loading: Using DataLoader from PyTorch to efficiently load and batch training data, shuffling the samples in each epoch to prevent the model from learning spurious patterns from the data order.
- Device Agnostic Code: Writing code that can seamlessly switch between CPU and GPU devices for training and inference, making the code more flexible and adaptable to different hardware setups.
- Forward Pass: Passing the input data through the model to obtain predictions, applying the softmax function to the output logits to obtain probabilities for each class.
- Loss Calculation: Computing the loss between the model’s predictions and the ground truth labels using a suitable loss function, typically cross-entropy loss for multi-class classification tasks.
- Backward Pass: Calculating gradients of the loss with respect to the model’s parameters using backpropagation, highlighting the importance of understanding this fundamental algorithm that allows neural networks to learn from data.
- Optimization: Updating the model’s parameters using an optimizer like stochastic gradient descent (SGD) to minimize the loss and improve the model’s ability to make accurate predictions.
The sources emphasize the importance of monitoring the training process to ensure the model is learning effectively and generalizing well to unseen data. They guide you through tracking metrics like training loss and accuracy across epochs, visualizing them to identify potential issues like overfitting, where the model performs well on the training data but struggles to generalize to new data.

The sources address the problem of overfitting, suggesting techniques like:
- Data Augmentation: Artificially increasing the diversity of the training data by applying random transformations to the images, such as rotations, flips, and color adjustments, making the model more robust to variations in the input.
- Dropout: Randomly deactivating a proportion of neurons during training, forcing the network to learn more robust and generalizable features.
The sources showcase how to implement these techniques in PyTorch, highlighting the importance of finding the right balance between overfitting and underfitting, where the model is too simple to capture the patterns in the data.

The sources guide you through evaluating the trained model on the test set, measuring its performance using metrics like accuracy, precision, recall, and the F1-score. They emphasize the importance of using a separate test set, unseen during training, to assess the model’s ability to generalize to new data. They showcase how to generate a confusion matrix to visualize the model’s performance across different classes, identifying which classes the model struggles with the most.

The sources provide insights into analyzing the confusion matrix to gain a deeper understanding of the model’s strengths and weaknesses, informing further improvements and refinements. They emphasize that evaluating a model is not merely about reporting a single accuracy score, but rather a multifaceted process of understanding its behavior and limitations.

The main topic of the book, based on the provided excerpts and our conversation history, is deep learning with PyTorch. The book appears to function as a comprehensive course, designed to guide readers from foundational concepts to practical implementation, ultimately empowering them to build their own deep learning models.
- The book begins by introducing fundamental concepts:
- Machine Learning (ML) and Deep Learning (DL): The book establishes a clear understanding of these core concepts, explaining that DL is a subset of ML. [1-3] It emphasizes that DL is particularly well-suited for tasks involving complex patterns in large datasets. [1, 2]
- PyTorch: The book highlights PyTorch as a popular and powerful framework for deep learning. [4, 5] It emphasizes the practical, hands-on nature of the course, encouraging readers to “see things happen” rather than getting bogged down in theoretical definitions. [1, 3, 6]
- Tensors: The book underscores the role of tensors as the fundamental building blocks of data in deep learning, explaining how they represent data numerically for processing within neural networks. [5, 7, 8]
- The book then transitions into the PyTorch workflow, outlining the key steps involved in building and training deep learning models:
- Preparing and Loading Data: The book emphasizes the critical importance of data preparation, [9] highlighting techniques for loading, splitting, and visualizing data. [10-17]
- Building Models: The book guides readers through the process of constructing neural network models in PyTorch, introducing key modules like torch.nn. [18-22] It covers essential concepts like:
- Sub-classing nn.Module to define custom models [20]
- Implementing the forward method to define the flow of data through the network [21, 22]
- Training Models: The book details the training process, explaining:
- Loss Functions: These measure how well the model is performing, guiding the optimization process. [23, 24]
- Optimizers: These update the model’s parameters based on the calculated gradients, aiming to minimize the loss and improve accuracy. [25, 26]
- Training Loops: These iterate through the data, performing forward and backward passes to update the model’s parameters. [26-29]
- The Importance of Monitoring: The book stresses the need to track metrics like loss and accuracy during training to ensure the model is learning effectively and to diagnose issues like overfitting. [30-32]
- Evaluating Models: The book explains techniques for evaluating the performance of trained models on a separate test set, unseen during training. [15, 30, 33] It introduces metrics like accuracy, precision, recall, and the F1-score to assess model performance. [34, 35]
- Saving and Loading Models: The book provides instructions on how to save trained models and load them for later use, preserving the model’s learned parameters. [36-39]
- Beyond the foundational workflow, the book explores specific applications of deep learning:
- Classification: The book dedicates significant attention to classification problems, which involve categorizing data into predefined classes. [40-42] It covers:
- Binary Classification: Distinguishing between two classes (e.g., spam or not spam) [41, 43]
- Multi-Class Classification: Categorizing into more than two classes (e.g., different types of images) [41, 43]
- Computer Vision: The book dives into the world of computer vision, which focuses on enabling computers to “see” and interpret images. [44, 45] It introduces:
- Convolutional Neural Networks (CNNs): Specialized architectures designed to effectively process image data. [44-46]
- Torchvision: PyTorch’s library specifically designed for computer vision tasks. [47]
- Throughout the book, there is a strong emphasis on practical implementation, with:
- Coding Examples: The book uses plentiful code snippets to illustrate concepts and techniques, encouraging readers to experiment and learn by doing. [1, 3, 6, 48, 49]
- Exercises: The book provides exercises at the end of each section to reinforce learning and challenge readers to apply their knowledge. [50-64]
- Troubleshooting Tips: The book offers practical advice on how to debug code, interpret errors, and find solutions to common problems. [65-78]
- The book encourages continuous learning beyond the course material:
- Extracurricular Resources: The book consistently points readers to additional resources, such as the official PyTorch documentation, research papers, and blog posts, to deepen their understanding and explore more advanced topics. [7, 8, 19, 35, 37, 39, 55, 57-59, 61, 63, 64, 71, 79-94]
- The book strongly advocates for a visual and experimental approach to deep learning:
- Visualization: The book frequently encourages visualizing data and model outputs to gain insights and understand complex concepts. [14, 16, 17, 95-101]
- Experimentation: The book encourages readers to actively experiment with code, adjust parameters, and observe the effects to develop a deeper understanding. [32, 51, 53, 65, 102, 103]
In essence, the book provides a guided journey into the world of deep learning with PyTorch. It doesn’t just explain concepts—it empowers readers to build, train, and evaluate their own deep learning models. It does this by combining clear explanations, engaging code examples, practical exercises, and insightful references to supplementary resources, creating a comprehensive and stimulating learning experience.

Python’s Role in Deep Learning and Machine Learning with PyTorch

The sources strongly emphasize the use of Python as the primary programming language for working with PyTorch, a popular deep learning framework. While the sources focus on teaching PyTorch itself, they highlight Python’s integral role in making deep learning accessible and efficient. Here’s a discussion of Python’s significance, drawing upon insights from the sources:
- Python’s User-Friendliness: The sources mention that PyTorch “allows you to write fast deep learning code in Python” [1], implying that Python’s readability and ease of use make it suitable for expressing complex deep learning concepts in code. This is particularly beneficial for beginners with some Python experience (3-6 months is suggested) [2]. Python’s clear syntax and extensive libraries simplify the process of translating mathematical ideas into working deep learning models.
- PyTorch’s Pythonic Nature: The sources highlight how PyTorch seamlessly integrates with Python. Phrases like “beautifully interactive with pure Python” [3] and “testament to the Pythonic nature of PyTorch” [3] suggest that PyTorch leverages Python’s strengths, allowing developers to work with familiar Python constructs like functions and classes while building deep learning models. This tight integration enhances flexibility and allows for the incorporation of custom Python code directly into PyTorch workflows, as seen in the creation of custom accuracy functions [3].
- Integration with the Scientific Python Ecosystem: The sources mention the interaction between PyTorch and NumPy, a core library in the scientific Python ecosystem. PyTorch provides methods for converting data between NumPy arrays and PyTorch tensors [4, 5]. This interoperability signifies that PyTorch doesn’t exist in isolation. It leverages the rich ecosystem of scientific computing tools available in Python, allowing developers to utilize familiar libraries for tasks like data manipulation, visualization, and analysis in conjunction with PyTorch’s deep learning capabilities.
- Python as the Foundation: The sources consistently present code examples in Python, using Python syntax and constructs to demonstrate PyTorch concepts. This pervasive use of Python throughout the course material reinforces its position as the foundational language for interacting with PyTorch.
The sources, while focusing specifically on PyTorch, make it evident that Python plays a critical role in enabling deep learning. Python’s user-friendliness, its seamless integration with PyTorch, and its connection to a broader ecosystem of scientific computing tools contribute to PyTorch’s popularity and make deep learning more accessible to a wider audience.

Note: It is worth noting that while Python is dominant in the deep learning landscape, other languages are also used for deep learning, such as C++ and R. This information is not directly derived from the sources you provided and you may want to independently verify it.

Okay, here is a detailed summary of the book, broken down into sets of 10 pages, based on the provided excerpts. Keep in mind that these excerpts do not represent the full text of the book, so this summary provides insights into the covered topics and the overall teaching style.

Pages 1-10: Laying the Foundation: Machine Learning, Deep Learning, and PyTorch
- Introduction (Pages 1-2): These pages set the stage for the course, highlighting the importance of hands-on learning and experimentation with PyTorch. They emphasize the availability of course materials on GitHub and through the online book version at learnpytorch.io. It is also stated that the book may contain more content than is covered in the video transcript.
- Understanding Deep Learning (Pages 3-6): The book provides a concise overview of machine learning (ML) and deep learning (DL), emphasizing DL’s ability to handle complex patterns in large datasets. It suggests focusing on practical implementation rather than dwelling on detailed definitions, as these can be easily accessed online. The importance of considering simpler, rule-based solutions before resorting to ML is also stressed.
- Embracing Self-Learning (Pages 6-7): The book encourages active learning by suggesting readers explore topics like deep learning and neural networks independently, utilizing resources such as Wikipedia and specific YouTube channels like 3Blue1Brown. It stresses the value of forming your own understanding by consulting multiple sources and synthesizing information.
- Introducing PyTorch (Pages 8-10): PyTorch is introduced as a prominent deep learning framework, particularly popular in research. Its Pythonic nature is highlighted, making it efficient for writing deep learning code. The book directs readers to the official PyTorch documentation as a primary resource for exploring the framework’s capabilities.
Pages 11-20: PyTorch Fundamentals: Tensors, Operations, and More
- Getting Specific (Pages 11-12): The book emphasizes a hands-on approach, encouraging readers to explore concepts like tensors through online searches and coding experimentation. It highlights the importance of asking questions and actively engaging with the material rather than passively following along. The inclusion of exercises at the end of each module is mentioned to reinforce understanding.
- Learning Through Doing (Pages 12-14): The book emphasizes the importance of active learning through:
- Asking questions of yourself, the code, the community, and online resources.
- Completing the exercises provided to test knowledge and solidify understanding.
- Sharing your work to reinforce learning and contribute to the community.
- Avoiding Overthinking (Page 13): A key piece of advice is to avoid getting overwhelmed by the complexity of the subject. Starting with a clear understanding of the fundamentals and building upon them gradually is encouraged.
- Course Resources (Pages 14-17): The book reiterates the availability of course materials:
- GitHub repository: Containing code and other resources.
- GitHub discussions: A platform for asking questions and engaging with the community.
- learnpytorch.io: The online book version of the course.
- Tensors in Action (Pages 17-20): The book dives into PyTorch tensors, explaining their creation using torch.tensor and referencing the official documentation for further exploration. It demonstrates basic tensor operations, emphasizing that writing code and interacting with tensors is the best way to grasp their functionality. The use of the torch.arange function is introduced to create tensors with specific ranges and step sizes.
Pages 21-30: Understanding PyTorch’s Data Loading and Workflow
- Tensor Manipulation and Stacking (Pages 21-22): The book covers tensor manipulation techniques, including permuting dimensions (e.g., rearranging color channels, height, and width in an image tensor). The torch.stack function is introduced to concatenate tensors along a new dimension. The concept of a pseudo-random number generator and the role of a random seed are briefly touched upon, referencing the PyTorch documentation for a deeper understanding.
- Running Tensors on Devices (Pages 22-23): The book mentions the concept of running PyTorch tensors on different devices, such as CPUs and GPUs, although the details of this are not provided in the excerpts.
- Exercises and Extra Curriculum (Pages 23-27): The importance of practicing concepts through exercises is highlighted, and the book encourages readers to refer to the PyTorch documentation for deeper understanding. It provides guidance on how to approach exercises using Google Colab alongside the book material. The book also points out the availability of solution templates and a dedicated folder for exercise solutions.
- PyTorch Workflow in Action (Pages 28-31): The book begins exploring a complete PyTorch workflow, emphasizing a code-driven approach with explanations interwoven as needed. A six-step workflow is outlined:
1. Data preparation and loading
2. Building a machine learning/deep learning model
3. Fitting the model to data
4. Making predictions
5. Evaluating the model
6. Saving and loading the model
Pages 31-40: Data Preparation, Linear Regression, and Visualization
- The Two Parts of Machine Learning (Pages 31-33): The book breaks down machine learning into two fundamental parts:
- Representing Data Numerically: Converting data into a format suitable for models to process.
- Building a Model to Learn Patterns: Training a model to identify relationships within the numerical representation.
- Linear Regression Example (Pages 33-35): The book uses a linear regression example (y = a + bx) to illustrate the relationship between data and model parameters. It encourages a hands-on approach by coding the formula, emphasizing that coding helps solidify understanding compared to simply reading formulas.
- Visualizing Data (Pages 35-40): The book underscores the importance of data visualization using Matplotlib, adhering to the “visualize, visualize, visualize” motto. It provides code for plotting data, highlighting the use of scatter plots and the importance of consulting the Matplotlib documentation for detailed information on plotting functions. It guides readers through the process of creating plots, setting figure sizes, plotting training and test data, and customizing plot elements like colors, markers, and labels.
Pages 41-50: Model Building Essentials and Inference
- Color-Coding and PyTorch Modules (Pages 41-42): The book uses color-coding in the online version to enhance visual clarity. It also highlights essential PyTorch modules for data preparation, model building, optimization, evaluation, and experimentation, directing readers to the learnpytorch.io book and the PyTorch documentation.
- Model Predictions (Pages 42-43): The book emphasizes the process of making predictions using a trained model, noting the expectation that an ideal model would accurately predict output values based on input data. It introduces the concept of “inference mode,” which can enhance code performance during prediction. A Twitter thread and a blog post on PyTorch’s inference mode are referenced for further exploration.
- Understanding Loss Functions (Pages 44-47): The book dives into loss functions, emphasizing their role in measuring the discrepancy between a model’s predictions and the ideal outputs. It clarifies that loss functions can also be referred to as cost functions or criteria in different contexts. A table in the book outlines various loss functions in PyTorch, providing common values and links to documentation. The concept of Mean Absolute Error (MAE) and the L1 loss function are introduced, with encouragement to explore other loss functions in the documentation.
- Understanding Optimizers and Hyperparameters (Pages 48-50): The book explains optimizers, which adjust model parameters based on the calculated loss, with the goal of minimizing the loss over time. The distinction between parameters (values set by the model) and hyperparameters (values set by the data scientist) is made. The learning rate, a crucial hyperparameter controlling the step size of the optimizer, is introduced. The process of minimizing loss within a training loop is outlined, emphasizing the iterative nature of adjusting weights and biases.
Pages 51-60: Training Loops, Saving Models, and Recap
- Putting It All Together: The Training Loop (Pages 51-53): The book assembles the previously discussed concepts into a training loop, demonstrating the iterative process of updating a model’s parameters over multiple epochs. It shows how to track and print loss values during training, illustrating the gradual reduction of loss as the model learns. The convergence of weights and biases towards ideal values is shown as a sign of successful training.
- Saving and Loading Models (Pages 53-56): The book explains the process of saving trained models, preserving learned parameters for later use. The concept of a “state dict,” a Python dictionary mapping layers to their parameter tensors, is introduced. The use of torch.save and torch.load for saving and loading models is demonstrated. The book also references the PyTorch documentation for more detailed information on saving and loading models.
- Wrapping Up the Fundamentals (Pages 57-60): The book concludes the section on PyTorch workflow fundamentals, reiterating the key steps:
- Getting data ready
- Converting data to tensors
- Building or selecting a model
- Choosing a loss function and an optimizer
- Training the model
- Evaluating the model
- Saving and loading the model
- Exercises and Resources (Pages 57-60): The book provides exercises focused on the concepts covered in the section, encouraging readers to practice implementing a linear regression model from scratch. A variety of extracurricular resources are listed, including links to articles on gradient descent, backpropagation, loading and saving models, a PyTorch cheat sheet, and the unofficial PyTorch optimization loop song. The book directs readers to the extras folder in the GitHub repository for exercise templates and solutions.
This breakdown of the first 60 pages, based on the excerpts provided, reveals the book’s structured and engaging approach to teaching deep learning with PyTorch. It balances conceptual explanations with hands-on coding examples, exercises, and references to external resources. The book emphasizes experimentation and active learning, encouraging readers to move beyond passive reading and truly grasp the material by interacting with code and exploring concepts independently.

Note: Please keep in mind that this summary only covers the content found within the provided excerpts, which may not represent the entirety of the book.

Pages 61-70: Multi-Class Classification and Building a Neural Network
- Multi-Class Classification (Pages 61-63): The book introduces multi-class classification, where a model predicts one out of multiple possible classes. It shifts from the linear regression example to a new task involving a data set with four distinct classes. It also highlights the use of one-hot encoding to represent categorical data numerically, and emphasizes the importance of understanding the problem domain and using appropriate data representations for a given task.
- Preparing Data (Pages 63-64): The sources demonstrate the creation of a multi-class data set. The book uses PyTorch’s make_blobs function to generate synthetic data points representing four classes, each with its own color. It emphasizes the importance of visualizing the generated data and confirming that it aligns with the desired structure. The train_test_split function is used to divide the data into training and testing sets.
- Building a Neural Network (Pages 64-66): The book starts building a neural network model using PyTorch’s nn.Module class, showing how to define layers and connect them in a sequential manner. It provides a step-by-step explanation of the process:
1. Initialization: Defining the model class with layers and computations.
2. Input Layer: Specifying the number of features for the input layer based on the data set.
3. Hidden Layers: Creating hidden layers and determining their input and output sizes.
4. Output Layer: Defining the output layer with a size corresponding to the number of classes.
5. Forward Method: Implementing the forward pass, where data flows through the network.
- Matching Shapes (Pages 67-70): The book emphasizes the crucial concept of shape compatibility between layers. It shows how to calculate output shapes based on input shapes and layer parameters. It explains that input shapes must align with the expected shapes of subsequent layers to ensure smooth data flow. The book also underscores the importance of code experimentation to confirm shape alignment. The sources specifically focus on checking that the output shape of the network matches the shape of the target values (y) for training.
Pages 71-80: Loss Functions and Activation Functions
- Revisiting Loss Functions (Pages 71-73): The book revisits loss functions, now in the context of multi-class classification. It highlights that the choice of loss function depends on the specific problem type. The Mean Absolute Error (MAE), used for regression in previous examples, is not suitable for classification. Instead, the book introduces cross-entropy loss (nn.CrossEntropyLoss), emphasizing its suitability for classification tasks with multiple classes. It also mentions the BCEWithLogitsLoss, another common loss function for classification problems.
- The Role of Activation Functions (Pages 74-76): The book raises the concept of activation functions, hinting at their significance in model performance. The sources state that combining multiple linear layers in a neural network doesn’t increase model capacity because a series of linear transformations is still ultimately linear. This suggests that linear models might be limited in capturing complex, non-linear relationships in data.
- Visualizing Limitations (Pages 76-78): The sources introduce the “Data Explorer’s Motto”: “Visualize, visualize, visualize!” This highlights the importance of visualization for understanding both data and model behavior. The book provides a visualization demonstrating the limitations of a linear model, showing its inability to accurately classify data with non-linear boundaries.
- Exploring Nonlinearities (Pages 78-80): The sources pose the question, “What patterns could you draw if you were given an infinite amount of straight and non-straight lines?” This prompts readers to consider the expressive power of combining linear and non-linear components. The book then encourages exploring non-linear activation functions within the PyTorch documentation, specifically referencing torch.nn, and suggests trying to identify an activation function that has already been used in the examples. This interactive approach pushes learners to actively seek out information and connect concepts.
Pages 81-90: Building and Training with Non-Linearity
- Introducing ReLU (Pages 81-83): The sources emphasize the crucial role of non-linearity in neural network models, introducing the Rectified Linear Unit (ReLU) as a commonly used non-linear activation function. The book describes ReLU as a “magic piece of the puzzle,” highlighting its ability to add non-linearity to the model and enable the learning of more complex patterns. The sources again emphasize the importance of trying to draw various patterns using a combination of straight and curved lines to gain intuition about the impact of non-linearity.
- Building with ReLU (Pages 83-87): The book guides readers through modifying the neural network model by adding ReLU activation functions between the existing linear layers. The placement of ReLU functions within the model architecture is shown. The sources suggest experimenting with the TensorFlow Playground, a web-based tool for visualizing neural networks, to recreate the model and observe the effects of ReLU on data separation.
- Training the Enhanced Model (Pages 87-90): The book outlines the training process for the new model, utilizing familiar steps such as creating a loss function (BCEWithLogitsLoss in this case), setting up an optimizer (torch.optim.Adam), and defining training and evaluation loops. It demonstrates how to pass data through the model, calculate the loss, perform backpropagation, and update model parameters. The sources emphasize that even though the code structure is familiar, learners should strive to understand the underlying mechanisms and how they contribute to model training. It also suggests considering how the training code could be further optimized and modularized into functions for reusability.
It’s important to remember that this information is based on the provided excerpts, and the book likely covers these topics and concepts in more depth. The book’s interactive approach, focusing on experimentation, code interaction, and visualization, encourages active engagement with the material, urging readers to explore, question, and discover rather than passively follow along.

Continuing with Non-Linearity and Multi-Class Classification
- Visualizing Non-Linearity (Pages 91-94): The sources emphasize the importance of visualizing the model’s performance after incorporating the ReLU activation function. They use a custom plotting function, plot_decision_boundary, to visually assess the model’s ability to separate the circular data. The visualization reveals a significant improvement compared to the linear model, demonstrating that ReLU enables the model to learn non-linear decision boundaries and achieve a better separation of the classes.
- Pushing for Improvement (Pages 94-96): Even though the non-linear model shows improvement, the sources encourage continued experimentation to achieve even better performance. They challenge readers to improve the model’s accuracy on the test data to over 80%. This encourages an iterative approach to model development, where experimentation, analysis, and refinement are key. The sources suggest potential strategies, such as:
- Adding more layers to the network
- Increasing the number of hidden units
- Training for a greater number of epochs
- Adjusting the learning rate of the optimizer
- Multi-Class Classification Revisited (Pages 96-99): The sources return to multi-class classification, moving beyond the binary classification example of the circular data. They introduce a new data set called “X BLOB,” which consists of data points belonging to three distinct classes. This shift introduces additional challenges in model building and training, requiring adjustments to the model architecture, loss function, and evaluation metrics.
- Data Preparation and Model Building (Pages 99-102): The sources guide readers through preparing the X BLOB data set for training, using familiar steps such as splitting the data into training and testing sets and creating data loaders. The book emphasizes the importance of understanding the data set’s characteristics, such as the number of classes, and adjusting the model architecture accordingly. It also encourages experimentation with different model architectures, specifically referencing PyTorch’s torch.nn module, to find an appropriate model for the task. The TensorFlow Playground is again suggested as a tool for visualizing and experimenting with neural network architectures.
The sources repeatedly emphasize the iterative and experimental nature of machine learning and deep learning, urging learners to actively engage with the code, explore different options, and visualize results to gain a deeper understanding of the concepts. This hands-on approach fosters a mindset of continuous learning and improvement, crucial for success in these fields.

Building and Training with Non-Linearity: Pages 103-113
- The Power of Non-Linearity (Pages 103-105): The sources continue emphasizing the crucial role of non-linearity in neural networks, highlighting its ability to capture complex patterns in data. The book states that neural networks combine linear and non-linear functions to find patterns in data. It reiterates that linear functions alone are limited in their expressive power and that non-linear functions, like ReLU, enable models to learn intricate decision boundaries and achieve better separation of classes. The sources encourage readers to experiment with different non-linear activation functions and observe their impact on model performance, reinforcing the idea that experimentation is essential in machine learning.
- Multi-Class Model with Non-Linearity (Pages 105-108): Building upon the previous exploration, the sources guide readers through constructing a multi-class classification model with a non-linear activation function. The book provides a step-by-step breakdown of the model architecture, including:
1. Input Layer: Takes in features from the data set, same as before.
2. Hidden Layers: Incorporate linear transformations using PyTorch’s nn.Linear layers, just like in previous models.
3. ReLU Activation: Introduces ReLU activation functions between the linear layers, adding non-linearity to the model.
4. Output Layer: Produces a set of raw output values, also known as logits, corresponding to the number of classes.
- Prediction Probabilities (Pages 108-110): The sources explain that the raw output logits from the model need to be converted into probabilities to interpret the model’s predictions. They introduce the torch.softmax function, which transforms the logits into a probability distribution over the classes, indicating the likelihood of each class for a given input. The book emphasizes that understanding the relationship between logits, probabilities, and model predictions is crucial for evaluating and interpreting model outputs.
- Training and Evaluation (Pages 110-111): The sources outline the training process for the multi-class model, utilizing familiar steps such as setting up a loss function (Cross-Entropy Loss is recommended for multi-class classification), defining an optimizer (torch.optim.SGD), creating training and testing loops, and evaluating the model’s performance using loss and accuracy metrics. The sources reiterate the importance of device-agnostic code, ensuring that the model and data reside on the same device (CPU or GPU) for seamless computation. They also encourage readers to experiment with different optimizers and hyperparameters, such as learning rate and batch size, to observe their effects on training dynamics and model performance.
- Experimentation and Visualization (Pages 111-113): The sources strongly advocate for ongoing experimentation, urging readers to modify the model, adjust hyperparameters, and visualize results to gain insights into model behavior. They demonstrate how removing the ReLU activation function leads to a model with linear decision boundaries, resulting in a significant decrease in accuracy, highlighting the importance of non-linearity in capturing complex patterns. The sources also encourage readers to refer back to previous notebooks, experiment with different model architectures, and explore advanced visualization techniques to enhance their understanding of the concepts and improve model performance.
The consistent theme across these sections is the value of active engagement and experimentation. The sources emphasize that learning in machine learning and deep learning is an iterative process. Readers are encouraged to question assumptions, try different approaches, visualize results, and continuously refine their models based on observations and experimentation. This hands-on approach is crucial for developing a deep understanding of the concepts and fostering the ability to apply these techniques to real-world problems.

The Impact of Non-Linearity and Multi-Class Classification Challenges: Pages 113-116
- Non-Linearity’s Impact on Model Performance: The sources examine the critical role non-linearity plays in a model’s ability to accurately classify data. They demonstrate this by training a model without the ReLU activation function, resulting in linear decision boundaries and significantly reduced accuracy. The visualizations provided highlight the stark difference between the model with ReLU and the one without, showcasing how non-linearity enables the model to capture the circular patterns in the data and achieve better separation between classes [1]. This emphasizes the importance of understanding how different activation functions contribute to a model’s capacity to learn complex relationships within data.
- Understanding the Data and Model Relationship (Pages 115-116): The sources remind us that evaluating a model is as crucial as building one. They highlight the importance of becoming one with the data, both at the beginning and after training a model, to gain a deeper understanding of its behavior and performance. Analyzing the model’s predictions on the data helps identify potential issues, such as overfitting or underfitting, and guides further experimentation and refinement [2].
- Key Takeaways: The sources reinforce several key concepts and best practices in machine learning and deep learning:
- Visualize, Visualize, Visualize: Visualizing data and model predictions is crucial for understanding patterns, identifying potential issues, and guiding model development.
- Experiment, Experiment, Experiment: Trying different approaches, adjusting hyperparameters, and iteratively refining models based on observations is essential for achieving optimal performance.
- The Data Scientist’s/Machine Learning Practitioner’s Motto: Experimentation is at the heart of successful machine learning, encouraging continuous learning and improvement.
- Steps in Modeling with PyTorch: The sources repeatedly reinforce a structured workflow for building and training models in PyTorch, emphasizing the importance of following a methodical approach to ensure consistency and reproducibility.
The sources conclude this section by directing readers to a set of exercises and extra curriculum designed to solidify their understanding of non-linearity, multi-class classification, and the steps involved in building, training, and evaluating models in PyTorch. These resources provide valuable opportunities for hands-on practice and further exploration of the concepts covered. They also serve as a reminder that learning in these fields is an ongoing process that requires continuous engagement, experimentation, and a willingness to iterate and refine models based on observations and analysis [3].

Continuing the Computer Vision Workflow: Pages 116-129
- Introducing Computer Vision and CNNs: The sources introduce a new module focusing on computer vision and convolutional neural networks (CNNs). They acknowledge the excitement surrounding this topic and emphasize its importance as a core concept within deep learning. The sources also provide clear instructions on how to access help and resources if learners encounter challenges during the module, encouraging active engagement and a problem-solving mindset. They reiterate the motto of “if in doubt, run the code,” highlighting the value of practical experimentation. They also point to available resources, including the PyTorch Deep Learning repository, specific notebooks, and a dedicated discussions tab for questions and answers.
- Understanding Custom Datasets: The sources explain the concept of custom datasets, recognizing that while pre-built datasets like FashionMNIST are valuable for learning, real-world applications often involve working with unique data. They acknowledge the potential need for custom data loading solutions when existing libraries don’t provide the necessary functionality. The sources introduce the idea of creating a custom PyTorch dataset class by subclassing torch.utils.data.Dataset and implementing specific methods to handle data loading and preparation tailored to the unique requirements of the custom dataset.
- Building a Baseline Model (Pages 118-120): The sources guide readers through building a baseline computer vision model using PyTorch. They emphasize the importance of understanding the input and output shapes to ensure the model is appropriately configured for the task. The sources also introduce the concept of creating a dummy forward pass to check the model’s functionality and verify the alignment of input and output dimensions.
- Training the Baseline Model (Pages 120-125): The sources step through the process of training the baseline computer vision model. They provide a comprehensive breakdown of the code, including the use of a progress bar for tracking training progress. The steps highlighted include:
1. Setting up the training loop: Iterating through epochs and batches of data
2. Performing the forward pass: Passing data through the model to obtain predictions
3. Calculating the loss: Measuring the difference between predictions and ground truth labels
4. Backpropagation: Calculating gradients to update model parameters
5. Updating model parameters: Using the optimizer to adjust weights based on calculated gradients
- Evaluating Model Performance (Pages 126-128): The sources stress the importance of comprehensive evaluation, going beyond simple loss and accuracy metrics. They introduce techniques like plotting loss curves to visualize training dynamics and gain insights into model behavior. The sources also emphasize the value of experimentation, encouraging readers to explore the impact of different devices (CPU vs. GPU) on training time and performance.
- Improving Through Experimentation: The sources encourage ongoing experimentation to improve model performance. They introduce the idea of building a better model with non-linearity, suggesting the inclusion of activation functions like ReLU. They challenge readers to try building such a model and experiment with different configurations to observe their impact on results.
The sources maintain their consistent focus on hands-on learning, guiding readers through each step of building, training, and evaluating computer vision models using PyTorch. They emphasize the importance of understanding the underlying concepts while actively engaging with the code, trying different approaches, and visualizing results to gain deeper insights and build practical experience.

Functionizing Code for Efficiency and Readability: Pages 129-139
- The Benefits of Functionizing Training and Evaluation Loops: The sources introduce the concept of functionizing code, specifically focusing on training and evaluation (testing) loops in PyTorch. They explain that writing reusable functions for these repetitive tasks brings several advantages:
- Improved code organization and readability: Breaking down complex processes into smaller, modular functions enhances the overall structure and clarity of the code. This makes it easier to understand, maintain, and modify in the future.
- Reduced errors: Encapsulating common operations within functions helps prevent inconsistencies and errors that can arise from repeatedly writing similar code blocks.
- Increased efficiency: Reusable functions streamline the development process by eliminating the need to rewrite the same code for different models or datasets.
- Creating the train_step Function (Pages 130-132): The sources guide readers through creating a function called train_step that encapsulates the logic of a single training step within a PyTorch training loop. The function takes several arguments:
- model: The PyTorch model to be trained
- data_loader: The data loader providing batches of training data
- loss_function: The loss function used to calculate the training loss
- optimizer: The optimizer responsible for updating model parameters
- accuracy_function: A function for calculating the accuracy of the model’s predictions
- device: The device (CPU or GPU) on which to perform the computations
- The train_step function performs the following steps for each batch of training data:
1. Sets the model to training mode using model.train()
2. Sends the input data and labels to the specified device
3. Performs the forward pass by passing the data through the model
4. Calculates the loss using the provided loss function
5. Performs backpropagation to calculate gradients
6. Updates model parameters using the optimizer
7. Calculates and accumulates the training loss and accuracy for the batch
- Creating the test_step Function (Pages 132-136): The sources proceed to create a function called test_step that performs a single evaluation step on a batch of testing data. This function follows a similar structure to train_step, but with key differences:
- It sets the model to evaluation mode using model.eval() to disable certain behaviors, such as dropout, specific to training.
- It utilizes the torch.inference_mode() context manager to potentially optimize computations for inference tasks, aiming for speed improvements.
- It calculates and accumulates the testing loss and accuracy for the batch without updating the model’s parameters.
- Combining train_step and test_step into a train Function (Pages 137-139): The sources combine the functionality of train_step and test_step into a single function called train, which orchestrates the entire training and evaluation process over a specified number of epochs. The train function takes arguments similar to train_step and test_step, including the number of epochs to train for. It iterates through the specified epochs, calling train_step for each batch of training data and test_step for each batch of testing data. It tracks and prints the training and testing loss and accuracy for each epoch, providing a clear view of the model’s progress during training.
By encapsulating the training and evaluation logic into these functions, the sources demonstrate best practices in PyTorch code development, emphasizing modularity, readability, and efficiency. This approach makes it easier to experiment with different models, datasets, and hyperparameters while maintaining a structured and manageable codebase.

Leveraging Functions for Model Training and Evaluation: Pages 139-148
- Training Model 1 Using the train Function: The sources demonstrate how to use the newly created train function to train the model_1 that was built earlier. They highlight that only a few lines of code are needed to initiate the training process, showcasing the efficiency gained from functionization.
- Examining Training Results and Performance Comparison: The sources emphasize the importance of carefully examining the training results, particularly the training and testing loss curves. They point out that while model_1 achieves good results, the baseline model_0 appears to perform slightly better. This observation prompts a discussion on potential reasons for the difference in performance, including the possibility that the simpler baseline model might be better suited for the dataset or that further experimentation and hyperparameter tuning might be needed for model_1 to surpass model_0. The sources also highlight the impact of using a GPU for computations, showing that training on a GPU generally leads to faster training times compared to using a CPU.
- Creating a Results Dictionary to Track Experiments: The sources introduce the concept of creating a dictionary to store the results of different experiments. This organized approach allows for easy comparison and analysis of model performance across various configurations and hyperparameter settings. They emphasize the importance of such systematic tracking, especially when exploring multiple models and variations, to gain insights into the factors influencing performance and make informed decisions about model selection and improvement.
- Visualizing Loss Curves for Model Analysis: The sources encourage visualizing the loss curves using a function called plot_loss_curves. They stress the value of visual representations in understanding the training dynamics and identifying potential issues like overfitting or underfitting. By plotting the training and testing losses over epochs, it becomes easier to assess whether the model is learning effectively and generalizing well to unseen data. The sources present different scenarios for loss curves, including:
- Underfitting: The training loss remains high, indicating that the model is not capturing the patterns in the data effectively.
- Overfitting: The training loss decreases significantly, but the testing loss increases, suggesting that the model is memorizing the training data and failing to generalize to new examples.
- Good Fit: Both the training and testing losses decrease and converge, indicating that the model is learning effectively and generalizing well to unseen data.
- Addressing Overfitting and Introducing Data Augmentation: The sources acknowledge overfitting as a common challenge in machine learning and introduce data augmentation as one technique to mitigate it. Data augmentation involves creating variations of existing training data by applying transformations like random rotations, flips, or crops. This expands the effective size of the training set, potentially improving the model’s ability to generalize to new data. They acknowledge that while data augmentation may not always lead to significant improvements, it remains a valuable tool in the machine learning practitioner’s toolkit, especially when dealing with limited datasets or complex models prone to overfitting.
- Building and Training a CNN Model: The sources shift focus towards building a convolutional neural network (CNN) using PyTorch. They guide readers through constructing a CNN architecture, referencing the TinyVGG model from the CNN Explainer website as a starting point. The process involves stacking convolutional layers, activation functions (ReLU), and pooling layers to create a network capable of learning features from images effectively. They emphasize the importance of choosing appropriate hyperparameters, such as the number of filters, kernel size, and padding, and understanding their influence on the model’s capacity and performance.
- Creating Functions for Training and Evaluation with Custom Datasets: The sources revisit the concept of functionization, this time adapting the train_step and test_step functions to work with custom datasets. They highlight the importance of writing reusable and adaptable code that can handle various data formats and scenarios.
The sources continue to guide learners through a comprehensive workflow for building, training, and evaluating models in PyTorch, introducing advanced concepts and techniques along the way. They maintain their focus on practical application, encouraging hands-on experimentation, visualization, and analysis to deepen understanding and foster mastery of the tools and concepts involved in machine learning and deep learning.

Training and Evaluating Models with Custom Datasets: Pages 171-187
- Building the TinyVGG Architecture: The sources guide the creation of a CNN model based on the TinyVGG architecture. The model consists of convolutional layers, ReLU activation functions, and max-pooling layers arranged in a specific pattern to extract features from images effectively. The sources highlight the importance of understanding the role of each layer and how they work together to process image data. They also mention a blog post, “Making deep learning go brrr from first principles,” which might provide further insights into the principles behind deep learning models. You might want to explore this resource for a deeper understanding.
- Adapting Training and Evaluation Functions for Custom Datasets: The sources revisit the train_step and test_step functions, modifying them to accommodate custom datasets. They emphasize the need for flexibility in code, enabling it to handle different data formats and structures. The changes involve ensuring the data is loaded and processed correctly for the specific dataset used.
- Creating a train Function for Custom Dataset Training: The sources combine the train_step and test_step functions within a new train function specifically designed for custom datasets. This function orchestrates the entire training and evaluation process, looping through epochs, calling the appropriate step functions for each batch of data, and tracking the model’s performance.
- Training and Evaluating the Model: The sources demonstrate the process of training the TinyVGG model on the custom food image dataset using the newly created train function. They emphasize the importance of setting random seeds for reproducibility, ensuring consistent results across different runs.
- Analyzing Loss Curves and Accuracy Trends: The sources analyze the training results, focusing on the loss curves and accuracy trends. They point out that the model exhibits good performance, with the loss decreasing and the accuracy increasing over epochs. They also highlight the potential for further improvement by training for a longer duration.
- Exploring Different Loss Curve Scenarios: The sources discuss different types of loss curves, including:
- Underfitting: The training loss remains high, indicating the model isn’t effectively capturing the data patterns.
- Overfitting: The training loss decreases substantially, but the testing loss increases, signifying the model is memorizing the training data and failing to generalize to new examples.
- Good Fit: Both training and testing losses decrease and converge, demonstrating that the model is learning effectively and generalizing well.
- Addressing Overfitting with Data Augmentation: The sources introduce data augmentation as a technique to combat overfitting. Data augmentation creates variations of the training data through transformations like rotations, flips, and crops. This approach effectively expands the training dataset, potentially improving the model’s generalization abilities. They acknowledge that while data augmentation might not always yield significant enhancements, it remains a valuable strategy, especially for smaller datasets or complex models prone to overfitting.
- Building a Model with Data Augmentation: The sources demonstrate how to build a TinyVGG model incorporating data augmentation techniques. They explore the impact of data augmentation on model performance.
- Visualizing Results and Evaluating Performance: The sources advocate for visualizing results to gain insights into model behavior. They encourage using techniques like plotting loss curves and creating confusion matrices to assess the model’s effectiveness.
- Saving and Loading the Best Model: The sources highlight the importance of saving the best-performing model to preserve its state for future use. They demonstrate the process of saving and loading a PyTorch model.
- Exercises and Extra Curriculum: The sources provide guidance on accessing exercises and supplementary materials, encouraging learners to further explore and solidify their understanding of custom datasets, data augmentation, and CNNs in PyTorch.
The sources provide a comprehensive walkthrough of building, training, and evaluating models with custom datasets in PyTorch, introducing and illustrating various concepts and techniques along the way. They underscore the value of practical application, experimentation, and analysis to enhance understanding and skill development in machine learning and deep learning.

Continuing the Exploration of Custom Datasets and Data Augmentation
- Building a Model with Data Augmentation: The sources guide the construction of a TinyVGG model incorporating data augmentation techniques to potentially improve its generalization ability and reduce overfitting. [1] They introduce data augmentation as a way to create variations of existing training data by applying transformations like random rotations, flips, or crops. [1] This increases the effective size of the training dataset and exposes the model to a wider range of input patterns, helping it learn more robust features.
- Training the Model with Data Augmentation and Analyzing Results: The sources walk through the process of training the model with data augmentation and evaluating its performance. [2] They observe that, in this specific case, data augmentation doesn’t lead to substantial improvements in quantitative metrics. [2] The reasons for this could be that the baseline model might already be underfitting, or the specific augmentations used might not be optimal for the dataset. They emphasize that experimenting with different augmentations and hyperparameters is crucial to determine the most effective strategies for a given problem.
- Visualizing Loss Curves and Emphasizing the Importance of Evaluation: The sources stress the importance of visualizing results, especially loss curves, to understand the training dynamics and identify potential issues like overfitting or underfitting. [2] They recommend using the plot_loss_curves function to visually compare the training and testing losses across epochs. [2]
- Providing Access to Exercises and Extra Curriculum: The sources conclude by directing learners to the resources available for practicing the concepts covered, including an exercise template notebook and example solutions. [3] They encourage readers to attempt the exercises independently and use the example solutions as a reference only after making a genuine effort. [3] The exercises focus on building a CNN model for image classification, highlighting the steps involved in data loading, model creation, training, and evaluation. [3]
- Concluding the Section on Custom Datasets and Looking Ahead: The sources wrap up the section on working with custom datasets and using data augmentation techniques. [4] They point out that learners have now covered a significant portion of the course material and gained valuable experience in building, training, and evaluating PyTorch models for image classification tasks. [4] They briefly touch upon the next steps in the deep learning journey, including deployment, and encourage learners to continue exploring and expanding their knowledge. [4]
The sources aim to equip learners with the necessary tools and knowledge to tackle real-world deep learning projects. They advocate for a hands-on, experimental approach, emphasizing the importance of understanding the data, choosing appropriate models and techniques, and rigorously evaluating the results. They also encourage learners to continuously seek out new information and refine their skills through practice and exploration.

Exploring Techniques for Model Improvement and Evaluation: Pages 188-190
- Examining the Impact of Data Augmentation: The sources continue to assess the effectiveness of data augmentation in improving model performance. They observe that, despite its potential benefits, data augmentation might not always result in significant enhancements. In the specific example provided, the model trained with data augmentation doesn’t exhibit noticeable improvements compared to the baseline model. This outcome could be attributed to the baseline model potentially underfitting the data, implying that the model’s capacity is insufficient to capture the complexities of the dataset even with augmented data. Alternatively, the specific data augmentations employed might not be well-suited to the dataset, leading to minimal performance gains.
- Analyzing Loss Curves to Understand Model Behavior: The sources emphasize the importance of visualizing results, particularly loss curves, to gain insights into the model’s training dynamics. They recommend plotting the training and validation loss curves to observe how the model’s performance evolves over epochs. These visualizations help identify potential issues such as:
- Underfitting: When both training and validation losses remain high, suggesting the model isn’t effectively learning the patterns in the data.
- Overfitting: When the training loss decreases significantly while the validation loss increases, indicating the model is memorizing the training data rather than learning generalizable features.
- Good Fit: When both training and validation losses decrease and converge, demonstrating the model is learning effectively and generalizing well to unseen data.
- Directing Learners to Exercises and Supplementary Materials: The sources encourage learners to engage with the exercises and extra curriculum provided to solidify their understanding of the concepts covered. They point to resources like an exercise template notebook and example solutions designed to reinforce the knowledge acquired in the section. The exercises focus on building a CNN model for image classification, covering aspects like data loading, model creation, training, and evaluation.
The sources strive to equip learners with the critical thinking skills necessary to analyze model performance, identify potential problems, and explore strategies for improvement. They highlight the value of visualizing results and understanding the implications of different loss curve patterns. Furthermore, they encourage learners to actively participate in the provided exercises and seek out supplementary materials to enhance their practical skills in deep learning.

Evaluating the Effectiveness of Data Augmentation

The sources consistently emphasize the importance of evaluating the impact of data augmentation on model performance. While data augmentation is a widely used technique to mitigate overfitting and potentially improve generalization ability, its effectiveness can vary depending on the specific dataset and model architecture.

In the context of the food image classification task, the sources demonstrate building a TinyVGG model with and without data augmentation. They analyze the results and observe that, in this particular instance, data augmentation doesn’t lead to significant improvements in quantitative metrics like loss or accuracy. This outcome could be attributed to several factors:
- Underfitting Baseline Model: The baseline model, even without augmentation, might already be underfitting the data. This suggests that the model’s capacity is insufficient to capture the complexities of the dataset effectively. In such scenarios, data augmentation might not provide substantial benefits as the model’s limitations prevent it from leveraging the augmented data fully.
- Suboptimal Augmentations: The specific data augmentation techniques used might not be well-suited to the characteristics of the food image dataset. The chosen transformations might not introduce sufficient diversity or might inadvertently alter crucial features, leading to limited performance gains.
- Dataset Size: The size of the original dataset could influence the impact of data augmentation. For larger datasets, data augmentation might have a more pronounced effect, as it helps expand the training data and exposes the model to a wider range of variations. However, for smaller datasets, the benefits of augmentation might be less noticeable.
The sources stress the importance of experimentation and analysis to determine the effectiveness of data augmentation for a specific task. They recommend exploring different augmentation techniques, adjusting hyperparameters, and carefully evaluating the results to find the optimal strategy. They also point out that even if data augmentation doesn’t result in substantial quantitative improvements, it can still contribute to a more robust and generalized model. [1, 2]

Exploring Data Augmentation and Addressing Overfitting

The sources highlight the importance of data augmentation as a technique to combat overfitting in machine learning models, particularly in the realm of computer vision. They emphasize that data augmentation involves creating variations of the existing training data by applying transformations such as rotations, flips, or crops. This effectively expands the training dataset and presents the model with a wider range of input patterns, promoting the learning of more robust and generalizable features.

However, the sources caution that data augmentation is not a guaranteed solution and its effectiveness can vary depending on several factors, including:
- The nature of the dataset: The type of data and the inherent variability within the dataset can influence the impact of data augmentation. Certain datasets might benefit significantly from augmentation, while others might exhibit minimal improvement.
- The model architecture: The complexity and capacity of the model can determine how effectively it can leverage augmented data. A simple model might not fully utilize the augmented data, while a more complex model might be prone to overfitting even with augmentation.
- The choice of augmentation techniques: The specific transformations applied during augmentation play a crucial role in its success. Selecting augmentations that align with the characteristics of the data and the task at hand is essential. Inappropriate or excessive augmentations can even hinder performance.
The sources demonstrate the application of data augmentation in the context of a food image classification task using a TinyVGG model. They train the model with and without augmentation and compare the results. Notably, they observe that, in this particular scenario, data augmentation does not lead to substantial improvements in quantitative metrics such as loss or accuracy. This outcome underscores the importance of carefully evaluating the impact of data augmentation and not assuming its universal effectiveness.

To gain further insights into the model’s behavior and the effects of data augmentation, the sources recommend visualizing the training and validation loss curves. These visualizations can reveal patterns that indicate:
- Underfitting: If both the training and validation losses remain high, it suggests the model is not adequately learning from the data, even with augmentation.
- Overfitting: If the training loss decreases while the validation loss increases, it indicates the model is memorizing the training data and failing to generalize to unseen data.
- Good Fit: If both the training and validation losses decrease and converge, it signifies the model is learning effectively and generalizing well.
The sources consistently emphasize the importance of experimentation and analysis when applying data augmentation. They encourage trying different augmentation techniques, fine-tuning hyperparameters, and rigorously evaluating the results to determine the optimal strategy for a given problem. They also highlight that, even if data augmentation doesn’t yield significant quantitative gains, it can still contribute to a more robust and generalized model.

Ultimately, the sources advocate for a nuanced approach to data augmentation, recognizing its potential benefits while acknowledging its limitations. They urge practitioners to adopt a data-driven methodology, carefully considering the characteristics of the dataset, the model architecture, and the task requirements to determine the most effective data augmentation strategy.

The Purpose and Impact of Inference Mode in PyTorch

The sources introduce inference mode, a feature in PyTorch designed to optimize the model for making predictions, often referred to as “inference” or “evaluation” in machine learning. Inference mode is activated using the torch.inference_mode context manager, as demonstrated in source [1].

Key Benefits of Inference Mode

While the sources don’t go into extensive detail about the internal workings of inference mode, they highlight its primary benefits:
- Improved Speed: Inference mode disables gradient calculations and certain operations not required during prediction, resulting in faster code execution. Source [2] specifically mentions a Twitter thread where PyTorch developers discuss the speed enhancements achieved using inference mode.
- Reduced Memory Consumption: By disabling gradient tracking, inference mode reduces the memory footprint of the model during prediction. This can be particularly advantageous when deploying models on resource-constrained devices or handling large datasets.
Observing the Difference

Source [3] presents a comparison between predictions made with and without inference mode. The key distinction lies in the presence or absence of a grad_fn attribute in the output. The grad_fn attribute is associated with gradient calculations, essential for training but unnecessary during inference. When inference mode is active, the output lacks the grad_fn, indicating that gradient tracking is disabled, leading to speed and memory optimizations.

Making Inference Mode a Habit

The sources encourage developing the habit of using inference mode whenever making predictions with a PyTorch model. This practice ensures that the model operates in its most efficient mode for inference tasks, maximizing performance and minimizing resource utilization.

Beyond the Sources

The sources provide a high-level overview of inference mode and its benefits, recommending further exploration through external resources, particularly the PyTorch documentation and the mentioned Twitter thread. For a deeper understanding of the technical aspects and implementation details of inference mode, consulting the official PyTorch documentation would be beneficial.

Building a Robust PyTorch Training Loop: Pages 201-210

The sources transition into providing a detailed walkthrough of constructing a robust training loop in PyTorch for a machine-learning task involving straight-line data. This example focuses on regression, where the goal is to predict a continuous numerical value. They emphasize that while this specific task involves a simple linear relationship, the concepts and steps involved are generalizable to more complex scenarios.

Here’s a breakdown of the key elements covered in the sources:
- Data Generation and Preparation: The sources guide the reader through generating a synthetic dataset representing a straight line with a predefined weight and bias. This dataset simulates a real-world scenario where the goal is to train a model to learn the underlying relationship between input features and target variables.
- Model Definition: The sources introduce the nn.Linear module, a fundamental building block in PyTorch for defining linear layers in neural networks. They demonstrate how to instantiate a linear layer, specifying the input and output dimensions based on the dataset. This layer will learn the weight and bias parameters during training to approximate the straight-line relationship.
- Loss Function and Optimizer: The sources explain the importance of a loss function in training a machine learning model. In this case, they use the Mean Squared Error (MSE) loss, a common choice for regression tasks that measures the average squared difference between the predicted and actual values. They also introduce the concept of an optimizer, specifically Stochastic Gradient Descent (SGD), responsible for updating the model’s parameters to minimize the loss function during training.
- Training Loop Structure: The sources outline the core components of a training loop:
- Iterating Through Epochs: The training process typically involves multiple passes over the entire training dataset, each pass referred to as an epoch. The loop iterates through the specified number of epochs, performing the training steps for each epoch.
- Forward Pass: For each batch of data, the model makes predictions based on the current parameter values. This step involves passing the input data through the linear layer and obtaining the output, referred to as logits.
- Loss Calculation: The loss function (MSE in this example) is used to compute the difference between the model’s predictions (logits) and the actual target values.
- Backpropagation: This step involves calculating the gradients of the loss with respect to the model’s parameters. These gradients indicate the direction and magnitude of adjustments needed to minimize the loss.
- Optimizer Step: The optimizer (SGD in this case) utilizes the calculated gradients to update the model’s weight and bias parameters, moving them towards values that reduce the loss.
- Visualizing the Training Process: The sources emphasize the importance of visualizing the training progress to gain insights into the model’s behavior. They demonstrate plotting the loss values and parameter updates over epochs, helping to understand how the model is learning and whether the loss is decreasing as expected.
- Illustrating Epochs and Stepping the Optimizer: The sources use a coin analogy to explain the concept of epochs and the role of the optimizer in adjusting model parameters. They compare each epoch to moving closer to a coin at the back of a couch, with the optimizer taking steps to reduce the distance to the target (the coin).
The sources provide a comprehensive guide to constructing a fundamental PyTorch training loop for a regression problem, emphasizing the key components and the rationale behind each step. They stress the importance of visualization to understand the training dynamics and the role of the optimizer in guiding the model towards a solution that minimizes the loss function.

Understanding Non-Linearities and Activation Functions: Pages 211-220

The sources shift their focus to the concept of non-linearities in neural networks and their crucial role in enabling models to learn complex patterns beyond simple linear relationships. They introduce activation functions as the mechanism for introducing non-linearity into the model’s computations.

Here’s a breakdown of the key concepts covered in the sources:
- Limitations of Linear Models: The sources revisit the previous example of training a linear model to fit a straight line. They acknowledge that while linear models are straightforward to understand and implement, they are inherently limited in their capacity to model complex, non-linear relationships often found in real-world data.
- The Need for Non-Linearities: The sources emphasize that introducing non-linearity into the model’s architecture is essential for capturing intricate patterns and making accurate predictions on data with non-linear characteristics. They highlight that without non-linearities, neural networks would essentially collapse into a series of linear transformations, offering no advantage over simple linear models.
- Activation Functions: The sources introduce activation functions as the primary means of incorporating non-linearities into neural networks. Activation functions are applied to the output of linear layers, transforming the linear output into a non-linear representation. They act as “decision boundaries,” allowing the network to learn more complex and nuanced relationships between input features and target variables.
- Sigmoid Activation Function: The sources specifically discuss the sigmoid activation function, a common choice that squashes the input values into a range between 0 and 1. They highlight that while sigmoid was historically popular, it has limitations, particularly in deep networks where it can lead to vanishing gradients, hindering training.
- ReLU Activation Function: The sources present the ReLU (Rectified Linear Unit) activation function as a more modern and widely used alternative to sigmoid. ReLU is computationally efficient and addresses the vanishing gradient problem associated with sigmoid. It simply sets all negative values to zero and leaves positive values unchanged, introducing non-linearity while preserving the benefits of linear behavior in certain regions.
- Visualizing the Impact of Non-Linearities: The sources emphasize the importance of visualization to understand the impact of activation functions. They demonstrate how the addition of a ReLU activation function to a simple linear model drastically changes the model’s decision boundary, enabling it to learn non-linear patterns in a toy dataset of circles. They showcase how the ReLU-augmented model achieves near-perfect performance, highlighting the power of non-linearities in enhancing model capabilities.
- Exploration of Activation Functions in torch.nn: The sources guide the reader to explore the torch.nn module in PyTorch, which contains a comprehensive collection of activation functions. They encourage exploring the documentation and experimenting with different activation functions to understand their properties and impact on model behavior.
The sources provide a clear and concise introduction to the fundamental concepts of non-linearities and activation functions in neural networks. They emphasize the limitations of linear models and the essential role of activation functions in empowering models to learn complex patterns. The sources encourage a hands-on approach, urging readers to experiment with different activation functions in PyTorch and visualize their effects on model behavior.

Optimizing Gradient Descent: Pages 221-230

The sources move on to refining the gradient descent process, a crucial element in training machine-learning models. They highlight several techniques and concepts aimed at enhancing the efficiency and effectiveness of gradient descent.
- Gradient Accumulation and the optimizer.zero_grad() Method: The sources explain the concept of gradient accumulation, where gradients are calculated and summed over multiple batches before being applied to update model parameters. They emphasize the importance of resetting the accumulated gradients to zero before each batch using the optimizer.zero_grad() method. This prevents gradients from previous batches from interfering with the current batch’s calculations, ensuring accurate gradient updates.
- The Intertwined Nature of Gradient Descent Steps: The sources point out the interconnectedness of the steps involved in gradient descent:
- optimizer.zero_grad(): Resets the gradients to zero.
- loss.backward(): Calculates gradients through backpropagation.
- optimizer.step(): Updates model parameters based on the calculated gradients.
- They emphasize that these steps work in tandem to optimize the model parameters, moving them towards values that minimize the loss function.
- Learning Rate Scheduling and the Coin Analogy: The sources introduce the concept of learning rate scheduling, a technique for dynamically adjusting the learning rate, a hyperparameter controlling the size of parameter updates during training. They use the analogy of reaching for a coin at the back of a couch to explain this concept.
- Large Steps Initially: When starting the arm far from the coin (analogous to the initial stages of training), larger steps are taken to cover more ground quickly.
- Smaller Steps as the Target Approaches: As the arm gets closer to the coin (similar to approaching the optimal solution), smaller, more precise steps are needed to avoid overshooting the target.
- The sources suggest exploring resources on learning rate scheduling for further details.
- Visualizing Model Improvement: The sources demonstrate the positive impact of training for more epochs, showing how predictions align better with the target values as training progresses. They visualize the model’s predictions alongside the actual data points, illustrating how the model learns to fit the data more accurately over time.
- The torch.no_grad() Context Manager for Evaluation: The sources introduce the torch.no_grad() context manager, used during the evaluation phase to disable gradient calculations. This optimization enhances speed and reduces memory consumption, as gradients are unnecessary for evaluating a trained model.
- The Jingle for Remembering Training Steps: To help remember the key steps in a training loop, the sources introduce a catchy jingle: “For an epoch in a range, do the forward pass, calculate the loss, optimizer zero grad, loss backward, optimizer step, step, step.” This mnemonic device reinforces the sequence of actions involved in training a model.
- Customizing Printouts and Monitoring Metrics: The sources emphasize the flexibility of customizing printouts during training to monitor relevant metrics. They provide examples of printing the loss, weights, and bias values at specific intervals (every 10 epochs in this case) to track the training progress. They also hint at introducing accuracy metrics in later stages.
- Reinitializing the Model and the Importance of Random Seeds: The sources demonstrate reinitializing the model to start training from scratch, showcasing how the model begins with random predictions but progressively improves as training progresses. They emphasize the role of random seeds in ensuring reproducibility, allowing for consistent model initialization and experimentation.
The sources provide a comprehensive exploration of techniques and concepts for optimizing the gradient descent process in PyTorch. They cover gradient accumulation, learning rate scheduling, and the use of context managers for efficient evaluation. They emphasize visualization to monitor progress and the importance of random seeds for reproducible experiments.

Saving, Loading, and Evaluating Models: Pages 231-240

The sources guide readers through saving a trained model, reloading it for later use, and exploring additional evaluation metrics beyond just loss.
- Saving a Trained Model with torch.save(): The sources introduce the torch.save() function in PyTorch to save a trained model to a file. They emphasize the importance of saving models to preserve the learned parameters, allowing for later reuse without retraining. The code examples demonstrate saving the model’s state dictionary, containing the learned parameters, to a file named “01_pytorch_workflow_model_0.pth”.
- Verifying Model File Creation with ls: The sources suggest using the ls command in a terminal or command prompt to verify that the model file has been successfully created in the designated directory.
- Loading a Saved Model with torch.load(): The sources then present the torch.load() function for loading a saved model back into the environment. They highlight the ease of loading saved models, allowing for continued training or deployment for making predictions without the need to repeat the entire training process. They challenge readers to attempt loading the saved model before providing the code solution.
- Examining Loaded Model Parameters: The sources suggest examining the loaded model’s parameters, particularly the weights and biases, to confirm that they match the values from the saved model. This step ensures that the model has been loaded correctly and is ready for further use.
- Improving Model Performance with More Epochs: The sources revisit the concept of training for more epochs to improve model performance. They demonstrate how increasing the number of epochs can lead to lower loss and better alignment between predictions and target values. They encourage experimentation with different epoch values to observe the impact on model accuracy.
- Plotting Loss Curves to Visualize Training Progress: The sources showcase plotting loss curves to visualize the training progress over time. They track the loss values for both the training and test sets across epochs and plot these values to observe the trend of decreasing loss as training proceeds. The sources point out that if the training and test loss curves converge closely, it indicates that the model is generalizing well to unseen data, a desirable outcome.
- Storing Useful Values During Training: The sources recommend creating empty lists to store useful values during training, such as epoch counts, loss values, and test loss values. This organized storage facilitates later analysis and visualization of the training process.
- Reviewing Code, Slides, and Extra Curriculum: The sources encourage readers to review the code, accompanying slides, and extra curriculum resources for a deeper understanding of the concepts covered. They particularly recommend the book version of the course, which contains comprehensive explanations and additional resources.
This section of the sources focuses on the practical aspects of saving, loading, and evaluating PyTorch models. The sources provide clear code examples and explanations for these essential tasks, enabling readers to efficiently manage their trained models and assess their performance. They continue to emphasize the importance of visualization for understanding training progress and model behavior.

Building and Understanding Neural Networks: Pages 241-250

The sources transition from focusing on fundamental PyTorch workflows to constructing and comprehending neural networks for more complex tasks, particularly classification. They guide readers through building a neural network designed to classify data points into distinct categories.
- Shifting Focus to PyTorch Fundamentals: The sources highlight that the upcoming content will concentrate on the core principles of PyTorch, shifting away from the broader workflow-oriented perspective. They direct readers to specific sections in the accompanying resources, such as the PyTorch Fundamentals notebook and the online book version of the course, for supplementary materials and in-depth explanations.
- Exercises and Extra Curriculum: The sources emphasize the availability of exercises and extra curriculum materials to enhance learning and practical application. They encourage readers to actively engage with these resources to solidify their understanding of the concepts.
- Introduction to Neural Network Classification: The sources mark the beginning of a new section focused on neural network classification, a common machine learning task where models learn to categorize data into predefined classes. They distinguish between binary classification (one thing or another) and multi-class classification (more than two classes).
- Examples of Classification Problems: To illustrate classification tasks, the sources provide real-world examples:
- Image Classification: Classifying images as containing a cat or a dog.
- Spam Filtering: Categorizing emails as spam or not spam.
- Social Media Post Classification: Labeling posts on platforms like Facebook or Twitter based on their content.
- Fraud Detection: Identifying fraudulent transactions.
- Multi-Class Classification with Wikipedia Labels: The sources extend the concept of multi-class classification to using labels from the Wikipedia page for “deep learning.” They note that the Wikipedia page itself has multiple categories or labels, such as “deep learning,” “artificial neural networks,” “artificial intelligence,” and “emerging technologies.” This example highlights how a machine learning model could be trained to classify text based on multiple labels.
- Architecture, Input/Output Shapes, Features, and Labels: The sources outline the key aspects of neural network classification models that they will cover:
- Architecture: The structure and organization of the neural network, including the layers and their connections.
- Input/Output Shapes: The dimensions of the data fed into the model and the expected dimensions of the model’s predictions.
- Features: The input variables or characteristics used by the model to make predictions.
- Labels: The target variables representing the classes or categories to which the data points belong.
- Practical Example with the make_circles Dataset: The sources introduce a hands-on example using the make_circles dataset from scikit-learn, a Python library for machine learning. They generate a synthetic dataset consisting of 1000 data points arranged in two concentric circles, each circle representing a different class.
- Data Exploration and Visualization: The sources emphasize the importance of exploring and visualizing data before model building. They print the first five samples of both the features (X) and labels (Y) and guide readers through understanding the structure of the data. They acknowledge that discerning patterns from raw numerical data can be challenging and advocate for visualization to gain insights.
- Creating a Dictionary for Structured Data Representation: The sources structure the data into a dictionary format to organize the features (X1, X2) and labels (Y) for each sample. They explain the rationale behind this approach, highlighting how it improves readability and understanding of the dataset.
- Transitioning to Visualization: The sources prepare to shift from numerical representations to visual representations of the data, emphasizing the power of visualization for revealing patterns and gaining a deeper understanding of the dataset’s characteristics.
This section of the sources marks a transition to a more code-centric and hands-on approach to understanding neural networks for classification. They introduce essential concepts, provide real-world examples, and guide readers through a practical example using a synthetic dataset. They continue to advocate for visualization as a crucial tool for data exploration and model understanding.

Visualizing and Building a Classification Model: Pages 251-260

The sources demonstrate how to visualize the make_circles dataset and begin constructing a neural network model designed for binary classification.
- Visualizing the make_circles Dataset: The sources utilize Matplotlib, a Python plotting library, to visualize the make_circles dataset created earlier. They emphasize the data explorer’s motto: “Visualize, visualize, visualize,” underscoring the importance of visually inspecting data to understand patterns and relationships. The visualization reveals two distinct circles, each representing a different class, confirming the expected structure of the dataset.
- Splitting Data into Training and Test Sets: The sources guide readers through splitting the dataset into training and test sets using array slicing. They explain the rationale for this split:
- Training Set: Used to train the model and allow it to learn patterns from the data.
- Test Set: Held back from training and used to evaluate the model’s performance on unseen data, providing an estimate of its ability to generalize to new examples.
- They calculate and verify the lengths of the training and test sets, ensuring that the split adheres to the desired proportions (in this case, 80% for training and 20% for testing).
- Building a Simple Neural Network with PyTorch: The sources initiate building a simple neural network model using PyTorch. They introduce essential components of a PyTorch model:
- torch.nn.Module: The base class for all neural network modules in PyTorch.
- __init__ Method: The constructor method where model layers are defined.
- forward Method: Defines the forward pass of data through the model.
- They guide readers through creating a class named CircleModelV0 that inherits from torch.nn.Module and outline the steps for defining the model’s layers and the forward pass logic.
- Key Concepts in the Neural Network Model:
- Linear Layers: The model uses linear layers (torch.nn.Linear), which apply a linear transformation to the input data.
- Non-Linear Activation Function (Sigmoid): The model employs a non-linear activation function, specifically the sigmoid function (torch.sigmoid), to introduce non-linearity into the model. Non-linearity allows the model to learn more complex patterns in the data.
- Input and Output Dimensions: The sources carefully consider the input and output dimensions of each layer to ensure compatibility between the layers and the data. They emphasize the importance of aligning these dimensions to prevent errors during model execution.
- Visualizing the Neural Network Architecture: The sources present a visual representation of the neural network architecture, highlighting the flow of data through the layers, the application of the sigmoid activation function, and the final output representing the model’s prediction. They encourage readers to visualize their own neural networks to aid in comprehension.
- Loss Function and Optimizer: The sources introduce the concept of a loss function and an optimizer, crucial components of the training process:
- Loss Function: Measures the difference between the model’s predictions and the true labels, providing a signal to guide the model’s learning.
- Optimizer: Updates the model’s parameters (weights and biases) based on the calculated loss, aiming to minimize the loss and improve the model’s accuracy.
- They select the binary cross-entropy loss function (torch.nn.BCELoss) and the stochastic gradient descent (SGD) optimizer (torch.optim.SGD) for this classification task. They mention that alternative loss functions and optimizers exist and provide resources for further exploration.
- Training Loop and Evaluation: The sources establish a training loop, a fundamental process in machine learning where the model iteratively learns from the training data. They outline the key steps involved in each iteration of the loop:
1. Forward Pass: Pass the training data through the model to obtain predictions.
2. Calculate Loss: Compute the loss using the chosen loss function.
3. Zero Gradients: Reset the gradients of the model’s parameters.
4. Backward Pass (Backpropagation): Calculate the gradients of the loss with respect to the model’s parameters.
5. Update Parameters: Adjust the model’s parameters using the optimizer based on the calculated gradients.
- They perform a small number of training epochs (iterations over the entire training dataset) to demonstrate the training process. They evaluate the model’s performance after training by calculating the loss on the test data.
- Visualizing Model Predictions: The sources visualize the model’s predictions on the test data using Matplotlib. They plot the data points, color-coded by their true labels, and overlay the decision boundary learned by the model, illustrating how the model separates the data into different classes. They note that the model’s predictions, although far from perfect at this early stage of training, show some initial separation between the classes, indicating that the model is starting to learn.
- Improving a Model: An Overview: The sources provide a high-level overview of techniques for improving the performance of a machine learning model. They suggest various strategies for enhancing model accuracy, including adding more layers, increasing the number of hidden units, training for a longer duration, and incorporating non-linear activation functions. They emphasize that these strategies may not always guarantee improvement and that experimentation is crucial to determine the optimal approach for a particular dataset and problem.
- Saving and Loading Models with PyTorch: The sources reiterate the importance of saving trained models for later use. They demonstrate the use of torch.save() to save the model’s state dictionary to a file. They also showcase how to load a saved model using torch.load(), allowing for reuse without the need for retraining.
- Transition to Putting It All Together: The sources prepare to transition to a section where they will consolidate the concepts covered so far by working through a comprehensive example that incorporates the entire machine learning workflow, emphasizing practical application and problem-solving.
This section of the sources focuses on the practical aspects of building and training a simple neural network for binary classification. They guide readers through defining the model architecture, choosing a loss function and optimizer, implementing a training loop, and visualizing the model’s predictions. They also introduce strategies for improving model performance and reinforce the importance of saving and loading trained models.

Putting It All Together: Pages 261-270

The sources revisit the key steps in the PyTorch workflow, bringing together the concepts covered previously to solidify readers’ understanding of the end-to-end process. They emphasize a code-centric approach, encouraging readers to code along to reinforce their learning.
- Reiterating the PyTorch Workflow: The sources highlight the importance of practicing the PyTorch workflow to gain proficiency. They guide readers through a step-by-step review of the process, emphasizing a shift toward coding over theoretical explanations.
- The Importance of Practice: The sources stress that actively writing and running code is crucial for internalizing concepts and developing practical skills. They encourage readers to participate in coding exercises and explore additional resources to enhance their understanding.
- Data Preparation and Transformation into Tensors: The sources reiterate the initial steps of preparing data and converting it into tensors, a format suitable for PyTorch models. They remind readers of the importance of data exploration and transformation, emphasizing that these steps are fundamental to successful model development.
- Model Building, Loss Function, and Optimizer Selection: The sources revisit the core components of model construction:
- Building or Selecting a Model: Choosing an appropriate model architecture or constructing a custom model based on the problem’s requirements.
- Picking a Loss Function: Selecting a loss function that measures the difference between the model’s predictions and the true labels, guiding the model’s learning process.
- Building an Optimizer: Choosing an optimizer that updates the model’s parameters based on the calculated loss, aiming to minimize the loss and improve the model’s accuracy.
- Training Loop and Model Fitting: The sources highlight the central role of the training loop in machine learning. They recap the key steps involved in each iteration:
1. Forward Pass: Pass the training data through the model to obtain predictions.
2. Calculate Loss: Compute the loss using the chosen loss function.
3. Zero Gradients: Reset the gradients of the model’s parameters.
4. Backward Pass (Backpropagation): Calculate the gradients of the loss with respect to the model’s parameters.
5. Update Parameters: Adjust the model’s parameters using the optimizer based on the calculated gradients.
- Making Predictions and Evaluating the Model: The sources remind readers of the steps involved in using the trained model to make predictions on new data and evaluating its performance using appropriate metrics, such as loss and accuracy. They emphasize the importance of evaluating models on unseen data (the test set) to assess their ability to generalize to new examples.
- Saving and Loading Trained Models: The sources reiterate the value of saving trained models to avoid retraining. They demonstrate the use of torch.save() to save the model’s state dictionary to a file and torch.load() to load a saved model for reuse.
- Exercises and Extra Curriculum Resources: The sources consistently emphasize the availability of exercises and extra curriculum materials to supplement learning. They direct readers to the accompanying resources, such as the online book and the GitHub repository, where these materials can be found. They encourage readers to actively engage with these resources to solidify their understanding and develop practical skills.
- Transition to Convolutional Neural Networks: The sources prepare to move into a new section focused on computer vision and convolutional neural networks (CNNs), indicating that readers have gained a solid foundation in the fundamental PyTorch workflow and are ready to explore more advanced deep learning architectures. [1]
This section of the sources serves as a review and consolidation of the key concepts and steps involved in the PyTorch workflow. It reinforces the importance of practice and hands-on coding and prepares readers to explore more specialized deep learning techniques, such as CNNs for computer vision tasks.

Navigating Resources and Deep Learning Concepts: Pages 271-280

The sources transition into discussing resources for further learning and exploring essential deep learning concepts, setting the stage for a deeper understanding of PyTorch and its applications.
- Emphasizing Continuous Learning: The sources emphasize the importance of ongoing learning in the ever-evolving field of deep learning. They acknowledge that a single course cannot cover every aspect of PyTorch and encourage readers to actively seek out additional resources to expand their knowledge.
- Recommended Resources for PyTorch Mastery: The sources provide specific recommendations for resources that can aid in further exploration of PyTorch:
- Google Search: A fundamental tool for finding answers to specific questions, troubleshooting errors, and exploring various concepts related to PyTorch and deep learning. [1, 2]
- PyTorch Documentation: The official PyTorch documentation serves as an invaluable reference for understanding PyTorch’s functions, modules, and classes. The sources demonstrate how to effectively navigate the documentation to find information about specific functions, such as torch.arange. [3]
- GitHub Repository: The sources highlight a dedicated GitHub repository that houses the materials covered in the course, including notebooks, code examples, and supplementary resources. They encourage readers to utilize this repository as a learning aid and a source of reference. [4-14]
- Learn PyTorch Website: The sources introduce an online book version of the course, accessible through a website, offering a readable format for revisiting course content and exploring additional chapters that cover more advanced topics, including transfer learning, model experiment tracking, and paper replication. [1, 4, 5, 7, 11, 15-30]
- Course Q&A Forum: The sources acknowledge the importance of community support and encourage readers to utilize a dedicated Q&A forum, possibly on GitHub, to seek assistance from instructors and fellow learners. [4, 8, 11, 15]
- Encouraging Active Exploration of Definitions: The sources recommend that readers proactively research definitions of key deep learning concepts, such as deep learning and neural networks. They suggest using resources like Google Search and Wikipedia to explore various interpretations and develop a personal understanding of these concepts. They prioritize hands-on work over rote memorization of definitions. [1, 2]
- Structured Approach to the Course: The sources suggest a structured approach to navigating the course materials, presenting them in numerical order for ease of comprehension. They acknowledge that alternative learning paths exist but recommend following the numerical sequence for clarity. [31]
- Exercises, Extra Curriculum, and Documentation Reading: The sources emphasize the significance of hands-on practice and provide exercises designed to reinforce the concepts covered in the course. They also highlight the availability of extra curriculum materials for those seeking to deepen their understanding. Additionally, they encourage readers to actively engage with the PyTorch documentation to familiarize themselves with its structure and content. [6, 10, 12, 13, 16, 18-21, 23, 24, 28-30, 32-34]
This section of the sources focuses on directing readers towards valuable learning resources and fostering a mindset of continuous learning in the dynamic field of deep learning. They provide specific recommendations for accessing course materials, leveraging the PyTorch documentation, engaging with the community, and exploring definitions of key concepts. They also encourage active participation in exercises, exploration of extra curriculum content, and familiarization with the PyTorch documentation to enhance practical skills and deepen understanding.

Introducing the Coding Environment: Pages 281-290

The sources transition from theoretical discussion and resource navigation to a more hands-on approach, guiding readers through setting up their coding environment and introducing Google Colab as the primary tool for the course.
- Shifting to Hands-On Coding: The sources signal a shift in focus toward practical coding exercises, encouraging readers to actively participate and write code alongside the instructions. They emphasize the importance of getting involved with hands-on work rather than solely focusing on theoretical definitions.
- Introducing Google Colab: The sources introduce Google Colab, a cloud-based Jupyter notebook environment, as the primary tool for coding throughout the course. They suggest that using Colab facilitates a consistent learning experience and removes the need for local installations and setup, allowing readers to focus on learning PyTorch. They recommend using Colab as the preferred method for following along with the course materials.
- Advantages of Google Colab: The sources highlight the benefits of using Google Colab, including its accessibility, ease of use, and collaborative features. Colab provides a pre-configured environment with necessary libraries and dependencies already installed, simplifying the setup process for readers. Its cloud-based nature allows access from various devices and facilitates code sharing and collaboration.
- Navigating the Colab Interface: The sources guide readers through the basic functionality of Google Colab, demonstrating how to create new notebooks, run code cells, and access various features within the Colab environment. They introduce essential commands, such as torch.version and torchvision.version, for checking the versions of installed libraries.
- Creating and Running Code Cells: The sources demonstrate how to create new code cells within Colab notebooks and execute Python code within these cells. They illustrate the use of print() statements to display output and introduce the concept of importing necessary libraries, such as torch for PyTorch functionality.
- Checking Library Versions: The sources emphasize the importance of ensuring compatibility between PyTorch and its associated libraries. They demonstrate how to check the versions of installed libraries, such as torch and torchvision, using commands like torch.__version__ and torchvision.__version__. This step ensures that readers are using compatible versions for the upcoming code examples and exercises.
- Emphasizing Hands-On Learning: The sources reiterate their preference for hands-on learning and a code-centric approach, stating that they will prioritize coding together rather than spending extensive time on slides or theoretical explanations.
This section of the sources marks a transition from theoretical discussions and resource exploration to a more hands-on coding approach. They introduce Google Colab as the primary coding environment for the course, highlighting its benefits and demonstrating its basic functionality. The sources guide readers through creating code cells, running Python code, and checking library versions to ensure compatibility. By focusing on practical coding examples, the sources encourage readers to actively participate in the learning process and reinforce their understanding of PyTorch concepts.

Setting the Stage for Classification: Pages 291-300

The sources shift focus to classification problems, a fundamental task in machine learning, and begin by explaining the core concepts of binary, multi-class, and multi-label classification, providing examples to illustrate each type. They then delve into the specifics of binary and multi-class classification, setting the stage for building classification models in PyTorch.
- Introducing Classification Problems: The sources introduce classification as a key machine learning task where the goal is to categorize data into predefined classes or categories. They differentiate between various types of classification problems:
- Binary Classification: Involves classifying data into one of two possible classes. Examples include:
- Image Classification: Determining whether an image contains a cat or a dog.
- Spam Detection: Classifying emails as spam or not spam.
- Fraud Detection: Identifying fraudulent transactions from legitimate ones.
- Multi-Class Classification: Deals with classifying data into one of multiple (more than two) classes. Examples include:
- Image Recognition: Categorizing images into different object classes, such as cars, bicycles, and pedestrians.
- Handwritten Digit Recognition: Classifying handwritten digits into the numbers 0 through 9.
- Natural Language Processing: Assigning text documents to specific topics or categories.
- Multi-Label Classification: Involves assigning multiple labels to a single data point. Examples include:
- Image Tagging: Assigning multiple tags to an image, such as “beach,” “sunset,” and “ocean.”
- Text Classification: Categorizing documents into multiple relevant topics.
- Understanding the ImageNet Dataset: The sources reference the ImageNet dataset, a large-scale dataset commonly used in computer vision research, as an example of multi-class classification. They point out that ImageNet contains thousands of object categories, making it a challenging dataset for multi-class classification tasks.
- Illustrating Multi-Label Classification with Wikipedia: The sources use a Wikipedia article about deep learning as an example of multi-label classification. They point out that the article has multiple categories assigned to it, such as “deep learning,” “artificial neural networks,” and “artificial intelligence,” demonstrating that a single data point (the article) can have multiple labels.
- Real-World Examples of Classification: The sources provide relatable examples from everyday life to illustrate different classification scenarios:
- Photo Categorization: Modern smartphone cameras often automatically categorize photos based on their content, such as “people,” “food,” or “landscapes.”
- Email Filtering: Email services frequently categorize emails into folders like “primary,” “social,” or “promotions,” performing a multi-class classification task.
- Focusing on Binary and Multi-Class Classification: The sources acknowledge the existence of other types of classification but choose to focus on binary and multi-class classification for the remainder of the section. They indicate that these two types are fundamental and provide a strong foundation for understanding more complex classification scenarios.
This section of the sources sets the stage for exploring classification problems in PyTorch. They introduce different types of classification, providing examples and real-world applications to illustrate each type. The sources emphasize the importance of understanding binary and multi-class classification as fundamental building blocks for more advanced classification tasks. By providing clear definitions, examples, and a structured approach, the sources prepare readers to build and train classification models using PyTorch.

Building a Binary Classification Model with PyTorch: Pages 301-310

The sources begin the practical implementation of a binary classification model using PyTorch. They guide readers through generating a synthetic dataset, exploring its characteristics, and visualizing it to gain insights into the data before proceeding to model building.
- Generating a Synthetic Dataset with make_circles: The sources introduce the make_circles function from the sklearn.datasets module to create a synthetic dataset for binary classification. This function generates a dataset with two concentric circles, each representing a different class. The sources provide a code example using make_circles to generate 1000 samples, storing the features in the variable X and the corresponding labels in the variable Y. They emphasize the common convention of using capital X to represent a matrix of features and capital Y for labels.
- Exploring the Dataset: The sources guide readers through exploring the characteristics of the generated dataset:
- Examining the First Five Samples: The sources provide code to display the first five samples of both features (X) and labels (Y) using array slicing. They use print() statements to display the output, encouraging readers to visually inspect the data.
- Formatting for Clarity: The sources emphasize the importance of presenting data in a readable format. They use a dictionary to structure the data, mapping feature names (X1 and X2) to the corresponding values and including the label (Y). This structured format enhances the readability and interpretation of the data.
- Visualizing the Data: The sources highlight the importance of visualizing data, especially in classification tasks. They emphasize the data explorer’s motto: “visualize, visualize, visualize.” They point out that while patterns might not be evident from numerical data alone, visualization can reveal underlying structures and relationships.
- Visualizing with Matplotlib: The sources introduce Matplotlib, a popular Python plotting library, for visualizing the generated dataset. They provide a code example using plt.scatter() to create a scatter plot of the data, with different colors representing the two classes. The visualization reveals the circular structure of the data, with one class forming an inner circle and the other class forming an outer circle. This visual representation provides a clear understanding of the dataset’s characteristics and the challenge posed by the binary classification task.
This section of the sources marks the beginning of hands-on model building with PyTorch. They start by generating a synthetic dataset using make_circles, allowing for controlled experimentation and a clear understanding of the data’s structure. They guide readers through exploring the dataset’s characteristics, both numerically and visually. The use of Matplotlib to visualize the data reinforces the importance of understanding data patterns before proceeding to model development. By emphasizing the data explorer’s motto, the sources encourage readers to actively engage with the data and gain insights that will inform their subsequent modeling choices.

Exploring Model Architecture and PyTorch Fundamentals: Pages 311-320

The sources proceed with building a simple neural network model using PyTorch, introducing key components like layers, neurons, activation functions, and matrix operations. They guide readers through understanding the model’s architecture, emphasizing the connection between the code and its visual representation. They also highlight PyTorch’s role in handling computations and the importance of visualizing the network’s structure.
- Creating a Simple Neural Network Model: The sources guide readers through creating a basic neural network model in PyTorch. They introduce the concept of layers, representing different stages of computation in the network, and neurons, the individual processing units within each layer. They provide code to construct a model with:
- An Input Layer: Takes in two features, corresponding to the X1 and X2 features from the generated dataset.
- A Hidden Layer: Consists of five neurons, introducing the idea of hidden layers for learning complex patterns.
- An Output Layer: Produces a single output, suitable for binary classification.
- Relating Code to Visual Representation: The sources emphasize the importance of understanding the connection between the code and its visual representation. They encourage readers to visualize the network’s structure, highlighting the flow of data through the input, hidden, and output layers. This visualization clarifies how the network processes information and makes predictions.
- PyTorch’s Role in Computation: The sources explain that while they write the code to define the model’s architecture, PyTorch handles the underlying computations. PyTorch takes care of matrix operations, activation functions, and other mathematical processes involved in training and using the model.
- Illustrating Network Structure with torch.nn.Linear: The sources use the torch.nn.Linear module to create the layers in the neural network. They provide code examples demonstrating how to define the input and output dimensions for each layer, emphasizing that the output of one layer becomes the input to the subsequent layer.
- Understanding Input and Output Shapes: The sources emphasize the significance of input and output shapes in neural networks. They explain that the input shape corresponds to the number of features in the data, while the output shape depends on the type of problem. In this case, the binary classification model has an output shape of one, representing a single probability score for the positive class.
This section of the sources introduces readers to the fundamental concepts of building neural networks in PyTorch. They guide through creating a simple binary classification model, explaining the key components like layers, neurons, and activation functions. The sources emphasize the importance of visualizing the network’s structure and understanding the connection between the code and its visual representation. They highlight PyTorch’s role in handling computations and guide readers through defining the input and output shapes for each layer, ensuring the model’s structure aligns with the dataset and the classification task. By combining code examples with clear explanations, the sources provide a solid foundation for building and understanding neural networks in PyTorch.

Setting up for Success: Approaching the PyTorch Deep Learning Course: Pages 321-330

The sources transition from the specifics of model architecture to a broader discussion about navigating the PyTorch deep learning course effectively. They emphasize the importance of active learning, self-directed exploration, and leveraging available resources to enhance understanding and skill development.
- Embracing Google and Exploration: The sources advocate for active learning and encourage learners to “Google it.” They suggest that encountering unfamiliar concepts or terms should prompt learners to independently research and explore, using search engines like Google to delve deeper into the subject matter. This approach fosters a self-directed learning style and encourages learners to go beyond the course materials.
- Prioritizing Hands-On Experience: The sources stress the significance of hands-on experience over theoretical definitions. They acknowledge that while definitions are readily available online, the focus of the course is on practical implementation and building models. They encourage learners to prioritize coding and experimentation to solidify their understanding of PyTorch.
- Utilizing Wikipedia for Definitions: The sources specifically recommend Wikipedia as a reliable resource for looking up definitions. They recognize Wikipedia’s comprehensive and well-maintained content, suggesting it as a valuable tool for learners seeking clear and accurate explanations of technical terms.
- Structuring the Course for Effective Learning: The sources outline a structured approach to the course, breaking down the content into manageable modules and emphasizing a sequential learning process. They introduce the concept of “chapters” as distinct units of learning, each covering specific topics and building upon previous knowledge.
- Encouraging Questions and Discussion: The sources foster an interactive learning environment, encouraging learners to ask questions and engage in discussions. They highlight the importance of seeking clarification and sharing insights with instructors and peers to enhance the learning experience. They recommend utilizing online platforms, such as GitHub discussion pages, for asking questions and engaging in course-related conversations.
- Providing Course Materials on GitHub: The sources ensure accessibility to course materials by making them readily available on GitHub. They specify the repository where learners can access code, notebooks, and other resources used throughout the course. They also mention “learnpytorch.io” as an alternative location where learners can find an online, readable book version of the course content.
This section of the sources provides guidance on approaching the PyTorch deep learning course effectively. The sources encourage a self-directed learning style, emphasizing the importance of active exploration, independent research, and hands-on experimentation. They recommend utilizing online resources, including search engines and Wikipedia, for in-depth understanding and advocate for engaging in discussions and seeking clarification. By outlining a structured approach, providing access to comprehensive course materials, and fostering an interactive learning environment, the sources aim to equip learners with the necessary tools and mindset for a successful PyTorch deep learning journey.

Navigating Course Resources and Documentation: Pages 331-340

The sources guide learners on how to effectively utilize the course resources and navigate PyTorch documentation to enhance their learning experience. They emphasize the importance of referring to the materials provided on GitHub, engaging in Q&A sessions, and familiarizing oneself with the structure and features of the online book version of the course.
- Identifying Key Resources: The sources highlight three primary resources for the PyTorch course:
- Materials on GitHub: The sources specify a GitHub repository (“Mr. D. Burks in my GitHub slash PyTorch deep learning” [1]) as the central location for accessing course materials, including outlines, code, notebooks, and additional resources. This repository serves as a comprehensive hub for learners to find everything they need to follow along with the course. They note that this repository is a work in progress [1] but assure users that the organization will remain largely the same [1].
- Course Q&A: The sources emphasize the importance of asking questions and seeking clarification throughout the learning process. They encourage learners to utilize the designated Q&A platform, likely a forum or discussion board, to post their queries and engage with instructors and peers. This interactive component of the course fosters a collaborative learning environment and provides a valuable avenue for resolving doubts and gaining insights.
- Course Online Book (learnpytorch.io): The sources recommend referring to the online book version of the course, accessible at “learn pytorch.io” [2, 3]. This platform offers a structured and readable format for the course content, presenting the material in a more organized and comprehensive manner compared to the video lectures. The online book provides learners with a valuable resource to reinforce their understanding and revisit concepts in a more detailed format.
- Navigating the Online Book: The sources describe the key features of the online book platform, highlighting its user-friendly design and functionality:
- Readable Format and Search Functionality: The online book presents the course content in a clear and easily understandable format, making it convenient for learners to review and grasp the material. Additionally, the platform offers search functionality, enabling learners to quickly locate specific topics or concepts within the book. This feature enhances the book’s usability and allows learners to efficiently find the information they need.
- Structured Headings and Images: The online book utilizes structured headings and includes relevant images to organize and illustrate the content effectively. The use of headings breaks down the material into logical sections, improving readability and comprehension. The inclusion of images provides visual aids to complement the textual explanations, further enhancing understanding and engagement.
This section of the sources focuses on guiding learners on how to effectively utilize the various resources provided for the PyTorch deep learning course. The sources emphasize the importance of accessing the materials on GitHub, actively engaging in Q&A sessions, and utilizing the online book version of the course to supplement learning. By describing the structure and features of these resources, the sources aim to equip learners with the knowledge and tools to navigate the course effectively, enhance their understanding of PyTorch, and ultimately succeed in their deep learning journey.

Deep Dive into PyTorch Tensors: Pages 341-350

The sources shift focus to PyTorch tensors, the fundamental data structure for working with numerical data in PyTorch. They explain how to create tensors using various methods and introduce essential tensor operations like indexing, reshaping, and stacking. The sources emphasize the significance of tensors in deep learning, highlighting their role in representing data and performing computations. They also stress the importance of understanding tensor shapes and dimensions for effective manipulation and model building.
- Introducing the torch.nn Module: The sources introduce the torch.nn module as the core component for building neural networks in PyTorch. They explain that torch.nn provides a collection of classes and functions for defining and working with various layers, activation functions, and loss functions. They highlight that almost everything in PyTorch relies on torch.tensor as the foundational data structure.
- Creating PyTorch Tensors: The sources provide a practical introduction to creating PyTorch tensors using the torch.tensor function. They emphasize that this function serves as the primary method for creating tensors, which act as multi-dimensional arrays for storing and manipulating numerical data. They guide readers through basic examples, illustrating how to create tensors from lists of values.
- Encouraging Exploration of PyTorch Documentation: The sources consistently encourage learners to explore the official PyTorch documentation for in-depth understanding and reference. They specifically recommend spending at least 10 minutes reviewing the documentation for torch.tensor after completing relevant video tutorials. This practice fosters familiarity with PyTorch’s functionalities and encourages a self-directed learning approach.
- Exploring the torch.arange Function: The sources introduce the torch.arange function for generating tensors containing a sequence of evenly spaced values within a specified range. They provide code examples demonstrating how to use torch.arange to create tensors similar to Python’s built-in range function. They also explain the function’s parameters, including start, end, and step, allowing learners to control the sequence generation.
- Highlighting Deprecated Functions: The sources point out that certain PyTorch functions, like torch.range, may become deprecated over time as the library evolves. They inform learners about such deprecations and recommend using updated functions like torch.arange as alternatives. This awareness ensures learners are using the most current and recommended practices.
- Addressing Tensor Shape Compatibility in Reshaping: The sources discuss the concept of shape compatibility when reshaping tensors using the torch.reshape function. They emphasize that the new shape specified for the tensor must be compatible with the original number of elements in the tensor. They provide examples illustrating both compatible and incompatible reshaping scenarios, explaining the potential errors that may arise when incompatibility occurs. They also note that encountering and resolving errors during coding is a valuable learning experience, promoting problem-solving skills.
- Understanding Tensor Stacking with torch.stack: The sources introduce the torch.stack function for combining multiple tensors along a new dimension. They explain that stacking effectively concatenates tensors, creating a higher-dimensional tensor. They guide readers through code examples, demonstrating how to use torch.stack to combine tensors and control the stacking dimension using the dim parameter. They also reference the torch.stack documentation, encouraging learners to review it for a comprehensive understanding of the function’s usage.
- Illustrating Tensor Permutation with torch.permute: The sources delve into the torch.permute function for rearranging the dimensions of a tensor. They explain that permuting changes the order of axes in a tensor, effectively reshaping it without altering the underlying data. They provide code examples demonstrating how to use torch.permute to change the order of dimensions, illustrating the transformation of tensor shape. They also connect this concept to real-world applications, particularly in image processing, where permuting can be used to rearrange color channels, height, and width dimensions.
- Explaining Random Seed for Reproducibility: The sources address the importance of setting a random seed for reproducibility in deep learning experiments. They introduce the concept of pseudo-random number generators and explain how setting a random seed ensures consistent results when working with random processes. They link to PyTorch documentation for further exploration of random number generation and the role of random seeds.
- Providing Guidance on Exercises and Curriculum: The sources transition to discussing exercises and additional curriculum for learners to solidify their understanding of PyTorch fundamentals. They refer to the “PyTorch fundamentals notebook,” which likely contains a collection of exercises and supplementary materials for learners to practice the concepts covered in the course. They recommend completing these exercises to reinforce learning and gain hands-on experience. They also mention that each chapter in the online book concludes with exercises and extra curriculum, providing learners with ample opportunities for practice and exploration.
This section focuses on introducing PyTorch tensors, a fundamental concept in deep learning, and providing practical examples of tensor manipulation using functions like torch.arange, torch.reshape, and torch.stack. The sources encourage learners to refer to PyTorch documentation for comprehensive understanding and highlight the significance of tensors in representing data and performing computations. By combining code demonstrations with explanations and real-world connections, the sources equip learners with a solid foundation for working with tensors in PyTorch.

Working with Loss Functions and Optimizers in PyTorch: Pages 351-360

The sources transition to a discussion of loss functions and optimizers, crucial components of the training process for neural networks in PyTorch. They explain that loss functions measure the difference between model predictions and actual target values, guiding the optimization process towards minimizing this difference. They introduce different types of loss functions suitable for various machine learning tasks, such as binary classification and multi-class classification, highlighting their specific applications and characteristics. The sources emphasize the significance of selecting an appropriate loss function based on the nature of the problem and the desired model output. They also explain the role of optimizers in adjusting model parameters to reduce the calculated loss, introducing common optimizer choices like Stochastic Gradient Descent (SGD) and Adam, each with its unique approach to parameter updates.
- Understanding Binary Cross Entropy Loss: The sources introduce binary cross entropy loss as a commonly used loss function for binary classification problems, where the model predicts one of two possible classes. They note that PyTorch provides multiple implementations of binary cross entropy loss, including torch.nn.BCELoss and torch.nn.BCEWithLogitsLoss. They highlight a key distinction: torch.nn.BCELoss requires inputs to have already passed through the sigmoid activation function, while torch.nn.BCEWithLogitsLoss incorporates the sigmoid activation internally, offering enhanced numerical stability. The sources emphasize the importance of understanding these differences and selecting the appropriate implementation based on the model’s structure and activation functions.
- Exploring Loss Functions and Optimizers for Diverse Problems: The sources emphasize that PyTorch offers a wide range of loss functions and optimizers suitable for various machine learning problems beyond binary classification. They recommend referring to the online book version of the course for a comprehensive overview and code examples of different loss functions and optimizers applicable to diverse tasks. This comprehensive resource aims to equip learners with the knowledge to select appropriate components for their specific machine learning applications.
- Outlining the Training Loop Steps: The sources outline the key steps involved in a typical training loop for a neural network:
1. Forward Pass: Input data is fed through the model to obtain predictions.
2. Loss Calculation: The difference between predictions and actual target values is measured using the chosen loss function.
3. Optimizer Zeroing Gradients: Accumulated gradients from previous iterations are reset to zero.
4. Backpropagation: Gradients of the loss function with respect to model parameters are calculated, indicating the direction and magnitude of parameter adjustments needed to minimize the loss.
5. Optimizer Step: Model parameters are updated based on the calculated gradients and the optimizer’s update rule.
- Applying Sigmoid Activation for Binary Classification: The sources emphasize the importance of applying the sigmoid activation function to the raw output (logits) of a binary classification model before making predictions. They explain that the sigmoid function transforms the logits into a probability value between 0 and 1, representing the model’s confidence in each class.
- Illustrating Tensor Rounding and Dimension Squeezing: The sources demonstrate the use of torch.round to round tensor values to the nearest integer, often used for converting predicted probabilities into class labels in binary classification. They also explain the use of torch.squeeze to remove singleton dimensions from tensors, ensuring compatibility for operations requiring specific tensor shapes.
- Structuring Training Output for Clarity: The sources highlight the practice of organizing training output to enhance clarity and monitor progress. They suggest printing relevant metrics like epoch number, loss, and accuracy at regular intervals, allowing users to track the model’s learning progress over time.
This section introduces the concepts of loss functions and optimizers in PyTorch, emphasizing their importance in the training process. It guides learners on choosing suitable loss functions based on the problem type and provides insights into common optimizer choices. By explaining the steps involved in a typical training loop and showcasing practical code examples, the sources aim to equip learners with a solid understanding of how to train neural networks effectively in PyTorch.

Building and Evaluating a PyTorch Model: Pages 361-370

The sources transition to the practical application of the previously introduced concepts, guiding readers through the process of building, training, and evaluating a PyTorch model for a specific task. They emphasize the importance of structuring code clearly and organizing output for better understanding and analysis. The sources highlight the iterative nature of model development, involving multiple steps of training, evaluation, and refinement.
- Defining a Simple Linear Model: The sources provide a code example demonstrating how to define a simple linear model in PyTorch using torch.nn.Linear. They explain that this model takes a specified number of input features and produces a corresponding number of output features, performing a linear transformation on the input data. They stress that while this simple model may not be suitable for complex tasks, it serves as a foundational example for understanding the basics of building neural networks in PyTorch.
- Emphasizing Visualization in Data Exploration: The sources reiterate the importance of visualization in data exploration, encouraging readers to represent data visually to gain insights and understand patterns. They advocate for the “data explorer’s motto: visualize, visualize, visualize,” suggesting that visualizing data helps users become more familiar with its structure and characteristics, aiding in the model development process.
- Preparing Data for Model Training: The sources outline the steps involved in preparing data for model training, which often includes splitting data into training and testing sets. They explain that the training set is used to train the model, while the testing set is used to evaluate its performance on unseen data. They introduce a simple method for splitting data based on a predetermined index and mention the popular scikit-learn library’s train_test_split function as a more robust method for random data splitting. They highlight that data splitting ensures that the model’s ability to generalize to new data is assessed accurately.
- Creating a Training Loop: The sources provide a code example demonstrating the creation of a training loop, a fundamental component of training neural networks. The training loop iterates over the training data for a specified number of epochs, performing the steps outlined previously: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They emphasize that one epoch represents a complete pass through the entire training dataset. They also explain the concept of a “training loop” as the iterative process of updating model parameters over multiple epochs to minimize the loss function. They provide guidance on customizing the training loop, such as printing out loss and other metrics at specific intervals to monitor training progress.
- Visualizing Loss and Parameter Convergence: The sources encourage visualizing the loss function’s value over epochs to observe its convergence, indicating the model’s learning progress. They also suggest tracking changes in model parameters (weights and bias) to understand how they adjust during training to minimize the loss. The sources highlight that these visualizations provide valuable insights into the training process and help users assess the model’s effectiveness.
- Understanding the Concept of Overfitting: The sources introduce the concept of overfitting, a common challenge in machine learning, where a model performs exceptionally well on the training data but poorly on unseen data. They explain that overfitting occurs when the model learns the training data too well, capturing noise and irrelevant patterns that hinder its ability to generalize. They mention that techniques like early stopping, regularization, and data augmentation can mitigate overfitting, promoting better model generalization.
- Evaluating Model Performance: The sources guide readers through evaluating a trained model’s performance using the testing set, data that the model has not seen during training. They calculate the loss on the testing set to assess how well the model generalizes to new data. They emphasize the importance of evaluating the model on data separate from the training set to obtain an unbiased estimate of its real-world performance. They also introduce the idea of visualizing model predictions alongside the ground truth data (actual labels) to gain qualitative insights into the model’s behavior.
- Saving and Loading a Trained Model: The sources highlight the significance of saving a trained PyTorch model to preserve its learned parameters for future use. They provide a code example demonstrating how to save the model’s state dictionary, which contains the trained weights and biases, using torch.save. They also show how to load a saved model using torch.load, enabling users to reuse trained models without retraining.
This section guides readers through the practical steps of building, training, and evaluating a simple linear model in PyTorch. The sources emphasize visualization as a key aspect of data exploration and model understanding. By combining code examples with clear explanations and introducing essential concepts like overfitting and model evaluation, the sources equip learners with a practical foundation for building and working with neural networks in PyTorch.

Understanding Neural Networks and PyTorch Resources: Pages 371-380

The sources shift focus to neural networks, providing a conceptual understanding and highlighting resources for further exploration. They encourage active learning by posing challenges to readers, prompting them to apply their knowledge and explore concepts independently. The sources also emphasize the practical aspects of learning PyTorch, advocating for a hands-on approach with code over theoretical definitions.
- Encouraging Exploration of Neural Network Definitions: The sources acknowledge the abundance of definitions for neural networks available online and encourage readers to formulate their own understanding by exploring various sources. They suggest engaging with external resources like Google searches and Wikipedia to broaden their knowledge and develop a personal definition of neural networks.
- Recommending a Hands-On Approach to Learning: The sources advocate for a hands-on approach to learning PyTorch, emphasizing the importance of practical experience over theoretical definitions. They prioritize working with code and experimenting with different concepts to gain a deeper understanding of the framework.
- Presenting Key PyTorch Resources: The sources introduce valuable resources for learning PyTorch, including:
- GitHub Repository: A repository containing all course materials, including code examples, notebooks, and supplementary resources.
- Course Q&A: A dedicated platform for asking questions and seeking clarification on course content.
- Online Book: A comprehensive online book version of the course, providing in-depth explanations and code examples.
- Highlighting Benefits of the Online Book: The sources highlight the advantages of the online book version of the course, emphasizing its user-friendly features:
- Searchable Content: Users can easily search for specific topics or keywords within the book.
- Interactive Elements: The book incorporates interactive elements, allowing users to engage with the content more dynamically.
- Comprehensive Material: The book covers a wide range of PyTorch concepts and provides in-depth explanations.
- Demonstrating PyTorch Documentation Usage: The sources demonstrate how to effectively utilize PyTorch documentation, emphasizing its value as a reference guide. They showcase examples of searching for specific functions within the documentation, highlighting the clear explanations and usage examples provided.
- Addressing Common Errors in Deep Learning: The sources acknowledge that shape errors are common in deep learning, emphasizing the importance of understanding tensor shapes and dimensions for successful model implementation. They provide examples of shape errors encountered during code demonstrations, illustrating how mismatched tensor dimensions can lead to errors. They encourage users to pay close attention to tensor shapes and use debugging techniques to identify and resolve such issues.
- Introducing the Concept of Tensor Stacking: The sources introduce the concept of tensor stacking using torch.stack, explaining its functionality in concatenating a sequence of tensors along a new dimension. They clarify the dim parameter, which specifies the dimension along which the stacking operation is performed. They provide code examples demonstrating the usage of torch.stack and its impact on tensor shapes, emphasizing its utility in combining tensors effectively.
- Explaining Tensor Permutation: The sources explain tensor permutation as a method for rearranging the dimensions of a tensor using torch.permute. They emphasize that permuting a tensor changes how the data is viewed without altering the underlying data itself. They illustrate the concept with an example of permuting a tensor representing color channels, height, and width of an image, highlighting how the permutation operation reorders these dimensions while preserving the image data.
- Introducing Indexing on Tensors: The sources introduce the concept of indexing on tensors, a fundamental operation for accessing specific elements or subsets of data within a tensor. They present a challenge to readers, asking them to practice indexing on a given tensor to extract specific values. This exercise aims to reinforce the understanding of tensor indexing and its practical application.
- Explaining Random Seed and Random Number Generation: The sources explain the concept of a random seed in the context of random number generation, highlighting its role in controlling the reproducibility of random processes. They mention that setting a random seed ensures that the same sequence of random numbers is generated each time the code is executed, enabling consistent results for debugging and experimentation. They provide external resources, such as documentation links, for those interested in delving deeper into random number generation concepts in computing.
This section transitions from general concepts of neural networks to practical aspects of using PyTorch, highlighting valuable resources for further exploration and emphasizing a hands-on learning approach. By demonstrating documentation usage, addressing common errors, and introducing tensor manipulation techniques like stacking, permutation, and indexing, the sources equip learners with essential tools for working effectively with PyTorch.

Building a Model with PyTorch: Pages 381-390

The sources guide readers through building a more complex model in PyTorch, introducing the concept of subclassing nn.Module to create custom model architectures. They highlight the importance of understanding the PyTorch workflow, which involves preparing data, defining a model, selecting a loss function and optimizer, training the model, making predictions, and evaluating performance. The sources emphasize that while the steps involved remain largely consistent across different tasks, understanding the nuances of each step and how they relate to the specific problem being addressed is crucial for effective model development.
- Introducing the nn.Module Class: The sources explain that in PyTorch, neural network models are built by subclassing the nn.Module class, which provides a structured framework for defining model components and their interactions. They highlight that this approach offers flexibility and organization, enabling users to create custom architectures tailored to specific tasks.
- Defining a Custom Model Architecture: The sources provide a code example demonstrating how to define a custom model architecture by subclassing nn.Module. They emphasize the key components of a model definition:
- Constructor (__init__): This method initializes the model’s layers and other components.
- Forward Pass (forward): This method defines how the input data flows through the model’s layers during the forward propagation step.
- Understanding PyTorch Building Blocks: The sources explain that PyTorch provides a rich set of building blocks for neural networks, contained within the torch.nn module. They highlight that nn contains various layers, activation functions, loss functions, and other components essential for constructing neural networks.
- Illustrating the Flow of Data Through a Model: The sources visually illustrate the flow of data through the defined model, using diagrams to represent the input features, hidden layers, and output. They explain that the input data is passed through a series of linear transformations (nn.Linear layers) and activation functions, ultimately producing an output that corresponds to the task being addressed.
- Creating a Training Loop with Multiple Epochs: The sources demonstrate how to create a training loop that iterates over the training data for a specified number of epochs, performing the steps involved in training a neural network: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They highlight the importance of training for multiple epochs to allow the model to learn from the data iteratively and adjust its parameters to minimize the loss function.
- Observing Loss Reduction During Training: The sources show the output of the training loop, emphasizing how the loss value decreases over epochs, indicating that the model is learning from the data and improving its performance. They explain that this decrease in loss signifies that the model’s predictions are becoming more aligned with the actual labels.
- Emphasizing Visual Inspection of Data: The sources reiterate the importance of visualizing data, advocating for visually inspecting the data before making predictions. They highlight that understanding the data’s characteristics and patterns is crucial for informed model development and interpretation of results.
- Preparing Data for Visualization: The sources guide readers through preparing data for visualization, including splitting it into training and testing sets and organizing it into appropriate data structures. They mention using libraries like matplotlib to create visual representations of the data, aiding in data exploration and understanding.
- Introducing the torch.no_grad Context: The sources introduce the concept of the torch.no_grad context, explaining its role in performing computations without tracking gradients. They highlight that this context is particularly useful during model evaluation or inference, where gradient calculations are not required, leading to more efficient computation.
- Defining a Testing Loop: The sources guide readers through defining a testing loop, similar to the training loop, which iterates over the testing data to evaluate the model’s performance on unseen data. They emphasize the importance of evaluating the model on data separate from the training set to obtain an unbiased assessment of its ability to generalize. They outline the steps involved in the testing loop: performing a forward pass, calculating the loss, and accumulating relevant metrics like loss and accuracy.
The sources provide a comprehensive walkthrough of building and training a more sophisticated neural network model in PyTorch. They emphasize the importance of understanding the PyTorch workflow, from data preparation to model evaluation, and highlight the flexibility and organization offered by subclassing nn.Module to create custom model architectures. They continue to stress the value of visual inspection of data and encourage readers to explore concepts like data visualization and model evaluation in detail.

Building and Evaluating Models in PyTorch: Pages 391-400

The sources focus on training and evaluating a regression model in PyTorch, emphasizing the iterative nature of model development and improvement. They guide readers through the process of building a simple model, training it, evaluating its performance, and identifying areas for potential enhancements. They introduce the concept of non-linearity in neural networks, explaining how the addition of non-linear activation functions can enhance a model’s ability to learn complex patterns.
- Building a Regression Model with PyTorch: The sources provide a step-by-step guide to building a simple regression model using PyTorch. They showcase the creation of a model with linear layers (nn.Linear), illustrating how to define the input and output dimensions of each layer. They emphasize that for regression tasks, the output layer typically has a single output unit representing the predicted value.
- Creating a Training Loop for Regression: The sources demonstrate how to create a training loop specifically for regression tasks. They outline the familiar steps involved: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They emphasize that the loss function used for regression differs from classification tasks, typically employing mean squared error (MSE) or similar metrics to measure the difference between predicted and actual values.
- Observing Loss Reduction During Regression Training: The sources show the output of the training loop for the regression model, highlighting how the loss value decreases over epochs, indicating that the model is learning to predict the target values more accurately. They explain that this decrease in loss signifies that the model’s predictions are converging towards the actual values.
- Evaluating the Regression Model: The sources guide readers through evaluating the trained regression model. They emphasize the importance of using a separate testing dataset to assess the model’s ability to generalize to unseen data. They outline the steps involved in evaluating the model on the testing set, including performing a forward pass, calculating the loss, and accumulating metrics.
- Visualizing Regression Model Predictions: The sources advocate for visualizing the predictions of the regression model, explaining that visual inspection can provide valuable insights into the model’s performance and potential areas for improvement. They suggest plotting the predicted values against the actual values, allowing users to assess how well the model captures the underlying relationship in the data.
- Introducing Non-Linearities in Neural Networks: The sources introduce the concept of non-linearity in neural networks, explaining that real-world data often exhibits complex, non-linear relationships. They highlight that incorporating non-linear activation functions into neural network models can significantly enhance their ability to learn and represent these intricate patterns. They mention activation functions like ReLU (Rectified Linear Unit) as common choices for introducing non-linearity.
- Encouraging Experimentation with Non-Linearities: The sources encourage readers to experiment with different non-linear activation functions, explaining that the choice of activation function can impact model performance. They suggest trying various activation functions and observing their effects on the model’s ability to learn from the data and make accurate predictions.
- Highlighting the Role of Hyperparameters: The sources emphasize that various components of a neural network, such as the number of layers, number of units in each layer, learning rate, and activation functions, are hyperparameters that can be adjusted to influence model performance. They encourage experimentation with different hyperparameter settings to find optimal configurations for specific tasks.
- Demonstrating the Impact of Adding Layers: The sources visually demonstrate the effect of adding more layers to a neural network model, explaining that increasing the model’s depth can enhance its ability to learn complex representations. They show how a deeper model, compared to a shallower one, can better capture the intricacies of the data and make more accurate predictions.
- Illustrating the Addition of ReLU Activation Functions: The sources provide a visual illustration of incorporating ReLU activation functions into a neural network model. They show how ReLU introduces non-linearity by applying a thresholding operation to the output of linear layers, enabling the model to learn non-linear decision boundaries and better represent complex relationships in the data.
This section guides readers through the process of building, training, and evaluating a regression model in PyTorch, emphasizing the iterative nature of model development. The sources highlight the importance of visualizing predictions and the role of non-linear activation functions in enhancing model capabilities. They encourage experimentation with different architectures and hyperparameters, fostering a deeper understanding of the factors influencing model performance and promoting a data-driven approach to model building.

Working with Tensors and Data in PyTorch: Pages 401-410

The sources guide readers through various aspects of working with tensors and data in PyTorch, emphasizing the fundamental role tensors play in deep learning computations. They introduce techniques for creating, manipulating, and understanding tensors, highlighting their importance in representing and processing data for neural networks.
- Creating Tensors in PyTorch: The sources detail methods for creating tensors in PyTorch, focusing on the torch.arange() function. They explain that torch.arange() generates a tensor containing a sequence of evenly spaced values within a specified range. They provide code examples illustrating the use of torch.arange() with various parameters like start, end, and step to control the generated sequence.
- Understanding the Deprecation of torch.range(): The sources note that the torch.range() function, previously used for creating tensors with a range of values, has been deprecated in favor of torch.arange(). They encourage users to adopt torch.arange() for creating tensors containing sequences of values.
- Exploring Tensor Shapes and Reshaping: The sources emphasize the significance of understanding tensor shapes in PyTorch, explaining that the shape of a tensor determines its dimensionality and the arrangement of its elements. They introduce the concept of reshaping tensors, using functions like torch.reshape() to modify a tensor’s shape while preserving its total number of elements. They provide code examples demonstrating how to reshape tensors to match specific requirements for various operations or layers in neural networks.
- Stacking Tensors Together: The sources introduce the torch.stack() function, explaining its role in concatenating a sequence of tensors along a new dimension. They explain that torch.stack() takes a list of tensors as input and combines them into a higher-dimensional tensor, effectively stacking them together along a specified dimension. They illustrate the use of torch.stack() with code examples, highlighting how it can be used to combine multiple tensors into a single structure.
- Permuting Tensor Dimensions: The sources explore the concept of permuting tensor dimensions, explaining that it involves rearranging the axes of a tensor. They introduce the torch.permute() function, which reorders the dimensions of a tensor according to specified indices. They demonstrate the use of torch.permute() with code examples, emphasizing its application in tasks like transforming image data from the format (Height, Width, Channels) to (Channels, Height, Width), which is often required by convolutional neural networks.
- Visualizing Tensors and Their Shapes: The sources advocate for visualizing tensors and their shapes, explaining that visual inspection can aid in understanding the structure and arrangement of tensor data. They suggest using tools like matplotlib to create graphical representations of tensors, allowing users to better comprehend the dimensionality and organization of tensor elements.
- Indexing and Slicing Tensors: The sources guide readers through techniques for indexing and slicing tensors, explaining how to access specific elements or sub-regions within a tensor. They demonstrate the use of square brackets ([]) for indexing tensors, illustrating how to retrieve elements based on their indices along various dimensions. They further explain how slicing allows users to extract a portion of a tensor by specifying start and end indices along each dimension. They provide code examples showcasing various indexing and slicing operations, emphasizing their role in manipulating and extracting data from tensors.
- Introducing the Concept of Random Seeds: The sources introduce the concept of random seeds, explaining their significance in controlling the randomness in PyTorch operations that involve random number generation. They explain that setting a random seed ensures that the same sequence of random numbers is generated each time the code is run, promoting reproducibility of results. They provide code examples demonstrating how to set a random seed using torch.manual_seed(), highlighting its importance in maintaining consistency during model training and experimentation.
- Exploring the torch.rand() Function: The sources explore the torch.rand() function, explaining its role in generating tensors filled with random numbers drawn from a uniform distribution between 0 and 1. They provide code examples demonstrating the use of torch.rand() to create tensors of various shapes filled with random values.
- Discussing Running Tensors and GPUs: The sources introduce the concept of running tensors on GPUs (Graphics Processing Units), explaining that GPUs offer significant computational advantages for deep learning tasks compared to CPUs. They highlight that PyTorch provides mechanisms for transferring tensors to and from GPUs, enabling users to leverage GPU acceleration for training and inference.
- Emphasizing Documentation and Extra Resources: The sources consistently encourage readers to refer to the PyTorch documentation for detailed information on functions, modules, and concepts. They also highlight the availability of supplementary resources, including online tutorials, blog posts, and research papers, to enhance understanding and provide deeper insights into various aspects of PyTorch.
This section guides readers through various techniques for working with tensors and data in PyTorch, highlighting the importance of understanding tensor shapes, reshaping, stacking, permuting, indexing, and slicing operations. They introduce concepts like random seeds and GPU acceleration, emphasizing the importance of leveraging available documentation and resources to enhance understanding and facilitate effective deep learning development using PyTorch.

Constructing and Training Neural Networks with PyTorch: Pages 411-420

The sources focus on building and training neural networks in PyTorch, specifically in the context of binary classification tasks. They guide readers through the process of creating a simple neural network architecture, defining a suitable loss function, setting up an optimizer, implementing a training loop, and evaluating the model’s performance on test data. They emphasize the use of activation functions, such as the sigmoid function, to introduce non-linearity into the network and enable it to learn complex decision boundaries.
- Building a Neural Network for Binary Classification: The sources provide a step-by-step guide to constructing a neural network specifically for binary classification. They show the creation of a model with linear layers (nn.Linear) stacked sequentially, illustrating how to define the input and output dimensions of each layer. They emphasize that the output layer for binary classification tasks typically has a single output unit, representing the probability of the positive class.
- Using the Sigmoid Activation Function: The sources introduce the sigmoid activation function, explaining its role in transforming the output of linear layers into a probability value between 0 and 1. They highlight that the sigmoid function introduces non-linearity into the network, allowing it to model complex relationships between input features and the target class.
- Creating a Training Loop for Binary Classification: The sources demonstrate the implementation of a training loop tailored for binary classification tasks. They outline the familiar steps involved: forward pass to calculate the loss, optimizer zeroing gradients, backpropagation to calculate gradients, and optimizer step to update model parameters.
- Understanding Binary Cross-Entropy Loss: The sources explain the concept of binary cross-entropy loss, a common loss function used for binary classification tasks. They describe how binary cross-entropy loss measures the difference between the predicted probabilities and the true labels, guiding the model to learn to make accurate predictions.
- Calculating Accuracy for Binary Classification: The sources demonstrate how to calculate accuracy for binary classification tasks. They show how to convert the model’s predicted probabilities into binary predictions using a threshold (typically 0.5), comparing these predictions to the true labels to determine the percentage of correctly classified instances.
- Evaluating the Model on Test Data: The sources emphasize the importance of evaluating the trained model on a separate testing dataset to assess its ability to generalize to unseen data. They outline the steps involved in testing the model, including performing a forward pass on the test data, calculating the loss, and computing the accuracy.
- Plotting Predictions and Decision Boundaries: The sources advocate for visualizing the model’s predictions and decision boundaries, explaining that visual inspection can provide valuable insights into the model’s behavior and performance. They suggest using plotting techniques to display the decision boundary learned by the model, illustrating how the model separates data points belonging to different classes.
- Using Helper Functions to Simplify Code: The sources introduce the use of helper functions to organize and streamline the code for training and evaluating the model. They demonstrate how to encapsulate repetitive tasks, such as plotting predictions or calculating accuracy, into reusable functions, improving code readability and maintainability.
This section guides readers through the construction and training of neural networks for binary classification in PyTorch. The sources emphasize the use of activation functions to introduce non-linearity, the choice of suitable loss functions and optimizers, the implementation of a training loop, and the evaluation of the model on test data. They highlight the importance of visualizing predictions and decision boundaries and introduce techniques for organizing code using helper functions.

Exploring Non-Linearities and Multi-Class Classification in PyTorch: Pages 421-430

The sources continue the exploration of neural networks, focusing on incorporating non-linearities using activation functions and expanding into multi-class classification. They guide readers through the process of enhancing model performance by adding non-linear activation functions, transitioning from binary classification to multi-class classification, choosing appropriate loss functions and optimizers, and evaluating model performance with metrics such as accuracy.
- Incorporating Non-Linearity with Activation Functions: The sources emphasize the crucial role of non-linear activation functions in enabling neural networks to learn complex patterns and relationships within data. They introduce the ReLU (Rectified Linear Unit) activation function, highlighting its effectiveness and widespread use in deep learning. They explain that ReLU introduces non-linearity by setting negative values to zero and passing positive values unchanged. This simple yet powerful activation function allows neural networks to model non-linear decision boundaries and capture intricate data representations.
- Understanding the Importance of Non-Linearity: The sources provide insights into the rationale behind incorporating non-linearity into neural networks. They explain that without non-linear activation functions, a neural network, regardless of its depth, would essentially behave as a single linear layer, severely limiting its ability to learn complex patterns. Non-linear activation functions, like ReLU, introduce bends and curves into the model’s decision boundaries, allowing it to capture non-linear relationships and make more accurate predictions.
- Transitioning to Multi-Class Classification: The sources smoothly transition from binary classification to multi-class classification, where the task involves classifying data into more than two categories. They explain the key differences between binary and multi-class classification, highlighting the need for adjustments in the model’s output layer and the choice of loss function and activation function.
- Using Softmax for Multi-Class Classification: The sources introduce the softmax activation function, commonly used in the output layer of multi-class classification models. They explain that softmax transforms the raw output scores (logits) of the network into a probability distribution over the different classes, ensuring that the predicted probabilities for all classes sum up to one.
- Choosing an Appropriate Loss Function for Multi-Class Classification: The sources guide readers in selecting appropriate loss functions for multi-class classification. They discuss cross-entropy loss, a widely used loss function for multi-class classification tasks, explaining how it measures the difference between the predicted probability distribution and the true label distribution.
- Implementing a Training Loop for Multi-Class Classification: The sources outline the steps involved in implementing a training loop for multi-class classification models. They demonstrate the familiar process of iterating through the training data in batches, performing a forward pass, calculating the loss, backpropagating to compute gradients, and updating the model’s parameters using an optimizer.
- Evaluating Multi-Class Classification Models: The sources focus on evaluating the performance of multi-class classification models using metrics like accuracy. They explain that accuracy measures the percentage of correctly classified instances over the entire dataset, providing an overall assessment of the model’s predictive ability.
- Visualizing Multi-Class Classification Results: The sources suggest visualizing the predictions and decision boundaries of multi-class classification models, emphasizing the importance of visual inspection for gaining insights into the model’s behavior and performance. They demonstrate techniques for plotting the decision boundaries learned by the model, showing how the model divides the feature space to separate data points belonging to different classes.
- Highlighting the Interplay of Linear and Non-linear Functions: The sources emphasize the combined effect of linear transformations (performed by linear layers) and non-linear transformations (introduced by activation functions) in allowing neural networks to learn complex patterns. They explain that the interplay of linear and non-linear functions enables the model to capture intricate data representations and make accurate predictions across a wide range of tasks.
This section guides readers through the process of incorporating non-linearity into neural networks using activation functions like ReLU and transitioning from binary to multi-class classification using the softmax activation function. The sources discuss the choice of appropriate loss functions for multi-class classification, demonstrate the implementation of a training loop, and highlight the importance of evaluating model performance using metrics like accuracy and visualizing decision boundaries to gain insights into the model’s behavior. They emphasize the critical role of combining linear and non-linear functions to enable neural networks to effectively learn complex patterns within data.

Visualizing and Building Neural Networks for Multi-Class Classification: Pages 431-440

The sources emphasize the importance of visualization in understanding data patterns and building intuition for neural network architectures. They guide readers through the process of visualizing data for multi-class classification, designing a simple neural network for this task, understanding input and output shapes, and selecting appropriate loss functions and optimizers. They introduce tools like PyTorch’s nn.Sequential container to structure models and highlight the flexibility of PyTorch for customizing neural networks.
- Visualizing Data for Multi-Class Classification: The sources advocate for visualizing data before building models, especially for multi-class classification. They illustrate the use of scatter plots to display data points with different colors representing different classes. This visualization helps identify patterns, clusters, and potential decision boundaries that a neural network could learn.
- Designing a Neural Network for Multi-Class Classification: The sources demonstrate the construction of a simple neural network for multi-class classification using PyTorch’s nn.Sequential container, which allows for a streamlined definition of the model’s architecture by stacking layers in a sequential order. They show how to define linear layers (nn.Linear) with appropriate input and output dimensions based on the number of features and the number of classes in the dataset.
- Determining Input and Output Shapes: The sources guide readers in determining the input and output shapes for the different layers of the neural network. They explain that the input shape of the first layer is determined by the number of features in the dataset, while the output shape of the last layer corresponds to the number of classes. The input and output shapes of intermediate layers can be adjusted to control the network’s capacity and complexity. They highlight the importance of ensuring that the input and output dimensions of consecutive layers are compatible for a smooth flow of data through the network.
- Selecting Loss Functions and Optimizers: The sources discuss the importance of choosing appropriate loss functions and optimizers for multi-class classification. They explain the concept of cross-entropy loss, a commonly used loss function for this type of classification task, and discuss its role in guiding the model to learn to make accurate predictions. They also mention optimizers like Stochastic Gradient Descent (SGD), highlighting their role in updating the model’s parameters to minimize the loss function.
- Using PyTorch’s nn Module for Neural Network Components: The sources emphasize the use of PyTorch’s nn module, which contains building blocks for constructing neural networks. They specifically demonstrate the use of nn.Linear for creating linear layers and nn.Sequential for structuring the model by combining multiple layers in a sequential manner. They highlight that PyTorch offers a vast array of modules within the nn package for creating diverse and sophisticated neural network architectures.
This section encourages the use of visualization to gain insights into data patterns for multi-class classification and guides readers in designing simple neural networks for this task. The sources emphasize the importance of understanding and setting appropriate input and output shapes for the different layers of the network and provide guidance on selecting suitable loss functions and optimizers. They showcase PyTorch’s flexibility and its powerful nn module for constructing neural network architectures.

Building a Multi-Class Classification Model: Pages 441-450

The sources continue the discussion of multi-class classification, focusing on designing a neural network architecture and creating a custom MultiClassClassification model in PyTorch. They guide readers through the process of defining the input and output shapes of each layer based on the number of features and classes in the dataset, constructing the model using PyTorch’s nn.Linear and nn.Sequential modules, and testing the data flow through the model with a forward pass. They emphasize the importance of understanding how the shape of data changes as it passes through the different layers of the network.
- Defining the Neural Network Architecture: The sources present a structured approach to designing a neural network architecture for multi-class classification. They outline the key components of the architecture:
- Input layer shape: Determined by the number of features in the dataset.
- Hidden layers: Allow the network to learn complex relationships within the data. The number of hidden layers and the number of neurons (hidden units) in each layer can be customized to control the network’s capacity and complexity.
- Output layer shape: Corresponds to the number of classes in the dataset. Each output neuron represents a different class.
- Output activation: Typically uses the softmax function for multi-class classification. Softmax transforms the network’s output scores (logits) into a probability distribution over the classes, ensuring that the predicted probabilities sum to one.
- Creating a Custom MultiClassClassification Model in PyTorch: The sources guide readers in implementing a custom MultiClassClassification model using PyTorch. They demonstrate how to define the model class, inheriting from PyTorch’s nn.Module, and how to structure the model using nn.Sequential to stack layers in a sequential manner.
- Using nn.Linear for Linear Transformations: The sources explain the use of nn.Linear for creating linear layers in the neural network. nn.Linear applies a linear transformation to the input data, calculating a weighted sum of the input features and adding a bias term. The weights and biases are the learnable parameters of the linear layer that the network adjusts during training to make accurate predictions.
- Testing Data Flow Through the Model: The sources emphasize the importance of testing the data flow through the model to ensure that the input and output shapes of each layer are compatible. They demonstrate how to perform a forward pass with dummy data to verify that data can successfully pass through the network without encountering shape errors.
- Troubleshooting Shape Issues: The sources provide tips for troubleshooting shape issues, highlighting the significance of paying attention to the error messages that PyTorch provides. Error messages related to shape mismatches often provide clues about which layers or operations need adjustments to ensure compatibility.
- Visualizing Shape Changes with Print Statements: The sources suggest using print statements within the model’s forward method to display the shape of the data as it passes through each layer. This visual inspection helps confirm that data transformations are occurring as expected and aids in identifying and resolving shape-related issues.
This section guides readers through the process of designing and implementing a multi-class classification model in PyTorch. The sources emphasize the importance of understanding input and output shapes for each layer, utilizing PyTorch’s nn.Linear for linear transformations, using nn.Sequential for structuring the model, and verifying the data flow with a forward pass. They provide tips for troubleshooting shape issues and encourage the use of print statements to visualize shape changes, facilitating a deeper understanding of the model’s architecture and behavior.

Training and Evaluating the Multi-Class Classification Model: Pages 451-460

The sources shift focus to the practical aspects of training and evaluating the multi-class classification model in PyTorch. They guide readers through creating a training loop, setting up an optimizer and loss function, implementing a testing loop to evaluate model performance on unseen data, and calculating accuracy as a performance metric. The sources emphasize the iterative nature of model training, involving forward passes, loss calculation, backpropagation, and parameter updates using an optimizer.
- Creating a Training Loop in PyTorch: The sources emphasize the importance of a training loop in machine learning, which is the process of iteratively training a model on a dataset. They guide readers in creating a training loop in PyTorch, incorporating the following key steps:
1. Iterating over epochs: An epoch represents one complete pass through the entire training dataset. The number of epochs determines how many times the model will see the training data during the training process.
2. Iterating over batches: The training data is typically divided into smaller batches to make the training process more manageable and efficient. Each batch contains a subset of the training data.
3. Performing a forward pass: Passing the input data (a batch of data) through the model to generate predictions.
4. Calculating the loss: Comparing the model’s predictions to the true labels to quantify how well the model is performing. This comparison is done using a loss function, such as cross-entropy loss for multi-class classification.
5. Performing backpropagation: Calculating gradients of the loss function with respect to the model’s parameters. These gradients indicate how much each parameter contributes to the overall error.
6. Updating model parameters: Adjusting the model’s parameters (weights and biases) using an optimizer, such as Stochastic Gradient Descent (SGD). The optimizer uses the calculated gradients to update the parameters in a direction that minimizes the loss function.
- Setting up an Optimizer and Loss Function: The sources demonstrate how to set up an optimizer and a loss function in PyTorch. They explain that optimizers play a crucial role in updating the model’s parameters to minimize the loss function during training. They showcase the use of the Adam optimizer (torch.optim.Adam), a popular optimization algorithm for deep learning. For the loss function, they use the cross-entropy loss (nn.CrossEntropyLoss), a common choice for multi-class classification tasks.
- Evaluating Model Performance with a Testing Loop: The sources guide readers in creating a testing loop in PyTorch to evaluate the trained model’s performance on unseen data (the test dataset). The testing loop follows a similar structure to the training loop but without the backpropagation and parameter update steps. It involves performing a forward pass on the test data, calculating the loss, and often using additional metrics like accuracy to assess the model’s generalization capability.
- Calculating Accuracy as a Performance Metric: The sources introduce accuracy as a straightforward metric for evaluating classification model performance. Accuracy measures the proportion of correctly classified samples in the test dataset, providing a simple indication of how well the model generalizes to unseen data.
This section emphasizes the importance of the training loop, which iteratively improves the model’s performance by adjusting its parameters based on the calculated loss. It guides readers through implementing the training loop in PyTorch, setting up an optimizer and loss function, creating a testing loop to evaluate model performance, and calculating accuracy as a basic performance metric for classification tasks.

Refining and Improving Model Performance: Pages 461-470

The sources guide readers through various strategies for refining and improving the performance of the multi-class classification model. They cover techniques like adjusting the learning rate, experimenting with different optimizers, exploring the concept of nonlinear activation functions, and understanding the idea of running tensors on a Graphical Processing Unit (GPU) for faster training. They emphasize that model improvement in machine learning often involves experimentation, trial-and-error, and a systematic approach to evaluating and comparing different model configurations.
- Adjusting the Learning Rate: The sources emphasize the importance of the learning rate in the training process. They explain that the learning rate controls the size of the steps the optimizer takes when updating model parameters during backpropagation. A high learning rate may lead to the model missing the optimal minimum of the loss function, while a very low learning rate can cause slow convergence, making the training process unnecessarily lengthy. The sources suggest experimenting with different learning rates to find an appropriate balance between speed and convergence.
- Experimenting with Different Optimizers: The sources highlight the importance of choosing an appropriate optimizer for training neural networks. They mention that different optimizers use different strategies for updating model parameters based on the calculated gradients, and some optimizers might be more suitable than others for specific problems or datasets. The sources encourage readers to experiment with various optimizers available in PyTorch, such as Stochastic Gradient Descent (SGD), Adam, and RMSprop, to observe their impact on model performance.
- Introducing Nonlinear Activation Functions: The sources introduce the concept of nonlinear activation functions and their role in enhancing the capacity of neural networks. They explain that linear layers alone can only model linear relationships within the data, limiting the complexity of patterns the model can learn. Nonlinear activation functions, applied to the outputs of linear layers, introduce nonlinearities into the model, enabling it to learn more complex relationships and capture nonlinear patterns in the data. The sources mention the sigmoid activation function as an example, but PyTorch offers a variety of nonlinear activation functions within the nn module.
- Utilizing GPUs for Faster Training: The sources touch on the concept of running PyTorch tensors on a GPU (Graphical Processing Unit) to significantly speed up the training process. GPUs are specialized hardware designed for parallel computations, making them particularly well-suited for the matrix operations involved in deep learning. By utilizing a GPU, training times can be significantly reduced, allowing for faster experimentation and model development.
- Improving a Model: The sources discuss the iterative process of improving a machine learning model, highlighting that model development rarely produces optimal results on the first attempt. They suggest a systematic approach involving the following:
- Starting simple: Beginning with a simpler model architecture and gradually increasing complexity if needed.
- Experimenting with hyperparameters: Tuning parameters like learning rate, batch size, and the number of hidden layers to find an optimal configuration.
- Evaluating and comparing results: Carefully analyzing the model’s performance on the training and test datasets, using metrics like loss and accuracy to assess its effectiveness and generalization capabilities.
This section guides readers in exploring various strategies for refining and improving the multi-class classification model. The sources emphasize the importance of adjusting the learning rate, experimenting with different optimizers, introducing nonlinear activation functions for enhanced model capacity, and leveraging GPUs for faster training. They underscore the iterative nature of model improvement, encouraging readers to adopt a systematic approach involving experimentation, hyperparameter tuning, and thorough evaluation.

Please note that specific recommendations about optimal learning rates or best optimizers for a given problem may vary depending on the dataset, model architecture, and other factors. These aspects often require experimentation and a deeper understanding of the specific machine learning problem being addressed.

Exploring the PyTorch Workflow and Model Evaluation: Pages 471-480

The sources guide readers through crucial aspects of the PyTorch workflow, focusing on saving and loading trained models, understanding common choices for loss functions and optimizers, and exploring additional classification metrics beyond accuracy. They delve into the concept of a confusion matrix as a valuable tool for evaluating classification models, providing deeper insights into the model’s performance across different classes. The sources advocate for a holistic approach to model evaluation, emphasizing that multiple metrics should be considered to gain a comprehensive understanding of a model’s strengths and weaknesses.
- Saving and Loading Trained PyTorch Models: The sources emphasize the importance of saving trained models in PyTorch. They demonstrate the process of saving a model’s state dictionary, which contains the learned parameters (weights and biases), using torch.save(). They also showcase the process of loading a saved model using torch.load(), enabling users to reuse trained models for inference or further training.
- Common Choices for Loss Functions and Optimizers: The sources present a table summarizing common choices for loss functions and optimizers in PyTorch, specifically tailored for binary and multi-class classification tasks. They provide brief descriptions of each loss function and optimizer, highlighting key characteristics and situations where they are commonly used. For binary classification, they mention the Binary Cross Entropy Loss (nn.BCELoss) and the Stochastic Gradient Descent (SGD) optimizer as common choices. For multi-class classification, they mention the Cross Entropy Loss (nn.CrossEntropyLoss) and the Adam optimizer.
- Exploring Additional Classification Metrics: The sources introduce additional classification metrics beyond accuracy, emphasizing the importance of considering multiple metrics for a comprehensive evaluation. They touch on precision, recall, the F1 score, confusion matrices, and classification reports as valuable tools for assessing model performance, particularly when dealing with imbalanced datasets or situations where different types of errors carry different weights.
- Constructing and Interpreting a Confusion Matrix: The sources introduce the confusion matrix as a powerful tool for visualizing the performance of a classification model. They explain that a confusion matrix displays the counts (or proportions) of correctly and incorrectly classified instances for each class. The rows of the matrix typically represent the true classes, while the columns represent the predicted classes. Each cell in the matrix represents the number of instances that were classified as belonging to a particular predicted class when their true class was different. The sources guide readers through creating a confusion matrix in PyTorch using the torchmetrics library, which provides a dedicated ConfusionMatrix class. They emphasize that confusion matrices offer valuable insights into:
- True positives (TP): Correctly predicted positive instances.
- True negatives (TN): Correctly predicted negative instances.
- False positives (FP): Incorrectly predicted positive instances (Type I errors).
- False negatives (FN): Incorrectly predicted negative instances (Type II errors).
This section highlights the practical steps of saving and loading trained PyTorch models, providing users with the ability to reuse trained models for different purposes. It presents common choices for loss functions and optimizers, aiding users in selecting appropriate configurations for their classification tasks. The sources expand the discussion on classification metrics, introducing additional measures like precision, recall, the F1 score, and the confusion matrix. They advocate for using a combination of metrics to gain a more nuanced understanding of model performance, particularly when addressing real-world problems where different types of errors have varying consequences.

Visualizing and Evaluating Model Predictions: Pages 481-490

The sources guide readers through the process of visualizing and evaluating the predictions made by the trained convolutional neural network (CNN) model. They emphasize the importance of going beyond overall accuracy and examining individual predictions to gain a deeper understanding of the model’s behavior and identify potential areas for improvement. The sources introduce techniques for plotting predictions visually, comparing model predictions to ground truth labels, and using a confusion matrix to assess the model’s performance across different classes.
- Visualizing Model Predictions: The sources introduce techniques for visualizing model predictions on individual images from the test dataset. They suggest randomly sampling a set of images from the test dataset, obtaining the model’s predictions for these images, and then displaying both the images and their corresponding predicted labels. This approach allows for a qualitative assessment of the model’s performance, enabling users to visually inspect how well the model aligns with human perception.
- Comparing Predictions to Ground Truth: The sources stress the importance of comparing the model’s predictions to the ground truth labels associated with the test images. By visually aligning the predicted labels with the true labels, users can quickly identify instances where the model makes correct predictions and instances where it errs. This comparison helps to pinpoint specific types of images or classes that the model might struggle with, providing valuable insights for further model refinement.
- Creating a Confusion Matrix for Deeper Insights: The sources reiterate the value of a confusion matrix for evaluating classification models. They guide readers through creating a confusion matrix using libraries like torchmetrics and mlxtend, which offer tools for calculating and visualizing confusion matrices. The confusion matrix provides a comprehensive overview of the model’s performance across all classes, highlighting the counts of true positives, true negatives, false positives, and false negatives. This visualization helps to identify classes that the model might be confusing, revealing patterns of misclassification that can inform further model development or data augmentation strategies.
This section guides readers through practical techniques for visualizing and evaluating the predictions made by the trained CNN model. The sources advocate for a multi-faceted evaluation approach, emphasizing the value of visually inspecting individual predictions, comparing them to ground truth labels, and utilizing a confusion matrix to analyze the model’s performance across all classes. By combining qualitative and quantitative assessment methods, users can gain a more comprehensive understanding of the model’s capabilities, identify its strengths and weaknesses, and glean insights for potential improvements.

Getting Started with Computer Vision and Convolutional Neural Networks: Pages 491-500

The sources introduce the field of computer vision and convolutional neural networks (CNNs), providing readers with an overview of key libraries, resources, and the basic concepts involved in building computer vision models with PyTorch. They guide readers through setting up the necessary libraries, understanding the structure of CNNs, and preparing to work with image datasets. The sources emphasize a hands-on approach to learning, encouraging readers to experiment with code and explore the concepts through practical implementation.
- Essential Computer Vision Libraries in PyTorch: The sources present several essential libraries commonly used for computer vision tasks in PyTorch, highlighting their functionalities and roles in building and training CNNs:
- Torchvision: This library serves as the core domain library for computer vision in PyTorch. It provides utilities for data loading, image transformations, pre-trained models, and more. Within torchvision, several sub-modules are particularly relevant:
- datasets: This module offers a collection of popular computer vision datasets, including ImageNet, CIFAR10, CIFAR100, MNIST, and FashionMNIST, readily available for download and use in PyTorch.
- models: This module contains a variety of pre-trained CNN architectures, such as ResNet, AlexNet, VGG, and Inception, which can be used directly for inference or fine-tuned for specific tasks.
- transforms: This module provides a range of image transformations, including resizing, cropping, flipping, and normalization, which are crucial for preprocessing image data before feeding it into a CNN.
- utils: This module offers helpful utilities for tasks like visualizing images, displaying model summaries, and saving and loading checkpoints.
- Matplotlib: This versatile plotting library is essential for visualizing images, plotting training curves, and exploring data patterns in computer vision tasks.
- Exploring Convolutional Neural Networks: The sources provide a high-level introduction to CNNs, explaining that they are specialized neural networks designed for processing data with a grid-like structure, such as images. They highlight the key components of a CNN:
- Convolutional Layers: These layers apply a series of learnable filters (kernels) to the input image, extracting features like edges, textures, and patterns. The filters slide across the input image, performing convolutions to produce feature maps that highlight specific characteristics of the image.
- Pooling Layers: These layers downsample the feature maps generated by convolutional layers, reducing their spatial dimensions while preserving important features. Pooling layers help to make the model more robust to variations in the position of features within the image.
- Fully Connected Layers: These layers, often found in the final stages of a CNN, connect all the features extracted by the convolutional and pooling layers, enabling the model to learn complex relationships between these features and perform high-level reasoning about the image content.
- Obtaining and Preparing Image Datasets: The sources guide readers through the process of obtaining image datasets for training computer vision models, emphasizing the importance of:
- Choosing the right dataset: Selecting a dataset relevant to the specific computer vision task being addressed.
- Understanding dataset structure: Familiarizing oneself with the organization of images and labels within the dataset, ensuring compatibility with PyTorch’s data loading mechanisms.
- Preprocessing images: Applying necessary transformations to the images, such as resizing, cropping, normalization, and data augmentation, to prepare them for input into a CNN.
This section serves as a starting point for readers venturing into the world of computer vision and CNNs using PyTorch. The sources introduce essential libraries, resources, and basic concepts, equipping readers with the foundational knowledge and tools needed to begin building and training computer vision models. They highlight the structure of CNNs, emphasizing the roles of convolutional, pooling, and fully connected layers in processing image data. The sources stress the importance of selecting appropriate image datasets, understanding their structure, and applying necessary preprocessing steps to prepare the data for training.

Getting Hands-on with the FashionMNIST Dataset: Pages 501-510

The sources walk readers through the practical steps involved in working with the FashionMNIST dataset for image classification using PyTorch. They cover checking library versions, exploring the torchvision.datasets module, setting up the FashionMNIST dataset for training, understanding data loaders, and visualizing samples from the dataset. The sources emphasize the importance of familiarizing oneself with the dataset’s structure, accessing its elements, and gaining insights into the images and their corresponding labels.
- Checking Library Versions for Compatibility: The sources recommend checking the versions of the PyTorch and torchvision libraries to ensure compatibility and leverage the latest features. They provide code snippets to display the version numbers of both libraries using torch.__version__ and torchvision.__version__. This step helps to avoid potential issues arising from version mismatches and ensures a smooth workflow.
- Exploring the torchvision.datasets Module: The sources introduce the torchvision.datasets module as a valuable resource for accessing a variety of popular computer vision datasets. They demonstrate how to explore the available datasets within this module, providing examples like Caltech101, CIFAR100, CIFAR10, MNIST, FashionMNIST, and ImageNet. The sources explain that these datasets can be easily downloaded and loaded into PyTorch using dedicated functions within the torchvision.datasets module.
- Setting Up the FashionMNIST Dataset: The sources guide readers through the process of setting up the FashionMNIST dataset for training an image classification model. They outline the following steps:
1. Importing Necessary Modules: Import the required modules from torchvision.datasets and torchvision.transforms.
2. Downloading the Dataset: Download the FashionMNIST dataset using the FashionMNIST class from torchvision.datasets, specifying the desired root directory for storing the dataset.
3. Applying Transformations: Apply transformations to the images using the transforms.Compose function. Common transformations include:
- transforms.ToTensor(): Converts PIL images (common format for image data) to PyTorch tensors.
- transforms.Normalize(): Normalizes the pixel values of the images, typically to a range of 0 to 1 or -1 to 1, which can help to improve model training.
- Understanding Data Loaders: The sources introduce data loaders as an essential component for efficiently loading and iterating through datasets in PyTorch. They explain that data loaders provide several benefits:
- Batching: They allow you to easily create batches of data, which is crucial for training models on large datasets that cannot be loaded into memory all at once.
- Shuffling: They can shuffle the data between epochs, helping to prevent the model from memorizing the order of the data and improving its ability to generalize.
- Parallel Loading: They support parallel loading of data, which can significantly speed up the training process.
- Visualizing Samples from the Dataset: The sources emphasize the importance of visualizing samples from the dataset to gain a better understanding of the data being used for training. They provide code examples for iterating through a data loader, extracting image tensors and their corresponding labels, and displaying the images using matplotlib. This visual inspection helps to ensure that the data has been loaded and preprocessed correctly and can provide insights into the characteristics of the images within the dataset.
This section offers practical guidance on working with the FashionMNIST dataset for image classification. The sources emphasize the importance of checking library versions, exploring available datasets in torchvision.datasets, setting up the FashionMNIST dataset for training, understanding the role of data loaders, and visually inspecting samples from the dataset. By following these steps, readers can effectively load, preprocess, and visualize image data, laying the groundwork for building and training computer vision models.

Mini-Batches and Building a Baseline Model with Linear Layers: Pages 511-520

The sources introduce the concept of mini-batches in machine learning, explaining their significance in training models on large datasets. They guide readers through the process of creating mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. The sources then demonstrate how to build a simple baseline model using linear layers for classifying images from the FashionMNIST dataset, highlighting the steps involved in setting up the model’s architecture, defining the input and output shapes, and performing a forward pass to verify data flow.
- The Importance of Mini-Batches: The sources explain that mini-batches play a crucial role in training machine learning models, especially when dealing with large datasets. They break down the dataset into smaller, manageable chunks called mini-batches, which are processed by the model in each training iteration. Using mini-batches offers several advantages:
- Efficient Memory Usage: Processing the entire dataset at once can overwhelm the computer’s memory, especially for large datasets. Mini-batches allow the model to work on smaller portions of the data, reducing memory requirements and making training feasible.
- Faster Training: Updating the model’s parameters after each sample can be computationally expensive. Mini-batches enable the model to calculate gradients and update parameters based on a group of samples, leading to faster convergence and reduced training time.
- Improved Generalization: Training on mini-batches introduces some randomness into the process, as the samples within each batch are shuffled. This randomness can help the model to learn more robust patterns and improve its ability to generalize to unseen data.
- Creating Mini-Batches with DataLoader: The sources demonstrate how to create mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. The DataLoader class provides a convenient way to iterate through the dataset in batches, handling shuffling, batching, and data loading automatically. It takes the dataset as input, along with the desired batch size and other optional parameters.
- Building a Baseline Model with Linear Layers: The sources guide readers through the construction of a simple baseline model using linear layers for classifying images from the FashionMNIST dataset. They outline the following steps:
1. Defining the Model Architecture: The sources start by creating a class called LinearModel that inherits from nn.Module, which is the base class for all neural network modules in PyTorch. Within the class, they define the following layers:
- A linear layer (nn.Linear) that takes the flattened input image (784 features, representing the 28×28 pixels of a FashionMNIST image) and maps it to a hidden layer with a specified number of units.
- Another linear layer that maps the hidden layer to the output layer, producing a tensor of scores for each of the 10 classes in FashionMNIST.
1. Setting Up the Input and Output Shapes: The sources emphasize the importance of aligning the input and output shapes of the linear layers to ensure proper data flow through the model. They specify the input features and output features for each linear layer based on the dataset’s characteristics and the desired number of hidden units.
2. Performing a Forward Pass: The sources demonstrate how to perform a forward pass through the model using a randomly generated tensor. This step verifies that the data flows correctly through the layers and helps to confirm the expected output shape. They print the output tensor and its shape, providing insights into the model’s behavior.
This section introduces the concept of mini-batches and their importance in machine learning, providing practical guidance on creating mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. It then demonstrates how to build a simple baseline model using linear layers for classifying images, highlighting the steps involved in defining the model architecture, setting up the input and output shapes, and verifying data flow through a forward pass. This foundation prepares readers for building more complex convolutional neural networks for image classification tasks.

Training and Evaluating a Linear Model on the FashionMNIST Dataset: Pages 521-530

The sources guide readers through the process of training and evaluating the previously built linear model on the FashionMNIST dataset, focusing on creating a training loop, setting up a loss function and an optimizer, calculating accuracy, and implementing a testing loop to assess the model’s performance on unseen data.
- Setting Up the Loss Function and Optimizer: The sources explain that a loss function quantifies how well the model’s predictions match the true labels, with lower loss values indicating better performance. They discuss common choices for loss functions and optimizers, emphasizing the importance of selecting appropriate options based on the problem and dataset.
- The sources specifically recommend binary cross-entropy loss (BCE) for binary classification problems and cross-entropy loss (CE) for multi-class classification problems.
- They highlight that PyTorch provides both nn.BCELoss and nn.CrossEntropyLoss implementations for these loss functions.
- For the optimizer, the sources mention stochastic gradient descent (SGD) as a common choice, with PyTorch offering the torch.optim.SGD class for its implementation.
- Creating a Training Loop: The sources outline the fundamental steps involved in a training loop, emphasizing the iterative process of adjusting the model’s parameters to minimize the loss and improve its ability to classify images correctly. The typical steps in a training loop include:
1. Forward Pass: Pass a batch of data through the model to obtain predictions.
2. Calculate the Loss: Compare the model’s predictions to the true labels using the chosen loss function.
3. Optimizer Zero Grad: Reset the gradients calculated from the previous batch to avoid accumulating gradients across batches.
4. Loss Backward: Perform backpropagation to calculate the gradients of the loss with respect to the model’s parameters.
5. Optimizer Step: Update the model’s parameters based on the calculated gradients and the optimizer’s learning rate.
- Calculating Accuracy: The sources introduce accuracy as a metric for evaluating the model’s performance, representing the percentage of correctly classified samples. They provide a code snippet to calculate accuracy by comparing the predicted labels to the true labels.
- Implementing a Testing Loop: The sources explain the importance of evaluating the model’s performance on a separate set of data, the test set, that was not used during training. This helps to assess the model’s ability to generalize to unseen data and prevent overfitting, where the model performs well on the training data but poorly on new data. The testing loop follows similar steps to the training loop, but without updating the model’s parameters:
1. Forward Pass: Pass a batch of test data through the model to obtain predictions.
2. Calculate the Loss: Compare the model’s predictions to the true test labels using the loss function.
3. Calculate Accuracy: Determine the percentage of correctly classified test samples.
The sources provide code examples for implementing the training and testing loops, including detailed explanations of each step. They also emphasize the importance of monitoring the loss and accuracy values during training to track the model’s progress and ensure that it is learning effectively. These steps provide a comprehensive understanding of the training and evaluation process, enabling readers to apply these techniques to their own image classification tasks.

Building and Training a Multi-Layer Model with Non-Linear Activation Functions: Pages 531-540

The sources extend the image classification task by introducing non-linear activation functions and building a more complex multi-layer model. They emphasize the importance of non-linearity in enabling neural networks to learn complex patterns and improve classification accuracy. The sources guide readers through implementing the ReLU (Rectified Linear Unit) activation function and constructing a multi-layer model, demonstrating its performance on the FashionMNIST dataset.
- The Role of Non-Linear Activation Functions: The sources explain that linear models, while straightforward, are limited in their ability to capture intricate relationships in data. Introducing non-linear activation functions between linear layers enhances the model’s capacity to learn complex patterns. Non-linear activation functions allow the model to approximate non-linear decision boundaries, enabling it to classify data points that are not linearly separable.
- Introducing ReLU Activation: The sources highlight ReLU as a popular non-linear activation function, known for its simplicity and effectiveness. ReLU replaces negative values in the input tensor with zero, while retaining positive values. This simple operation introduces non-linearity into the model, allowing it to learn more complex representations of the data. The sources provide the code for implementing ReLU in PyTorch using nn.ReLU().
- Constructing a Multi-Layer Model: The sources guide readers through building a more complex model with multiple linear layers and ReLU activations. They introduce a three-layer model:
1. A linear layer that takes the flattened input image (784 features) and maps it to a hidden layer with a specified number of units.
2. A ReLU activation function applied to the output of the first linear layer.
3. Another linear layer that maps the activated hidden layer to a second hidden layer with a specified number of units.
4. A ReLU activation function applied to the output of the second linear layer.
5. A final linear layer that maps the activated second hidden layer to the output layer (10 units, representing the 10 classes in FashionMNIST).
- Training and Evaluating the Multi-Layer Model: The sources demonstrate how to train and evaluate this multi-layer model using the same training and testing loops described in the previous pages summary. They emphasize that the inclusion of ReLU activations between the linear layers significantly enhances the model’s performance compared to the previous linear models. This improvement highlights the crucial role of non-linearity in enabling neural networks to learn complex patterns and achieve higher classification accuracy.
The sources provide code examples for implementing the multi-layer model with ReLU activations, showcasing the steps involved in defining the model’s architecture, setting up the layers and activations, and training the model using the established training and testing loops. These examples offer practical guidance on building and training more complex models with non-linear activation functions, laying the foundation for understanding and implementing even more sophisticated architectures like convolutional neural networks.

Improving Model Performance and Visualizing Predictions: Pages 541-550

The sources discuss strategies for improving the performance of machine learning models, focusing on techniques to enhance a model’s ability to learn from data and make accurate predictions. They also guide readers through visualizing the model’s predictions, providing insights into its decision-making process and highlighting areas for potential improvement.
- Improving a Model’s Performance: The sources acknowledge that achieving satisfactory results with machine learning models often involves an iterative process of experimentation and refinement. They outline several strategies to improve a model’s performance, emphasizing that the effectiveness of these techniques can vary depending on the complexity of the problem and the characteristics of the dataset. Some common approaches include:
1. Adding More Layers: Increasing the depth of the neural network by adding more layers can enhance its capacity to learn complex representations of the data. However, adding too many layers can lead to overfitting, especially if the dataset is small.
2. Adding More Hidden Units: Increasing the number of hidden units within each layer can also enhance the model’s ability to capture intricate patterns. Similar to adding more layers, adding too many hidden units can contribute to overfitting.
3. Training for Longer: Allowing the model to train for a greater number of epochs can provide more opportunities to adjust its parameters and minimize the loss. However, excessive training can also lead to overfitting, especially if the model’s capacity is high.
4. Changing the Learning Rate: The learning rate determines the step size the optimizer takes when updating the model’s parameters. A learning rate that is too high can cause the optimizer to overshoot the optimal values, while a learning rate that is too low can slow down convergence. Experimenting with different learning rates can improve the model’s ability to find the optimal parameter values.
- Visualizing Model Predictions: The sources stress the importance of visualizing the model’s predictions to gain insights into its decision-making process. Visualizations can reveal patterns in the data that the model is capturing and highlight areas where it is struggling to make accurate predictions. The sources guide readers through creating visualizations using Matplotlib, demonstrating how to plot the model’s predictions for different classes and analyze its performance.
The sources provide practical advice and code examples for implementing these improvement strategies, encouraging readers to experiment with different techniques to find the optimal configuration for their specific problem. They also emphasize the value of visualizing model predictions to gain a deeper understanding of its strengths and weaknesses, facilitating further model refinement and improvement. This section equips readers with the knowledge and tools to iteratively improve their models and enhance their understanding of the model’s behavior through visualizations.

Saving, Loading, and Evaluating Models: Pages 551-560

The sources shift their focus to the practical aspects of saving, loading, and comprehensively evaluating trained models. They emphasize the importance of preserving trained models for future use, enabling the application of trained models to new data without retraining. The sources also introduce techniques for assessing model performance beyond simple accuracy, providing a more nuanced understanding of a model’s strengths and weaknesses.
- Saving and Loading Trained Models: The sources highlight the significance of saving trained models to avoid the time and computational expense of retraining. They outline the process of saving a model’s state dictionary, which contains the learned parameters (weights and biases), using PyTorch’s torch.save() function. The sources provide a code example demonstrating how to save a model’s state dictionary to a file, typically with a .pth extension. They also explain how to load a saved model using torch.load(), emphasizing the need to create an instance of the model with the same architecture before loading the saved state dictionary.
- Making Predictions With a Loaded Model: The sources guide readers through making predictions using a loaded model, emphasizing the importance of setting the model to evaluation mode (model.eval()) before making predictions. Evaluation mode deactivates certain layers, such as dropout, that are used during training but not during inference. They provide a code snippet illustrating the process of loading a saved model, setting it to evaluation mode, and using it to generate predictions on new data.
- Evaluating Model Performance Beyond Accuracy: The sources acknowledge that accuracy, while a useful metric, can provide an incomplete picture of a model’s performance, especially when dealing with imbalanced datasets where some classes have significantly more samples than others. They introduce the concept of a confusion matrix as a valuable tool for evaluating classification models. A confusion matrix displays the number of correct and incorrect predictions for each class, providing a detailed breakdown of the model’s performance across different classes. The sources explain how to interpret a confusion matrix, highlighting its ability to reveal patterns in misclassifications and identify classes where the model is performing poorly.
The sources guide readers through the essential steps of saving, loading, and evaluating trained models, equipping them with the skills to manage trained models effectively and perform comprehensive assessments of model performance beyond simple accuracy. This section focuses on the practical aspects of deploying and understanding the behavior of trained models, providing a valuable foundation for applying machine learning models to real-world tasks.

Putting it All Together: A PyTorch Workflow and Building a Classification Model: Pages 561 – 570

The sources guide readers through a comprehensive PyTorch workflow for building and training a classification model, consolidating the concepts and techniques covered in previous sections. They illustrate this workflow by constructing a binary classification model to classify data points generated using the make_circles dataset in scikit-learn.
- PyTorch End-to-End Workflow: The sources outline a structured approach to developing PyTorch models, encompassing the following key steps:
1. Data: Acquire, prepare, and transform data into a suitable format for training. This step involves understanding the dataset, loading the data, performing necessary preprocessing steps, and splitting the data into training and testing sets.
2. Model: Choose or build a model architecture appropriate for the task, considering the complexity of the problem and the nature of the data. This step involves selecting suitable layers, activation functions, and other components of the model.
3. Loss Function: Select a loss function that quantifies the difference between the model’s predictions and the actual target values. The choice of loss function depends on the type of problem (e.g., binary classification, multi-class classification, regression).
4. Optimizer: Choose an optimization algorithm that updates the model’s parameters to minimize the loss function. Popular optimizers include stochastic gradient descent (SGD), Adam, and RMSprop.
5. Training Loop: Implement a training loop that iteratively feeds the training data to the model, calculates the loss, and updates the model’s parameters using the chosen optimizer.
6. Evaluation: Evaluate the trained model’s performance on the testing set using appropriate metrics, such as accuracy, precision, recall, and the confusion matrix.
- Building a Binary Classification Model: The sources demonstrate this workflow by creating a binary classification model to classify data points generated using scikit-learn’s make_circles dataset. They guide readers through:
1. Generating the Dataset: Using make_circles to create a dataset of data points arranged in concentric circles, with each data point belonging to one of two classes.
2. Visualizing the Data: Employing Matplotlib to visualize the generated data points, providing a visual representation of the classification task.
3. Building the Model: Constructing a multi-layer neural network with linear layers and ReLU activation functions. The output layer utilizes the sigmoid activation function to produce probabilities for the two classes.
4. Choosing the Loss Function and Optimizer: Selecting the binary cross-entropy loss function (nn.BCELoss) and the stochastic gradient descent (SGD) optimizer for this binary classification task.
5. Implementing the Training Loop: Implementing the training loop to train the model, including the steps for calculating the loss, backpropagation, and updating the model’s parameters.
6. Evaluating the Model: Assessing the model’s performance using accuracy, precision, recall, and visualizing the predictions.
The sources provide a clear and structured approach to developing PyTorch models for classification tasks, emphasizing the importance of a systematic workflow that encompasses data preparation, model building, loss function and optimizer selection, training, and evaluation. This section offers a practical guide to applying the concepts and techniques covered in previous sections to build a functioning classification model, preparing readers for more complex tasks and datasets.

Multi-Class Classification with PyTorch: Pages 571-580

The sources introduce the concept of multi-class classification, expanding on the binary classification discussed in previous sections. They guide readers through building a multi-class classification model using PyTorch, highlighting the key differences and considerations when dealing with problems involving more than two classes. The sources utilize a synthetic dataset of multi-dimensional blobs created using scikit-learn’s make_blobs function to illustrate this process.
- Multi-Class Classification: The sources distinguish multi-class classification from binary classification, explaining that multi-class classification involves assigning data points to one of several possible classes. They provide examples of real-world multi-class classification problems, such as classifying images into different categories (e.g., cats, dogs, birds) or identifying different types of objects in an image.
- Building a Multi-Class Classification Model: The sources outline the steps for building a multi-class classification model in PyTorch, emphasizing the adjustments needed compared to binary classification:
1. Generating the Dataset: Using scikit-learn’s make_blobs function to create a synthetic dataset with multiple classes, where each data point has multiple features and belongs to one specific class.
2. Visualizing the Data: Utilizing Matplotlib to visualize the generated data points and their corresponding class labels, providing a visual understanding of the multi-class classification problem.
3. Building the Model: Constructing a neural network with linear layers and ReLU activation functions. The key difference in multi-class classification lies in the output layer. Instead of a single output neuron with a sigmoid activation function, the output layer has multiple neurons, one for each class. The softmax activation function is applied to the output layer to produce a probability distribution over the classes.
4. Choosing the Loss Function and Optimizer: Selecting an appropriate loss function for multi-class classification, such as the cross-entropy loss (nn.CrossEntropyLoss), and choosing an optimizer like stochastic gradient descent (SGD) or Adam.
5. Implementing the Training Loop: Implementing the training loop to train the model, similar to binary classification but using the chosen loss function and optimizer for multi-class classification.
6. Evaluating the Model: Evaluating the performance of the trained model using appropriate metrics for multi-class classification, such as accuracy and the confusion matrix. The sources emphasize that accuracy alone may not be sufficient for evaluating models on imbalanced datasets and suggest exploring other metrics like precision and recall.
The sources provide a comprehensive guide to building and training multi-class classification models in PyTorch, highlighting the adjustments needed in model architecture, loss function, and evaluation metrics compared to binary classification. By working through a concrete example using the make_blobs dataset, the sources equip readers with the fundamental knowledge and practical skills to tackle multi-class classification problems using PyTorch.

Enhancing a Model and Introducing Nonlinearities: Pages 581 – 590

The sources discuss strategies for improving the performance of machine learning models and introduce the concept of nonlinear activation functions, which play a crucial role in enabling neural networks to learn complex patterns in data. They explore ways to enhance a previously built multi-class classification model and introduce the ReLU (Rectified Linear Unit) activation function as a widely used nonlinearity in deep learning.
- Improving a Model’s Performance: The sources acknowledge that achieving satisfactory results with a machine learning model often involves experimentation and iterative improvement. They present several strategies for enhancing a model’s performance, including:
1. Adding More Layers: Increasing the depth of the neural network by adding more layers can allow the model to learn more complex representations of the data. The sources suggest that adding layers can be particularly beneficial for tasks with intricate data patterns.
2. Increasing Hidden Units: Expanding the number of hidden units within each layer can provide the model with more capacity to capture and learn the underlying patterns in the data.
3. Training for Longer: Extending the number of training epochs can give the model more opportunities to learn from the data and potentially improve its performance. However, training for too long can lead to overfitting, where the model performs well on the training data but poorly on unseen data.
4. Using a Smaller Learning Rate: Decreasing the learning rate can lead to more stable training and allow the model to converge to a better solution, especially when dealing with complex loss landscapes.
5. Adding Nonlinearities: Incorporating nonlinear activation functions between layers is essential for enabling neural networks to learn nonlinear relationships in the data. Without nonlinearities, the model would essentially be a series of linear transformations, limiting its ability to capture complex patterns.
- Introducing the ReLU Activation Function: The sources introduce the ReLU activation function as a widely used nonlinearity in deep learning. They describe ReLU’s simple yet effective operation: it outputs the input directly if the input is positive and outputs zero if the input is negative. Mathematically, ReLU(x) = max(0, x).
- The sources highlight the benefits of ReLU, including its computational efficiency and its tendency to mitigate the vanishing gradient problem, which can hinder training in deep networks.
- Incorporating ReLU into the Model: The sources guide readers through adding ReLU activation functions to the previously built multi-class classification model. They demonstrate how to insert ReLU layers between the linear layers of the model, enabling the network to learn nonlinear decision boundaries and improve its ability to classify the data.
The sources provide a practical guide to improving machine learning model performance and introduce the concept of nonlinearities, emphasizing the importance of ReLU activation functions in enabling neural networks to learn complex data patterns. By incorporating ReLU into the multi-class classification model, the sources showcase the power of nonlinearities in enhancing a model’s ability to capture and represent the underlying structure of the data.

Building and Evaluating Convolutional Neural Networks: Pages 591 – 600

The sources transition from traditional feedforward neural networks to convolutional neural networks (CNNs), a specialized architecture particularly effective for computer vision tasks. They emphasize the power of CNNs in automatically learning and extracting features from images, eliminating the need for manual feature engineering. The sources utilize a simplified version of the VGG architecture, dubbed “TinyVGG,” to illustrate the building blocks of CNNs and their application in image classification.
- Convolutional Neural Networks (CNNs): The sources introduce CNNs as a powerful type of neural network specifically designed for processing data with a grid-like structure, such as images. They explain that CNNs excel in computer vision tasks because they exploit the spatial relationships between pixels in an image, learning to identify patterns and features that are relevant for classification.
- Key Components of CNNs: The sources outline the fundamental building blocks of CNNs:
1. Convolutional Layers: Convolutional layers perform convolutions, a mathematical operation that involves sliding a filter (also called a kernel) over the input image to extract features. The filter acts as a pattern detector, learning to recognize specific shapes, edges, or textures in the image.
2. Activation Functions: Non-linear activation functions, such as ReLU, are applied to the output of convolutional layers to introduce non-linearity into the network, enabling it to learn complex patterns.
3. Pooling Layers: Pooling layers downsample the output of convolutional layers, reducing the spatial dimensions of the feature maps while retaining the most important information. Common pooling operations include max pooling and average pooling.
4. Fully Connected Layers: Fully connected layers, similar to those in traditional feedforward networks, are often used in the final stages of a CNN to perform classification based on the extracted features.
- Building TinyVGG: The sources guide readers through implementing a simplified version of the VGG architecture, named TinyVGG, to demonstrate how to build and train a CNN for image classification. They detail the architecture of TinyVGG, which consists of:
1. Convolutional Blocks: Multiple convolutional blocks, each comprising convolutional layers, ReLU activation functions, and a max pooling layer.
2. Classifier Layer: A final classifier layer consisting of a flattening operation followed by fully connected layers to perform classification.
- Training and Evaluating TinyVGG: The sources provide code for training TinyVGG using the FashionMNIST dataset, a collection of grayscale images of clothing items. They demonstrate how to define the training loop, calculate the loss, perform backpropagation, and update the model’s parameters using an optimizer. They also guide readers through evaluating the trained model’s performance using accuracy and other relevant metrics.
The sources provide a clear and accessible introduction to CNNs and their application in image classification, demonstrating the power of CNNs in automatically learning features from images without manual feature engineering. By implementing and training TinyVGG, the sources equip readers with the practical skills and understanding needed to build and work with CNNs for computer vision tasks.

Visualizing CNNs and Building a Custom Dataset: Pages 601-610

The sources emphasize the importance of understanding how convolutional neural networks (CNNs) operate and guide readers through visualizing the effects of convolutional layers, kernels, strides, and padding. They then transition to the concept of custom datasets, explaining the need to go beyond pre-built datasets and create datasets tailored to specific machine learning problems. The sources utilize the Food101 dataset, creating a smaller subset called “Food Vision Mini” to illustrate building a custom dataset for image classification.
- Visualizing CNNs: The sources recommend using the CNN Explainer website (https://poloclub.github.io/cnn-explainer/) to gain a deeper understanding of how CNNs work.
- They acknowledge that the mathematical operations involved in convolutions can be challenging to grasp. The CNN Explainer provides an interactive visualization that allows users to experiment with different CNN parameters and observe their effects on the input image.
- Key Insights from CNN Explainer: The sources highlight the following key concepts illustrated by the CNN Explainer:
1. Kernels: Kernels, also called filters, are small matrices that slide across the input image, extracting features by performing element-wise multiplications and summations. The values within the kernel represent the weights that the CNN learns during training.
2. Strides: Strides determine how much the kernel moves across the input image in each step. Larger strides result in a larger downsampling of the input, reducing the spatial dimensions of the output feature maps.
3. Padding: Padding involves adding extra pixels around the borders of the input image. Padding helps control the spatial dimensions of the output feature maps and can prevent information loss at the edges of the image.
- Building a Custom Dataset: The sources recognize that many real-world machine learning problems require creating custom datasets that are not readily available. They guide readers through the process of building a custom dataset for image classification, using the Food101 dataset as an example.
- Creating Food Vision Mini: The sources construct a smaller subset of the Food101 dataset called Food Vision Mini, which contains only three classes (pizza, steak, and sushi) and a reduced number of images. They advocate for starting with a smaller dataset for experimentation and development, scaling up to the full dataset once the model and workflow are established.
- Standard Image Classification Format: The sources emphasize the importance of organizing the dataset into a standard image classification format, where images are grouped into separate folders corresponding to their respective classes. This standard format facilitates data loading and preprocessing using PyTorch’s built-in tools.
- Loading Image Data using ImageFolder: The sources introduce PyTorch’s ImageFolder class, a convenient tool for loading image data that is organized in the standard image classification format. They demonstrate how to use ImageFolder to create dataset objects for the training and testing splits of Food Vision Mini.
- They highlight the benefits of ImageFolder, including its automatic labeling of images based on their folder location and its ability to apply transformations to the images during loading.
- Visualizing the Custom Dataset: The sources encourage visualizing the custom dataset to ensure that the images and labels are loaded correctly. They provide code for displaying random images and their corresponding labels from the training dataset, enabling a qualitative assessment of the dataset’s content.
The sources offer a practical guide to understanding and visualizing CNNs and provide a step-by-step approach to building a custom dataset for image classification. By using the Food Vision Mini dataset as a concrete example, the sources equip readers with the knowledge and skills needed to create and work with datasets tailored to their specific machine learning problems.

Building a Custom Dataset Class and Exploring Data Augmentation: Pages 611-620

The sources shift from using the convenient ImageFolder class to building a custom Dataset class in PyTorch, providing greater flexibility and control over data loading and preprocessing. They explain the structure and key methods of a custom Dataset class and demonstrate how to implement it for the Food Vision Mini dataset. The sources then explore data augmentation techniques, emphasizing their role in improving model generalization by artificially increasing the diversity of the training data.
- Building a Custom Dataset Class: The sources guide readers through creating a custom Dataset class in PyTorch, offering a more versatile approach compared to ImageFolder for handling image data. They outline the essential components of a custom Dataset:
1. Initialization (__init__): The initialization method sets up the necessary attributes of the dataset, such as the image paths, labels, and transformations.
2. Length (__len__): The length method returns the total number of samples in the dataset, allowing PyTorch’s data loaders to determine the dataset’s size.
3. Get Item (__getitem__): The get item method retrieves a specific sample from the dataset given its index. It typically involves loading the image, applying transformations, and returning the transformed image and its corresponding label.
- Implementing the Custom Dataset: The sources provide a step-by-step implementation of a custom Dataset class for the Food Vision Mini dataset. They demonstrate how to:
1. Collect Image Paths and Labels: Iterate through the image directories and store the paths to each image along with their corresponding labels.
2. Define Transformations: Specify the desired image transformations to be applied during data loading, such as resizing, cropping, and converting to tensors.
3. Implement __getitem__: Retrieve the image at the given index, apply transformations, and return the transformed image and label as a tuple.
- Benefits of Custom Dataset Class: The sources highlight the advantages of using a custom Dataset class:
1. Flexibility: Custom Dataset classes offer greater control over data loading and preprocessing, allowing developers to tailor the data handling process to their specific needs.
2. Extensibility: Custom Dataset classes can be easily extended to accommodate various data formats and incorporate complex data loading logic.
3. Code Clarity: Custom Dataset classes promote code organization and readability, making it easier to understand and maintain the data loading pipeline.
- Data Augmentation: The sources introduce data augmentation as a crucial technique for improving the generalization ability of machine learning models. Data augmentation involves artificially expanding the training dataset by applying various transformations to the original images.
- Purpose of Data Augmentation: The goal of data augmentation is to expose the model to a wider range of variations in the data, reducing the risk of overfitting and enabling the model to learn more robust and generalizable features.
- Types of Data Augmentations: The sources showcase several common data augmentation techniques, including:
1. Random Flipping: Flipping images horizontally or vertically.
2. Random Cropping: Cropping images to different sizes and positions.
3. Random Rotation: Rotating images by a random angle.
4. Color Jitter: Adjusting image brightness, contrast, saturation, and hue.
- Benefits of Data Augmentation: The sources emphasize the following benefits of data augmentation:
1. Increased Data Diversity: Data augmentation artificially expands the training dataset, exposing the model to a wider range of image variations.
2. Improved Generalization: Training on augmented data helps the model learn more robust features that generalize better to unseen data.
3. Reduced Overfitting: Data augmentation can mitigate overfitting by preventing the model from memorizing specific examples in the training data.
- Incorporating Data Augmentations: The sources guide readers through applying data augmentations to the Food Vision Mini dataset using PyTorch’s transforms module.
- They demonstrate how to compose multiple transformations into a pipeline, applying them sequentially to the images during data loading.
- Visualizing Augmented Images: The sources encourage visualizing the augmented images to ensure that the transformations are being applied as expected. They provide code for displaying random augmented images from the training dataset, allowing a qualitative assessment of the augmentation pipeline’s effects.
The sources provide a comprehensive guide to building a custom Dataset class in PyTorch, empowering readers to handle data loading and preprocessing with greater flexibility and control. They then explore the concept and benefits of data augmentation, emphasizing its role in enhancing model generalization by introducing artificial diversity into the training data.

Constructing and Training a TinyVGG Model: Pages 621-630

The sources guide readers through constructing a TinyVGG model, a simplified version of the VGG (Visual Geometry Group) architecture commonly used in computer vision. They explain the rationale behind TinyVGG’s design, detail its layers and activation functions, and demonstrate how to implement it in PyTorch. They then focus on training the TinyVGG model using the custom Food Vision Mini dataset. They highlight the importance of setting a random seed for reproducibility and illustrate the training process using a combination of code and explanatory text.
- Introducing TinyVGG Architecture: The sources introduce the TinyVGG architecture as a simplified version of the VGG architecture, well-known for its performance in image classification tasks.
- Rationale Behind TinyVGG: They explain that TinyVGG aims to capture the essential elements of the VGG architecture while using fewer layers and parameters, making it more computationally efficient and suitable for smaller datasets like Food Vision Mini.
- Layers and Activation Functions in TinyVGG: The sources provide a detailed breakdown of the layers and activation functions used in the TinyVGG model:
1. Convolutional Layers (nn.Conv2d): Multiple convolutional layers are used to extract features from the input images. Each convolutional layer applies a set of learnable filters (kernels) to the input, generating feature maps that highlight different patterns in the image.
2. ReLU Activation Function (nn.ReLU): The rectified linear unit (ReLU) activation function is applied after each convolutional layer. ReLU introduces non-linearity into the model, allowing it to learn complex relationships between features. It is defined as f(x) = max(0, x), meaning it outputs the input directly if it is positive and outputs zero if the input is negative.
3. Max Pooling Layers (nn.MaxPool2d): Max pooling layers downsample the feature maps by selecting the maximum value within a small window. This reduces the spatial dimensions of the feature maps while retaining the most salient features.
4. Flatten Layer (nn.Flatten): The flatten layer converts the multi-dimensional feature maps from the convolutional layers into a one-dimensional feature vector. This vector is then fed into the fully connected layers for classification.
5. Linear Layer (nn.Linear): The linear layer performs a matrix multiplication on the input feature vector, producing a set of scores for each class.
- Implementing TinyVGG in PyTorch: The sources guide readers through implementing the TinyVGG architecture using PyTorch’s nn.Module class. They define a class called TinyVGG that inherits from nn.Module and implements the model’s architecture in its __init__ and forward methods.
- __init__ Method: This method initializes the model’s layers, including convolutional layers, ReLU activation functions, max pooling layers, a flatten layer, and a linear layer for classification.
- forward Method: This method defines the flow of data through the model, taking an input tensor and passing it through the various layers in the correct sequence.
- Setting the Random Seed: The sources stress the importance of setting a random seed before training the model using torch.manual_seed(42). This ensures that the model’s initialization and training process are deterministic, making the results reproducible.
- Training the TinyVGG Model: The sources demonstrate how to train the TinyVGG model on the Food Vision Mini dataset. They provide code for:
1. Creating an Instance of the Model: Instantiating the TinyVGG class creates an object representing the model.
2. Choosing a Loss Function: Selecting an appropriate loss function to measure the difference between the model’s predictions and the true labels.
3. Setting up an Optimizer: Choosing an optimization algorithm to update the model’s parameters during training, aiming to minimize the loss function.
4. Defining a Training Loop: Implementing a loop that iterates through the training data, performs forward and backward passes, updates model parameters, and tracks the training progress.
The sources provide a practical walkthrough of constructing and training a TinyVGG model using the Food Vision Mini dataset. They explain the architecture’s design principles, detail its layers and activation functions, and demonstrate how to implement and train the model in PyTorch. They emphasize the importance of setting a random seed for reproducibility, enabling others to replicate the training process and results.

Visualizing the Model, Evaluating Performance, and Comparing Results: Pages 631-640

The sources move towards visualizing the TinyVGG model’s layers and their effects on input data, offering insights into how convolutional neural networks process information. They then focus on evaluating the model’s performance using various metrics, emphasizing the need to go beyond simple accuracy and consider measures like precision, recall, and F1 score for a more comprehensive assessment. Finally, the sources introduce techniques for comparing the performance of different models, highlighting the role of dataframes in organizing and presenting the results.
- Visualizing TinyVGG’s Convolutional Layers: The sources explore how to visualize the convolutional layers of the TinyVGG model.
- They leverage the CNN Explainer website, which offers an interactive tool for understanding the workings of convolutional neural networks.
- The sources guide readers through creating dummy data in the same shape as the input data used in the CNN Explainer, allowing them to observe how the model’s convolutional layers transform the input.
- The sources emphasize the importance of understanding hyperparameters like kernel size, stride, and padding and their influence on the convolutional operation.
- Understanding Kernel Size, Stride, and Padding: The sources explain the significance of key hyperparameters involved in convolutional layers:
1. Kernel Size: Refers to the size of the filter that slides across the input image. A larger kernel captures a wider receptive field, allowing the model to learn more complex features. However, a larger kernel also increases the number of parameters and computational complexity.
2. Stride: Determines the step size at which the kernel moves across the input. A larger stride results in a smaller output feature map, effectively downsampling the input.
3. Padding: Involves adding extra pixels around the input image to control the output size and prevent information loss at the edges. Different padding strategies, such as “same” padding or “valid” padding, influence how the kernel interacts with the image boundaries.
- Evaluating Model Performance: The sources shift focus to evaluating the performance of the trained TinyVGG model. They emphasize that relying solely on accuracy may not provide a complete picture, especially when dealing with imbalanced datasets where one class might dominate the others.
- Metrics Beyond Accuracy: The sources introduce several additional metrics for evaluating classification models:
1. Precision: Measures the proportion of correctly predicted positive instances out of all instances predicted as positive. A high precision indicates that the model is good at avoiding false positives.
2. Recall: Measures the proportion of correctly predicted positive instances out of all actual positive instances. A high recall suggests that the model is effective at identifying most of the positive instances.
3. F1 Score: The harmonic mean of precision and recall, providing a balanced measure that considers both false positives and false negatives. It is particularly useful when dealing with imbalanced datasets where precision and recall might provide conflicting insights.
- Confusion Matrix: The sources introduce the concept of a confusion matrix, a powerful tool for visualizing the performance of a classification model.
- Structure of a Confusion Matrix: The confusion matrix is a table that shows the counts of true positives, true negatives, false positives, and false negatives for each class, providing a detailed breakdown of the model’s prediction patterns.
- Benefits of Confusion Matrix: The confusion matrix helps identify classes that the model struggles with, providing insights into potential areas for improvement.
- Comparing Model Performance: The sources explore techniques for comparing the performance of different models trained on the Food Vision Mini dataset. They demonstrate how to use Pandas dataframes to organize and present the results clearly and concisely.
- Creating a Dataframe for Comparison: The sources guide readers through creating a dataframe that includes relevant metrics like training time, training loss, test loss, and test accuracy for each model. This allows for a side-by-side comparison of their performance.
- Benefits of Dataframes: Dataframes provide a structured and efficient way to handle and analyze tabular data. They enable easy sorting, filtering, and visualization of the results, facilitating the process of model selection and comparison.
The sources emphasize the importance of going beyond simple accuracy when evaluating classification models. They introduce a range of metrics, including precision, recall, and F1 score, and highlight the usefulness of the confusion matrix in providing a detailed analysis of the model’s prediction patterns. The sources then demonstrate how to use dataframes to compare the performance of multiple models systematically, aiding in model selection and understanding the impact of different design choices or training strategies.

Building, Training, and Evaluating a Multi-Class Classification Model: Pages 641-650

The sources transition from binary classification, where models distinguish between two classes, to multi-class classification, which involves predicting one of several possible classes. They introduce the concept of multi-class classification, comparing it to binary classification, and use the Fashion MNIST dataset as an example, where models need to classify images into ten different clothing categories. The sources guide readers through adapting the TinyVGG architecture and training process for this multi-class setting, explaining the modifications needed for handling multiple classes.
- From Binary to Multi-Class Classification: The sources explain the shift from binary to multi-class classification.
- Binary Classification: Involves predicting one of two possible classes, like “cat” or “dog” in an image classification task.
- Multi-Class Classification: Extends the concept to predicting one of multiple classes, as in the Fashion MNIST dataset, where models must classify images into classes like “T-shirt,” “Trouser,” “Pullover,” “Dress,” “Coat,” “Sandal,” “Shirt,” “Sneaker,” “Bag,” and “Ankle Boot.” [1, 2]
- Adapting TinyVGG for Multi-Class Classification: The sources explain how to modify the TinyVGG architecture for multi-class problems.
- Output Layer: The key change involves adjusting the output layer of the TinyVGG model. The number of output units in the final linear layer needs to match the number of classes in the dataset. For Fashion MNIST, this means having ten output units, one for each clothing category. [3]
- Activation Function: They also recommend using the softmax activation function in the output layer for multi-class classification. The softmax function converts the raw output scores (logits) from the linear layer into a probability distribution over the classes, where each probability represents the model’s confidence in assigning the input to that particular class. [4]
- Choosing the Right Loss Function and Optimizer: The sources guide readers through selecting appropriate loss functions and optimizers for multi-class classification:
- Cross-Entropy Loss: They recommend using the cross-entropy loss function, a common choice for multi-class classification tasks. Cross-entropy loss measures the dissimilarity between the predicted probability distribution and the true label distribution. [5]
- Optimizers: The sources discuss using optimizers like Stochastic Gradient Descent (SGD) or Adam to update the model’s parameters during training, aiming to minimize the cross-entropy loss. [5]
- Training the Multi-Class Model: The sources demonstrate how to train the adapted TinyVGG model on the Fashion MNIST dataset, following a similar training loop structure used in previous sections:
- Data Loading: Loading batches of image data and labels from the Fashion MNIST dataset using PyTorch’s DataLoader. [6, 7]
- Forward Pass: Passing the input data through the model to obtain predictions (logits). [8]
- Calculating Loss: Computing the cross-entropy loss between the predicted logits and the true labels. [8]
- Backpropagation: Calculating gradients of the loss with respect to the model’s parameters. [8]
- Optimizer Step: Updating the model’s parameters using the chosen optimizer, aiming to minimize the loss. [8]
- Evaluating Performance: The sources reiterate the importance of evaluating model performance using metrics beyond simple accuracy, especially in multi-class settings.
- Precision, Recall, F1 Score: They encourage considering metrics like precision, recall, and F1 score, which provide a more nuanced understanding of the model’s ability to correctly classify instances across different classes. [9]
- Confusion Matrix: They highlight the usefulness of the confusion matrix, allowing visualization of the model’s prediction patterns and identification of classes the model struggles with. [10]
The sources smoothly transition readers from binary to multi-class classification. They outline the key differences, provide clear instructions on adapting the TinyVGG architecture for multi-class tasks, and guide readers through the training process. They emphasize the need for comprehensive model evaluation, suggesting the use of metrics beyond accuracy and showcasing the value of the confusion matrix in analyzing the model’s performance.

Evaluating Model Predictions and Understanding Data Augmentation: Pages 651-660

The sources guide readers through evaluating model predictions on individual samples from the Fashion MNIST dataset, emphasizing the importance of visual inspection and understanding where the model succeeds or fails. They then introduce the concept of data augmentation as a technique for artificially increasing the diversity of the training data, aiming to improve the model’s generalization ability and robustness.
- Visually Evaluating Model Predictions: The sources demonstrate how to make predictions on individual samples from the test set and visualize them alongside their true labels.
- Selecting Random Samples: They guide readers through selecting random samples from the test data, preparing the images for visualization using matplotlib, and making predictions using the trained model.
- Visualizing Predictions: They showcase a technique for creating a grid of images, displaying each test sample alongside its predicted label and its true label. This visual approach provides insights into the model’s performance on specific instances.
- Analyzing Results: The sources encourage readers to analyze the visual results, looking for patterns in the model’s predictions and identifying instances where it might be making errors. This process helps understand the strengths and weaknesses of the model’s learned representations.
- Confusion Matrix for Deeper Insights: The sources revisit the concept of the confusion matrix, introduced earlier, as a powerful tool for evaluating classification model performance.
- Creating a Confusion Matrix: They guide readers through creating a confusion matrix using libraries like torchmetrics and mlxtend, which offer convenient functions for computing and visualizing confusion matrices.
- Interpreting the Confusion Matrix: The sources explain how to interpret the confusion matrix, highlighting the patterns in the model’s predictions and identifying classes that might be easily confused.
- Benefits of Confusion Matrix: They emphasize that the confusion matrix provides a more granular view of the model’s performance compared to simple accuracy, allowing for a deeper understanding of its prediction patterns.
- Data Augmentation: The sources introduce the concept of data augmentation as a technique to improve model generalization and performance.
- Definition of Data Augmentation: They define data augmentation as the process of artificially increasing the diversity of the training data by applying various transformations to the original images.
- Benefits of Data Augmentation: The sources explain that data augmentation helps expose the model to a wider range of variations during training, making it more robust to changes in input data and improving its ability to generalize to unseen examples.
- Common Data Augmentation Techniques: The sources discuss several commonly used data augmentation techniques:
1. Random Cropping: Involves randomly selecting a portion of the image to use for training, helping the model learn to recognize objects regardless of their location within the image.
2. Random Flipping: Horizontally flipping images, teaching the model to recognize objects even when they are mirrored.
3. Random Rotation: Rotating images by a random angle, improving the model’s ability to handle different object orientations.
4. Color Jitter: Adjusting the brightness, contrast, saturation, and hue of images, making the model more robust to variations in lighting and color.
- Applying Data Augmentation in PyTorch: The sources demonstrate how to apply data augmentation using PyTorch’s transforms module, which offers a wide range of built-in transformations for image data. They create a custom transformation pipeline that includes random cropping, random horizontal flipping, and random rotation. They then visualize examples of augmented images, highlighting the diversity introduced by these transformations.
The sources guide readers through evaluating individual model predictions, showcasing techniques for visual inspection and analysis using matplotlib. They reiterate the importance of the confusion matrix as a tool for gaining deeper insights into the model’s prediction patterns. They then introduce the concept of data augmentation, explaining its purpose and benefits. The sources provide clear explanations of common data augmentation techniques and demonstrate how to apply them using PyTorch’s transforms module, emphasizing the role of data augmentation in improving model generalization and robustness.

Building and Training a TinyVGG Model on a Custom Dataset: Pages 661-670

The sources shift focus to building and training a TinyVGG convolutional neural network model on the custom food dataset (pizza, steak, sushi) prepared in the previous sections. They guide readers through the process of model definition, setting up a loss function and optimizer, and defining training and testing steps for the model. The sources emphasize a step-by-step approach, encouraging experimentation and understanding of the model’s architecture and training dynamics.
- Defining the TinyVGG Architecture: The sources provide a detailed breakdown of the TinyVGG architecture, outlining the layers and their configurations:
- Convolutional Blocks: They describe the arrangement of convolutional layers (nn.Conv2d), activation functions (typically ReLU – nn.ReLU), and max-pooling layers (nn.MaxPool2d) within convolutional blocks. They explain how these blocks extract features from the input images at different levels of abstraction.
- Classifier Layer: They describe the classifier layer, consisting of a flattening operation (nn.Flatten) followed by fully connected linear layers (nn.Linear). This layer takes the extracted features from the convolutional blocks and maps them to the output classes (pizza, steak, sushi).
- Model Implementation: The sources guide readers through implementing the TinyVGG model in PyTorch, showing how to define the model class by subclassing nn.Module:
- __init__ Method: They demonstrate the initialization of the model’s layers within the __init__ method, setting up the convolutional blocks and the classifier layer.
- forward Method: They explain the forward method, which defines the flow of data through the model during the forward pass, outlining how the input data passes through each layer and transformation.
- Input and Output Shape Verification: The sources stress the importance of verifying the input and output shapes of each layer in the model. They encourage readers to print the shapes at different stages to ensure the data is flowing correctly through the network and that the dimensions are as expected. They also mention techniques for troubleshooting shape mismatches.
- Introducing torchinfo Package: The sources introduce the torchinfo package as a helpful tool for summarizing the architecture of a PyTorch model, providing information about layer shapes, parameters, and the overall structure of the model. They demonstrate how to use torchinfo to get a concise overview of the defined TinyVGG model.
- Setting Up the Loss Function and Optimizer: The sources guide readers through selecting a suitable loss function and optimizer for training the TinyVGG model:
- Cross-Entropy Loss: They recommend using the cross-entropy loss function for the multi-class classification problem of the food dataset. They explain that cross-entropy loss is commonly used for classification tasks and measures the difference between the predicted probability distribution and the true label distribution.
- Stochastic Gradient Descent (SGD) Optimizer: They suggest using the SGD optimizer for updating the model’s parameters during training. They explain that SGD is a widely used optimization algorithm that iteratively adjusts the model’s parameters to minimize the loss function.
- Defining Training and Testing Steps: The sources provide code for defining the training and testing steps of the model training process:
- train_step Function: They define a train_step function, which takes a batch of training data as input, performs a forward pass through the model, calculates the loss, performs backpropagation to compute gradients, and updates the model’s parameters using the optimizer. They emphasize accumulating the loss and accuracy over the batches within an epoch.
- test_step Function: They define a test_step function, which takes a batch of testing data as input, performs a forward pass to get predictions, calculates the loss, and accumulates the loss and accuracy over the batches. They highlight that the test_step does not involve updating the model’s parameters, as it’s used for evaluation purposes.
The sources guide readers through the process of defining the TinyVGG architecture, verifying layer shapes, setting up the loss function and optimizer, and defining the training and testing steps for the model. They emphasize the importance of understanding the model’s structure and the flow of data through it. They encourage readers to experiment and pay attention to details to ensure the model is correctly implemented and set up for training.

Training, Evaluating, and Saving the TinyVGG Model: Pages 671-680

The sources guide readers through the complete training process of the TinyVGG model on the custom food dataset, highlighting techniques for visualizing training progress, evaluating model performance, and saving the trained model for later use. They emphasize practical considerations, such as setting up training loops, tracking loss and accuracy metrics, and making predictions on test data.
- Implementing the Training Loop: The sources provide code for implementing the training loop, iterating through multiple epochs and performing training and testing steps for each epoch. They break down the training loop into clear steps:
- Epoch Iteration: They use a for loop to iterate over the specified number of training epochs.
- Setting Model to Training Mode: Before starting the training step for each epoch, they explicitly set the model to training mode using model.train(). They explain that this is important for activating certain layers, like dropout or batch normalization, which behave differently during training and evaluation.
- Iterating Through Batches: Within each epoch, they use another for loop to iterate through the batches of data from the training data loader.
- Calling the train_step Function: For each batch, they call the previously defined train_step function, which performs a forward pass, calculates the loss, performs backpropagation, and updates the model’s parameters.
- Accumulating Loss and Accuracy: They accumulate the training loss and accuracy values over the batches within an epoch.
- Setting Model to Evaluation Mode: Before starting the testing step, they set the model to evaluation mode using model.eval(). They explain that this deactivates training-specific behaviors of certain layers.
- Iterating Through Test Batches: They iterate through the batches of data from the test data loader.
- Calling the test_step Function: For each batch, they call the test_step function, which calculates the loss and accuracy on the test data.
- Accumulating Test Loss and Accuracy: They accumulate the test loss and accuracy values over the test batches.
- Calculating Average Loss and Accuracy: After iterating through all the training and testing batches, they calculate the average training loss, training accuracy, test loss, and test accuracy for the epoch.
- Printing Epoch Statistics: They print the calculated statistics for each epoch, providing a clear view of the model’s progress during training.
- Visualizing Training Progress: The sources emphasize the importance of visualizing the training process to gain insights into the model’s learning dynamics:
- Creating Loss and Accuracy Curves: They guide readers through creating plots of the training loss and accuracy values over the epochs, allowing for visual inspection of how the model is improving.
- Analyzing Loss Curves: They explain how to analyze the loss curves, looking for trends that indicate convergence or potential issues like overfitting. They suggest that a steadily decreasing loss curve generally indicates good learning progress.
- Saving and Loading the Best Model: The sources highlight the importance of saving the model with the best performance achieved during training:
- Tracking the Best Test Loss: They introduce a variable to track the best test loss achieved so far during training.
- Saving the Model When Test Loss Improves: They include a condition within the training loop to save the model’s state dictionary (model.state_dict()) whenever a new best test loss is achieved.
- Loading the Saved Model: They demonstrate how to load the saved model’s state dictionary using torch.load() and use it to restore the model’s parameters for later use.
- Evaluating the Loaded Model: The sources guide readers through evaluating the performance of the loaded model on the test data:
- Performing a Test Pass: They use the test_step function to calculate the loss and accuracy of the loaded model on the entire test dataset.
- Comparing Results: They compare the results of the loaded model with the results obtained during training to ensure that the loaded model performs as expected.
The sources provide a comprehensive walkthrough of the training process for the TinyVGG model, emphasizing the importance of setting up the training loop, tracking loss and accuracy metrics, visualizing training progress, saving the best model, and evaluating its performance. They offer practical tips and best practices for effective model training, encouraging readers to actively engage in the process, analyze the results, and gain a deeper understanding of how the model learns and improves.

Understanding and Implementing Custom Datasets: Pages 681-690

The sources shift focus to explaining the concept and implementation of custom datasets in PyTorch, emphasizing the flexibility and customization they offer for handling diverse types of data beyond pre-built datasets. They guide readers through the process of creating a custom dataset class, understanding its key methods, and visualizing samples from the custom dataset.
- Introducing Custom Datasets: The sources introduce the concept of custom datasets in PyTorch, explaining that they allow for greater control and flexibility in handling data that doesn’t fit the structure of pre-built datasets. They highlight that custom datasets are especially useful when working with:
- Data in Non-Standard Formats: Data that is not readily available in formats supported by pre-built datasets, requiring specific loading and processing steps.
- Data with Unique Structures: Data with specific organizational structures or relationships that need to be represented in a particular way.
- Data Requiring Specialized Transformations: Data that requires specific transformations or augmentations to prepare it for model training.
- Using torchvision.datasets.ImageFolder : The sources acknowledge that the torchvision.datasets.ImageFolder class can handle many image classification datasets. They explain that ImageFolder works well when the data follows a standard directory structure, where images are organized into subfolders representing different classes. However, they also emphasize the need for custom dataset classes when dealing with data that doesn’t conform to this standard structure.
- Building FoodVisionMini Custom Dataset: The sources guide readers through creating a custom dataset class called FoodVisionMini, designed to work with the smaller subset of the Food 101 dataset (pizza, steak, sushi) prepared earlier. They outline the key steps and considerations involved:
- Subclassing torch.utils.data.Dataset: They explain that custom dataset classes should inherit from the torch.utils.data.Dataset class, which provides the basic framework for representing a dataset in PyTorch.
- Implementing Required Methods: They highlight the essential methods that need to be implemented in a custom dataset class:
- __init__ Method: The __init__ method initializes the dataset, taking the necessary arguments, such as the data directory, transformations to be applied, and any other relevant information.
- __len__ Method: The __len__ method returns the total number of samples in the dataset.
- __getitem__ Method: The __getitem__ method retrieves a data sample at a given index. It typically involves loading the data, applying transformations, and returning the processed data and its corresponding label.
- __getitem__ Method Implementation: The sources provide a detailed breakdown of implementing the __getitem__ method in the FoodVisionMini dataset:
- Getting the Image Path: The method first determines the file path of the image to be loaded based on the provided index.
- Loading the Image: It uses PIL.Image.open() to open the image file.
- Applying Transformations: It applies the specified transformations (if any) to the loaded image.
- Converting to Tensor: It converts the transformed image to a PyTorch tensor.
- Returning Data and Label: It returns the processed image tensor and its corresponding class label.
- Overriding the __len__ Method: The sources also explain the importance of overriding the __len__ method to return the correct number of samples in the custom dataset. They demonstrate a simple implementation that returns the length of the list of image file paths.
- Visualizing Samples from the Custom Dataset: The sources emphasize the importance of visually inspecting samples from the custom dataset to ensure that the data is loaded and processed correctly. They guide readers through creating a function to display random images from the dataset, including their labels, to verify the dataset’s integrity and the effectiveness of applied transformations.
The sources provide a detailed guide to understanding and implementing custom datasets in PyTorch. They explain the motivations for using custom datasets, the key methods to implement, and practical considerations for loading, processing, and visualizing data. They encourage readers to explore the flexibility of custom datasets and create their own to handle diverse data formats and structures for their specific machine learning tasks.

Exploring Data Augmentation and Building the TinyVGG Model Architecture: Pages 691-700

The sources introduce the concept of data augmentation, a powerful technique for enhancing the diversity and robustness of training datasets, and then guide readers through building the TinyVGG model architecture using PyTorch.
- Visualizing the Effects of Data Augmentation: The sources demonstrate the visual effects of applying data augmentation techniques to images from the custom food dataset. They showcase examples where images have been:
- Cropped: Portions of the original images have been removed, potentially changing the focus or composition.
- Darkened/Brightened: The overall brightness or contrast of the images has been adjusted, simulating variations in lighting conditions.
- Shifted: The content of the images has been moved within the frame, altering the position of objects.
- Rotated: The images have been rotated by a certain angle, introducing variations in orientation.
- Color-Modified: The color balance or saturation of the images has been altered, simulating variations in color perception.
The sources emphasize that applying these augmentations randomly during training can help the model learn more robust and generalizable features, making it less sensitive to variations in image appearance and less prone to overfitting the training data.
- Creating a Function to Display Random Transformed Images: The sources provide code for creating a function to display random images from the custom dataset after they have been transformed using data augmentation techniques. This function allows for visual inspection of the augmented images, helping readers understand the impact of different transformations on the dataset. They explain how this function can be used to:
- Verify Transformations: Ensure that the intended augmentations are being applied correctly to the images.
- Assess Augmentation Strength: Evaluate whether the strength or intensity of the augmentations is appropriate for the dataset and task.
- Visualize Data Diversity: Observe the increased diversity in the dataset resulting from data augmentation.
- Implementing the TinyVGG Model Architecture: The sources guide readers through implementing the TinyVGG model architecture, a convolutional neural network architecture known for its simplicity and effectiveness in image classification tasks. They outline the key building blocks of the TinyVGG model:
- Convolutional Blocks (conv_block): The model uses multiple convolutional blocks, each consisting of:
- Convolutional Layers (nn.Conv2d): These layers apply learnable filters to the input image, extracting features at different scales and orientations.
- ReLU Activation Layers (nn.ReLU): These layers introduce non-linearity into the model, allowing it to learn complex patterns in the data.
- Max Pooling Layers (nn.MaxPool2d): These layers downsample the feature maps, reducing their spatial dimensions while retaining the most important features.
- Classifier Layer: The convolutional blocks are followed by a classifier layer, which consists of:
- Flatten Layer (nn.Flatten): This layer converts the multi-dimensional feature maps from the convolutional blocks into a one-dimensional feature vector.
- Linear Layer (nn.Linear): This layer performs a linear transformation on the feature vector, producing output logits that represent the model’s predictions for each class.
The sources emphasize the hierarchical structure of the TinyVGG model, where the convolutional blocks progressively extract more abstract and complex features from the input image, and the classifier layer uses these features to make predictions. They explain that the TinyVGG model’s simple yet effective design makes it a suitable choice for various image classification tasks, and its modular structure allows for customization and experimentation with different layer configurations.
- Troubleshooting Shape Mismatches: The sources address the common issue of shape mismatches that can occur when building deep learning models, emphasizing the importance of carefully checking the input and output dimensions of each layer:
- Using Error Messages as Guides: They explain that error messages related to shape mismatches can provide valuable clues for identifying the source of the issue.
- Printing Shapes for Verification: They recommend printing the shapes of tensors at various points in the model to verify that the dimensions are as expected and to trace the flow of data through the model.
- Calculating Shapes Manually: They suggest calculating the expected output shapes of convolutional and pooling layers manually, considering factors like kernel size, stride, and padding, to ensure that the model is structured correctly.
- Using torchinfo for Model Summary: The sources introduce the torchinfo package, a useful tool for visualizing the structure and parameters of a PyTorch model. They explain that torchinfo can provide a comprehensive summary of the model, including:
- Layer Information: The type and configuration of each layer in the model.
- Input and Output Shapes: The expected dimensions of tensors at each stage of the model.
- Number of Parameters: The total number of trainable parameters in the model.
- Memory Usage: An estimate of the model’s memory requirements.
The sources demonstrate how to use torchinfo to summarize the TinyVGG model, highlighting its ability to provide insights into the model’s architecture and complexity, and assist in debugging shape-related issues.

The sources provide a practical guide to understanding and implementing data augmentation techniques, building the TinyVGG model architecture, and troubleshooting common issues. They emphasize the importance of visualizing the effects of augmentations, carefully checking layer shapes, and utilizing tools like torchinfo for model analysis. These steps lay the foundation for training the TinyVGG model on the custom food dataset in subsequent sections.

Training and Evaluating the TinyVGG Model on a Custom Dataset: Pages 701-710

The sources guide readers through training and evaluating the TinyVGG model on the custom food dataset, explaining how to implement training and evaluation loops, track model performance, and visualize results.
- Preparing for Model Training: The sources outline the steps to prepare for training the TinyVGG model:
- Setting a Random Seed: They emphasize the importance of setting a random seed for reproducibility. This ensures that the random initialization of model weights and any data shuffling during training is consistent across different runs, making it easier to compare and analyze results. [1]
- Creating a List of Image Paths: They generate a list of paths to all the image files in the custom dataset. This list will be used to access and process images during training. [1]
- Visualizing Data with PIL: They demonstrate how to use the Python Imaging Library (PIL) to:
- Open and Display Images: Load and display images from the dataset using PIL.Image.open(). [2]
- Convert Images to Arrays: Transform images into numerical arrays using np.array(), enabling further processing and analysis. [3]
- Inspect Color Channels: Examine the red, green, and blue (RGB) color channels of images, understanding how color information is represented numerically. [3]
- Implementing Image Transformations: They review the concept of image transformations and their role in preparing images for model input, highlighting:
- Conversion to Tensors: Transforming images into PyTorch tensors, the required data format for inputting data into PyTorch models. [3]
- Resizing and Cropping: Adjusting image dimensions to ensure consistency and compatibility with the model’s input layer. [3]
- Normalization: Scaling pixel values to a specific range, typically between 0 and 1, to improve model training stability and efficiency. [3]
- Data Augmentation: Applying random transformations to images during training to increase data diversity and prevent overfitting. [4]
- Utilizing ImageFolder for Data Loading: The sources demonstrate the convenience of using the torchvision.datasets.ImageFolder class for loading images from a directory structured according to image classification standards. They explain how ImageFolder:
- Organizes Data by Class: Automatically infers class labels based on the subfolder structure of the image directory, streamlining data organization. [5]
- Provides Data Length: Offers a __len__ method to determine the number of samples in the dataset, useful for tracking progress during training. [5]
- Enables Sample Access: Implements a __getitem__ method to retrieve a specific image and its corresponding label based on its index, facilitating data access during training. [5]
- Creating DataLoader for Batch Processing: The sources emphasize the importance of using the torch.utils.data.DataLoader class to create data loaders, explaining their role in:
- Batching Data: Grouping multiple images and labels into batches, allowing the model to process multiple samples simultaneously, which can significantly speed up training. [6]
- Shuffling Data: Randomizing the order of samples within batches to prevent the model from learning spurious patterns based on the order of data presentation. [6]
- Loading Data Efficiently: Optimizing data loading and transfer, especially when working with large datasets, to minimize training time and resource usage. [6]
- Visualizing a Sample and Label: The sources guide readers through visualizing an image and its label from the custom dataset using Matplotlib, allowing for a visual confirmation that the data is being loaded and processed correctly. [7]
- Understanding Data Shape and Transformations: The sources highlight the importance of understanding how data shapes change as they pass through different stages of the model:
- Color Channels First (NCHW): PyTorch often expects images in the format “Batch Size (N), Color Channels (C), Height (H), Width (W).” [8]
- Transformations and Shape: They reiterate the importance of verifying that image transformations result in the expected output shapes, ensuring compatibility with subsequent layers. [8]
- Replicating ImageFolder Functionality: The sources provide code for replicating the core functionality of ImageFolder manually. They explain that this exercise can deepen understanding of how custom datasets are created and provide a foundation for building more specialized datasets in the future. [9]
The sources meticulously guide readers through the essential steps of preparing data, loading it using ImageFolder, and creating data loaders for efficient batch processing. They emphasize the importance of data visualization, shape verification, and understanding the transformations applied to images. These detailed explanations set the stage for training and evaluating the TinyVGG model on the custom food dataset.

Constructing the Training Loop and Evaluating Model Performance: Pages 711-720

The sources focus on building the training loop and evaluating the performance of the TinyVGG model on the custom food dataset. They introduce techniques for tracking training progress, calculating loss and accuracy, and visualizing the training process.
- Creating Training and Testing Step Functions: The sources explain the importance of defining separate functions for the training and testing steps. They guide readers through implementing these functions:
- train_step Function: This function outlines the steps involved in a single training iteration. It includes:
1. Setting the Model to Train Mode: The model is set to training mode (model.train()) to enable gradient calculations and updates during backpropagation.
2. Performing a Forward Pass: The input data (images) is passed through the model to obtain the output predictions (logits).
3. Calculating the Loss: The predicted logits are compared to the true labels using a loss function (e.g., cross-entropy loss), providing a measure of how well the model’s predictions match the actual data.
4. Calculating the Accuracy: The model’s accuracy is calculated by determining the percentage of correct predictions.
5. Zeroing Gradients: The gradients from the previous iteration are reset to zero (optimizer.zero_grad()) to prevent their accumulation and ensure that each iteration’s gradients are calculated independently.
6. Performing Backpropagation: The gradients of the loss function with respect to the model’s parameters are calculated (loss.backward()), tracing the path of error back through the network.
7. Updating Model Parameters: The optimizer updates the model’s parameters (optimizer.step()) based on the calculated gradients, adjusting the model’s weights and biases to minimize the loss function.
8. Returning Loss and Accuracy: The function returns the calculated loss and accuracy for the current training iteration, allowing for performance monitoring.
- test_step Function: This function performs a similar process to the train_step function, but without gradient calculations or parameter updates. It is designed to evaluate the model’s performance on a separate test dataset, providing an unbiased assessment of how well the model generalizes to unseen data.
- Implementing the Training Loop: The sources outline the structure of the training loop, which iteratively trains and evaluates the model over a specified number of epochs:
- Looping through Epochs: The loop iterates through the desired number of epochs, allowing the model to see and learn from the training data multiple times.
- Looping through Batches: Within each epoch, the loop iterates through the batches of data provided by the training data loader.
- Calling train_step and test_step: For each batch, the train_step function is called to train the model, and periodically, the test_step function is called to evaluate the model’s performance on the test dataset.
- Tracking and Accumulating Loss and Accuracy: The loss and accuracy values from each batch are accumulated to calculate the average loss and accuracy for the entire epoch.
- Printing Progress: The training progress, including epoch number, loss, and accuracy, is printed to the console, providing a real-time view of the model’s performance.
- Using tqdm for Progress Bars: The sources recommend using the tqdm library to create progress bars, which visually display the progress of the training loop, making it easier to track how long each epoch takes and estimate the remaining training time.
- Visualizing Training Progress with Loss Curves: The sources emphasize the importance of visualizing the model’s training progress by plotting loss curves. These curves show how the loss function changes over time (epochs or batches), providing insights into:
- Model Convergence: Whether the model is successfully learning and reducing the error on the training data, indicated by a decreasing loss curve.
- Overfitting: If the loss on the training data continues to decrease while the loss on the test data starts to increase, it might indicate that the model is overfitting the training data and not generalizing well to unseen data.
- Understanding Ideal and Problematic Loss Curves: The sources provide examples of ideal and problematic loss curves, helping readers identify patterns that suggest healthy training progress or potential issues that may require adjustments to the model’s architecture, hyperparameters, or training process.
The sources provide a detailed guide to constructing the training loop, tracking model performance, and visualizing the training process. They explain how to implement training and testing steps, use tqdm for progress tracking, and interpret loss curves to monitor the model’s learning and identify potential issues. These steps are crucial for successfully training and evaluating the TinyVGG model on the custom food dataset.

Experiment Tracking and Enhancing Model Performance: Pages 721-730

The sources guide readers through tracking model experiments and exploring techniques to enhance the TinyVGG model’s performance on the custom food dataset. They explain methods for comparing results, adjusting hyperparameters, and introduce the concept of transfer learning.
- Comparing Model Results: The sources introduce strategies for comparing the results of different model training experiments. They demonstrate how to:
- Create a Dictionary to Store Results: Organize the results of each experiment, including loss, accuracy, and training time, into separate dictionaries for easy access and comparison.
- Use Pandas DataFrames for Analysis: Leverage the power of Pandas DataFrames to:
- Structure Results: Neatly organize the results from different experiments into a tabular format, facilitating clear comparisons.
- Sort and Analyze Data: Sort and analyze the data to identify trends, such as which model configuration achieved the lowest loss or highest accuracy, and to observe how changes in hyperparameters affect performance.
- Exploring Ways to Improve a Model: The sources discuss various techniques for improving the performance of a deep learning model, including:
- Adjusting Hyperparameters: Modifying hyperparameters, such as the learning rate, batch size, and number of epochs, can significantly impact model performance. They suggest experimenting with these parameters to find optimal settings for a given dataset.
- Adding More Layers: Increasing the depth of the model by adding more layers can potentially allow the model to learn more complex representations of the data, leading to improved accuracy.
- Adding More Hidden Units: Increasing the number of hidden units in each layer can also enhance the model’s capacity to learn intricate patterns in the data.
- Training for Longer: Training the model for more epochs can sometimes lead to further improvements, but it is crucial to monitor the loss curves for signs of overfitting.
- Using a Different Optimizer: Different optimizers employ distinct strategies for updating model parameters. Experimenting with various optimizers, such as Adam or RMSprop, might yield better performance compared to the default stochastic gradient descent (SGD) optimizer.
- Leveraging Transfer Learning: The sources introduce the concept of transfer learning, a powerful technique where a model pre-trained on a large dataset is used as a starting point for training on a smaller, related dataset. They explain how transfer learning can:
- Improve Performance: Benefit from the knowledge gained by the pre-trained model, often resulting in faster convergence and higher accuracy on the target dataset.
- Reduce Training Time: Leverage the pre-trained model’s existing feature representations, potentially reducing the need for extensive training from scratch.
- Making Predictions on a Custom Image: The sources demonstrate how to use the trained model to make predictions on a custom image. This involves:
- Loading and Transforming the Image: Loading the image using PIL, applying the same transformations used during training (resizing, normalization, etc.), and converting the image to a PyTorch tensor.
- Passing the Image through the Model: Inputting the transformed image tensor into the trained model to obtain the predicted logits.
- Applying Softmax for Probabilities: Converting the raw logits into probabilities using the softmax function, indicating the model’s confidence in each class prediction.
- Determining the Predicted Class: Selecting the class with the highest probability as the model’s prediction for the input image.
- Understanding Model Performance: The sources emphasize the importance of evaluating the model’s performance both quantitatively and qualitatively:
- Quantitative Evaluation: Using metrics like loss and accuracy to assess the model’s performance numerically, providing objective measures of its ability to learn and generalize.
- Qualitative Evaluation: Examining predictions on individual images to gain insights into the model’s decision-making process. This can help identify areas where the model struggles and suggest potential improvements to the training data or model architecture.
The sources cover important aspects of tracking experiments, improving model performance, and making predictions. They explain methods for comparing results, discuss various hyperparameter tuning techniques and introduce transfer learning. They also guide readers through making predictions on custom images and emphasize the importance of both quantitative and qualitative evaluation to understand the model’s strengths and limitations.

Building Custom Datasets with PyTorch: Pages 731-740

The sources shift focus to constructing custom datasets in PyTorch. They explain the motivation behind creating custom datasets, walk through the process of building one for the food classification task, and highlight the importance of understanding the dataset structure and visualizing the data.
- Understanding the Need for Custom Datasets: The sources explain that while pre-built datasets like FashionMNIST are valuable for learning and experimentation, real-world machine learning projects often require working with custom datasets specific to the problem at hand. Building custom datasets allows for greater flexibility and control over the data used for training models.
- Creating a Custom ImageDataset Class: The sources guide readers through creating a custom dataset class named ImageDataset, which inherits from the Dataset class provided by PyTorch. They outline the key steps and methods involved:
1. Initialization (__init__): This method initializes the dataset by:
- Defining the root directory where the image data is stored.
- Setting up the transformation pipeline to be applied to each image (e.g., resizing, normalization).
- Creating a list of image file paths by recursively traversing the directory structure.
- Generating a list of corresponding labels based on the image’s parent directory (representing the class).
1. Calculating Dataset Length (__len__): This method returns the total number of samples in the dataset, determined by the length of the image file path list. This allows PyTorch’s data loaders to know how many samples are available.
2. Getting a Sample (__getitem__): This method fetches a specific sample from the dataset given its index. It involves:
- Retrieving the image file path and label corresponding to the provided index.
- Loading the image using PIL.
- Applying the defined transformations to the image.
- Converting the image to a PyTorch tensor.
- Returning the transformed image tensor and its associated label.
- Mapping Class Names to Integers: The sources demonstrate a helper function that maps class names (e.g., “pizza”, “steak”, “sushi”) to integer labels (e.g., 0, 1, 2). This is necessary for PyTorch models, which typically work with numerical labels.
- Visualizing Samples and Labels: The sources stress the importance of visually inspecting the data to gain a better understanding of the dataset’s structure and contents. They guide readers through creating a function to display random images from the custom dataset along with their corresponding labels, allowing for a qualitative assessment of the data.
The sources provide a comprehensive overview of building custom datasets in PyTorch, specifically focusing on creating an ImageDataset class for image classification tasks. They outline the essential methods for initialization, calculating length, and retrieving samples, along with the process of mapping class names to integers and visualizing the data.

Visualizing and Augmenting Custom Datasets: Pages 741-750

The sources focus on visualizing data from the custom ImageDataset and introduce the concept of data augmentation as a technique to enhance model performance. They guide readers through creating a function to display random images from the dataset and explore various data augmentation techniques, specifically using the torchvision.transforms module.
- Creating a Function to Display Random Images: The sources outline the steps involved in creating a function to visualize random images from the custom dataset, enabling a qualitative assessment of the data and the transformations applied. They provide detailed guidance on:
1. Function Definition: Define a function that accepts the dataset, class names, the number of images to display (defaulting to 10), and a boolean flag (display_shape) to optionally show the shape of each image.
2. Limiting Display for Practicality: To prevent overwhelming the display, the function caps the maximum number of images to 10. If the user requests more than 10 images, the function automatically sets the limit to 10 and disables the display_shape option.
3. Random Sampling: Generate a list of random indices within the range of the dataset’s length using random.sample. The number of indices to sample is determined by the n parameter (number of images to display).
4. Setting up the Plot: Create a Matplotlib figure with a size adjusted based on the number of images to display.
5. Iterating through Samples: Loop through the randomly sampled indices, retrieving the corresponding image and label from the dataset using the __getitem__ method.
6. Creating Subplots: For each image, create a subplot within the Matplotlib figure, arranging them in a single row.
7. Displaying Images: Use plt.imshow to display the image within its designated subplot.
8. Setting Titles: Set the title of each subplot to display the class name of the image.
9. Optional Shape Display: If the display_shape flag is True, print the shape of each image tensor below its subplot.
- Introducing Data Augmentation: The sources highlight the importance of data augmentation, a technique that artificially increases the diversity of training data by applying various transformations to the original images. Data augmentation helps improve the model’s ability to generalize and reduces the risk of overfitting. They provide a conceptual explanation of data augmentation and its benefits, emphasizing its role in enhancing model robustness and performance.
- Exploring torchvision.transforms: The sources guide readers through the torchvision.transforms module, a valuable tool in PyTorch that provides a range of image transformations for data augmentation. They discuss specific transformations like:
- RandomHorizontalFlip: Randomly flips the image horizontally with a given probability.
- RandomRotation: Rotates the image by a random angle within a specified range.
- ColorJitter: Randomly adjusts the brightness, contrast, saturation, and hue of the image.
- RandomResizedCrop: Crops a random portion of the image and resizes it to a given size.
- ToTensor: Converts the PIL image to a PyTorch tensor.
- Normalize: Normalizes the image tensor using specified mean and standard deviation values.
- Visualizing Transformed Images: The sources demonstrate how to visualize images after applying data augmentation transformations. They create a new transformation pipeline incorporating the desired augmentations and then use the previously defined function to display random images from the dataset after they have been transformed.
The sources provide valuable insights into visualizing custom datasets and leveraging data augmentation to improve model training. They explain the creation of a function to display random images, introduce data augmentation as a concept, and explore various transformations provided by the torchvision.transforms module. They also demonstrate how to visualize the effects of these transformations, allowing for a better understanding of how they augment the training data.

Implementing a Convolutional Neural Network for Food Classification: Pages 751-760

The sources shift focus to building and training a convolutional neural network (CNN) to classify images from the custom food dataset. They walk through the process of implementing a TinyVGG architecture, setting up training and testing functions, and evaluating the model’s performance.
- Building a TinyVGG Architecture: The sources introduce the TinyVGG architecture as a simplified version of the popular VGG network, known for its effectiveness in image classification tasks. They provide a step-by-step guide to constructing the TinyVGG model using PyTorch:
1. Defining Input Shape and Hidden Units: Establish the input shape of the images, considering the number of color channels, height, and width. Also, determine the number of hidden units to use in convolutional layers.
2. Constructing Convolutional Blocks: Create two convolutional blocks, each consisting of:
- A 2D convolutional layer (nn.Conv2d) to extract features from the input images.
- A ReLU activation function (nn.ReLU) to introduce non-linearity.
- Another 2D convolutional layer.
- Another ReLU activation function.
- A max-pooling layer (nn.MaxPool2d) to downsample the feature maps, reducing their spatial dimensions.
1. Creating the Classifier Layer: Define the classifier layer, responsible for producing the final classification output. This layer comprises:
- A flattening layer (nn.Flatten) to convert the multi-dimensional feature maps from the convolutional blocks into a one-dimensional feature vector.
- A linear layer (nn.Linear) to perform the final classification, mapping the features to the number of output classes.
- A ReLU activation function.
- Another linear layer to produce the final output with the desired number of classes.
1. Combining Layers in nn.Sequential: Utilize nn.Sequential to organize and connect the convolutional blocks and the classifier layer in a sequential manner, defining the flow of data through the model.
- Verifying Model Architecture with torchinfo: The sources introduce the torchinfo package as a helpful tool for summarizing and verifying the architecture of a PyTorch model. They demonstrate its usage by passing the created TinyVGG model to torchinfo.summary, providing a concise overview of the model’s layers, input and output shapes, and the number of trainable parameters.
- Setting up Training and Testing Functions: The sources outline the process of creating functions for training and testing the TinyVGG model. They provide a detailed explanation of the steps involved in each function:
- Training Function (train_step): This function handles a single training step, accepting the model, data loader, loss function, optimizer, and device as input:
1. Set the model to training mode (model.train()).
2. Iterate through batches of data from the data loader.
3. For each batch, send the input data and labels to the specified device.
4. Perform a forward pass through the model to obtain predictions (logits).
5. Calculate the loss using the provided loss function.
6. Perform backpropagation to compute gradients.
7. Update model parameters using the optimizer.
8. Accumulate training loss for the epoch.
9. Return the average training loss.
- Testing Function (test_step): This function evaluates the model’s performance on a given dataset, accepting the model, data loader, loss function, and device as input:
1. Set the model to evaluation mode (model.eval()).
2. Disable gradient calculation using torch.no_grad().
3. Iterate through batches of data from the data loader.
4. For each batch, send the input data and labels to the specified device.
5. Perform a forward pass through the model to obtain predictions.
6. Calculate the loss.
7. Accumulate testing loss.
8. Return the average testing loss.
- Training and Evaluating the Model: The sources guide readers through the process of training the TinyVGG model using the defined training function. They outline steps such as:
1. Instantiating the model and moving it to the desired device (CPU or GPU).
2. Defining the loss function (e.g., cross-entropy loss) and optimizer (e.g., SGD).
3. Setting up the training loop for a specified number of epochs.
4. Calling the train_step function for each epoch to train the model on the training data.
5. Evaluating the model’s performance on the test data using the test_step function.
6. Tracking and printing training and testing losses for each epoch.
- Visualizing the Loss Curve: The sources emphasize the importance of visualizing the loss curve to monitor the model’s training progress and detect potential issues like overfitting or underfitting. They provide guidance on creating a plot showing the training loss over epochs, allowing users to observe how the loss decreases as the model learns.
- Preparing for Model Improvement: The sources acknowledge that the initial performance of the TinyVGG model may not be optimal. They suggest various techniques to potentially improve the model’s performance in subsequent steps, paving the way for further experimentation and model refinement.
The sources offer a comprehensive walkthrough of building and training a TinyVGG model for image classification using a custom food dataset. They detail the architecture of the model, explain the training and testing procedures, and highlight the significance of visualizing the loss curve. They also lay the foundation for exploring techniques to enhance the model’s performance in later stages.

Improving Model Performance and Tracking Experiments: Pages 761-770

The sources transition from establishing a baseline model to exploring techniques for enhancing its performance and introduce methods for tracking experimental results. They focus on data augmentation strategies using the torchvision.transforms module and creating a system for comparing different model configurations.
- Evaluating the Custom ImageDataset: The sources revisit the custom ImageDataset created earlier, emphasizing the importance of assessing its functionality. They use the previously defined plot_random_images function to visually inspect a sample of images from the dataset, confirming that the images are loaded correctly and transformed as intended.
- Data Augmentation for Enhanced Performance: The sources delve deeper into data augmentation as a crucial technique for improving the model’s ability to generalize to unseen data. They highlight how data augmentation artificially increases the diversity and size of the training data, leading to more robust models that are less prone to overfitting.
- Exploring torchvision.transforms for Augmentation: The sources guide users through different data augmentation techniques available in the torchvision.transforms module. They explain the purpose and effects of various transformations, including:
- RandomHorizontalFlip: Randomly flips the image horizontally, adding variability to the dataset.
- RandomRotation: Rotates the image by a random angle within a specified range, exposing the model to different orientations.
- ColorJitter: Randomly adjusts the brightness, contrast, saturation, and hue of the image, making the model more robust to variations in lighting and color.
- Visualizing Augmented Images: The sources demonstrate how to visualize the effects of data augmentation by applying transformations to images and then displaying the transformed images. This visual inspection helps understand the impact of the augmentations and ensure they are applied correctly.
- Introducing TrivialAugment: The sources introduce TrivialAugment, a data augmentation strategy that randomly applies a sequence of simple augmentations to each image. They explain that TrivialAugment has been shown to be effective in improving model performance, particularly when combined with other techniques. They provide a link to a research paper for further reading on TrivialAugment, encouraging users to explore the strategy in more detail.
- Applying TrivialAugment to the Custom Dataset: The sources guide users through applying TrivialAugment to the custom food dataset. They create a new transformation pipeline incorporating TrivialAugment and then use the plot_random_images function to display a sample of augmented images, allowing users to visually assess the impact of the augmentations.
- Creating a System for Comparing Model Results: The sources shift focus to establishing a structured approach for tracking and comparing the performance of different model configurations. They create a dictionary called compare_results to store results from various model experiments. This dictionary is designed to hold information such as training time, training loss, testing loss, and testing accuracy for each model.
- Setting Up a Pandas DataFrame: The sources introduce Pandas DataFrames as a convenient tool for organizing and analyzing experimental results. They convert the compare_results dictionary into a Pandas DataFrame, providing a structured table-like representation of the results, making it easier to compare the performance of different models.
The sources provide valuable insights into techniques for improving model performance, specifically focusing on data augmentation strategies. They guide users through various transformations available in the torchvision.transforms module, explain the concept and benefits of TrivialAugment, and demonstrate how to visualize the effects of these augmentations. Moreover, they introduce a structured approach for tracking and comparing experimental results using a dictionary and a Pandas DataFrame, laying the groundwork for systematic model experimentation and analysis.

Predicting on a Custom Image and Wrapping Up the Custom Datasets Section: Pages 771-780

The sources shift focus to making predictions on a custom image using the trained TinyVGG model and summarize the key concepts covered in the custom datasets section. They guide users through the process of preparing the image, making predictions, and analyzing the results.
- Preparing a Custom Image for Prediction: The sources outline the steps for preparing a custom image for prediction:
1. Obtaining the Image: Acquire an image that aligns with the classes the model was trained on. In this case, the image should be of either pizza, steak, or sushi.
2. Resizing and Converting to RGB: Ensure the image is resized to the dimensions expected by the model (64×64 in this case) and converted to RGB format. This resizing step is crucial as the model was trained on images with specific dimensions and expects the same input format during prediction.
3. Converting to a PyTorch Tensor: Transform the image into a PyTorch tensor using torchvision.transforms.ToTensor(). This conversion is necessary to feed the image data into the PyTorch model.
- Making Predictions with the Trained Model: The sources walk through the process of using the trained TinyVGG model to make predictions on the prepared custom image:
1. Setting the Model to Evaluation Mode: Switch the model to evaluation mode using model.eval(). This step ensures that the model behaves appropriately for prediction, deactivating functionalities like dropout that are only used during training.
2. Performing a Forward Pass: Pass the prepared image tensor through the model to obtain the model’s predictions (logits).
3. Applying Softmax to Obtain Probabilities: Convert the raw logits into prediction probabilities using the softmax function (torch.softmax()). Softmax transforms the logits into a probability distribution, where each value represents the model’s confidence in the image belonging to a particular class.
4. Determining the Predicted Class: Identify the class with the highest predicted probability, representing the model’s final prediction for the input image.
- Analyzing the Prediction Results: The sources emphasize the importance of carefully analyzing the prediction results, considering both quantitative and qualitative aspects. They highlight that even if the model’s accuracy may not be perfect, a qualitative assessment of the predictions can provide valuable insights into the model’s behavior and potential areas for improvement.
- Summarizing the Custom Datasets Section: The sources provide a comprehensive summary of the key concepts covered in the custom datasets section:
1. Understanding Custom Datasets: They reiterate the importance of working with custom datasets, especially when dealing with domain-specific problems or when pre-trained models may not be readily available. They emphasize the ability of custom datasets to address unique challenges and tailor models to specific needs.
2. Building a Custom Dataset: They recap the process of building a custom dataset using torchvision.datasets.ImageFolder. They highlight the benefits of ImageFolder for handling image data organized in standard image classification format, where images are stored in separate folders representing different classes.
3. Creating a Custom ImageDataset Class: They review the steps involved in creating a custom ImageDataset class, demonstrating the flexibility and control this approach offers for handling and processing data. They explain the key methods required for a custom dataset, including __init__, __len__, and __getitem__, and how these methods interact with the data loader.
4. Data Augmentation Techniques: They emphasize the importance of data augmentation for improving model performance, particularly in scenarios where the training data is limited. They reiterate the techniques explored earlier, including random horizontal flipping, random rotation, color jittering, and TrivialAugment, highlighting how these techniques can enhance the model’s ability to generalize to unseen data.
5. Training and Evaluating Models: They summarize the process of training and evaluating models on custom datasets, highlighting the steps involved in setting up training loops, evaluating model performance, and visualizing results.
- Introducing Exercises and Extra Curriculum: The sources conclude the custom datasets section by providing a set of exercises and extra curriculum resources to reinforce the concepts covered. They direct users to the learnpytorch.io website and the pytorch-deep-learning GitHub repository for exercise templates, example solutions, and additional learning materials.
- Previewing Upcoming Sections: The sources briefly preview the upcoming sections of the course, hinting at topics like transfer learning, model experiment tracking, paper replicating, and more advanced architectures. They encourage users to continue their learning journey, exploring more complex concepts and techniques in deep learning with PyTorch.
The sources provide a practical guide to making predictions on a custom image using a trained TinyVGG model, carefully explaining the preparation steps, prediction process, and analysis of results. Additionally, they offer a concise summary of the key concepts covered in the custom datasets section, reinforcing the understanding of custom datasets, data augmentation techniques, and model training and evaluation. Finally, they introduce exercises and extra curriculum resources to encourage further practice and learning while previewing the exciting topics to come in the remainder of the course.

Setting Up a TinyVGG Model and Exploring Model Architectures: Pages 781-790

The sources transition from data preparation and augmentation to building a convolutional neural network (CNN) model using the TinyVGG architecture. They guide users through the process of defining the model’s architecture, understanding its components, and preparing it for training.
- Introducing the TinyVGG Architecture: The sources introduce TinyVGG, a simplified version of the VGG (Visual Geometry Group) architecture, known for its effectiveness in image classification tasks. They provide a visual representation of the TinyVGG architecture, outlining its key components, including:
- Convolutional Blocks: The foundation of TinyVGG, composed of convolutional layers (nn.Conv2d) followed by ReLU activation functions (nn.ReLU) and max-pooling layers (nn.MaxPool2d). Convolutional layers extract features from the input images, ReLU introduces non-linearity, and max-pooling downsamples the feature maps, reducing their dimensionality and making the model more robust to variations in the input.
- Classifier Layer: The final layer of TinyVGG, responsible for classifying the extracted features into different categories. It consists of a flattening layer (nn.Flatten), which converts the multi-dimensional feature maps from the convolutional blocks into a single vector, followed by a linear layer (nn.Linear) that outputs a score for each class.
- Building a TinyVGG Model in PyTorch: The sources provide a step-by-step guide to building a TinyVGG model in PyTorch using the nn.Module class. They explain the structure of the model definition, outlining the key components:
1. __init__ Method: Initializes the model’s layers and components, including convolutional blocks and the classifier layer.
2. forward Method: Defines the forward pass of the model, specifying how the input data flows through the different layers and operations.
- Understanding Input and Output Shapes: The sources emphasize the importance of understanding and verifying the input and output shapes of each layer in the model. They guide users through calculating the dimensions of the feature maps at different stages of the network, taking into account factors such as the kernel size, stride, and padding of the convolutional layers. This understanding of shape transformations is crucial for ensuring that data flows correctly through the network and for debugging potential shape mismatches.
- Passing a Random Tensor Through the Model: The sources recommend passing a random tensor with the expected input shape through the model as a preliminary step to verify the model’s architecture and identify potential shape errors. This technique helps ensure that data can successfully flow through the network before proceeding with training.
- Introducing torchinfo for Model Summary: The sources introduce the torchinfo package as a helpful tool for summarizing PyTorch models. They demonstrate how to use torchinfo.summary to obtain a concise overview of the model’s architecture, including the input and output shapes of each layer and the number of trainable parameters. This package provides a convenient way to visualize and verify the model’s structure, making it easier to understand and debug.
The sources provide a detailed walkthrough of building a TinyVGG model in PyTorch, explaining the architecture’s components, the steps involved in defining the model using nn.Module, and the significance of understanding input and output shapes. They introduce practical techniques like passing a random tensor through the model for verification and leverage the torchinfo package for obtaining a comprehensive model summary. These steps lay a solid foundation for building and understanding CNN models for image classification tasks.

Training the TinyVGG Model and Evaluating its Performance: Pages 791-800

The sources shift focus to training the constructed TinyVGG model on the custom food image dataset. They guide users through creating training and testing functions, setting up a training loop, and evaluating the model’s performance using metrics like loss and accuracy.
- Creating Training and Testing Functions: The sources outline the process of creating separate functions for the training and testing steps, promoting modularity and code reusability.
- train_step Function: This function performs a single training step, encompassing the forward pass, loss calculation, backpropagation, and parameter updates.
1. Forward Pass: It takes a batch of data from the training dataloader, passes it through the model, and obtains the model’s predictions.
2. Loss Calculation: It calculates the loss between the predictions and the ground truth labels using a chosen loss function (e.g., cross-entropy loss for classification).
3. Backpropagation: It computes the gradients of the loss with respect to the model’s parameters using the loss.backward() method. Backpropagation determines how each parameter contributed to the error, guiding the optimization process.
4. Parameter Updates: It updates the model’s parameters based on the computed gradients using an optimizer (e.g., stochastic gradient descent). The optimizer adjusts the parameters to minimize the loss, improving the model’s performance over time.
5. Accuracy Calculation: It calculates the accuracy of the model’s predictions on the current batch of training data. Accuracy measures the proportion of correctly classified samples.
- test_step Function: This function evaluates the model’s performance on a batch of test data, computing the loss and accuracy without updating the model’s parameters.
1. Forward Pass: It takes a batch of data from the testing dataloader, passes it through the model, and obtains the model’s predictions. The model’s behavior is set to evaluation mode (model.eval()) before performing the forward pass to ensure that training-specific functionalities like dropout are deactivated.
2. Loss Calculation: It calculates the loss between the predictions and the ground truth labels using the same loss function as in train_step.
3. Accuracy Calculation: It calculates the accuracy of the model’s predictions on the current batch of testing data.
- Setting up a Training Loop: The sources demonstrate the implementation of a training loop that iterates through the training data for a specified number of epochs, calling the train_step and test_step functions at each epoch.
1. Epoch Iteration: The loop iterates for a predefined number of epochs, each epoch representing a complete pass through the entire training dataset.
2. Training Phase: For each epoch, the loop iterates through the batches of training data provided by the training dataloader, calling the train_step function for each batch. The train_step function performs the forward pass, loss calculation, backpropagation, and parameter updates as described above. The training loss and accuracy values are accumulated across all batches within an epoch.
3. Testing Phase: After each epoch, the loop iterates through the batches of testing data provided by the testing dataloader, calling the test_step function for each batch. The test_step function computes the loss and accuracy on the testing data without updating the model’s parameters. The testing loss and accuracy values are also accumulated across all batches.
4. Printing Progress: The loop prints the training and testing loss and accuracy values at regular intervals, typically after each epoch or a set number of epochs. This step provides feedback on the model’s progress and allows for monitoring its performance over time.
- Visualizing Training Progress: The sources highlight the importance of visualizing the training process, particularly the loss curves, to gain insights into the model’s behavior and identify potential issues like overfitting or underfitting. They suggest plotting the training and testing losses over epochs to observe how the loss values change during training.
The sources guide users through setting up a robust training pipeline for the TinyVGG model, emphasizing modularity through separate training and testing functions and a structured training loop. They recommend monitoring and visualizing training progress, particularly using loss curves, to gain a deeper understanding of the model’s behavior and performance. These steps provide a practical foundation for training and evaluating CNN models on custom image datasets.

Training and Experimenting with the TinyVGG Model on a Custom Dataset: Pages 801-810

The sources guide users through training their TinyVGG model on the custom food image dataset using the training functions and loop set up in the previous steps. They emphasize the importance of tracking and comparing model results, including metrics like loss, accuracy, and training time, to evaluate performance and make informed decisions about model improvements.
- Tracking Model Results: The sources recommend using a dictionary to store the training and testing results for each epoch, including the training loss, training accuracy, testing loss, and testing accuracy. This approach allows users to track the model’s performance over epochs and to easily compare the results of different models or training configurations. [1]
- Setting Up the Training Process: The sources provide code for setting up the training process, including:
1. Initializing a Results Dictionary: Creating a dictionary to store the model’s training and testing results. [1]
2. Implementing the Training Loop: Utilizing the tqdm library to display a progress bar during training and iterating through the specified number of epochs. [2]
3. Calling Training and Testing Functions: Invoking the train_step and test_step functions for each epoch, passing in the necessary arguments, including the model, dataloaders, loss function, optimizer, and device. [3]
4. Updating the Results Dictionary: Storing the training and testing loss and accuracy values for each epoch in the results dictionary. [2]
5. Printing Epoch Results: Displaying the training and testing results for each epoch. [3]
6. Calculating and Printing Total Training Time: Measuring the total time taken for training and printing the result. [4]
- Evaluating and Comparing Model Results: The sources guide users through plotting the training and testing losses and accuracies over epochs to visualize the model’s performance. They explain how to analyze the loss curves for insights into the training process, such as identifying potential overfitting or underfitting. [5, 6] They also recommend comparing the results of different models trained with various configurations to understand the impact of different architectural choices or hyperparameters on performance. [7]
- Improving Model Performance: Building upon the visualization and comparison of results, the sources discuss strategies for improving the model’s performance, including:
1. Adding More Layers: Increasing the depth of the model to enable it to learn more complex representations of the data. [8]
2. Adding More Hidden Units: Expanding the capacity of each layer to enhance its ability to capture intricate patterns in the data. [8]
3. Training for Longer: Increasing the number of epochs to allow the model more time to learn from the data. [9]
4. Using a Smaller Learning Rate: Adjusting the learning rate, which determines the step size during parameter updates, to potentially improve convergence and prevent oscillations around the optimal solution. [8]
5. Trying a Different Optimizer: Exploring alternative optimization algorithms, each with its unique approach to updating parameters, to potentially find one that better suits the specific problem. [8]
6. Using Learning Rate Decay: Gradually reducing the learning rate over epochs to fine-tune the model and improve convergence towards the optimal solution. [8]
7. Adding Regularization Techniques: Implementing methods like dropout or weight decay to prevent overfitting, which occurs when the model learns the training data too well and performs poorly on unseen data. [8]
- Visualizing Loss Curves: The sources emphasize the importance of understanding and interpreting loss curves to gain insights into the training process. They provide visual examples of different loss curve shapes and explain how to identify potential issues like overfitting or underfitting based on the curves’ behavior. They also offer guidance on interpreting ideal loss curves and discuss strategies for addressing problems like overfitting or underfitting, pointing to additional resources for further exploration. [5, 10]
The sources offer a structured approach to training and evaluating the TinyVGG model on a custom food image dataset, encouraging the use of dictionaries to track results, visualizing performance through loss curves, and comparing different model configurations. They discuss potential areas for model improvement and highlight resources for delving deeper into advanced techniques like learning rate scheduling and regularization. These steps empower users to systematically experiment, analyze, and enhance their models’ performance on image classification tasks using custom datasets.

Evaluating Model Performance and Introducing Data Augmentation: Pages 811-820

The sources emphasize the need to comprehensively evaluate model performance beyond just loss and accuracy. They introduce concepts like training time and tools for visualizing comparisons between different trained models. They also explore the concept of data augmentation as a strategy to improve model performance, focusing specifically on the “Trivial Augment” technique.
- Comparing Model Results: The sources guide users through creating a Pandas DataFrame to organize and compare the results of different trained models. The DataFrame includes columns for metrics like training loss, training accuracy, testing loss, testing accuracy, and training time, allowing for a clear comparison of the models’ performance across various metrics.
- Data Augmentation: The sources explain data augmentation as a technique for artificially increasing the diversity and size of the training dataset by applying various transformations to the original images. Data augmentation aims to improve the model’s generalization ability and reduce overfitting by exposing the model to a wider range of variations within the training data.
- Trivial Augment: The sources focus on Trivial Augment [1], a data augmentation technique known for its simplicity and effectiveness. They guide users through implementing Trivial Augment using PyTorch’s torchvision.transforms module, showcasing how to apply transformations like random cropping, horizontal flipping, color jittering, and other augmentations to the training images. They provide code examples for defining a transformation pipeline using torchvision.transforms.Compose to apply a sequence of augmentations to the input images.
- Visualizing Augmented Images: The sources recommend visualizing the augmented images to ensure that the applied transformations are appropriate and effective. They provide code using Matplotlib to display a grid of augmented images, allowing users to visually inspect the impact of the transformations on the training data.
- Understanding the Benefits of Data Augmentation: The sources explain the potential benefits of data augmentation, including:
- Improved Generalization: Exposing the model to a wider range of variations within the training data can help it learn more robust and generalizable features, leading to better performance on unseen data.
- Reduced Overfitting: Increasing the diversity of the training data can mitigate overfitting, which occurs when the model learns the training data too well and performs poorly on new, unseen data.
- Increased Effective Dataset Size: Artificially expanding the training dataset through augmentations can be beneficial when the original dataset is relatively small.
The sources present a structured approach to evaluating and comparing model performance using Pandas DataFrames. They introduce data augmentation, particularly Trivial Augment, as a valuable technique for enhancing model generalization and performance. They guide users through implementing data augmentation pipelines using PyTorch’s torchvision.transforms module and recommend visualizing augmented images to ensure their effectiveness. These steps empower users to perform thorough model evaluation, understand the importance of data augmentation, and implement it effectively using PyTorch to potentially boost model performance on image classification tasks.

Exploring Convolutional Neural Networks and Building a Custom Model: Pages 821-830

The sources shift focus to the fundamentals of Convolutional Neural Networks (CNNs), introducing their key components and operations. They walk users through building a custom CNN model, incorporating concepts like convolutional layers, ReLU activation functions, max pooling layers, and flattening layers to create a model capable of learning from image data.
- Introduction to CNNs: The sources provide an overview of CNNs, explaining their effectiveness in image classification tasks due to their ability to learn spatial hierarchies of features. They introduce the essential components of a CNN, including:
1. Convolutional Layers: Convolutional layers apply filters to the input image to extract features like edges, textures, and patterns. These filters slide across the image, performing convolutions to create feature maps that capture different aspects of the input.
2. ReLU Activation Function: ReLU (Rectified Linear Unit) is a non-linear activation function applied to the output of convolutional layers. It introduces non-linearity into the model, allowing it to learn complex relationships between features.
3. Max Pooling Layers: Max pooling layers downsample the feature maps produced by convolutional layers, reducing their dimensionality while retaining important information. They help make the model more robust to variations in the input image.
4. Flattening Layer: A flattening layer converts the multi-dimensional output of the convolutional and pooling layers into a one-dimensional vector, preparing it as input for the fully connected layers of the network.
- Building a Custom CNN Model: The sources guide users through constructing a custom CNN model using PyTorch’s nn.Module class. They outline a step-by-step process, explaining how to define the model’s architecture:
1. Defining the Model Class: Creating a Python class that inherits from nn.Module, setting up the model’s structure and layers.
2. Initializing the Layers: Instantiating the convolutional layers (nn.Conv2d), ReLU activation function (nn.ReLU), max-pooling layers (nn.MaxPool2d), and flattening layer (nn.Flatten) within the model’s constructor (__init__).
3. Implementing the Forward Pass: Defining the forward method, outlining the flow of data through the model’s layers during the forward pass, including the application of convolutional operations, activation functions, and pooling.
4. Setting Model Input Shape: Determining the expected input shape for the model based on the dimensions of the input images, considering the number of color channels, height, and width.
5. Verifying Input and Output Shapes: Ensuring that the input and output shapes of each layer are compatible, using techniques like printing intermediate shapes or utilizing tools like torchinfo to summarize the model’s architecture.
- Understanding Input and Output Shapes: The sources highlight the importance of comprehending the input and output shapes of each layer in the CNN. They explain how to calculate the output shape of convolutional layers based on factors like kernel size, stride, and padding, providing resources for a deeper understanding of these concepts.
- Using torchinfo for Model Summary: The sources introduce the torchinfo package as a helpful tool for summarizing PyTorch models, visualizing their architecture, and verifying input and output shapes. They demonstrate how to use torchinfo to print a concise summary of the model’s layers, parameters, and input/output sizes, aiding in understanding the model’s structure and ensuring its correctness.
The sources provide a clear and structured introduction to CNNs and guide users through building a custom CNN model using PyTorch. They explain the key components of CNNs, including convolutional layers, activation functions, pooling layers, and flattening layers. They walk users through defining the model’s architecture, understanding input/output shapes, and using tools like torchinfo to visualize and verify the model’s structure. These steps equip users with the knowledge and skills to create and work with CNNs for image classification tasks using custom datasets.

Training and Evaluating the TinyVGG Model: Pages 831-840

The sources walk users through the process of training and evaluating the TinyVGG model using the custom dataset created in the previous steps. They guide users through setting up training and testing functions, training the model for multiple epochs, visualizing the training progress using loss curves, and comparing the performance of the custom TinyVGG model to a baseline model.
- Setting up Training and Testing Functions: The sources present Python functions for training and testing the model, highlighting the key steps involved in each phase:
- train_step Function: This function performs a single training step, iterating through batches of training data and performing the following actions:
1. Forward Pass: Passing the input data through the model to get predictions.
2. Loss Calculation: Computing the loss between the predictions and the target labels using a chosen loss function.
3. Backpropagation: Calculating gradients of the loss with respect to the model’s parameters.
4. Optimizer Update: Updating the model’s parameters using an optimization algorithm to minimize the loss.
5. Accuracy Calculation: Calculating the accuracy of the model’s predictions on the training batch.
- test_step Function: Similar to the train_step function, this function evaluates the model’s performance on the test data, iterating through batches of test data and performing the forward pass, loss calculation, and accuracy calculation.
- Training the Model: The sources guide users through training the TinyVGG model for a specified number of epochs, calling the train_step and test_step functions in each epoch. They showcase how to track and store the training and testing loss and accuracy values across epochs for later analysis and visualization.
- Visualizing Training Progress with Loss Curves: The sources emphasize the importance of visualizing the training progress by plotting loss curves. They explain that loss curves depict the trend of the loss value over epochs, providing insights into the model’s learning process.
- Interpreting Loss Curves: They guide users through interpreting loss curves, highlighting that a decreasing loss generally indicates that the model is learning effectively. They explain that if the training loss continues to decrease but the testing loss starts to increase or plateau, it might indicate overfitting, where the model performs well on the training data but poorly on unseen data.
- Comparing Models and Exploring Hyperparameter Tuning: The sources compare the performance of the custom TinyVGG model to a baseline model, providing insights into the effectiveness of the chosen architecture. They suggest exploring techniques like hyperparameter tuning to potentially improve the model’s performance.
- Hyperparameter Tuning: They briefly introduce hyperparameter tuning as the process of finding the optimal values for the model’s hyperparameters, such as learning rate, batch size, and the number of hidden units.
The sources provide a comprehensive guide to training and evaluating the TinyVGG model using the custom dataset. They outline the steps involved in creating training and testing functions, performing the training process, visualizing training progress using loss curves, and comparing the model’s performance to a baseline model. These steps equip users with a structured approach to training, evaluating, and iteratively improving CNN models for image classification tasks.

Saving, Loading, and Reflecting on the PyTorch Workflow: Pages 841-850

The sources guide users through saving and loading the trained TinyVGG model, emphasizing the importance of preserving trained models for future use. They also provide a comprehensive reflection on the key steps involved in the PyTorch workflow for computer vision tasks, summarizing the concepts and techniques covered throughout the previous sections and offering insights into the overall process.
- Saving and Loading the Trained Model: The sources highlight the significance of saving trained models to avoid retraining from scratch. They explain that saving the model’s state dictionary, which contains the learned parameters, allows for easy reloading and reuse.
- Using torch.save: They demonstrate how to use PyTorch’s torch.save function to save the model’s state dictionary to a file, specifying the file path and the state dictionary as arguments. This step ensures that the trained model’s parameters are stored persistently.
- Using torch.load: They showcase how to use PyTorch’s torch.load function to load the saved state dictionary back into a new model instance. They explain the importance of creating a new model instance with the same architecture as the saved model before loading the state dictionary. This step allows for seamless restoration of the trained model’s parameters.
- Verifying Loaded Model: They suggest making predictions using the loaded model to ensure that it performs as expected and the loading process was successful.
- Reflecting on the PyTorch Workflow: The sources provide a comprehensive recap of the essential steps involved in the PyTorch workflow for computer vision tasks, summarizing the concepts and techniques covered in the previous sections. They present a structured overview of the workflow, highlighting the following key stages:
1. Data Preparation: Preparing the data, including loading, splitting into training and testing sets, and applying necessary transformations.
2. Model Building: Constructing the neural network model, defining its architecture, layers, and activation functions.
3. Loss Function and Optimizer Selection: Choosing an appropriate loss function to measure the model’s performance and an optimizer to update the model’s parameters during training.
4. Training Loop: Implementing a training loop to iteratively train the model on the training data, performing forward passes, loss calculations, backpropagation, and optimizer updates.
5. Model Evaluation: Evaluating the model’s performance on the test data, using metrics like loss and accuracy.
6. Hyperparameter Tuning and Experimentation: Exploring different model architectures, hyperparameters, and data augmentation techniques to potentially improve the model’s performance.
7. Saving and Loading the Model: Preserving the trained model by saving its state dictionary to a file for future use.
- Encouraging Further Exploration and Practice: The sources emphasize that mastering the PyTorch workflow requires practice and encourage users to explore different datasets, models, and techniques to deepen their understanding. They recommend referring to the PyTorch documentation and online resources for additional learning and problem-solving.
The sources provide clear guidance on saving and loading trained models, emphasizing the importance of preserving trained models for reuse. They offer a thorough recap of the PyTorch workflow for computer vision tasks, summarizing the key steps and techniques covered in the previous sections. They guide users through the process of saving the model’s state dictionary and loading it back into a new model instance. By emphasizing the overall workflow and providing practical examples, the sources equip users with a solid foundation for tackling computer vision projects using PyTorch. They encourage further exploration and experimentation to solidify understanding and enhance practical skills in building, training, and deploying computer vision models.

Expanding the Horizons of PyTorch: Pages 851-860

The sources shift focus from the specific TinyVGG model and custom dataset to a broader exploration of PyTorch’s capabilities. They introduce additional concepts, resources, and areas of study within the realm of deep learning and PyTorch, encouraging users to expand their knowledge and pursue further learning beyond the scope of the initial tutorial.
- Advanced Topics and Resources for Further Learning: The sources recognize that the covered material represents a foundational introduction to PyTorch and deep learning, and they acknowledge that there are many more advanced topics and areas of specialization within this field.
- Transfer Learning: The sources highlight transfer learning as a powerful technique that involves leveraging pre-trained models on large datasets to improve the performance on new, potentially smaller datasets.
- Model Experiment Tracking: They introduce the concept of model experiment tracking, emphasizing the importance of keeping track of different model architectures, hyperparameters, and results for organized experimentation and analysis.
- PyTorch Paper Replication: The sources mention the practice of replicating research papers that introduce new deep learning architectures or techniques using PyTorch. They suggest that this is a valuable way to gain deeper understanding and practical experience with cutting-edge advancements in the field.
- Additional Chapters and Resources: The sources point to additional chapters and resources available on the learnpytorch.io website, indicating that the learning journey continues beyond the current section. They encourage users to explore these resources to deepen their understanding of various aspects of deep learning and PyTorch.
- Encouraging Continued Learning and Exploration: The sources strongly emphasize the importance of continuous learning and exploration within the field of deep learning. They recognize that deep learning is a rapidly evolving field with new architectures, techniques, and applications emerging frequently.
- Staying Updated with Advancements: They advise users to stay updated with the latest research papers, blog posts, and online courses to keep their knowledge and skills current.
- Building Projects and Experimenting: The sources encourage users to actively engage in building projects, experimenting with different datasets and models, and participating in the deep learning community.
The sources gracefully transition from the specific tutorial on TinyVGG and custom datasets to a broader perspective on the vast landscape of deep learning and PyTorch. They introduce additional topics, resources, and areas of study, encouraging users to continue their learning journey and explore more advanced concepts. By highlighting these areas and providing guidance on where to find further information, the sources empower users to expand their knowledge, skills, and horizons within the exciting and ever-evolving world of deep learning and PyTorch.

Diving into Multi-Class Classification with PyTorch: Pages 861-870

The sources introduce the concept of multi-class classification, a common task in machine learning where the goal is to categorize data into one of several possible classes. They contrast this with binary classification, which involves only two classes. The sources then present the FashionMNIST dataset, a collection of grayscale images of clothing items, as an example for demonstrating multi-class classification using PyTorch.
- Multi-Class Classification: The sources distinguish multi-class classification from binary classification, explaining that multi-class classification involves assigning data points to one of multiple possible categories, while binary classification deals with only two categories. They emphasize that many real-world problems fall under the umbrella of multi-class classification. [1]
- FashionMNIST Dataset: The sources introduce the FashionMNIST dataset, a widely used dataset for image classification tasks. This dataset comprises 70,000 grayscale images of 10 different clothing categories, including T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. The sources highlight that this dataset provides a suitable playground for experimenting with multi-class classification techniques using PyTorch. [1, 2]
- Preparing the Data: The sources outline the steps involved in preparing the FashionMNIST dataset for use in PyTorch, emphasizing the importance of loading the data, splitting it into training and testing sets, and applying necessary transformations. They mention using PyTorch’s DataLoader class to efficiently handle data loading and batching during training and testing. [2]
- Building a Multi-Class Classification Model: The sources guide users through building a simple neural network model for multi-class classification using PyTorch. They discuss the choice of layers, activation functions, and the output layer’s activation function. They mention using a softmax activation function in the output layer to produce a probability distribution over the possible classes. [2]
- Training the Model: The sources outline the process of training the multi-class classification model, highlighting the use of a suitable loss function (such as cross-entropy loss) and an optimization algorithm (such as stochastic gradient descent) to minimize the loss and improve the model’s accuracy during training. [2]
- Evaluating the Model: The sources emphasize the need to evaluate the trained model’s performance on the test dataset, using metrics such as accuracy, precision, recall, and the F1-score to assess its effectiveness in classifying images into the correct categories. [2]
- Visualization for Understanding: The sources advocate for visualizing the data and the model’s predictions to gain insights into the classification process. They suggest techniques like plotting the images and their corresponding predicted labels to qualitatively assess the model’s performance. [2]
The sources effectively introduce the concept of multi-class classification and its relevance in various machine learning applications. They guide users through the process of preparing the FashionMNIST dataset, building a neural network model, training the model, and evaluating its performance. By emphasizing visualization and providing code examples, the sources equip users with the tools and knowledge to tackle multi-class classification problems using PyTorch.

Beyond Accuracy: Exploring Additional Classification Metrics: Pages 871-880

The sources introduce several additional metrics for evaluating the performance of classification models, going beyond the commonly used accuracy metric. They highlight the importance of considering multiple metrics to gain a more comprehensive understanding of a model’s strengths and weaknesses. The sources also emphasize that the choice of appropriate metrics depends on the specific problem and the desired balance between different types of errors.
- Limitations of Accuracy: The sources acknowledge that accuracy, while a useful metric, can be misleading in situations where the classes are imbalanced. In such cases, a model might achieve high accuracy simply by correctly classifying the majority class, even if it performs poorly on the minority class.
- Precision and Recall: The sources introduce precision and recall as two important metrics that provide a more nuanced view of a classification model’s performance, particularly when dealing with imbalanced datasets.
- Precision: Precision measures the proportion of correctly classified positive instances out of all instances predicted as positive. A high precision indicates that the model is good at avoiding false positives.
- Recall: Recall, also known as sensitivity or the true positive rate, measures the proportion of correctly classified positive instances out of all actual positive instances. A high recall suggests that the model is effective at identifying all positive instances.
- F1-Score: The sources present the F1-score as a harmonic mean of precision and recall, providing a single metric that balances both precision and recall. A high F1-score indicates a good balance between minimizing false positives and false negatives.
- Confusion Matrix: The sources introduce the confusion matrix as a valuable tool for visualizing the performance of a classification model. A confusion matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a detailed breakdown of the model’s predictions across different classes.
- Classification Report: The sources mention the classification report as a comprehensive summary of key classification metrics, including precision, recall, F1-score, and support (the number of instances of each class) for each class in the dataset.
- TorchMetrics Module: The sources recommend exploring the torchmetrics module in PyTorch, which provides a wide range of pre-implemented classification metrics. Using this module simplifies the calculation and tracking of various metrics during model training and evaluation.
The sources effectively expand the discussion of classification model evaluation by introducing additional metrics that go beyond accuracy. They explain precision, recall, the F1-score, the confusion matrix, and the classification report, highlighting their importance in understanding a model’s performance, especially in cases of imbalanced datasets. By encouraging the use of the torchmetrics module, the sources provide users with practical tools to easily calculate and track these metrics during their machine learning workflows. They emphasize that choosing the right metrics depends on the specific problem and the relative importance of different types of errors.

Exploring Convolutional Neural Networks and Computer Vision: Pages 881-890

The sources mark a transition into the realm of computer vision, specifically focusing on Convolutional Neural Networks (CNNs), a type of neural network architecture highly effective for image-related tasks. They introduce core concepts of CNNs and showcase their application in image classification using the FashionMNIST dataset.
- Introduction to Computer Vision: The sources acknowledge computer vision as a rapidly expanding field within deep learning, encompassing tasks like image classification, object detection, and image segmentation. They emphasize the significance of CNNs as a powerful tool for extracting meaningful features from image data, enabling machines to “see” and interpret visual information.
- Convolutional Neural Networks (CNNs): The sources provide a foundational understanding of CNNs, highlighting their key components and how they differ from traditional neural networks.
- Convolutional Layers: They explain how convolutional layers apply filters (also known as kernels) to the input image to extract features such as edges, textures, and patterns. These filters slide across the image, performing convolutions to produce feature maps.
- Activation Functions: The sources discuss the use of activation functions like ReLU (Rectified Linear Unit) within CNNs to introduce non-linearity, allowing the network to learn complex relationships in the image data.
- Pooling Layers: They explain how pooling layers, such as max pooling, downsample the feature maps, reducing their dimensionality while retaining essential information, making the network more computationally efficient and robust to variations in the input image.
- Fully Connected Layers: The sources mention that after several convolutional and pooling layers, the extracted features are flattened and passed through fully connected layers, similar to those found in traditional neural networks, to perform the final classification.
- Applying CNNs to FashionMNIST: The sources guide users through building a simple CNN model for image classification using the FashionMNIST dataset. They walk through the process of defining the model architecture, choosing appropriate layers and hyperparameters, and training the model using the training dataset.
- Evaluation and Visualization: The sources emphasize evaluating the trained CNN model on the test dataset, using metrics like accuracy to assess its performance. They also encourage visualizing the model’s predictions and the learned feature maps to gain a deeper understanding of how the CNN is “seeing” and interpreting the images.
- Importance of Experimentation: The sources highlight that designing and training effective CNNs often involves experimentation with different architectures, hyperparameters, and training techniques. They encourage users to explore different approaches and carefully analyze the results to optimize their models for specific computer vision tasks.
Working with Tensors and Building Models in PyTorch: Pages 891-900

The sources shift focus to the practical aspects of working with tensors in PyTorch and building neural network models for both regression and classification tasks. They emphasize the importance of understanding tensor operations, data manipulation, and building blocks of neural networks within the PyTorch framework.
- Understanding Tensors: The sources reiterate the importance of tensors as the fundamental data structure in PyTorch, highlighting their role in representing data and model parameters. They discuss tensor creation, indexing, and various operations like stacking, permuting, and reshaping tensors to prepare data for use in neural networks.
- Building a Regression Model: The sources walk through the steps of building a simple linear regression model in PyTorch to predict a continuous target variable from a set of input features. They explain:
- Model Architecture: Defining a model class that inherits from PyTorch’s nn.Module, specifying the linear layers and activation functions that make up the model.
- Loss Function: Choosing an appropriate loss function, such as Mean Squared Error (MSE), to measure the difference between the model’s predictions and the actual target values.
- Optimizer: Selecting an optimizer, such as Stochastic Gradient Descent (SGD), to update the model’s parameters during training, minimizing the loss function.
- Training Loop: Implementing a training loop that iterates through the training data, performs forward and backward passes, calculates the loss, and updates the model’s parameters using the optimizer.
- Addressing Shape Errors: The sources address common shape errors that arise when working with tensors in PyTorch, emphasizing the importance of ensuring that tensor dimensions are compatible for operations like matrix multiplication. They provide examples of troubleshooting shape mismatches and adjusting tensor dimensions using techniques like reshaping or transposing.
- Visualizing Data and Predictions: The sources advocate for visualizing the data and the model’s predictions to gain insights into the regression process. They suggest plotting the input features against the target variable, along with the model’s predicted line, to visually assess the model’s fit and performance.
- Introducing Non-linearities: The sources acknowledge the limitations of linear models in capturing complex relationships in data. They introduce the concept of non-linear activation functions, such as ReLU (Rectified Linear Unit), as a way to introduce non-linearity into the model, enabling it to learn more complex patterns. They explain how incorporating ReLU layers can enhance a model’s ability to fit non-linear data.
The sources effectively transition from theoretical concepts to practical implementation by demonstrating how to work with tensors in PyTorch and build basic neural network models for both regression and classification tasks. They guide users through the essential steps of model definition, loss function selection, optimizer choice, and training loop implementation. By highlighting common pitfalls like shape errors and emphasizing visualization, the sources provide a hands-on approach to learning PyTorch and its application in building machine learning models. They also introduce the crucial concept of non-linear activation functions, laying the foundation for exploring more complex neural network architectures in subsequent sections.

Here are two ways to improve a model’s performance, based on the provided sources:
- Add More Layers to the Model: Adding more layers gives the model more opportunities to learn about patterns in the data. If a model currently has two layers with approximately 20 parameters, adding more layers would increase the number of parameters the model uses to try and learn the patterns in the data [1].
- Fit the Model for Longer: Every epoch is one pass through the data. Fitting the model for longer gives it more of a chance to learn. For example, if the model has only had 100 opportunities to look at a dataset, it may not be enough. Increasing the opportunities to 1,000 may improve the model’s results [2].
How Loss Functions Measure Model Performance

The sources explain that a loss function is crucial for training machine learning models. A loss function quantifies how “wrong” a model’s predictions are compared to the desired output. [1-6] The output of a loss function is a numerical value representing the error. Lower loss values indicate better performance.

Here’s how the loss function works in practice:
- Forward Pass: The model makes predictions on the input data. [7, 8] These predictions are often referred to as “logits” before further processing. [9-14]
- Comparing Predictions to True Values: The loss function takes the model’s predictions and compares them to the true labels from the dataset. [4, 8, 15-19]
- Calculating the Error: The loss function calculates a numerical value representing the difference between the predictions and the true labels. [1, 4-6, 8, 20-29] This value is the “loss,” and the specific calculation depends on the type of loss function used.
- Guiding Model Improvement: The loss value is used by the optimizer to adjust the model’s parameters (weights and biases) to reduce the error in subsequent predictions. [3, 20, 24, 27, 30-38] This iterative process of making predictions, calculating the loss, and updating the parameters is what drives the model’s learning during training.
The goal of training is to minimize the loss function, effectively bringing the model’s predictions closer to the true values. [4, 21, 27, 32, 37, 39-41]

The sources explain that different loss functions are appropriate for different types of problems. [42-48] For example:
- Regression problems (predicting a continuous numerical value) often use loss functions like Mean Absolute Error (MAE, also called L1 loss in PyTorch) or Mean Squared Error (MSE). [42, 44-46, 49, 50]
- Classification problems (predicting a category or class label) might use loss functions like Binary Cross Entropy (BCE) for binary classification or Cross Entropy for multi-class classification. [42, 43, 45, 46, 48, 50, 51]
The sources also highlight the importance of using the appropriate loss function for the chosen model and task. [44, 52, 53]

Key takeaway: Loss functions serve as a feedback mechanism, providing a quantitative measure of how well a model is performing. By minimizing the loss, the model learns to make more accurate predictions and improve its overall performance.

Main Steps in a PyTorch Training Loop

The sources provide a detailed explanation of the PyTorch training loop, highlighting its importance in the machine learning workflow. The training loop is the process where the model iteratively learns from the data and adjusts its parameters to improve its predictions. The sources provide code examples and explanations for both regression and classification problems.

Here is a breakdown of the main steps involved in a PyTorch training loop:

1. Setting Up
- Epochs: Define the number of epochs, which represent the number of times the model will iterate through the entire training dataset. [1]
- Training Mode: Set the model to training mode using model.train(). This activates specific settings and behaviors within the model, such as enabling dropout and batch normalization layers, crucial for training. [1, 2]
- Data Loading: Prepare the data loader to feed batches of training data to the model. [3]
2. Iterating Through Data Batches
- Loop: Initiate a loop to iterate through each batch of data provided by the data loader. [1]
3. The Optimization Loop (for each batch)
- Forward Pass: Pass the input data through the model to obtain predictions (often referred to as “logits” before further processing). [4, 5]
- Loss Calculation: Calculate the loss, which measures the difference between the model’s predictions and the true labels. Choose a loss function appropriate for the problem type (e.g., MSE for regression, Cross Entropy for classification). [5, 6]
- Zero Gradients: Reset the gradients of the model’s parameters to zero. This step is crucial to ensure that gradients from previous batches do not accumulate and affect the current batch’s calculations. [5, 7]
- Backpropagation: Calculate the gradients of the loss function with respect to the model’s parameters. This step involves going backward through the network, computing how much each parameter contributed to the loss. PyTorch handles this automatically using loss.backward(). [5, 7, 8]
- Gradient Descent: Update the model’s parameters to minimize the loss function. This step uses an optimizer (e.g., SGD, Adam) to adjust the weights and biases in the direction that reduces the loss. PyTorch’s optimizer.step() performs this parameter update. [5, 7, 8]
4. Testing (Evaluation) Loop (typically performed after each epoch)
- Evaluation Mode: Set the model to evaluation mode using model.eval(). This deactivates training-specific settings (like dropout) and prepares the model for inference. [2, 9]
- Inference Mode: Use the torch.inference_mode() context manager to perform inference. This disables gradient calculations and other operations not required for testing, potentially improving speed and memory efficiency. [9, 10]
- Forward Pass (on Test Data): Pass the test data through the model to obtain predictions. [4, 9]
- Loss Calculation (on Test Data): Calculate the loss on the test data to assess the model’s performance on unseen data. [9, 11]
- Performance Metrics: Calculate additional performance metrics relevant to the problem, such as accuracy for classification or other evaluation metrics suitable for the specific task. [9]
5. Repetition

The training loop repeats for the specified number of epochs, iterating through the data, performing the optimization steps, and periodically evaluating the model on the test data.

Key Takeaways
- The training loop is the heart of the model’s learning process in PyTorch.
- Each step in the loop plays a vital role in enabling the model to learn from data and improve its predictions.
- PyTorch simplifies many aspects of the training loop, handling backpropagation and gradient updates automatically, allowing you to focus on defining the model architecture, choosing appropriate loss functions and optimizers, and preparing the data.
The sources emphasize the importance of understanding these steps and how they work together to train effective machine learning models in PyTorch. They provide code examples and detailed explanations for each step, illustrating the concepts with practical implementations.

Reasons for PyTorch’s Popularity

The sources offer several insights into why PyTorch has become so popular for deep learning and machine learning:
- Research Favorite: The sources emphasize that PyTorch is the most popular deep learning research framework, as evidenced by its widespread use in published research papers and code repositories. [1, 2] For example, Papers with Code, a website tracking machine learning papers and their associated code, shows that 58% of the 65,000 papers with code they’ve tracked are implemented with PyTorch. [2] This popularity stems from PyTorch’s flexibility, ease of use, and ability to support cutting-edge research in various domains.
- Pythonic and User-Friendly: PyTorch is written in Python, making it highly accessible to the vast community of Python developers and researchers. [3] Its intuitive and Pythonic API simplifies the process of defining, training, and evaluating models. The dynamic computational graph enables flexible experimentation and debugging, contributing to its popularity among researchers and practitioners.
- Strong Industry Adoption: Beyond research, PyTorch has seen significant adoption by industry leaders like Tesla, Microsoft, OpenAI, and Facebook (Meta). [4-9] Tesla utilizes PyTorch for the computer vision models powering its Autopilot system. [5] OpenAI has standardized on PyTorch for its research and development. [6, 7] Facebook leverages PyTorch for its diverse machine learning applications. [9] This widespread industry adoption further validates PyTorch’s capabilities and reinforces its position as a leading deep learning framework.
- GPU Acceleration: PyTorch allows developers to leverage the power of GPUs (Graphics Processing Units) for faster computations. [3, 10] This is crucial for deep learning tasks, which often involve large datasets and computationally intensive operations. PyTorch utilizes CUDA, NVIDIA’s parallel computing platform and API, to enable efficient execution on GPUs, significantly accelerating training and inference processes. [11, 12]
- Rich Ecosystem and Resources: PyTorch benefits from a robust ecosystem of tools, libraries, and resources. Torch Hub provides access to pre-trained models, facilitating transfer learning. [3] Torchvision offers datasets, transformations, and model architectures specifically designed for computer vision tasks. [3, 13] PyTorch’s extensive documentation, active community forums, and numerous tutorials provide ample support for learners and practitioners at all levels.
- Accessibility and Ease of Setup: Tools like Google Colab further enhance PyTorch’s accessibility by providing a cloud-based environment with free GPU access and pre-installed PyTorch. [14-21] This ease of setup lowers the barrier to entry for beginners and enables rapid experimentation and prototyping without requiring extensive local hardware setup.
The sources, along with the conversation history, consistently portray PyTorch as a powerful, flexible, and user-friendly framework well-suited for both research and production-level deep learning applications. Its popularity is driven by a confluence of factors, including its research focus, Pythonic nature, strong industry support, GPU acceleration capabilities, a rich ecosystem, and ease of accessibility.

One of the main reasons for PyTorch’s popularity in research is its prominence as the most popular deep learning research framework [1, 2]. The sources emphasize that PyTorch is widely used in published research papers and code repositories.

For example, Papers with Code, a website tracking machine learning papers and their associated code, shows that 58% of the 65,000 papers with code they’ve tracked are implemented with PyTorch [3]. This demonstrates how often researchers choose PyTorch when implementing their models. This popularity likely stems from PyTorch’s flexibility, ease of use, and ability to support cutting-edge research in various domains.

The Impact of Random Seeds on Neural Network Training

The sources explain that randomness is a fundamental aspect of training neural networks in PyTorch. They highlight the use of random seeds as a mechanism for controlling this randomness and enabling reproducibility in experiments. Here’s how random seeds influence the training process:

1. Initialization with Random Numbers
- Random Parameters: Neural networks typically start with their parameters (weights and biases) initialized with random values. Source [1] emphasizes this, stating that “Random tensors are important because the way many neural networks learn is that they start with tensors full of random numbers and then adjust those random numbers to better represent the data.” The sources demonstrate this with code examples, using functions like torch.rand() to generate random tensors for weight and bias initialization.
- Stochastic Gradient Descent: The sources mention “stochastic” gradient descent [2], highlighting that the optimization process itself involves randomness. The optimizer takes random steps to update the model’s parameters, gradually reducing the loss.
2. The Role of the Random Seed
- Flavoring the Randomness: Source [3] explains the concept of a random seed as a way to “flavor the randomness” Because computers generate pseudo-random numbers, the random seed acts as a starting point for this pseudo-random number generation process. Setting a specific seed ensures that the sequence of random numbers generated will be the same every time the code is run.
- Enabling Reproducibility: Using a random seed makes experiments reproducible. This means that if you share your code and the seed value with someone else, they should be able to obtain the same results as you, assuming they are using the same software and hardware environment. Source [4] illustrates this with the example of sharing a notebook with a friend and wanting to reduce the randomness in the results.
- PyTorch Implementation: In PyTorch, you can set the random seed using torch.manual_seed(seed_value). Source [5] provides examples of setting the random seed and creating reproducible random tensors. Source [6] further notes that there is a separate seed for CUDA operations, torch.cuda.manual_seed(seed_value), to control randomness when using GPUs.
3. Effects on Training
- Deterministic Initialization: Setting a random seed ensures that the model starts with the same initial parameters every time you run the training code. This removes variability in the starting point of the optimization process.
- Consistent Optimization Path: With a fixed seed, the optimizer will take the same steps during training, leading to a more consistent learning trajectory. This can be helpful for debugging and understanding the model’s behavior.
- Comparable Experiments: Using random seeds makes it easier to compare the results of different experiments. For example, if you are trying different model architectures or hyperparameters, using the same seed across experiments ensures that the observed differences are due to the changes you made rather than random variations in initialization or optimization.
Key Takeaways
- Randomness is inherent in neural network training, impacting both parameter initialization and the optimization process.
- Random seeds provide a mechanism for controlling this randomness, making experiments reproducible and facilitating comparisons.
- Setting a random seed doesn’t eliminate randomness entirely but ensures that the same sequence of “random” numbers is used, leading to more consistent and comparable results.
The sources encourage developers to be mindful of randomness and leverage random seeds to enhance the reproducibility and reliability of their machine learning experiments. While complete determinism is often difficult to achieve due to factors beyond the random seed, using seeds is a valuable practice for improving the scientific rigor of deep learning research and development.

Training a Neural Network in PyTorch: A Step-by-Step Guide

The sources outline the primary steps involved in training a neural network using the PyTorch deep learning framework. These steps, often referred to as the PyTorch workflow, provide a structured approach to building, training, and evaluating models.

1. Data Preparation and Loading
- Data Acquisition: This initial step involves obtaining the data required for your machine-learning task. As noted in Source, data can take various forms, including structured data (e.g., spreadsheets), images, videos, audio, and even DNA sequences.
- Data Exploration: Becoming familiar with your data is crucial. This might involve visualizing the data (e.g., plotting images, creating histograms) and understanding its distribution, patterns, and potential biases.
- Data Preprocessing: Preparing the data for use with a PyTorch model often requires transformation and formatting. This could involve:
- Numerical Encoding: Converting categorical data into numerical representations, as many machine learning models operate on numerical inputs.
- Normalization: Scaling numerical features to a standard range (e.g., between 0 and 1) to prevent features with larger scales from dominating the learning process.
- Reshaping: Restructuring data into the appropriate dimensions expected by the neural network.
- Tensor Conversion: The sources emphasize that tensors are the fundamental building blocks of data in PyTorch. You’ll need to convert your data into PyTorch tensors using functions like torch.tensor().
- Dataset and DataLoader: Source recommends using PyTorch’s Dataset and DataLoader classes to efficiently manage and load data during training. A Dataset object represents your dataset, while a DataLoader provides an iterable over the dataset, enabling batching, shuffling, and other data handling operations.
2. Model Building or Selection
- Model Architecture: This step involves defining the structure of your neural network. You’ll need to decide on:
- Layer Types: PyTorch provides a wide range of layers in the torch.nn module, including linear layers (nn.Linear), convolutional layers (nn.Conv2d), recurrent layers (nn.LSTM), and more.
- Number of Layers: The depth of your network, often determined through experimentation and the complexity of the task.
- Number of Hidden Units: The dimensionality of the hidden representations within the network.
- Activation Functions: Non-linear functions applied to the output of layers to introduce non-linearity into the model.
- Model Implementation: You can build models from scratch, stacking layers together manually, or leverage pre-trained models from repositories like Torch Hub, particularly for tasks like image classification. Source showcases both approaches:
- Subclassing nn.Module: This common pattern involves creating a Python class that inherits from nn.Module. You’ll define layers as attributes of the class and implement the forward() method to specify how data flows through the network.
- Using nn.Sequential: Source demonstrates this simpler method for creating sequential models where data flows linearly through a sequence of layers.
3. Loss Function and Optimizer Selection
- Loss Function: The loss function measures how well the model is performing during training. It quantifies the difference between the model’s predictions and the actual target values. The choice of loss function depends on the nature of the problem:
- Regression: Common loss functions include Mean Squared Error (MSE) and Mean Absolute Error (MAE).
- Classification: Common loss functions include Cross-Entropy Loss and Binary Cross-Entropy Loss.
- Optimizer: The optimizer is responsible for updating the model’s parameters (weights and biases) during training, aiming to minimize the loss function. Popular optimizers in PyTorch include Stochastic Gradient Descent (SGD) and Adam.
- Hyperparameters: Both the loss function and optimizer often have hyperparameters that you’ll need to tune. For example, the learning rate for an optimizer controls the step size taken during parameter updates.
4. Training Loop Implementation
- Epochs: The training process is typically organized into epochs. An epoch involves iterating over the entire training dataset once. You’ll specify the number of epochs to train for.
- Batches: To improve efficiency, data is often processed in batches rather than individually. You’ll set the batch size, determining the number of data samples processed in each iteration of the training loop.
- Training Steps: The core of the training loop involves the following steps, repeated for each batch of data:
- Forward Pass: Passing the input data through the model to obtain predictions.
- Loss Calculation: Computing the loss by comparing predictions to the target values.
- Backpropagation: Calculating gradients of the loss with respect to the model’s parameters. This identifies how each parameter contributed to the error.
- Parameter Update: Using the optimizer to update the model’s parameters based on the calculated gradients. The goal is to adjust parameters in a direction that reduces the loss.
- Evaluation: Periodically, you’ll evaluate the model’s performance on a separate validation set to monitor its progress and prevent overfitting (where the model learns the training data too well and performs poorly on unseen data).
5. Model Saving and Loading
- Saving: Once the model is trained to a satisfactory level, you’ll want to save it for later use. The sources describe methods for saving PyTorch models, including:
- Saving the State Dictionary: This approach saves the model’s learned parameters in a dictionary-like object. It’s generally the recommended method as it’s more efficient and flexible.
- Saving the Entire Model: This saves the entire model architecture and parameters. However, it can lead to larger file sizes and potential compatibility issues if the PyTorch version changes.
- Loading: You can later load a saved model to reuse it for inference (making predictions on new data) or to continue training.
6. Additional Considerations
- GPU Acceleration: The sources highlight the importance of utilizing GPUs for faster training, especially for large models and datasets. PyTorch provides mechanisms for transferring models and data to GPUs.
- Experiment Tracking: As you train and experiment with different models and hyperparameters, it’s essential to keep track of your experiments. Tools like Weights & Biases, MLflow, and TensorBoard (mentioned in Source) can help you log metrics, visualize training progress, and compare different runs.
- Deployment: Once you have a well-trained model, you might want to deploy it to make predictions in real-world applications. This often involves considerations related to model serving, optimization, and integration with other systems.
Key Points from Sources
- Iterative Workflow: The PyTorch workflow is iterative, meaning you’ll often go back and forth between different steps, experimenting with various model architectures, hyperparameters, and data preprocessing techniques to improve performance.
- Experimentation: The sources encourage experimentation as a core part of the machine-learning process. Trying different things and observing their impact is crucial for finding the best model for your task.
- Focus on Code: While a theoretical understanding of deep learning concepts is helpful, the sources prioritize a hands-on, code-centric approach, emphasizing that writing code and running experiments is the most effective way to learn PyTorch.
- Abundant Resources: The sources emphasize the availability of extensive PyTorch documentation, tutorials, and community forums. These resources are invaluable for learning the framework and troubleshooting issues.
Beyond the Basics

The training steps described here provide a foundational understanding of how to train neural networks in PyTorch. As you progress in your deep learning journey, you’ll encounter more advanced topics and techniques. This might include:
- Transfer Learning: Leveraging pre-trained models to jump-start your model’s performance, particularly when you have limited data.
- Fine-tuning: Adapting pre-trained models to your specific task by modifying and training certain layers.
- Custom Layers and Architectures: Designing and implementing your own specialized neural network layers and architectures to address unique challenges.
- Distributed Training: Training models across multiple GPUs or machines to accelerate the process and handle extremely large datasets.
The sources provide a strong foundation for getting started with PyTorch. They encourage an experimental mindset, highlighting the importance of writing code, running experiments, and iteratively refining your models to achieve optimal performance.

Key Advantages of Utilizing PyTorch

The sources highlight several key benefits of using PyTorch as a deep learning framework, particularly for research and building machine learning models. Here’s a summary of the main advantages:

1. Research-Oriented and Widely Adopted:
- PyTorch is positioned as the “most popular research deep learning framework” [1], with statistics from December 2021 indicating that 58% of papers with code on Papers With Code were implemented using PyTorch [2]. This popularity within the research community stems from its flexibility and ease of use, making it a preferred choice for developing state-of-the-art machine learning algorithms.
- The widespread adoption of PyTorch is further evidenced by its use in prominent organizations like Tesla (for Autopilot computer vision models), OpenAI, Facebook (for in-house machine learning applications), and Microsoft [3-5].
2. Pythonic and User-Friendly:
- PyTorch is deeply integrated with Python, making it highly accessible for Python developers [1]. Its syntax and structure align closely with Pythonic conventions, reducing the learning curve for those already familiar with the language.
- This user-friendliness is emphasized throughout the sources, advocating for a hands-on, code-centric approach to learning PyTorch and stressing that “if you know Python, it’s a very user-friendly programming language” [6].
3. Dynamic Computational Graph and Debugging Ease:
- PyTorch’s dynamic computational graph is a significant advantage. Unlike static graph frameworks like TensorFlow (at least in its earlier versions), PyTorch builds the graph as you execute the code [This information is not from your provided sources]. This dynamic nature allows for greater flexibility during development, as you can modify the graph on the fly. It also simplifies debugging, as you can use standard Python debugging tools to inspect variables and step through the code.
4. GPU Acceleration and Fast Computations:
- PyTorch enables you to leverage the power of GPUs to accelerate computations [1, 7]. This is particularly crucial for deep learning, where training often involves vast amounts of data and computationally intensive operations.
- PyTorch accomplishes GPU acceleration through CUDA, NVIDIA’s parallel computing platform and API [8, 9]. It also supports TPUs (Tensor Processing Units), although GPUs are more commonly used with PyTorch [9].
- The sources highlight this benefit, noting that PyTorch “allows you to write fast deep learning code in Python” [1] and “enables you to run your code, your machine learning code accelerated on a GPU” [7].
5. Extensive Ecosystem and Community Support:
- PyTorch benefits from a rich ecosystem of tools, libraries, and resources. Source mentions Torch Hub, a repository for pre-trained models, and Torch Vision, a library specifically designed for computer vision tasks [6].
- There’s also a vibrant PyTorch community, providing support through forums, tutorials, and online resources [This information is not from your provided sources].
6. Flexibility and Customization:
- PyTorch’s flexibility allows you to build a wide range of models, from simple linear regressions to complex convolutional neural networks, and experiment with various architectures and techniques.
- The sources demonstrate this flexibility through examples, showing how to create models by subclassing nn.Module, using nn.Sequential, and implementing custom layers.
7. Focus on Tensors as Fundamental Building Blocks:
- The sources repeatedly emphasize that tensors are the core data structures in PyTorch [10, 11]. This focus on tensors simplifies data handling and manipulation, as tensors provide a unified representation for various data types, including images, text, and numerical data.
In summary, PyTorch offers a compelling combination of research-oriented features, user-friendliness, performance, and a robust ecosystem, making it a powerful and popular choice for deep learning practitioners.

PyTorch and GPU Acceleration: A CUDA-Powered Partnership

The sources provide a clear explanation of how PyTorch utilizes GPUs to achieve faster computations in deep learning. Here’s a breakdown:
- GPUs are inherently designed for numerical computation. They were initially developed for video games, which require intensive graphical processing that involves a significant number of mathematical calculations. This inherent capability makes them well-suited for the complex operations involved in deep learning. [1]
- PyTorch provides an interface to harness this computational power through CUDA. CUDA, which stands for Compute Unified Device Architecture, is NVIDIA’s parallel computing platform and API. It acts as a bridge between software, like PyTorch, and NVIDIA GPUs, enabling the software to use the GPU for general-purpose computing tasks. [2]
- By leveraging CUDA, PyTorch can execute deep learning code on NVIDIA GPUs. This allows for significant speedups, particularly when dealing with large datasets and complex models. [3]
- Device Agnostic Code: PyTorch promotes the concept of “device agnostic code”, which means the code can run on either a GPU (if available) or a CPU. This is achieved by setting a device variable that defaults to CUDA if a GPU is present and falls back to the CPU if not. [4, 5]
- The to method plays a crucial role. This method is used to move tensors and models to the desired device (either CUDA for GPU or CPU). By moving the data and the model to the GPU, PyTorch ensures that all computations are performed on the faster hardware. [6, 7]
- The sources strongly emphasize the performance benefits of using a GPU. They mention that it results in “faster computations”, allowing for faster pattern discovery in data, more experimentation, and ultimately, finding the best model. [8, 9]
Key Takeaways:
- PyTorch’s ability to utilize GPUs stems from its integration with CUDA.
- The to method allows for seamless movement of data and models to the GPU.
- Utilizing a GPU can lead to substantial speed improvements in deep learning tasks.
It’s important to note that while GPUs generally offer significant performance gains, there are situations where the overhead of transferring data to and from the GPU might outweigh the computational benefits, particularly with smaller datasets and less complex models. [10]

Top Three Errors in PyTorch

The sources identify three major error types that you’re likely to encounter when working with PyTorch and deep learning:

1. Tensor Data Type Mismatches
- The Root of the Problem: PyTorch relies heavily on tensors for representing and manipulating data. Tensors have an associated data type, such as float32, int64, or bool. Many PyTorch functions and operations require tensors to have specific data types to work correctly. If the data types of tensors involved in a calculation are incompatible, PyTorch will raise an error.
- Common Manifestations: You might encounter this error when:
- Performing mathematical operations between tensors with mismatched data types (e.g., multiplying a float32 tensor by an int64 tensor) [1, 2].
- Using a function that expects a particular data type but receiving a tensor of a different type (e.g., torch.mean requires a float32 tensor) [3-5].
- Real-World Example: The sources illustrate this error with torch.mean. If you attempt to calculate the mean of a tensor that isn’t a floating-point type, PyTorch will throw an error. To resolve this, you need to convert the tensor to float32 using tensor.type(torch.float32) [4].
- Debugging Strategies:Carefully inspect the data types of the tensors involved in the operation or function call where the error occurs.
- Use tensor.dtype to check a tensor’s data type.
- Convert tensors to the required data type using tensor.type().
- Key Insight: Pay close attention to data types. When in doubt, default to float32 as it’s PyTorch’s preferred data type [6].
2. Tensor Shape Mismatches
- The Core Issue: Tensors also have a shape, which defines their dimensionality. For example, a vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, and an image with three color channels is often represented as a 3-dimensional tensor. Many PyTorch operations, especially matrix multiplications and neural network layers, have strict requirements regarding the shapes of input tensors.
- Where It Goes Wrong:Matrix Multiplication: The inner dimensions of matrices being multiplied must match [7, 8].
- Neural Networks: The output shape of one layer needs to be compatible with the input shape of the next layer.
- Reshaping Errors: Attempting to reshape a tensor into an incompatible shape (e.g., squeezing 9 elements into a shape of 1×7) [9].
- Example in Action: The sources provide an example of a shape error during matrix multiplication using torch.matmul. If the inner dimensions don’t match, PyTorch will raise an error [8].
- Troubleshooting Tips:Shape Inspection: Thoroughly understand the shapes of your tensors using tensor.shape.
- Visualization: When possible, visualize tensors (especially high-dimensional ones) to get a better grasp of their structure.
- Reshape Carefully: Ensure that reshaping operations (tensor.reshape, tensor.view) result in compatible shapes.
- Crucial Takeaway: Always verify shape compatibility before performing operations. Shape errors are prevalent in deep learning, so be vigilant.
3. Device Mismatches (CPU vs. GPU)
- The Device Divide: PyTorch supports both CPUs and GPUs for computation. GPUs offer significant performance advantages, but require data and models to reside in GPU memory. If you attempt to perform an operation between tensors or models located on different devices, PyTorch will raise an error.
- Typical Scenarios:Moving Data to GPU: You might forget to move your input data to the GPU using tensor.to(device), leading to an error when performing calculations with a model that’s on the GPU [10].
- NumPy and GPU Tensors: NumPy operates on CPU memory, so you can’t directly use NumPy functions on GPU tensors [11]. You need to first move the tensor back to the CPU using tensor.cpu() [12].
- Source Illustration: The sources demonstrate this issue when trying to use numpy.array() on a tensor that’s on the GPU. The solution is to bring the tensor back to the CPU using tensor.cpu() [12].
- Best Practices:Device Agnostic Code: Use the device variable and the to() method to ensure that data and models are on the correct device [11, 13].
- CPU-to-GPU Transfers: Minimize the number of data transfers between the CPU and GPU, as these transfers can introduce overhead.
- Essential Reminder: Be device-aware. Always ensure that all tensors involved in an operation are on the same device (either CPU or GPU) to avoid errors.
The Big Three Errors in PyTorch and Deep Learning

The sources dedicate significant attention to highlighting the three most common errors encountered when working with PyTorch for deep learning, emphasizing that mastering these will equip you to handle a significant portion of the challenges you’ll face in your deep learning journey.

1. Tensor Not the Right Data Type
- The Core of the Issue: Tensors, the fundamental building blocks of data in PyTorch, come with associated data types (dtype), such as float32, float16, int32, and int64 [1, 2]. These data types specify how much detail a single number is stored with in memory [3]. Different PyTorch functions and operations may require specific data types to work correctly [3, 4].
- Why it’s Tricky: Sometimes operations may unexpectedly work even if tensors have different data types [4, 5]. However, other operations, especially those involved in training large neural networks, can be quite sensitive to data type mismatches and will throw errors [4].
- Debugging and Prevention:Awareness is Key: Be mindful of the data types of your tensors and the requirements of the operations you’re performing.
- Check Data Types: Utilize tensor.dtype to inspect the data type of a tensor [6].
- Conversion: If needed, convert tensors to the desired data type using tensor.type(desired_dtype) [7].
- Real-World Example: The sources provide examples of using torch.mean, a function that requires a float32 tensor [8, 9]. If you attempt to use it with an integer tensor, PyTorch will throw an error. You’ll need to convert the tensor to float32 before calculating the mean.
2. Tensor Not the Right Shape
- The Heart of the Problem: Neural networks are essentially intricate structures built upon layers of matrix multiplications. For these operations to work seamlessly, the shapes (dimensions) of tensors must be compatible [10-12].
- Shape Mismatch Scenarios: This error arises when:
- The inner dimensions of matrices being multiplied don’t match, violating the fundamental rule of matrix multiplication [10, 13].
- Neural network layers receive input tensors with incompatible shapes, preventing the data from flowing through the network as expected [11].
- You attempt to reshape a tensor into a shape that doesn’t accommodate all its elements [14].
- Troubleshooting and Best Practices:Inspect Shapes: Make it a habit to meticulously examine the shapes of your tensors using tensor.shape [6].
- Visualize: Whenever possible, try to visualize your tensors to gain a clearer understanding of their structure, especially for higher-dimensional tensors. This can help you identify potential shape inconsistencies.
- Careful Reshaping: Exercise caution when using operations like tensor.reshape or tensor.view to modify the shape of a tensor. Always ensure that the resulting shape is compatible with the intended operation or layer.
- Source Illustration: The sources offer numerous instances where shape errors occur during matrix multiplication and when passing data through neural network layers [13-18].
3. Tensor Not on the Right Device
- The Device Dilemma: PyTorch allows you to perform computations on either a CPU or a GPU, with GPUs offering substantial speed advantages for deep learning tasks [19, 20]. However, this flexibility introduces the potential for device mismatches, where you attempt to perform operations between tensors located on different devices (CPU or GPU), resulting in errors [19, 21].
- Common Culprits:Data on CPU, Model on GPU: You might load your data but forget to transfer it to the GPU using tensor.to(device), leading to an error when interacting with a model that’s on the GPU [22].
- NumPy and GPUs: NumPy, a popular Python library for numerical computing, operates on CPU memory. Attempting to directly use NumPy functions on GPU tensors will result in errors. You’ll need to first move the tensor back to the CPU using tensor.cpu() [19].
- Debugging and Solutions:Device-Agnostic Code: Adopt the practice of writing device-agnostic code, using the device variable and the to() method. This will help ensure that your data and models are consistently on the correct device [17].
- Minimize Transfers: Be mindful of the overhead associated with data transfers between the CPU and GPU. Try to minimize these transfers to optimize performance.
- Source Walkthrough: The sources provide examples of device mismatch errors, particularly when attempting to use NumPy functions with tensors on the GPU [19] and when data and models are not explicitly moved to the same device [17, 22-24].
The sources consistently emphasize that these three types of errors are incredibly prevalent in PyTorch and deep learning in general, often leading to a significant amount of troubleshooting and debugging. By understanding the nature of these errors, their common causes, and how to address them, you’ll be well-prepared to tackle a substantial portion of the challenges you’ll encounter while developing and training deep learning models with PyTorch.

The Dynamic Duo: Gradient Descent and Backpropagation

The sources highlight two fundamental algorithms that are at the heart of training neural networks: gradient descent and backpropagation. Let’s explore each of these in detail.

1. Gradient Descent: The Optimizer
- What it Does: Gradient descent is an optimization algorithm that aims to find the best set of parameters (weights and biases) for a neural network to minimize the loss function. The loss function quantifies how “wrong” the model’s predictions are compared to the actual target values.
- The Analogy: Imagine you’re standing on a mountain and want to find the lowest point (the valley). Gradient descent is like taking small steps downhill, following the direction of the steepest descent. The “steepness” is determined by the gradient of the loss function.
- In PyTorch: PyTorch provides the torch.optim module, which contains various implementations of gradient descent and other optimization algorithms. You specify the model’s parameters and a learning rate (which controls the size of the steps taken downhill). [1-3]
- Variations: There are different flavors of gradient descent:
- Stochastic Gradient Descent (SGD): Updates parameters based on the gradient calculated from a single data point or a small batch of data. This introduces some randomness (noise) into the optimization process, which can help escape local minima. [3]
- Adam: A more sophisticated variant of SGD that uses momentum and adaptive learning rates to improve convergence speed and stability. [4, 5]
- Key Insight: The choice of optimizer and its hyperparameters (like learning rate) can significantly influence the training process and the final performance of your model. Experimentation is often needed to find the best settings for a given problem.
2. Backpropagation: The Gradient Calculator
- Purpose: Backpropagation is the algorithm responsible for calculating the gradients of the loss function with respect to the neural network’s parameters. These gradients are then used by gradient descent to update the parameters in the direction that reduces the loss.
- How it Works: Backpropagation uses the chain rule from calculus to efficiently compute gradients, starting from the output layer and propagating them backward through the network layers to the input.
- The “Backward Pass”: In PyTorch, you trigger backpropagation by calling the loss.backward() method. This calculates the gradients and stores them in the grad attribute of each parameter tensor. [6-9]
- PyTorch’s Magic: PyTorch’s autograd feature handles the complexities of backpropagation automatically. You don’t need to manually implement the chain rule or derivative calculations. [10, 11]
- Essential for Learning: Backpropagation is the key to enabling neural networks to learn from data by adjusting their parameters in a way that minimizes prediction errors.
The sources emphasize that gradient descent and backpropagation work in tandem: backpropagation computes the gradients, and gradient descent uses these gradients to update the model’s parameters, gradually improving its performance over time. [6, 10]

Transfer Learning: Leveraging Existing Knowledge

Transfer learning is a powerful technique in deep learning where you take a model that has already been trained on a large dataset for a particular task and adapt it to solve a different but related task. This approach offers several advantages, especially when dealing with limited data or when you want to accelerate the training process. The sources provide examples of how transfer learning can be applied and discuss some of the key resources within PyTorch that support this technique.

The Core Idea: Instead of training a model from scratch, you start with a model that has already learned a rich set of features from a massive dataset (often called a pre-trained model). These pre-trained models are typically trained on datasets like ImageNet, which contains millions of images across thousands of categories.

How it Works:
1. Choose a Pre-trained Model: Select a pre-trained model that is relevant to your target task. For image classification, popular choices include ResNet, VGG, and Inception.
2. Feature Extraction: Use the pre-trained model as a feature extractor. You can either:
- Freeze the weights of the early layers of the model (which have learned general image features) and only train the later layers (which are more specific to your task).
- Fine-tune the entire pre-trained model, allowing all layers to adapt to your target dataset.
1. Transfer to Your Task: Replace the final layer(s) of the pre-trained model with layers that match the output requirements of your task. For example, if you’re classifying images into 10 categories, you’d replace the final layer with a layer that outputs 10 probabilities.
2. Train on Your Data: Train the modified model on your dataset. Since the pre-trained model already has a good understanding of general image features, the training process can converge faster and achieve better performance, even with limited data.
PyTorch Resources for Transfer Learning:
- Torch Hub: A repository of pre-trained models that can be easily loaded and used. The sources mention Torch Hub as a valuable resource for finding models to use in transfer learning.
- torchvision.models: Contains a collection of popular computer vision architectures (like ResNet and VGG) that come with pre-trained weights. You can easily load these models and modify them for your specific tasks.
Benefits of Transfer Learning:
- Faster Training: Since you’re not starting from random weights, the training process typically requires less time.
- Improved Performance: Pre-trained models often bring a wealth of knowledge that can lead to better accuracy on your target task, especially when you have a small dataset.
- Less Data Required: Transfer learning can be highly effective even when your dataset is relatively small.
Examples in the Sources:

The sources provide a glimpse into how transfer learning can be applied to image classification problems. For instance, you could leverage a model pre-trained on ImageNet to classify different types of food images or to distinguish between different clothing items in fashion images.

Key Takeaway: Transfer learning is a valuable technique that allows you to build upon the knowledge gained from training large models on extensive datasets. By adapting these pre-trained models, you can often achieve better results faster, particularly in scenarios where labeled data is scarce.

Here are some reasons why you might choose a machine learning algorithm over traditional programming:
- When you have problems with long lists of rules, it can be helpful to use a machine learning or a deep learning approach. For example, the rules of driving would be very difficult to code into a traditional program, but machine learning and deep learning are currently being used in self-driving cars to manage these complexities [1].
- Machine learning can be beneficial in continually changing environments because it can adapt to new data. For example, a machine learning model for self-driving cars could learn to adapt to new neighborhoods and driving conditions [2].
- Machine learning and deep learning excel at discovering insights within large collections of data. For example, the Food 101 data set contains images of 101 different kinds of food, which would be very challenging to classify using traditional programming techniques [3].
- If a problem can be solved with a simple set of rules, you should use traditional programming. For example, if you could write five steps to make your grandmother’s famous roast chicken, then it is better to do that than to use a machine learning algorithm [4, 5].
Traditional programming is when you write code to define a set of rules that map inputs to outputs. For example, you could write a program to make your grandmother’s roast chicken by defining a set of steps that map the ingredients to the finished dish [6, 7].

Machine learning, on the other hand, is when you give a computer a set of inputs and outputs, and it figures out the rules for itself. For example, you could give a machine learning algorithm a bunch of pictures of cats and dogs, and it would learn to distinguish between them [8, 9]. This is often described as supervised learning, because the algorithm is given both the inputs and the desired outputs, also known as features and labels. The algorithm’s job is to figure out the relationship between the features and the labels [8].

Deep learning is a subset of machine learning that uses neural networks with many layers. This allows deep learning models to learn more complex patterns than traditional machine learning algorithms. Deep learning is typically better for unstructured data, such as images, text, and audio [10].

Machine learning can be used for a wide variety of tasks, including:
- Image classification: Identifying the objects in an image. [11]
- Object detection: Locating objects in an image. [11]
- Natural language processing: Understanding and processing human language. [12]
- Speech recognition: Converting speech to text. [13]
- Machine translation: Translating text from one language to another. [13]
Overall, machine learning algorithms can be a powerful tool for solving complex problems that would be difficult or impossible to solve with traditional programming. However, it is important to remember that machine learning is not a silver bullet. There are many problems that are still best solved with traditional programming.

Here are the key advantages of using deep learning for problems with long lists of rules:
- Deep learning can excel at finding patterns in complex data, making it suitable for problems where it is difficult to explicitly code all of the rules. [1] For example, driving a car involves many rules, such as how to back out of a driveway, how to turn left, how to parallel park, and how to stop at an intersection. It would be extremely difficult to code all of these rules into a traditional program. [2]
- Deep learning is also well-suited for problems that involve continually changing environments. [3] This is because deep learning models can continue to learn and adapt to new data. [3] For example, a self-driving car might need to adapt to new neighborhoods and driving conditions. [3]
- Deep learning can be used to discover insights within large collections of data. [4] This is because deep learning models are able to learn complex patterns from large amounts of data. [4] For example, a deep learning model could be trained on a large dataset of food images to learn to classify different types of food. [4]
However, there are also some potential drawbacks to using deep learning for problems with long lists of rules:
- Deep learning models can be difficult to interpret. [5] This is because the patterns learned by a deep learning model are often represented as a large number of weights and biases, which can be difficult for humans to understand. [5]
- Deep learning models can be computationally expensive to train. [5] This is because deep learning models often have a large number of parameters, which require a lot of computational power to train. [5]
Overall, deep learning can be a powerful tool for solving problems with long lists of rules, but it is important to be aware of the potential drawbacks before using it.

Here is a comprehensive response to your query.

Deep Learning Models Learn by Adjusting Random Numbers

Deep learning models learn by starting with tensors full of random numbers and then adjusting those random numbers to represent data better. [1] This process is repeated over and over, with the model gradually improving its representation of the data. [2] This is a fundamental concept in deep learning. [1]

This process of adjusting random numbers is driven by two algorithms: gradient descent and backpropagation. [3, 4]
- Gradient descent minimizes the difference between the model’s predictions and the actual outputs by adjusting model parameters (weights and biases). [3, 4] The learning rate is a hyperparameter that determines how large the steps are that the model takes during gradient descent. [5, 6]
- Backpropagation calculates the gradients of the parameters with respect to the loss function. [4] In other words, backpropagation tells the model how much each parameter needs to be adjusted to reduce the error. [4] PyTorch implements backpropagation behind the scenes, making it easier to build deep learning models without needing to understand the complex math involved. [4, 7]
Deep learning models have many parameters, often thousands or even millions. [8, 9] These parameters represent the patterns that the model has learned from the data. [8, 10] By adjusting these parameters using gradient descent and backpropagation, the model can improve its performance on a given task. [1, 2]

This learning process is similar to how humans learn. For example, when a child learns to ride a bike, they start by making random movements. Through trial and error, they gradually learn to coordinate their movements and balance on the bike. Similarly, a deep learning model starts with random parameters and gradually adjusts them to better represent the data it is trying to learn.

In short, the main concept behind a deep learning model’s ability to learn is its ability to adjust a large number of random parameters to better represent the data, driven by gradient descent and backpropagation.

Supervised and Unsupervised Learning Paradigms

Supervised learning is a type of machine learning where you have data and labels. The labels are the desired outputs for each input. The goal of supervised learning is to train a model that can accurately predict the labels for new, unseen data. An example of supervised learning is training a model to discern between cat and dog photos using photos labeled as either “cat” or “dog”. [1, 2]

Unsupervised and self-supervised learning are types of machine learning where you only have data, and no labels. The goal of unsupervised learning is to find patterns in the data without any guidance from labels. The goal of self-supervised learning is similar, but the algorithm attempts to learn an inherent representation of the data without being told what to look for. [2, 3] For example, a self-supervised learning algorithm could be trained on a dataset of dog and cat photos without being told which photos are of cats and which are of dogs. The algorithm would then learn to identify the underlying patterns in the data that distinguish cats from dogs. This representation of the data could then be used to train a supervised learning model to classify cats and dogs. [3, 4]

Transfer learning is a type of machine learning where you take the patterns that one model has learned on one dataset and apply them to another dataset. This is a powerful technique that can be used to improve the performance of machine learning models on new tasks. For example, you could use a model that has been trained to classify images of dogs and cats to help train a model to classify images of birds. [4, 5]

Reinforcement learning is another machine learning paradigm that does not fall into the categories of supervised, unsupervised, or self-supervised learning. [6] In reinforcement learning, an agent learns to interact with an environment by performing actions and receiving rewards or observations in return. [6, 7] An example of reinforcement learning is teaching a dog to urinate outside by rewarding it for urinating outside. [7]

This response has comprehensively covered the key differences between supervised and unsupervised learning based on the information in the sources.

Underfitting in Machine Learning

Underfitting occurs when a machine learning model is not complex enough to capture the patterns in the training data. As a result, an underfit model will have high training error and high test error. This means it will make inaccurate predictions on both the data it was trained on and new, unseen data.

Here are some ways to identify underfitting:
- The model’s loss on the training and test data sets could be lower [1].
- The loss curve does not decrease significantly over time, remaining relatively flat [1].
- The accuracy of the model is lower than desired on both the training and test sets [2].
Here’s an analogy to better understand underfitting: Imagine you are trying to learn to play a complex piano piece but are only allowed to use one finger. You can learn to play a simplified version of the song, but it will not sound very good. You are underfitting the data because your one-finger technique is not complex enough to capture the nuances of the original piece.

Underfitting is often caused by using a model that is too simple for the data. For example, using a linear model to fit data with a non-linear relationship will result in underfitting [3]. It can also be caused by not training the model for long enough. If you stop training too early, the model may not have had enough time to learn the patterns in the data.

Here are some ways to address underfitting:
- Add more layers or units to your model: This will increase the complexity of the model and allow it to learn more complex patterns [4].
- Train for longer: This will give the model more time to learn the patterns in the data [5].
- Tweak the learning rate: If the learning rate is too high, the model may not be able to converge on a good solution. Reducing the learning rate can help the model learn more effectively [4].
- Use transfer learning: Transfer learning can help to improve the performance of a model by using knowledge learned from a previous task [6].
- Use less regularization: Regularization is a technique that can help to prevent overfitting, but if you use too much regularization, it can lead to underfitting. Reducing the amount of regularization can help the model learn more effectively [7].
The goal in machine learning is to find the sweet spot between underfitting and overfitting, where the model is complex enough to capture the patterns in the data, but not so complex that it overfits. This is an ongoing challenge, and there is no one-size-fits-all solution. However, by understanding the concepts of underfitting and overfitting, you can take steps to improve the performance of your machine learning models.

Impact of the Learning Rate on Gradient Descent

The learning rate, often abbreviated as “LR”, is a hyperparameter that determines the size of the steps taken during the gradient descent algorithm [1-3]. Gradient descent, as previously discussed, is an iterative optimization algorithm that aims to find the optimal set of model parameters (weights and biases) that minimize the loss function [4-6].

A smaller learning rate means the model parameters are adjusted in smaller increments during each iteration of gradient descent [7-10]. This leads to slower convergence, requiring more epochs to reach the optimal solution. However, a smaller learning rate can also be beneficial as it allows the model to explore the loss landscape more carefully, potentially avoiding getting stuck in local minima [11].

Conversely, a larger learning rate results in larger steps taken during gradient descent [7-10]. This can lead to faster convergence, potentially reaching the optimal solution in fewer epochs. However, a large learning rate can also be detrimental as it can cause the model to overshoot the optimal solution, leading to oscillations or even divergence, where the loss increases instead of decreasing [7, 10, 12].

Visualizing the learning rate’s effect can be helpful. Imagine trying to find the lowest point in a valley. A small learning rate is like taking small, careful steps down the slope, ensuring you don’t miss the bottom. A large learning rate is like taking large, confident strides, potentially reaching the bottom faster but risking stepping over it entirely.

The choice of learning rate is crucial and often involves experimentation to find the optimal balance between convergence speed and stability [12-15]. Setting the learning rate too low can result in unnecessarily long training times, while setting it too high can prevent the model from converging at all [7, 10, 12].

In the sources, a learning rate of 0.01 is described as “relatively large” [16]. A learning rate of 0.1 is mentioned as a common starting point but potentially too large for some scenarios [10, 17]. Smaller learning rates like 0.001 are also used, particularly with the Adam optimizer, which has a default learning rate of 0.001 [18-20].

The sources demonstrate the impact of the learning rate on the training process. In one instance, lowering the learning rate from 0.1 to 0.01 resulted in a lower loss on the test dataset [21]. Additionally, using a TensorFlow Playground visualization, varying the learning rate illustrated how it affects the speed of convergence and the ability to find the optimal solution [11, 22, 23].

Therefore, selecting an appropriate learning rate is an important consideration when training machine learning models. Experimentation and monitoring the training process are often necessary to find the learning rate that yields the best results for a specific problem and model architecture.

Here is a response to your query about the benefits and drawbacks of using deep learning models compared to traditional machine learning algorithms:

Benefits and Drawbacks of Deep Learning Compared to Traditional Machine Learning

Deep learning, a subfield of machine learning, has gained immense popularity due to its ability to achieve state-of-the-art performance on various complex tasks. However, it’s essential to consider both its advantages and disadvantages compared to traditional machine learning algorithms before deciding which approach is suitable for a particular problem.

Benefits of Deep Learning
- Handles Unstructured Data: Deep learning excels in handling unstructured data such as images, text, and audio. Traditional machine learning algorithms often struggle with such data types, requiring extensive feature engineering to convert them into structured formats [1, 2]. Deep learning models, particularly convolutional neural networks (CNNs) for images and recurrent neural networks (RNNs) or transformers for text, automatically learn relevant features from raw data, simplifying the process [1-3].
- Superior Performance on Complex Tasks: Deep learning models, due to their depth and complexity, can capture intricate patterns and relationships in data that traditional algorithms may miss [3-5]. This enables them to outperform traditional methods on tasks like image recognition, natural language processing, and speech synthesis [6-8].
- Adaptability and Continuous Learning: Deep learning models can adapt to changing environments and learn from new data continuously [9]. This is crucial for applications like self-driving cars, where the model needs to adjust to new scenarios and learn from ongoing experiences [4, 9, 10].
Drawbacks of Deep Learning
- Black Box Nature and Explainability: Deep learning models often lack explainability. Their complex architectures and vast number of parameters make it challenging to interpret how they arrive at their predictions [11, 12]. This can be a concern in applications where understanding the reasoning behind a decision is critical, such as medical diagnosis or financial modeling [11, 13].
- Data Hungry and Computationally Intensive: Deep learning models typically require massive amounts of data to train effectively [14, 15]. Gathering and preparing such large datasets can be time-consuming and expensive. Additionally, training deep learning models can be computationally demanding, requiring powerful hardware like GPUs or TPUs [16, 17].
- Potential for Overfitting: Due to their complexity, deep learning models are susceptible to overfitting, where they learn the training data too well but generalize poorly to unseen data [18-20]. This requires careful tuning and techniques like regularization or dropout to mitigate [21, 22].
Traditional Machine Learning Algorithms: Strengths and Limitations

Traditional machine learning algorithms, like linear regression, support vector machines (SVMs), and decision trees, have been widely used and remain relevant for many applications. They often excel in scenarios where data is structured, explainability is crucial, or computational resources are limited.
- Strengths: They generally require less data, are easier to interpret, and are computationally less demanding compared to deep learning models [23, 24].
- Limitations: They may struggle with complex tasks involving unstructured data, require extensive feature engineering, and may not perform as well as deep learning on specific tasks.
The sources primarily focus on deep learning, and there is limited information on specific advantages and disadvantages of traditional machine learning algorithms. You may want to independently verify this information.

The choice between deep learning and traditional machine learning depends on the specific problem, available data, computational resources, and the importance of explainability. Deep learning often outperforms traditional methods on complex tasks with unstructured data. However, traditional algorithms may be a better choice when data is structured, interpretability is crucial, or computational resources are limited.

Google’s Number One Rule of Machine Learning: Benefits of Simplicity

Google’s number one rule of machine learning emphasizes choosing simplicity over complexity when possible. It states: “If you can build a simple rule-based system that doesn’t require machine learning, do that.” [1-3] This principle, although seemingly counterintuitive at the start of a machine learning course, highlights the importance of considering the trade-offs before deploying complex models. [4]

Here’s why it’s crucial to consider Google’s number one rule:
- Efficiency and Maintainability: Simple rule-based systems are often easier to implement, understand, maintain, and debug compared to machine learning models. They require less code, are more transparent in their decision-making process, and don’t need extensive training data or computational resources. [1-3]
- Reliability and Predictability: Rule-based systems provide predictable outcomes based on predefined rules. In contrast, machine learning models, especially deep learning models, involve probabilistic predictions, meaning there’s always a chance of error. [5] For tasks requiring absolute certainty or where errors are unacceptable, a rule-based system might be a more suitable choice. [5]
- Reduced Development Time and Costs: Building and deploying a machine learning model involves several steps, including data collection, preprocessing, model selection, training, and evaluation. This process can be time-consuming and resource-intensive. If a simple rule-based system can achieve the desired outcome, it can significantly reduce development time and costs. [1, 2]
- Avoiding Unnecessary Complexity: Machine learning models, especially deep learning models, can become highly complex, making them challenging to interpret and debug. Using a machine learning model when a simpler solution exists introduces unnecessary complexity, potentially leading to difficulties in maintenance and troubleshooting. [4]
The sources provide an analogy to illustrate this principle. If a simple set of five rules can accurately map ingredients to a Sicilian grandmother’s roast chicken recipe, there’s no need to employ a complex machine learning model. The rule-based system, in this case, would be more efficient and reliable. [1, 2]

However, it’s important to acknowledge that rule-based systems have limitations. They may not be suitable for complex problems with a vast number of rules, constantly changing environments, or situations requiring insights from large datasets. [6, 7]

Therefore, Google’s number one rule encourages a thoughtful approach to problem-solving, urging consideration of simpler alternatives before resorting to the complexity of machine learning. It emphasizes that machine learning, although powerful, is not a universal solution and should be applied judiciously when the problem demands it. [4, 7]

Here is a breakdown of the difference between parameters and hyperparameters in machine learning, based on the provided sources:

Parameters: Learned by the Model
- Parameters are the internal values of a machine learning model that are learned automatically during the training process. [1]
- They are responsible for capturing patterns and relationships within the data. [1]
- Examples of parameters include weights and biases in a neural network. [1, 2]
- Parameters are updated iteratively through optimization algorithms like gradient descent, guided by the loss function. [3, 4]
- The number of parameters can vary significantly depending on the complexity of the model and the dataset. Models can have from a few parameters to millions or even billions. [2]
- In the context of PyTorch, accessing model parameters can be done using model.parameters(). [5]
Hyperparameters: Set by the Machine Learning Engineer
- Hyperparameters are external configurations that are set by the machine learning engineer or data scientist before training the model. [4]
- They control the learning process and influence the behavior of the model, such as its complexity, learning speed, and ability to generalize. [6]
- Examples of hyperparameters:
- Learning rate (LR) [7]
- Number of hidden layers [8]
- Number of hidden units per layer [8]
- Number of epochs [9]
- Activation functions [8]
- Loss function [8]
- Optimizer [8]
- Batch size [10]
- Choosing appropriate hyperparameters is crucial for optimal model performance. [6]
- Finding the best hyperparameter settings often involves experimentation and techniques like grid search or random search. [This information about grid search or random search is not from your provided sources and you may want to independently verify it.]
Key Differences
- Learned vs. Set: The key difference is that parameters are learned by the model during training, while hyperparameters are set manually before training.
- Internal vs. External: Parameters are internal to the model, representing its learned knowledge, whereas hyperparameters are external configurations that guide the learning process.
- Optimization Target vs. Optimization Control: The model’s optimization algorithms aim to find the optimal parameter values, while hyperparameters control how this optimization process occurs.
The sources provide a clear distinction between parameters and hyperparameters. Parameters are like the model’s internal settings that it adjusts to capture patterns in the data. Hyperparameters are the external knobs that the machine learning engineer tweaks to guide the model’s learning process. Understanding this distinction is essential for building and training effective machine learning models.

Here is an explanation of back propagation and its role in gradient descent, based on the provided sources and our conversation history.

Back Propagation and Gradient Descent: A Collaborative Learning Process

Back propagation and gradient descent are two essential algorithms that work together to enable a machine learning model to learn from data and improve its performance. These concepts are particularly relevant to deep learning models, which involve complex architectures with numerous parameters that need to be optimized.

Back Propagation: Calculating the Gradients

Back propagation is an algorithm that calculates the gradients of the loss function with respect to each parameter in the model. The gradients represent the direction and magnitude of change needed in each parameter to minimize the loss function.
- Forward Pass: It begins with a forward pass, where data is fed through the model’s layers, and predictions are generated.
- Loss Calculation: The difference between these predictions and the actual target values is quantified using a loss function.
- Backward Pass: The back propagation algorithm then works backward through the network, starting from the output layer and moving towards the input layer.
- Chain Rule: It uses the chain rule of calculus to calculate the gradients of the loss function with respect to each parameter. This process involves calculating the partial derivatives of the loss function with respect to the outputs of each layer, and then using these derivatives to calculate the gradients for the parameters within that layer.
- Gradient Accumulation: The gradients are accumulated during this backward pass, providing information about how each parameter contributes to the overall error.
Gradient Descent: Updating the Parameters

Gradient descent is an optimization algorithm that uses the gradients calculated by back propagation to update the model’s parameters iteratively. The goal is to find the parameter values that minimize the loss function, leading to improved model performance.
- Learning Rate: The learning rate is a hyperparameter that determines the step size taken in the direction of the negative gradient. It controls how much the parameters are adjusted during each update.
- Iterative Updates: Gradient descent starts with an initial set of parameter values (often randomly initialized) and repeatedly updates these values based on the calculated gradients.
- Minimizing the Loss: The update rule involves moving the parameters in the opposite direction of the gradient, scaled by the learning rate. This process continues iteratively until the loss function reaches a minimum or a satisfactory level of convergence is achieved.
The Interplay

Back propagation provides the essential information needed for gradient descent to operate. By calculating the gradients of the loss function with respect to each parameter, back propagation tells gradient descent which direction to move each parameter to reduce the error.

The sources emphasize that PyTorch handles the intricate mathematics of back propagation and gradient descent behind the scenes. When you define your model, loss function, and optimizer in PyTorch, and execute the training loop, these algorithms are automatically triggered to update the model’s parameters. The loss.backward() function triggers back propagation, and the optimizer.step() function performs the parameter update using gradient descent.

Example: Imagine a neural network trained to classify images of cats and dogs. During training, back propagation calculates how much each weight and bias parameter contributes to misclassifications. Gradient descent then uses this information to adjust these parameters slightly, moving them in a direction that reduces the error. This process continues iteratively, gradually improving the model’s ability to distinguish between cats and dogs.

In summary, back propagation and gradient descent form a powerful duo in machine learning. Back propagation determines the direction and magnitude of change needed in the model’s parameters, and gradient descent uses this information to iteratively update the parameters, driving the model towards better performance. While the mathematical details can be complex, PyTorch simplifies this process by abstracting away the complexity, allowing you to focus on building and training models without manually implementing these algorithms.

The Role of Random Seeds in Neural Networks

The sources provide a detailed explanation of how randomness plays a crucial role in the initialization and training of neural networks. Here’s how random seeds influence this randomness:

Random Initialization of Parameters:
- Neural networks start with random values for their parameters, such as weights and biases. [1, 2] This random initialization is essential for breaking symmetry and allowing the model to explore different regions of the parameter space during training.
- Without random initialization, all neurons in a layer would learn the same features, hindering the network’s ability to learn complex patterns. [This point is not explicitly mentioned in your sources, so you may want to independently verify it.]
Sources of Randomness in PyTorch:
- PyTorch uses pseudo-random number generators to create these random values. [3] Pseudo-randomness means that while the generated numbers appear random, they are actually determined by a deterministic algorithm.
- Random Tensor Creation: When you create a random tensor in PyTorch using functions like torch.rand(), the underlying random number generator determines the values within that tensor. [1, 4] Each time you run the code, you get a different set of random values.
The Impact of Random Seeds:
- Reproducibility: The problem with this inherent randomness is that it makes it difficult to reproduce experiments. If you share your code with someone else, they will likely get different results due to the different random initializations.
- Controlling the Randomness: A random seed allows you to “flavor” the randomness. [5] Setting a seed using torch.manual_seed() ensures that the random number generator starts from a specific point, producing the same sequence of random numbers every time you run the code. [6]
- Flavors of Randomness: Think of each seed value as a different “flavor” of randomness. [6] While the numbers will still be random, they will be the same random numbers every time you use that specific seed.
Benefits of Using Random Seeds:
- Consistent Results: Using a random seed enables you to reproduce experiments and ensure consistency in your results. This is essential for debugging, sharing code, and comparing different model architectures or hyperparameter settings.
- Controlled Experiments: Random seeds allow you to control the randomness in your experiments, isolating the effects of other factors like model architecture or hyperparameters.
Limitations of Random Seeds:
- Not True Randomness: It’s important to remember that even with a random seed, you’re not getting true randomness. The numbers are still generated by a deterministic algorithm.
- Generalization: Over-reliance on a specific random seed may lead to a model that overfits to the specific random initialization, potentially harming its ability to generalize to unseen data. [This is a point worth noting but is not from your sources.]
Practical Usage:
- The sources recommend setting the random seed at the beginning of your code using torch.manual_seed(). [6] This ensures that all subsequent random operations in your code, such as creating random tensors or initializing model parameters, use the same “flavored” randomness.
- CUDA Random Seed: If you’re using CUDA for GPU computations, you also need to set the CUDA random seed using torch.cuda.manual_seed() to control randomness on the GPU. [7]
In summary, random seeds are a valuable tool for controlling the randomness inherent in neural networks, enabling reproducibility and facilitating controlled experiments. By setting a random seed, you ensure that the same “flavored” randomness is used every time you run your code, leading to consistent and reproducible results.

Here is an in-depth explanation of how batch size influences the efficiency of training a neural network, drawing on the information provided in the sources and our previous conversation.

Batch Size: A Balancing Act for Efficiency and Performance

Batch size is a crucial hyperparameter in machine learning that determines how many samples are processed by the model before updating its parameters. Instead of processing the entire training dataset in one go, the data is divided into smaller groups called batches. The model iterates through these batches, updating its parameters after processing each batch.

Impact of Batch Size on Training:
- Computational Efficiency: The sources highlight that batch size significantly impacts computational efficiency. Processing a large batch of images requires significant memory and computational power. Using a smaller batch size can make training more manageable, especially when dealing with limited hardware resources or large datasets.
- Gradient Update Frequency: A smaller batch size leads to more frequent updates to the model’s parameters because the gradients are calculated and applied after each batch. This can lead to faster convergence, especially in the early stages of training.
- Generalization: Using smaller batch sizes can also improve the model’s ability to generalize to unseen data. This is because the model is exposed to a more diverse set of samples during each epoch, potentially leading to a more robust representation of the data.
Choosing the Right Batch Size:
- Hardware Constraints: The sources emphasize that hardware constraints play a significant role in determining the batch size. If you have a powerful GPU with ample memory, you can use larger batch sizes without running into memory issues. However, if you’re working with limited hardware, smaller batch sizes may be necessary.
- Dataset Size: The size of your dataset also influences the choice of batch size. For smaller datasets, you might be able to use larger batch sizes, but for massive datasets, smaller batch sizes are often preferred.
- Experimentation: Finding the optimal batch size often involves experimentation. The sources recommend starting with a common batch size like 32 and adjusting it based on the specific problem and hardware limitations.
Mini-Batch Gradient Descent:
- Efficiency and Performance Trade-off: The concept of using batches to train a neural network is called mini-batch gradient descent. Mini-batch gradient descent strikes a balance between the computational efficiency of batch gradient descent (processing the entire dataset in one go) and the faster convergence of stochastic gradient descent (processing one sample at a time).
- Advantages of Mini-Batches: The sources list two primary benefits of using mini-batches:
1. Computational Efficiency: Mini-batches make it feasible to train models on large datasets that might not fit entirely in memory.
2. More Frequent Gradient Updates: More frequent updates lead to potentially faster convergence and can help the model escape local minima during training.
Example from the Sources:
- In the context of image classification using the Fashion MNIST dataset, the sources demonstrate how a batch size of 32 is used to divide the 60,000 training images into smaller, manageable batches. This allows the model to process and learn from the data more efficiently.
Key Considerations When Choosing Batch Size:
- Larger Batch Sizes:Can be more computationally efficient because you’re processing more samples in one go.
- Can potentially lead to smoother gradient updates, but also may require more memory.
- Risk of overfitting if the batch size is too large and doesn’t allow the model to explore diverse samples.
- Smaller Batch Sizes:Lead to more frequent gradient updates, potentially leading to faster convergence, especially in the early stages of training.
- Can help the model generalize better to unseen data due to exposure to more diverse samples during training.
- May be less computationally efficient as you’re processing fewer samples at a time.
In conclusion, batch size is a critical hyperparameter that significantly influences the efficiency of training a neural network. Choosing the right batch size involves considering hardware constraints, dataset size, and experimental findings. Mini-batch gradient descent, by processing the data in batches, offers a balance between computational efficiency and performance, enabling the training of complex models on large datasets.

Please note that while I have tried my best to provide a comprehensive response, some insights may be based on general machine learning knowledge and best practices that extend beyond the specific details provided in the sources. You may want to independently verify this additional information.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
November 10, 2025
PyTorch for Deep Learning & Machine Learning – Study Notes
PyTorch for Deep Learning FAQ

1. What are tensors and how are they represented in PyTorch?

Tensors are the fundamental data structures in PyTorch, used to represent numerical data. They can be thought of as multi-dimensional arrays. In PyTorch, tensors are created using the torch.tensor() function and can be classified as:
- Scalar: A single number (zero dimensions)
- Vector: A one-dimensional array (one dimension)
- Matrix: A two-dimensional array (two dimensions)
- Tensor: A general term for arrays with three or more dimensions
You can identify the number of dimensions by counting the pairs of closing square brackets used to define the tensor.

2. How do you determine the shape and dimensions of a tensor?
- Dimensions: Determined by counting the pairs of closing square brackets (e.g., [[]] represents two dimensions). Accessed using tensor.ndim.
- Shape: Represents the number of elements in each dimension. Accessed using tensor.shape or tensor.size().
For example, a tensor defined as [[1, 2], [3, 4]] has two dimensions and a shape of (2, 2), indicating two rows and two columns.

3. What are tensor data types and how do you change them?

Tensors have data types that specify the kind of numerical values they hold (e.g., float32, int64). The default data type in PyTorch is float32. You can change the data type of a tensor using the .type() method:

float_32_tensor = torch.tensor([1.0, 2.0, 3.0])

float_16_tensor = float_32_tensor.type(torch.float16)

4. What does “requires_grad” mean in PyTorch?

requires_grad is a parameter used when creating tensors. Setting it to True indicates that you want to track gradients for this tensor during training. This is essential for PyTorch to calculate derivatives and update model weights during backpropagation.

5. What is matrix multiplication in PyTorch and what are the rules?

Matrix multiplication, a key operation in deep learning, is performed using the @ operator or torch.matmul() function. Two important rules apply:
- Inner dimensions must match: The number of columns in the first matrix must equal the number of rows in the second matrix.
- Resulting matrix shape: The resulting matrix will have the number of rows from the first matrix and the number of columns from the second matrix.
6. What are common tensor operations for aggregation?

PyTorch provides several functions to aggregate tensor values, such as:
- torch.min(): Finds the minimum value.
- torch.max(): Finds the maximum value.
- torch.mean(): Calculates the average.
- torch.sum(): Calculates the sum.
These functions can be applied to the entire tensor or along specific dimensions.

7. What are the differences between reshape, view, and stack?
- reshape: Changes the shape of a tensor while maintaining the same data. The new shape must be compatible with the original number of elements.
- view: Creates a new view of the same underlying data as the original tensor, with a different shape. Changes to the view affect the original tensor.
- stack: Concatenates tensors along a new dimension, creating a higher-dimensional tensor.
8. What are the steps involved in a typical PyTorch training loop?
1. Forward Pass: Input data is passed through the model to get predictions.
2. Calculate Loss: The difference between predictions and actual labels is calculated using a loss function.
3. Zero Gradients: Gradients from previous iterations are reset to zero.
4. Backpropagation: Gradients are calculated for all parameters with requires_grad=True.
5. Optimize Step: The optimizer updates model weights based on calculated gradients.
Deep Learning and Machine Learning with PyTorch

Short-Answer Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What are the key differences between a scalar, a vector, a matrix, and a tensor in PyTorch?
2. How can you determine the number of dimensions of a tensor in PyTorch?
3. Explain the concept of “shape” in relation to PyTorch tensors.
4. Describe how to create a PyTorch tensor filled with ones and specify its data type.
5. What is the purpose of the torch.zeros_like() function?
6. How do you convert a PyTorch tensor from one data type to another?
7. Explain the importance of ensuring tensors are on the same device and have compatible data types for operations.
8. What are tensor attributes, and provide two examples?
9. What is tensor broadcasting, and what are the two key rules for its operation?
10. Define tensor aggregation and provide two examples of aggregation functions in PyTorch.
Short-Answer Quiz Answer Key
1. In PyTorch, a scalar is a single number, a vector is an array of numbers with direction, a matrix is a 2-dimensional array of numbers, and a tensor is a multi-dimensional array that encompasses scalars, vectors, and matrices. All of these are represented as torch.Tensor objects in PyTorch.
2. The number of dimensions of a tensor can be determined using the tensor.ndim attribute, which returns the number of dimensions or axes present in the tensor.
3. The shape of a tensor refers to the number of elements along each dimension of the tensor. It is represented as a tuple, where each element in the tuple corresponds to the size of each dimension.
4. To create a PyTorch tensor filled with ones, use torch.ones(size) where size is a tuple specifying the desired dimensions. To specify the data type, use the dtype parameter, for example, torch.ones(size, dtype=torch.float64).
5. The torch.zeros_like() function creates a new tensor filled with zeros, having the same shape and data type as the input tensor. It is useful for quickly creating a tensor with the same structure but with zero values.
6. To convert a PyTorch tensor from one data type to another, use the .type() method, specifying the desired data type as an argument. For example, to convert a tensor to float16: tensor = tensor.type(torch.float16).
7. PyTorch operations require tensors to be on the same device (CPU or GPU) and have compatible data types for successful computation. Performing operations on tensors with mismatched devices or incompatible data types will result in errors.
8. Tensor attributes provide information about the tensor’s properties. Two examples are:
- dtype: Specifies the data type of the tensor elements.
- shape: Represents the dimensionality of the tensor as a tuple.
1. Tensor broadcasting allows operations between tensors with different shapes, automatically expanding the smaller tensor to match the larger one under certain conditions. The two key rules for broadcasting are:
- Inner dimensions must match.
- The resulting matrix has the shape of the broadcasted tensors.
1. Tensor aggregation involves reducing the elements of a tensor to a single value using specific functions. Two examples are:
- torch.min(): Finds the minimum value in a tensor.
- torch.mean(): Calculates the average value of the elements in a tensor.
Essay Questions
1. Discuss the concept of dimensionality in PyTorch tensors. Explain how to create tensors with different dimensions and demonstrate how to access specific elements within a tensor. Provide examples and illustrate the relationship between dimensions, shape, and indexing.
2. Explain the importance of data types in PyTorch. Describe different data types available for tensors and discuss the implications of choosing specific data types for tensor operations. Provide examples of data type conversion and highlight potential issues arising from data type mismatches.
3. Compare and contrast the torch.reshape(), torch.view(), and torch.permute() functions. Explain their functionalities, use cases, and any potential limitations or considerations. Provide code examples to illustrate their usage.
4. Discuss the purpose and functionality of the PyTorch nn.Module class. Explain how to create custom neural network modules by subclassing nn.Module. Provide a code example demonstrating the creation of a simple neural network module with at least two layers.
5. Describe the typical workflow for training a neural network model in PyTorch. Explain the steps involved, including data loading, model creation, loss function definition, optimizer selection, training loop implementation, and model evaluation. Provide a code example outlining the essential components of the training process.
Glossary of Key Terms

Tensor: A multi-dimensional array, the fundamental data structure in PyTorch.

Dimensionality: The number of axes or dimensions present in a tensor.

Shape: A tuple representing the size of each dimension in a tensor.

Data Type: The type of values stored in a tensor (e.g., float32, int64).

Tensor Broadcasting: Automatically expanding the dimensions of tensors during operations to enable compatibility.

Tensor Aggregation: Reducing the elements of a tensor to a single value using functions like min, max, or mean.

nn.Module: The base class for building neural network modules in PyTorch.

Forward Pass: The process of passing input data through a neural network to obtain predictions.

Loss Function: A function that measures the difference between predicted and actual values during training.

Optimizer: An algorithm that adjusts the model’s parameters to minimize the loss function.

Training Loop: Iteratively performing forward passes, loss calculation, and parameter updates to train a model.

Device: The hardware used for computation (CPU or GPU).

Data Loader: An iterable that efficiently loads batches of data for training or evaluation.

Exploring Deep Learning with PyTorch

Fundamentals of Tensors

1. Understanding Tensors
- Introduction to tensors, the fundamental data structure in PyTorch.
- Differentiating between scalars, vectors, matrices, and tensors.
- Exploring tensor attributes: dimensions, shape, and indexing.
2. Manipulating Tensors
- Creating tensors with varying data types, devices, and gradient tracking.
- Performing arithmetic operations on tensors and managing potential data type errors.
- Reshaping tensors, understanding the concept of views, and employing stacking operations like torch.stack, torch.vstack, and torch.hstack.
- Utilizing torch.squeeze to remove single dimensions and torch.unsqueeze to add them.
- Practicing advanced indexing techniques on multi-dimensional tensors.
3. Tensor Aggregation and Comparison
- Exploring tensor aggregation with functions like torch.min, torch.max, and torch.mean.
- Utilizing torch.argmin and torch.argmax to find the indices of minimum and maximum values.
- Understanding element-wise tensor comparison and its role in machine learning tasks.
Building Neural Networks

4. Introduction to torch.nn
- Introducing the torch.nn module, the cornerstone of neural network construction in PyTorch.
- Exploring the concept of neural network layers and their role in transforming data.
- Utilizing matplotlib for data visualization and understanding PyTorch version compatibility.
5. Linear Regression with PyTorch
- Implementing a simple linear regression model using PyTorch.
- Generating synthetic data, splitting it into training and testing sets.
- Defining a linear model with parameters, understanding gradient tracking with requires_grad.
- Setting up a training loop, iterating through epochs, performing forward and backward passes, and optimizing model parameters.
6. Non-Linear Regression with PyTorch
- Transitioning from linear to non-linear regression.
- Introducing non-linear activation functions like ReLU and Sigmoid.
- Visualizing the impact of activation functions on data transformations.
- Implementing custom ReLU and Sigmoid functions and comparing them with PyTorch’s built-in versions.
Working with Datasets and Data Loaders

7. Multi-Class Classification with PyTorch
- Exploring multi-class classification using the make_blobs dataset from scikit-learn.
- Setting hyperparameters for data creation, splitting data into training and testing sets.
- Visualizing multi-class data with matplotlib and understanding the relationship between features and labels.
- Converting NumPy arrays to PyTorch tensors, managing data type consistency between NumPy and PyTorch.
8. Building a Multi-Class Classification Model
- Constructing a multi-class classification model using PyTorch.
- Defining a model class, utilizing linear layers and activation functions.
- Implementing the forward pass, calculating logits and probabilities.
- Setting up a training loop, calculating loss, performing backpropagation, and optimizing model parameters.
9. Model Evaluation and Prediction
- Evaluating the trained multi-class classification model.
- Making predictions using the model and converting probabilities to class labels.
- Visualizing model predictions and comparing them to true labels.
10. Introduction to Data Loaders
- Understanding the importance of data loaders in PyTorch for efficient data handling.
- Implementing data loaders using torch.utils.data.DataLoader for both training and testing data.
- Exploring data loader attributes and understanding their role in data batching and shuffling.
11. Building a Convolutional Neural Network (CNN)
- Introduction to CNNs, a specialized architecture for image and sequence data.
- Implementing a CNN using PyTorch’s nn.Conv2d layer, understanding concepts like kernels, strides, and padding.
- Flattening convolutional outputs using nn.Flatten and connecting them to fully connected layers.
- Defining a CNN model class, implementing the forward pass, and understanding the flow of data through the network.
12. Training and Evaluating a CNN
- Setting up a training loop for the CNN model, utilizing device-agnostic code for CPU and GPU compatibility.
- Implementing helper functions for training and evaluation, calculating loss, accuracy, and training time.
- Visualizing training progress, tracking loss and accuracy over epochs.
13. Transfer Learning with Pre-trained Models
- Exploring the concept of transfer learning, leveraging pre-trained models for faster training and improved performance.
- Introducing torchvision, a library for computer vision tasks, and understanding its dataset and model functionalities.
- Implementing data transformations using torchvision.transforms for data augmentation and pre-processing.
14. Custom Datasets and Data Augmentation
- Creating custom datasets using torch.utils.data.Dataset for managing image data.
- Implementing data transformations for resizing, converting to tensors, and normalizing images.
- Visualizing data transformations and understanding their impact on image data.
- Implementing data augmentation techniques to increase data variability and improve model robustness.
15. Advanced CNN Architectures and Optimization
- Exploring advanced CNN architectures, understanding concepts like convolutional blocks, residual connections, and pooling layers.
- Implementing a more complex CNN model using convolutional blocks and exploring its performance.
- Optimizing the training process, introducing learning rate scheduling and momentum-based optimizers.
Please provide me with the full text to analyze, as I need the complete context to create a detailed timeline and a cast of characters. The provided text snippets focus on PyTorch concepts and code examples related to tensors, neural networks, and data loading.

For a comprehensive analysis, I need the entire document to understand the flow of information, identify specific events, and extract relevant character details.

Once you provide the complete text, I can generate:
- Timeline: A chronological list of significant events discussed in the text, including conceptual explanations, code demonstrations, and challenges presented.
- Cast of Characters: A list of key individuals mentioned, along with their roles and contributions based on the provided information.
Please share the complete “748-PyTorch for Deep Learning & Machine Learning – Full Course.pdf” document for a more accurate and detailed analysis.

Briefing Doc: Deep Dive into PyTorch for Deep Learning

This briefing document summarizes key themes and concepts extracted from excerpts of the “748-PyTorch for Deep Learning & Machine Learning – Full Course.pdf” focusing on PyTorch fundamentals, tensor manipulation, model building, and training.

Core Themes:
1. Tensors: The Heart of PyTorch:
- Understanding Tensors:
- Tensors are multi-dimensional arrays representing numerical data in PyTorch.
- Understanding dimensions, shapes, and data types of tensors is crucial.
- Scalar, Vector, Matrix, and Tensor are different names for tensors with varying dimensions.
- “Dimension is like the number of square brackets… the shape of the vector is two. So we have two by one elements. So that means a total of two elements.”
- Manipulating Tensors:
- Reshaping, viewing, stacking, squeezing, and unsqueezing tensors are essential for preparing data.
- Indexing and slicing allow access to specific elements within a tensor.
- “Reshape has to be compatible with the original dimensions… view of a tensor shares the same memory as the original input.”
- Tensor Operations:
- PyTorch provides various operations for manipulating tensors, including arithmetic, aggregation, and matrix multiplication.
- Understanding broadcasting rules is vital for performing element-wise operations on tensors of different shapes.
- “The min of this tensor would be 27. So you’re turning it from nine elements to one element, hence aggregation.”
1. Building Neural Networks with PyTorch:
- torch.nn Module:
- This module provides building blocks for constructing neural networks, including layers, activation functions, and loss functions.
- nn.Module is the base class for defining custom models.
- “nn is the building block layer for neural networks. And within nn, so nn stands for neural network, is module.”
- Model Construction:
- Defining a model involves creating layers and arranging them in a specific order.
- nn.Sequential allows stacking layers in a sequential manner.
- Custom models can be built by subclassing nn.Module and defining the forward method.
- “Can you see what’s going on here? So as you might have guessed, sequential, it implements most of this code for us”
- Parameters and Gradients:
- Model parameters are tensors that store the model’s learned weights and biases.
- Gradients are used during training to update these parameters.
- requires_grad=True enables gradient tracking for a tensor.
- “Requires grad optional. If the parameter requires gradient. Hmm. What does requires gradient mean? Well, let’s come back to that in a second.”
1. Training Neural Networks:
- Training Loop:
- The training loop iterates over the dataset multiple times (epochs) to optimize the model’s parameters.
- Each iteration involves a forward pass (making predictions), calculating the loss, performing backpropagation, and updating parameters.
- “Epochs, an epoch is one loop through the data…So epochs, we’re going to start with one. So one time through all of the data.”
- Optimizers:
- Optimizers, like Stochastic Gradient Descent (SGD), are used to update model parameters based on the calculated gradients.
- “Optimise a zero grad, loss backwards, optimise a step, step, step.”
- Loss Functions:
- Loss functions measure the difference between the model’s predictions and the actual targets.
- The choice of loss function depends on the specific task (e.g., mean squared error for regression, cross-entropy for classification).
1. Data Handling and Visualization:
- Data Loading:
- PyTorch provides DataLoader for efficiently iterating over datasets in batches.
- “DataLoader, this creates a python iterable over a data set.”
- Data Transformations:
- The torchvision.transforms module offers various transformations for preprocessing images, such as converting to tensors, resizing, and normalization.
- Visualization:
- matplotlib is a commonly used library for visualizing data and model outputs.
- Visualizing data and model predictions is crucial for understanding the learning process and debugging potential issues.
1. Device Agnostic Code:
- PyTorch allows running code on different devices (CPU or GPU).
- Writing device agnostic code ensures flexibility and portability.
- “Device agnostic code for the model and for the data.”
Important Facts:
- PyTorch’s default tensor data type is torch.float32.
- CUDA (Compute Unified Device Architecture) enables utilizing GPUs for accelerated computations.
- torch.no_grad() disables gradient tracking, often used during inference or evaluation.
- torch.argmax finds the index of the maximum value in a tensor.
Next Steps:
- Explore different model architectures (CNNs, RNNs, etc.).
- Implement various optimizers and loss functions.
- Work with more complex datasets and tasks.
- Experiment with hyperparameter tuning.
- Dive deeper into PyTorch’s documentation and tutorials.
Traditional Programming vs. Machine Learning

Traditional programming involves providing the computer with data and explicit rules to generate output. Machine learning, on the other hand, involves providing the computer with data and desired outputs, allowing the computer to learn the rules for itself. [1, 2]

Here’s a breakdown of the differences, illustrated with the example of creating a program for cooking a Sicilian grandmother’s roast chicken dish:

Traditional Programming
- Input: Vegetables, chicken
- Rules: Cut vegetables, season chicken, preheat oven, cook chicken for 30 minutes, add vegetables (rules are explicitly programmed)
- Output: Roast chicken dish
Machine Learning
- Input: A collection of various ingredients (vegetables, chicken)
- Desired output: A collection of images showcasing the finished roast chicken dish
- Process: The machine learning algorithm analyzes the inputs and desired outputs to figure out the rules (patterns) for transforming the ingredients into the desired dish. [2-4]
Key Differences in Approach
- In traditional programming, humans define the rules, while in machine learning, the computer learns the rules. [2, 3]
- Traditional programming requires a complete understanding of the problem and the steps to solve it. In contrast, machine learning can be used for complex problems where explicitly defining all the rules would be challenging. [5]
Why Use Machine Learning?

Machine learning excels in scenarios where:
- Complex problems involve long lists of rules: For instance, teaching a computer to drive a car involves countless rules that would be difficult to program manually. [5]
- Environments are constantly changing: A self-driving car needs to adapt to unpredictable situations on the road. Machine learning enables algorithms to learn and adjust based on new data. [6]
- Large datasets offer valuable insights: Machine learning algorithms can uncover hidden patterns and relationships within massive datasets. [7]
The sources also emphasize that machine learning isn’t always the solution. Google’s number one rule of machine learning advises that if a problem can be solved with a simple rule-based system, that approach should be preferred. [8-10]

Here is how PyTorch leverages tensors and neural networks for deep learning:
- Tensors: Deep learning relies on numerical data representation. In PyTorch, this is done using tensors. Tensors are multi-dimensional arrays of numbers that can represent various data types, including images, audio, and text [1-3].
- Neural Networks: Neural networks are a fundamental aspect of deep learning, consisting of interconnected layers that perform mathematical operations on tensors [2, 4-6]. PyTorch provides the building blocks for creating these networks through the torch.nn module [7, 8].
- GPU Acceleration: PyTorch leverages GPUs (Graphics Processing Units) to accelerate the computation of deep learning models [9]. GPUs excel at number crunching, originally designed for video games but now crucial for deep learning tasks due to their parallel processing capabilities [9, 10]. PyTorch uses CUDA, a parallel computing platform, to interface with NVIDIA GPUs, allowing for faster computations [10, 11].
- Key Modules:torch.nn: Contains layers, loss functions, and other components needed for constructing computational graphs (neural networks) [8, 12].
- torch.nn.Parameter: Defines learnable parameters for the model, often set by PyTorch layers [12].
- torch.nn.Module: The base class for all neural network modules; models should subclass this and override the forward method [12].
- torch.optim: Contains optimizers that help adjust model parameters during training through gradient descent [13].
- torch.utils.data.Dataset: The base class for creating custom datasets [14].
- torch.utils.data.DataLoader: Creates a Python iterable over a dataset, allowing for batched data loading [14-16].
1. Workflow:Data Preparation: Involves loading, preprocessing, and transforming data into tensors [17, 18].
2. Building a Model: Constructing a neural network by combining different layers from torch.nn [7, 19, 20].
3. Loss Function: Choosing a suitable loss function to measure the difference between model predictions and the actual targets [21-24].
4. Optimizer: Selecting an optimizer (e.g., SGD, Adam) to adjust the model’s parameters based on the calculated gradients [21, 22, 24-26].
5. Training Loop: Implementing a training loop that iteratively feeds data through the model, calculates the loss, backpropagates the gradients, and updates the model’s parameters [22, 24, 27, 28].
6. Evaluation: Evaluating the trained model on unseen data to assess its performance [24, 28].
Overall, PyTorch uses tensors as the fundamental data structure and provides the necessary tools (modules, classes, and functions) to construct neural networks, optimize their parameters using gradient descent, and efficiently run deep learning models, often with GPU acceleration.

Training, Evaluating, and Saving a Deep Learning Model Using PyTorch

To train a deep learning model with PyTorch, you first need to prepare your data and turn it into tensors [1]. Tensors are the fundamental building blocks of deep learning and can represent almost any kind of data, such as images, videos, audio, or even DNA [2, 3]. Once your data is ready, you need to build or pick a pre-trained model to suit your problem [1, 4].
- PyTorch offers a variety of pre-built deep learning models through resources like Torch Hub and Torch Vision.Models [5]. These models can be used as is or adjusted for a specific problem through transfer learning [5].
- If you are building your model from scratch, PyTorch provides a flexible and powerful framework for building neural networks using various layers and modules [6].
- The torch.nn module contains all the building blocks for computational graphs, another term for neural networks [7, 8].
- PyTorch also offers layers for specific tasks, such as convolutional layers for image data, linear layers for simple calculations, and many more [9].
- The torch.nn.Module serves as the base class for all neural network modules [8, 10]. When building a model from scratch, you should subclass nn.Module and override the forward method to define the computations that your model will perform [8, 11].
After choosing or building a model, you need to select a loss function and an optimizer [1, 4].
- The loss function measures how wrong your model’s predictions are compared to the ideal outputs [12].
- The optimizer takes into account the loss of a model and adjusts the model’s parameters, such as weights and biases, to improve the loss function [13].
- The specific loss function and optimizer you use will depend on the problem you are trying to solve [14].
With your data, model, loss function, and optimizer in place, you can now build a training loop [1, 13].
- The training loop iterates through your training data, making predictions, calculating the loss, and updating the model’s parameters to minimize the loss [15].
- PyTorch implements the mathematical algorithms of back propagation and gradient descent behind the scenes, making the training process relatively straightforward [16, 17].
- The loss.backward() function calculates the gradients of the loss function with respect to each parameter in the model [18]. The optimizer.step() function then uses those gradients to update the model’s parameters in the direction that minimizes the loss [18].
- You can monitor the training process by printing out the loss and other metrics [19].
In addition to a training loop, you also need a testing loop to evaluate your model’s performance on data it has not seen during training [13, 20]. The testing loop is similar to the training loop but does not update the model’s parameters. Instead, it calculates the loss and other metrics to evaluate how well the model generalizes to new data [21, 22].

To save your trained model, PyTorch provides several methods, including torch.save, torch.load, and torch.nn.Module.load_state_dict [23-25].
- The recommended way to save and load a PyTorch model is by saving and loading its state dictionary [26].
- The state dictionary is a Python dictionary object that maps each layer in the model to its parameter tensor [27].
- You can save the state dictionary using torch.save and load it back in using torch.load and the model’s load_state_dict method [28, 29].
By following this general workflow, you can train, evaluate, and save deep learning models using PyTorch for a wide range of real-world applications.

A Comprehensive Discussion of the PyTorch Workflow

The PyTorch workflow outlines the steps involved in building, training, and deploying deep learning models using the PyTorch framework. The sources offer a detailed walkthrough of this workflow, emphasizing its application in various domains, including computer vision and custom datasets.

1. Data Preparation and Loading

The foundation of any machine learning project lies in data. Getting your data ready is the crucial first step in the PyTorch workflow [1-3]. This step involves:
- Data Acquisition: Gathering the data relevant to your problem. This could involve downloading existing datasets or collecting your own.
- Data Preprocessing: Cleaning and transforming the raw data into a format suitable for training a machine learning model. This often includes handling missing values, normalizing numerical features, and converting categorical variables into numerical representations.
- Data Transformation into Tensors: Converting the preprocessed data into PyTorch tensors. Tensors are multi-dimensional arrays that serve as the fundamental data structure in PyTorch [4-6]. This step uses torch.tensor to create tensors from various data types.
- Dataset and DataLoader Creation:Organizing the data into PyTorch datasets using torch.utils.data.Dataset. This involves defining how to access individual samples and their corresponding labels [7, 8].
- Creating data loaders using torch.utils.data.DataLoader [7, 9-11]. Data loaders provide a Python iterable over the dataset, allowing you to efficiently iterate through the data in batches during training. They handle shuffling, batching, and other data loading operations.
2. Building or Picking a Pre-trained Model

Once your data is ready, the next step is to build or pick a pre-trained model [1, 2]. This is a critical decision that will significantly impact your model’s performance.
- Pre-trained Models: PyTorch offers pre-built models through resources like Torch Hub and Torch Vision.Models [12].
- Benefits: Leveraging pre-trained models can save significant time and resources. These models have already learned useful features from large datasets, which can be adapted to your specific task through transfer learning [12, 13].
- Transfer Learning: Involves fine-tuning a pre-trained model on your dataset, adapting its learned features to your problem. This is especially useful when working with limited data [12, 14].
- Building from Scratch:When Necessary: You might need to build a model from scratch if your problem is unique or if no suitable pre-trained models exist.
- PyTorch Flexibility: PyTorch provides the tools to create diverse neural network architectures, including:
- Multi-layer Perceptrons (MLPs): Composed of interconnected layers of neurons, often using torch.nn.Linear layers [15].
- Convolutional Neural Networks (CNNs): Specifically designed for image data, utilizing convolutional layers (torch.nn.Conv2d) to extract spatial features [16-18].
- Recurrent Neural Networks (RNNs): Suitable for sequential data, leveraging recurrent layers to process information over time.
Key Considerations in Model Building:
- Subclassing torch.nn.Module: PyTorch models typically subclass nn.Module and override the forward method to define the computational flow [19-23].
- Understanding Layers: Familiarity with various PyTorch layers (available in torch.nn) is crucial for constructing effective models. Each layer performs specific mathematical operations that transform the data as it flows through the network [24-26].
- Model Inspection:print(model): Provides a basic overview of the model’s structure and parameters.
- model.parameters(): Allows you to access and inspect the model’s learnable parameters [27].
- Torch Info: This package offers a more programmatic way to obtain a detailed summary of your model, including the input and output shapes of each layer [28-30].
3. Setting Up a Loss Function and Optimizer

Training a deep learning model involves optimizing its parameters to minimize a loss function. Therefore, choosing the right loss function and optimizer is essential [31-33].
- Loss Function: Measures the difference between the model’s predictions and the actual target values. The choice of loss function depends on the type of problem you are solving [34, 35]:
- Regression: Mean Squared Error (MSE) or Mean Absolute Error (MAE) are common choices [36].
- Binary Classification: Binary Cross Entropy (BCE) is often used [35-39]. PyTorch offers variations like torch.nn.BCELoss and torch.nn.BCEWithLogitsLoss. The latter combines a sigmoid layer with the BCE loss, often simplifying the code [38, 39].
- Multi-Class Classification: Cross Entropy Loss is a standard choice [35-37].
- Optimizer: Responsible for updating the model’s parameters based on the calculated gradients to minimize the loss function [31-33, 40]. Popular optimizers in PyTorch include:
- Stochastic Gradient Descent (SGD): A foundational optimization algorithm [35, 36, 41, 42].
- Adam: An adaptive optimization algorithm often offering faster convergence [35, 36, 42].
PyTorch provides various loss functions in torch.nn and optimizers in torch.optim [7, 40, 43].

4. Building a Training Loop

The heart of the PyTorch workflow lies in the training loop [32, 44-46]. It’s where the model learns patterns in the data through repeated iterations of:
- Forward Pass: Passing the input data through the model to generate predictions [47, 48].
- Loss Calculation: Using the chosen loss function to measure the difference between the predictions and the actual target values [47, 48].
- Back Propagation: Calculating the gradients of the loss with respect to each parameter in the model using loss.backward() [41, 47-49]. PyTorch handles this complex mathematical operation automatically.
- Parameter Update: Updating the model’s parameters using the calculated gradients and the chosen optimizer (e.g., optimizer.step()) [41, 47, 49]. This step nudges the parameters in a direction that minimizes the loss.
Key Aspects of a Training Loop:
- Epochs: The number of times the training loop iterates through the entire training dataset [50].
- Batches: Dividing the training data into smaller batches to improve computational efficiency and model generalization [10, 11, 51].
- Monitoring Training Progress: Printing the loss and other metrics during training allows you to track how well the model is learning [50]. You can use techniques like progress bars (e.g., using the tqdm library) to visualize the training progress [52].
5. Evaluation and Testing Loop

After training, you need to evaluate your model’s performance on unseen data using a testing loop [46, 48, 53]. The testing loop is similar to the training loop, but it does not update the model’s parameters [48]. Its purpose is to assess how well the trained model generalizes to new data.

Steps in a Testing Loop:
- Setting Evaluation Mode: Switching the model to evaluation mode (model.eval()) deactivates certain layers like dropout, which are only needed during training [53, 54].
- Inference Mode: Using PyTorch’s inference mode (torch.inference_mode()) disables gradient tracking and other computations unnecessary for inference, making the evaluation process faster [53-56].
- Forward Pass: Making predictions on the test data by passing it through the model [57].
- Loss and Metric Calculation: Calculating the loss and other relevant metrics (e.g., accuracy, precision, recall) to assess the model’s performance on the test data [53].
6. Saving and Loading the Model

Once you have a trained model that performs well, you need to save it for later use or deployment [58]. PyTorch offers different ways to save and load models, including saving the entire model or saving its state dictionary [59].
- State Dictionary: The recommended way is to save the model’s state dictionary [59, 60], which is a Python dictionary containing the model’s parameters. This approach is more efficient and avoids saving unnecessary information.
Saving and Loading using State Dictionary:
- Saving: torch.save(model.state_dict(), ‘model_filename.pth’)
1. Loading:Create an instance of the model: loaded_model = MyModel()
2. Load the state dictionary: loaded_model.load_state_dict(torch.load(‘model_filename.pth’))
7. Improving the Model (Iterative Process)

Building a successful deep learning model often involves an iterative process of experimentation and improvement [61-63]. After evaluating your initial model, you might need to adjust various aspects to enhance its performance. This includes:
- Hyperparameter Tuning: Experimenting with different values for hyperparameters like learning rate, batch size, and model architecture [64].
- Data Augmentation: Applying transformations to the training data (e.g., random cropping, flipping, rotations) to increase data diversity and improve model generalization [65].
- Regularization Techniques: Using techniques like dropout or weight decay to prevent overfitting and improve model robustness.
- Experiment Tracking: Utilizing tools like TensorBoard or Weights & Biases to track your experiments, log metrics, and visualize results [66]. This can help you gain insights into the training process and make informed decisions about model improvements.
Additional Insights from the Sources:
- Functionalization: As your models and training loops become more complex, it’s beneficial to functionalize your code to improve readability and maintainability [67]. The sources demonstrate this by creating functions for training and evaluation steps [68, 69].
- Device Agnostic Code: PyTorch allows you to write code that can run on either a CPU or a GPU [70-73]. By using torch.device to determine the available device, you can make your code more flexible and efficient.
- Debugging and Troubleshooting: The sources emphasize common debugging tips, such as printing shapes and values to check for errors and using the PyTorch documentation as a reference [9, 74-77].
By following the PyTorch workflow and understanding the key steps involved, you can effectively build, train, evaluate, and deploy deep learning models for various applications. The sources provide valuable code examples and explanations to guide you through this process, enabling you to tackle real-world problems with PyTorch.

A Comprehensive Discussion of Neural Networks

Neural networks are a cornerstone of deep learning, a subfield of machine learning. They are computational models inspired by the structure and function of the human brain. The sources, while primarily focused on the PyTorch framework, offer valuable insights into the principles and applications of neural networks.

1. What are Neural Networks?

Neural networks are composed of interconnected nodes called neurons, organized in layers. These layers typically include:
- Input Layer: Receives the initial data, representing features or variables.
- Hidden Layers: Perform computations on the input data, transforming it through a series of mathematical operations. A network can have multiple hidden layers, increasing its capacity to learn complex patterns.
- Output Layer: Produces the final output, such as predictions or classifications.
The connections between neurons have associated weights that determine the strength of the signal transmitted between them. During training, the network adjusts these weights to learn the relationships between input and output data.

2. The Power of Linear and Nonlinear Functions

Neural networks leverage a combination of linear and nonlinear functions to approximate complex relationships in data.
- Linear functions represent straight lines. While useful, they are limited in their ability to model nonlinear patterns.
- Nonlinear functions introduce curves and bends, allowing the network to capture more intricate relationships in the data.
The sources illustrate this concept by demonstrating how a simple linear model struggles to separate circularly arranged data points. However, introducing nonlinear activation functions like ReLU (Rectified Linear Unit) allows the model to capture the nonlinearity and successfully classify the data.

3. Key Concepts and Terminology
- Activation Functions: Nonlinear functions applied to the output of neurons, introducing nonlinearity into the network and enabling it to learn complex patterns. Common activation functions include sigmoid, ReLU, and tanh.
- Layers: Building blocks of a neural network, each performing specific computations.
- Linear Layers (torch.nn.Linear): Perform linear transformations on the input data using weights and biases.
- Convolutional Layers (torch.nn.Conv2d): Specialized for image data, extracting features using convolutional kernels.
- Pooling Layers: Reduce the spatial dimensions of feature maps, often used in CNNs.
4. Architectures and Applications

The specific arrangement of layers and their types defines the network’s architecture. Different architectures are suited to various tasks. The sources explore:
- Multi-layer Perceptrons (MLPs): Basic neural networks with fully connected layers, often used for tabular data.
- Convolutional Neural Networks (CNNs): Excellent at image recognition tasks, utilizing convolutional layers to extract spatial features.
- Recurrent Neural Networks (RNNs): Designed for sequential data like text or time series, using recurrent connections to process information over time.
5. Training Neural Networks

Training a neural network involves adjusting its weights to minimize a loss function, which measures the difference between predicted and actual values. The sources outline the key steps of a training loop:
1. Forward Pass: Input data flows through the network, generating predictions.
2. Loss Calculation: The loss function quantifies the error between predictions and target values.
3. Backpropagation: The algorithm calculates gradients of the loss with respect to each weight, indicating the direction and magnitude of weight adjustments needed to reduce the loss.
4. Parameter Update: An optimizer (e.g., SGD or Adam) updates the weights based on the calculated gradients, moving them towards values that minimize the loss.
6. PyTorch and Neural Network Implementation

The sources demonstrate how PyTorch provides a flexible and powerful framework for building and training neural networks. Key features include:
- torch.nn Module: Contains pre-built layers, activation functions, and other components for constructing neural networks.
- Automatic Differentiation: PyTorch automatically calculates gradients during backpropagation, simplifying the training process.
- GPU Acceleration: PyTorch allows you to leverage GPUs for faster training, especially beneficial for computationally intensive deep learning models.
7. Beyond the Basics

While the sources provide a solid foundation, the world of neural networks is vast and constantly evolving. Further exploration might involve:
- Advanced Architectures: Researching more complex architectures like ResNet, Transformer networks, and Generative Adversarial Networks (GANs).
- Transfer Learning: Utilizing pre-trained models to accelerate training and improve performance on tasks with limited data.
- Deployment and Applications: Learning how to deploy trained models into real-world applications, from image recognition systems to natural language processing tools.
By understanding the fundamental principles, architectures, and training processes, you can unlock the potential of neural networks to solve a wide range of problems across various domains. The sources offer a practical starting point for your journey into the world of deep learning.

Training Machine Learning Models: A Deep Dive

Building upon the foundation of neural networks, the sources provide a detailed exploration of the model training process, focusing on the practical aspects using PyTorch. Here’s an expanded discussion on the key concepts and steps involved:

1. The Significance of the Training Loop

The training loop lies at the heart of fitting a model to data, iteratively refining its parameters to learn the underlying patterns. This iterative process involves several key steps, often likened to a song with a specific sequence:
1. Forward Pass: Input data, transformed into tensors, is passed through the model’s layers, generating predictions.
2. Loss Calculation: The loss function quantifies the discrepancy between the model’s predictions and the actual target values, providing a measure of how “wrong” the model is.
3. Optimizer Zero Grad: Before calculating gradients, the optimizer’s gradients are reset to zero to prevent accumulating gradients from previous iterations.
4. Loss Backwards: Backpropagation calculates the gradients of the loss with respect to each weight in the network, indicating how much each weight contributes to the error.
5. Optimizer Step: The optimizer, using algorithms like Stochastic Gradient Descent (SGD) or Adam, adjusts the model’s weights based on the calculated gradients. These adjustments aim to nudge the weights in a direction that minimizes the loss.
2. Choosing a Loss Function and Optimizer

The sources emphasize the crucial role of selecting an appropriate loss function and optimizer tailored to the specific machine learning task:
- Loss Function: Different tasks require different loss functions. For example, binary classification tasks often use binary cross-entropy loss, while multi-class classification tasks use cross-entropy loss. The loss function guides the model’s learning by quantifying its errors.
- Optimizer: Optimizers like SGD and Adam employ various algorithms to update the model’s weights during training. Selecting the right optimizer can significantly impact the model’s convergence speed and performance.
3. Training and Evaluation Modes

PyTorch provides distinct training and evaluation modes for models, each with specific settings to optimize performance:
- Training Mode (model.train): This mode enables gradient tracking and activates components like dropout and batch normalization layers, essential for the learning process.
- Evaluation Mode (model.eval): This mode disables gradient tracking and deactivates components not needed during evaluation or prediction. It ensures that the model’s behavior during testing reflects its true performance without the influence of training-specific mechanisms.
4. Monitoring Progress with Loss Curves

The sources introduce the concept of loss curves as visual tools to track the model’s performance during training. Loss curves plot the loss value over epochs (passes through the entire dataset). Observing these curves helps identify potential issues like underfitting or overfitting:
- Underfitting: Indicated by a high and relatively unchanging loss value for both training and validation data, suggesting the model is not effectively learning the patterns in the data.
- Overfitting: Characterized by a low training loss but a high validation loss, implying the model has memorized the training data but struggles to generalize to unseen data.
5. Improving Through Experimentation

Model training often involves an iterative process of experimentation to improve performance. The sources suggest several strategies for improving a model’s ability to learn and generalize:

Model-centric approaches:
- Adding more layers: Increasing the depth of the network can enhance its capacity to learn complex patterns.
- Adding more hidden units: Expanding the width of layers can provide more representational power.
- Changing the activation function: Experimenting with different activation functions like ReLU or sigmoid can influence the model’s nonlinearity and learning behavior.
Data-centric approaches:
- Training for longer: Increasing the number of epochs allows the model more iterations to adjust its weights and potentially reach a lower loss.
- Data Augmentation: Artificially expanding the training dataset by applying transformations like rotations, flips, and crops can help the model generalize better to unseen data.
6. Saving and Loading Models

PyTorch enables saving and loading trained models, crucial for deploying models or resuming training from a previous state. This process often involves saving the model’s state dictionary, containing the learned weights and biases:
- Saving a model (torch.save): Preserves the model’s state dictionary for later use.
- Loading a model (torch.load): Retrieves a saved model’s state dictionary to restore a previously trained model.
7. Going Beyond the Basics

The sources provide a comprehensive foundation for understanding and implementing model training using PyTorch. As you progress, further exploration might include:
- Advanced Optimizers: Investigating optimizers beyond SGD and Adam, such as RMSprop and Adagrad, each with different advantages and characteristics.
- Hyperparameter Tuning: Exploring techniques like grid search and random search to systematically find optimal hyperparameters for the model, loss function, and optimizer.
- Monitoring with TensorBoard: Utilizing TensorBoard, a visualization tool, to track various metrics like loss, accuracy, and gradients during training, providing insights into the learning process.
By grasping the core principles of the training loop, the importance of loss functions and optimizers, and techniques for improving model performance, you gain the tools to effectively train neural networks and other machine learning models using PyTorch. The sources offer a practical guide to navigate the intricacies of model training, setting the stage for tackling more complex deep learning challenges.

A Deep Dive into Computer Vision with PyTorch

Building on the foundation of neural networks and model training, the sources provide an extensive exploration of computer vision using the PyTorch framework. They guide you through the process of building, training, and evaluating computer vision models, offering valuable insights into the core concepts and practical techniques involved.

1. Understanding Computer Vision Problems

Computer vision, broadly defined, encompasses tasks that enable computers to “see” and interpret visual information, mimicking human visual perception. The sources illustrate the vast scope of computer vision problems, ranging from basic classification to more complex tasks like object detection and image segmentation.

Examples of Computer Vision Problems:
- Image Classification: Assigning a label to an image from a predefined set of categories. For instance, classifying an image as containing a cat, dog, or bird.
- Object Detection: Identifying and localizing specific objects within an image, often by drawing bounding boxes around them. Applications include self-driving cars recognizing pedestrians and traffic signs.
- Image Segmentation: Dividing an image into meaningful regions, labeling each pixel with its corresponding object or category. This technique is used in medical imaging to identify organs and tissues.
2. The Power of Convolutional Neural Networks (CNNs)

The sources highlight CNNs as powerful deep learning models well-suited for computer vision tasks. CNNs excel at extracting spatial features from images using convolutional layers, mimicking the human visual system’s hierarchical processing of visual information.

Key Components of CNNs:
- Convolutional Layers: Perform convolutions using learnable filters (kernels) that slide across the input image, extracting features like edges, textures, and patterns.
- Activation Functions: Introduce nonlinearity, allowing CNNs to model complex relationships between image features and output predictions.
- Pooling Layers: Downsample feature maps, reducing computational complexity and making the model more robust to variations in object position and scale.
- Fully Connected Layers: Combine features extracted by convolutional and pooling layers, generating final predictions for classification or other tasks.
The sources provide practical insights into building CNNs using PyTorch’s torch.nn module, guiding you through the process of defining layers, constructing the network architecture, and implementing the forward pass.

3. Working with Torchvision

PyTorch’s Torchvision library emerges as a crucial tool for computer vision projects, offering a rich ecosystem of pre-built datasets, models, and transformations.

Key Components of Torchvision:
- Datasets: Provides access to popular computer vision datasets like MNIST, FashionMNIST, CIFAR, and ImageNet. These datasets simplify the process of obtaining and loading data for model training and evaluation.
- Models: Offers pre-trained models for various computer vision tasks, allowing you to leverage the power of transfer learning by fine-tuning these models on your own datasets.
- Transforms: Enables data preprocessing and augmentation. You can use transforms to resize, crop, flip, normalize, and augment images, artificially expanding your dataset and improving model generalization.
4. The Computer Vision Workflow

The sources outline a typical workflow for computer vision projects using PyTorch, emphasizing practical steps and considerations:
1. Data Preparation: Obtaining or creating a suitable dataset, organizing it into appropriate folders (e.g., by class labels), and applying necessary preprocessing or transformations.
2. Dataset and DataLoader: Utilizing PyTorch’s Dataset and DataLoader classes to efficiently load and batch data for training and evaluation.
3. Model Construction: Defining the CNN architecture using PyTorch’s torch.nn module, specifying layers, activation functions, and other components based on the problem’s complexity and requirements.
4. Loss Function and Optimizer: Selecting a suitable loss function that aligns with the task (e.g., cross-entropy loss for classification) and choosing an optimizer like SGD or Adam to update the model’s weights during training.
5. Training Loop: Implementing the iterative training process, involving forward pass, loss calculation, backpropagation, and weight updates. Monitoring training progress using loss curves to identify potential issues like underfitting or overfitting.
6. Evaluation: Assessing the model’s performance on a held-out test dataset using metrics like accuracy, precision, recall, and F1-score, depending on the task.
7. Model Saving and Loading: Preserving trained models for later use or deployment using torch.save and loading them back using torch.load.
8. Prediction on Custom Data: Demonstrating how to load and preprocess custom images, pass them through the trained model, and obtain predictions.
5. Going Beyond the Basics

The sources provide a comprehensive foundation, but computer vision is a rapidly evolving field. Further exploration might lead you to:
- Advanced Architectures: Exploring more complex CNN architectures like ResNet, Inception, and EfficientNet, each designed to address challenges in image recognition.
- Object Detection and Segmentation: Investigating specialized models and techniques for object detection (e.g., YOLO, Faster R-CNN) and image segmentation (e.g., U-Net, Mask R-CNN).
- Transfer Learning in Depth: Experimenting with various pre-trained models and fine-tuning strategies to optimize performance on your specific computer vision tasks.
- Real-world Applications: Researching how computer vision is applied in diverse domains, such as medical imaging, autonomous driving, robotics, and image editing software.
By mastering the fundamentals of computer vision, understanding CNNs, and leveraging PyTorch’s powerful tools, you can build and deploy models that empower computers to “see” and understand the visual world. The sources offer a practical guide to navigate this exciting domain, equipping you with the skills to tackle a wide range of computer vision challenges.

Understanding Data Augmentation in Computer Vision

Data augmentation is a crucial technique in computer vision that artificially expands the diversity and size of a training dataset by applying various transformations to the existing images [1, 2]. This process enhances the model’s ability to generalize and learn more robust patterns, ultimately improving its performance on unseen data.

Why Data Augmentation is Important
1. Increased Dataset Diversity: Data augmentation introduces variations in the training data, exposing the model to different perspectives of the same image [2]. This prevents the model from overfitting, where it learns to memorize the specific details of the training set rather than the underlying patterns of the target classes.
2. Reduced Overfitting: By making the training data more challenging, data augmentation forces the model to learn more generalizable features that are less sensitive to minor variations in the input images [3, 4].
3. Improved Model Generalization: A model trained with augmented data is better equipped to handle unseen data, as it has learned to recognize objects and patterns under various transformations, making it more robust and reliable in real-world applications [1, 5].
Types of Data Augmentations

The sources highlight several commonly used data augmentation techniques, particularly within the context of PyTorch’s torchvision.transforms module [6-8].
- Resize: Changing the dimensions of the images [9]. This helps standardize the input size for the model and can also introduce variations in object scale.
- Random Horizontal Flip: Flipping the images horizontally with a certain probability [8]. This technique is particularly effective for objects that are symmetric or appear in both left-right orientations.
- Random Rotation: Rotating the images by a random angle [3]. This helps the model learn to recognize objects regardless of their orientation.
- Random Crop: Cropping random sections of the images [9, 10]. This forces the model to focus on different parts of the image and can also introduce variations in object position.
- Color Jitter: Adjusting the brightness, contrast, saturation, and hue of the images [11]. This helps the model learn to recognize objects under different lighting conditions.
Trivial Augment: A State-of-the-Art Approach

The sources mention Trivial Augment, a data augmentation strategy used by the PyTorch team to achieve state-of-the-art results on their computer vision models [12, 13]. Trivial Augment leverages randomness to select and apply a combination of augmentations from a predefined set with varying intensities, leading to a diverse and challenging training dataset [14].

Practical Implementation in PyTorch

PyTorch’s torchvision.transforms module provides a comprehensive set of functions for data augmentation [6-8]. You can create a transform pipeline by composing a sequence of transformations using transforms.Compose. For example, a basic transform pipeline might include resizing, random horizontal flipping, and conversion to a tensor:

from torchvision import transforms

train_transform = transforms.Compose([

transforms.Resize((64, 64)),

transforms.RandomHorizontalFlip(p=0.5),

transforms.ToTensor(),

])

To apply data augmentation during training, you would pass this transform pipeline to the Dataset or DataLoader when loading your images [7, 15].

Evaluating the Impact of Data Augmentation

The sources emphasize the importance of comparing model performance with and without data augmentation to assess its effectiveness [16, 17]. By monitoring training metrics like loss and accuracy, you can observe how data augmentation influences the model’s learning process and its ability to generalize to unseen data [18, 19].

The Crucial Role of Hyperparameters in Model Training

Hyperparameters are external configurations that are set by the machine learning engineer or data scientist before training a model. They are distinct from the parameters of a model, which are the internal values (weights and biases) that the model learns from the data during training. Hyperparameters play a critical role in shaping the model’s architecture, behavior, and ultimately, its performance.

Defining Hyperparameters

As the sources explain, hyperparameters are values that we, as the model builders, control and adjust. In contrast, parameters are values that the model learns and updates during training. The sources use the analogy of parking a car:
- Hyperparameters are akin to the external controls of the car, such as the steering wheel, accelerator, and brake, which the driver uses to guide the vehicle.
- Parameters are like the internal workings of the engine and transmission, which adjust automatically based on the driver’s input.
Impact of Hyperparameters on Model Training

Hyperparameters directly influence the learning process of a model. They determine factors such as:
- Model Complexity: Hyperparameters like the number of layers and hidden units dictate the model’s capacity to learn intricate patterns in the data. More layers and hidden units typically increase the model’s complexity and ability to capture nonlinear relationships. However, excessive complexity can lead to overfitting.
- Learning Rate: The learning rate governs how much the optimizer adjusts the model’s parameters during each training step. A high learning rate allows for rapid learning but can lead to instability or divergence. A low learning rate ensures stability but may require longer training times.
- Batch Size: The batch size determines how many training samples are processed together before updating the model’s weights. Smaller batches can lead to faster convergence but might introduce more noise in the gradients. Larger batches provide more stable gradients but can slow down training.
- Number of Epochs: The number of epochs determines how many times the entire training dataset is passed through the model. More epochs can improve learning, but excessive training can also lead to overfitting.
Example: Tuning Hyperparameters for a CNN

Consider the task of building a CNN for image classification, as described in the sources. Several hyperparameters are crucial to the model’s performance:
- Number of Convolutional Layers: This hyperparameter determines how many layers are used to extract features from the images. More layers allow for the capture of more complex features but increase computational complexity.
- Kernel Size: The kernel size (filter size) in convolutional layers dictates the receptive field of the filters, influencing the scale of features extracted. Smaller kernels capture fine-grained details, while larger kernels cover wider areas.
- Stride: The stride defines how the kernel moves across the image during convolution. A larger stride results in downsampling and a smaller feature map.
- Padding: Padding adds extra pixels around the image borders before convolution, preventing information loss at the edges and ensuring consistent feature map dimensions.
- Activation Function: Activation functions like ReLU introduce nonlinearity, enabling the model to learn complex relationships between features. The choice of activation function can significantly impact model performance.
- Optimizer: The optimizer (e.g., SGD, Adam) determines how the model’s parameters are updated based on the calculated gradients. Different optimizers have different convergence properties and might be more suitable for specific datasets or architectures.
By carefully tuning these hyperparameters, you can optimize the CNN’s performance on the image classification task. Experimentation and iteration are key to finding the best hyperparameter settings for a given dataset and model architecture.

The Hyperparameter Tuning Process

The sources highlight the iterative nature of finding the best hyperparameter configurations. There’s no single “best” set of hyperparameters that applies universally. The optimal settings depend on the specific dataset, model architecture, and task. The sources also emphasize:
- Experimentation: Try different combinations of hyperparameters to observe their impact on model performance.
- Monitoring Loss Curves: Use loss curves to gain insights into the model’s training behavior, identifying potential issues like underfitting or overfitting and adjusting hyperparameters accordingly.
- Validation Sets: Employ a validation dataset to evaluate the model’s performance on unseen data during training, helping to prevent overfitting and select the best-performing hyperparameters.
- Automated Techniques: Explore automated hyperparameter tuning methods like grid search, random search, or Bayesian optimization to efficiently search the hyperparameter space.
By understanding the role of hyperparameters and mastering techniques for tuning them, you can unlock the full potential of your models and achieve optimal performance on your computer vision tasks.

The Learning Process of Deep Learning Models

Deep learning models learn from data by adjusting their internal parameters to capture patterns and relationships within the data. The sources provide a comprehensive overview of this process, particularly within the context of supervised learning using neural networks.

1. Data Representation: Turning Data into Numbers

The first step in deep learning is to represent the data in a numerical format that the model can understand. As the sources emphasize, “machine learning is turning things into numbers” [1, 2]. This process involves encoding various forms of data, such as images, text, or audio, into tensors, which are multi-dimensional arrays of numbers.

2. Model Architecture: Building the Learning Framework

Once the data is numerically encoded, a model architecture is defined. Neural networks are a common type of deep learning model, consisting of interconnected layers of neurons. Each layer performs mathematical operations on the input data, transforming it into increasingly abstract representations.
- Input Layer: Receives the numerical representation of the data.
- Hidden Layers: Perform computations on the input, extracting features and learning representations.
- Output Layer: Produces the final output of the model, which is tailored to the specific task (e.g., classification, regression).
3. Parameter Initialization: Setting the Starting Point

The parameters of a neural network, typically weights and biases, are initially assigned random values. These parameters determine how the model processes the data and ultimately define its behavior.

4. Forward Pass: Calculating Predictions

During training, the data is fed forward through the network, layer by layer. Each layer performs its mathematical operations, using the current parameter values to transform the input data. The final output of the network represents the model’s prediction for the given input.

5. Loss Function: Measuring Prediction Errors

A loss function is used to quantify the difference between the model’s predictions and the true target values. The loss function measures how “wrong” the model’s predictions are, providing a signal for how to adjust the parameters to improve performance.

6. Backpropagation: Calculating Gradients

Backpropagation is the core algorithm that enables deep learning models to learn. It involves calculating the gradients of the loss function with respect to each parameter in the network. These gradients indicate the direction and magnitude of change needed for each parameter to reduce the loss.

7. Optimizer: Updating Parameters

An optimizer uses the calculated gradients to update the model’s parameters. The optimizer’s goal is to minimize the loss function by iteratively adjusting the parameters in the direction that reduces the error. Common optimizers include Stochastic Gradient Descent (SGD) and Adam.

8. Training Loop: Iterative Learning Process

The training loop encompasses the steps of forward pass, loss calculation, backpropagation, and parameter update. This process is repeated iteratively over the training data, allowing the model to progressively refine its parameters and improve its predictive accuracy.
- Epochs: Each pass through the entire training dataset is called an epoch.
- Batch Size: Data is typically processed in batches, where a batch is a subset of the training data.
9. Evaluation: Assessing Model Performance

After training, the model is evaluated on a separate dataset (validation or test set) to assess its ability to generalize to unseen data. Metrics like accuracy, precision, and recall are used to measure the model’s performance on the task.

10. Hyperparameter Tuning: Optimizing the Learning Process

Hyperparameters are external configurations that influence the model’s learning process. Examples include learning rate, batch size, and the number of layers. Tuning hyperparameters is crucial to achieving optimal model performance. This often involves experimentation and monitoring training metrics to find the best settings.

Key Concepts and Insights
- Iterative Learning: Deep learning models learn through an iterative process of making predictions, calculating errors, and adjusting parameters.
- Gradient Descent: Backpropagation and optimizers work together to implement gradient descent, guiding the parameter updates towards minimizing the loss function.
- Feature Learning: Hidden layers in neural networks automatically learn representations of the data, extracting meaningful features that contribute to the model’s predictive ability.
- Nonlinearity: Activation functions introduce nonlinearity, allowing models to capture complex relationships in the data that cannot be represented by simple linear models.
By understanding these fundamental concepts, you can gain a deeper appreciation for how deep learning models learn from data and achieve remarkable performance on a wide range of tasks.

Key Situations for Deep Learning Solutions

The sources provide a detailed explanation of when deep learning is a good solution and when simpler approaches might be more suitable. Here are three key situations where deep learning often excels:

1. Problems with Long Lists of Rules

Deep learning models are particularly effective when dealing with problems that involve a vast and intricate set of rules that would be difficult or impossible to program explicitly. The sources use the example of driving a car, which encompasses countless rules regarding navigation, safety, and traffic regulations.
- Traditional programming struggles with such complexity, requiring engineers to manually define and code every possible scenario. This approach quickly becomes unwieldy and prone to errors.
- Deep learning offers a more flexible and adaptable solution. Instead of explicitly programming rules, deep learning models learn from data, automatically extracting patterns and relationships that represent the underlying rules.
2. Continuously Changing Environments

Deep learning shines in situations where the environment or the data itself is constantly evolving. Unlike traditional rule-based systems, which require manual updates to adapt to changes, deep learning models can continuously learn and update their knowledge as new data becomes available.
- The sources highlight the adaptability of deep learning, stating that models can “keep learning if it needs to” and “adapt and learn to new scenarios.”
- This capability is crucial in applications such as self-driving cars, where road conditions, traffic patterns, and even driving regulations can change over time.
3. Discovering Insights Within Large Collections of Data

Deep learning excels at uncovering hidden patterns and insights within massive datasets. The ability to process vast amounts of data is a key advantage of deep learning, enabling it to identify subtle relationships and trends that might be missed by traditional methods.
- The sources emphasize the flourishing of deep learning in handling large datasets, citing examples like the Food 101 dataset, which contains images of 101 different kinds of foods.
- This capacity for large-scale data analysis is invaluable in fields such as medical image analysis, where deep learning can assist in detecting diseases, identifying anomalies, and predicting patient outcomes.
In these situations, deep learning offers a powerful and flexible approach, allowing models to learn from data, adapt to changes, and extract insights from vast datasets, providing solutions that were previously challenging or even impossible to achieve with traditional programming techniques.

The Most Common Errors in Deep Learning

The sources highlight shape errors as one of the most prevalent challenges encountered by deep learning developers. The sources emphasize that this issue stems from the fundamental reliance on matrix multiplication operations in neural networks.
- Neural networks are built upon interconnected layers, and matrix multiplication is the primary mechanism for data transformation between these layers. [1]
- Shape errors arise when the dimensions of the matrices involved in these multiplications are incompatible. [1, 2]
- The sources illustrate this concept by explaining that for matrix multiplication to succeed, the inner dimensions of the matrices must match. [2, 3]
Three Big Errors in PyTorch and Deep Learning

The sources further elaborate on this concept within the specific context of the PyTorch deep learning framework, identifying three primary categories of errors:
1. Tensors not having the Right Data Type: The sources point out that using the incorrect data type for tensors can lead to errors, especially during the training of large neural networks. [4]
2. Tensors not having the Right Shape: This echoes the earlier discussion of shape errors and their importance in matrix multiplication operations. [4]
3. Device Issues: This category of errors arises when tensors are located on different devices, typically the CPU and GPU. PyTorch requires tensors involved in an operation to reside on the same device. [5]
The Ubiquity of Shape Errors

The sources consistently underscore the significance of understanding tensor shapes and dimensions in deep learning.
- They emphasize that mismatches in input and output shapes between layers are a frequent source of errors. [6]
- The process of reshaping, stacking, squeezing, and unsqueezing tensors is presented as a crucial technique for addressing shape-related issues. [7, 8]
- The sources advise developers to become familiar with their data’s shape and consult documentation to understand the expected input shapes for various layers and operations. [9]
Troubleshooting Tips and Practical Advice

Beyond identifying shape errors as a common challenge, the sources offer practical tips and insights for troubleshooting such issues.
- Understanding matrix multiplication rules: Developers are encouraged to grasp the fundamental rules governing matrix multiplication to anticipate and prevent shape errors. [3]
- Visualizing matrix multiplication: The sources recommend using the website matrixmultiplication.xyz as a tool for visualizing matrix operations and understanding their dimensional requirements. [10]
- Programmatic shape checking: The sources advocate for incorporating programmatic checks of tensor shapes using functions like tensor.shape to identify and debug shape mismatches. [11, 12]
By understanding the importance of tensor shapes and diligently checking for dimensional compatibility, deep learning developers can mitigate the occurrence of shape errors and streamline their development workflow.

Two Common Deep Learning Errors

The sources describe three major errors faced by deep learning developers: tensors not having the correct data type, tensors not having the correct shape, and device issues. [1] Two particularly common errors are data type and shape mismatches. [1, 2]

Data Type Mismatches

The sources explain that using the wrong data type for a tensor, especially when training large neural networks, can lead to errors. [1] For example, the torch.mean() function requires a float32 tensor, but if a long tensor is used, an error occurs. [3] Data type mismatches can also occur with loss functions. For instance, if the torch.nn.BCELoss is used instead of the torch.nn.BCEWithLogitsLoss, the inputs to the loss function must have gone through a sigmoid activation function. [4-6]

Shape Mismatches

Shape errors are extremely common in deep learning. [1, 2, 7-13] The sources explain that shape errors arise when the dimensions of matrices are incompatible during matrix multiplication operations. [7-9] To perform matrix multiplication, the inner dimensions of the matrices must match. [7, 14] Shape errors can also occur if the input or output shapes of tensors are mismatched between layers in a neural network. [11, 15] For example, a convolutional layer might expect a four-dimensional tensor, but if a three-dimensional tensor is used, an error will occur. [13] The sources recommend checking the shape of tensors frequently to catch these errors. [11, 16]

Let’s go through the topics covered in the “PyTorch for Deep Learning & Machine Learning – Full Course” one by one.

1. Introduction: Deep Learning vs. Traditional Programming

The sources start by introducing deep learning as a subset of machine learning, which itself is a subset of artificial intelligence [1]. They explain the key difference between traditional programming and machine learning [2].
- In traditional programming, we give the computer specific rules and data, and it produces the output.
- In machine learning, we provide the computer with data and desired outputs, and it learns the rules to map the data to the outputs.
The sources argue that deep learning is particularly well-suited for complex problems where it’s difficult to hand-craft rules [3, 4]. Examples include self-driving cars and image recognition. However, they also caution against using machine learning when a simpler, rule-based system would suffice [4, 5].

2. PyTorch Fundamentals: Tensors and Operations

The sources then introduce PyTorch, a popular deep learning framework written in Python [6, 7]. The core data structure in PyTorch is the tensor, a multi-dimensional array that can be used to represent various types of data [8].
- The sources explain the different types of tensors: scalars, vectors, matrices, and higher-order tensors [9].
- They demonstrate how to create tensors using torch.tensor() and showcase various operations like reshaping, indexing, stacking, and permuting [9-11].
Understanding tensor shapes and dimensions is crucial for avoiding errors in deep learning, as highlighted in our previous conversation about shape mismatches [12].

3. The PyTorch Workflow: From Data to Model

The sources then outline a typical PyTorch workflow [13] for developing deep learning models:
1. Data Preparation and Loading: The sources emphasize the importance of preparing data for machine learning [14] and the process of transforming raw data into a numerical representation suitable for models. They introduce data loaders (torch.utils.data.DataLoader) [15] for efficiently loading data in batches [16].
2. Building a Machine Learning Model: The sources demonstrate how to build models in PyTorch by subclassing nn.Module [17]. This involves defining the model’s layers and the forward pass, which specifies how data flows through the model.
3. Fitting the Model to the Data (Training): The sources explain the concept of a training loop [18], where the model iteratively learns from the data. Key steps in the training loop include:
- Forward Pass: Passing data through the model to get predictions.
- Calculating the Loss: Measuring how wrong the model’s predictions are using a loss function [19].
- Backpropagation: Calculating gradients to determine how to adjust the model’s parameters.
- Optimizer Step: Updating the model’s parameters using an optimizer [20] to minimize the loss.
1. Evaluating the Model: The sources highlight the importance of evaluating the model’s performance on unseen data to assess its generalization ability. This typically involves calculating metrics such as accuracy, precision, and recall [21].
2. Saving and Reloading the Model: The sources discuss methods for saving and loading trained models using torch.save() and torch.load() [22, 23].
3. Improving the Model: The sources provide tips and strategies for enhancing the model’s performance, including techniques like hyperparameter tuning, data augmentation, and using different model architectures [24].
4. Classification with PyTorch: Binary and Multi-Class

The sources dive into classification problems, a common type of machine learning task where the goal is to categorize data into predefined classes [25]. They discuss:
- Binary Classification: Predicting one of two possible classes [26].
- Multi-Class Classification: Choosing from more than two classes [27].
The sources demonstrate how to build classification models in PyTorch and showcase various techniques:
- Choosing appropriate loss functions like binary cross entropy loss (nn.BCELoss) for binary classification and cross entropy loss (nn.CrossEntropyLoss) for multi-class classification [28].
- Using activation functions like sigmoid for binary classification and softmax for multi-class classification [29].
- Evaluating classification models using metrics like accuracy, precision, recall, and confusion matrices [30].
5. Computer Vision with PyTorch: Convolutional Neural Networks (CNNs)

The sources introduce computer vision, the field of enabling computers to “see” and interpret images [31]. They focus on convolutional neural networks (CNNs), a type of neural network architecture specifically designed for processing image data [32].
- Torchvision: The sources introduce torchvision, a PyTorch library containing datasets, model architectures, and image transformation tools [33].
- Data Augmentation: The sources showcase data augmentation techniques using torchvision.transforms to create variations of training images and improve model robustness [34].
- CNN Building Blocks: The sources explain and demonstrate key CNN components like convolutional layers (nn.Conv2d), pooling layers, and activation functions [35].
They guide you through building CNNs from scratch and visualizing the learned features.

6. Custom Datasets: Working with Your Own Data

The sources address the challenge of working with custom datasets not readily available in PyTorch’s built-in libraries [36]. They explain how to:
- Create custom datasets by subclassing torch.utils.data.Dataset [37] and implementing methods for loading and processing data.
- Integrate custom datasets with PyTorch’s data loaders for efficient training.
This section empowers you to apply PyTorch to real-world problems using your own data.

7. Further Resources and Exercises

The sources provide numerous extracurricular resources and exercises [38-40] to deepen your understanding and practice your skills. These resources include:
- Links to relevant PyTorch documentation and tutorials
- Blog posts and articles on specific deep learning concepts
- Code templates and solution notebooks for exercises
By actively engaging with these resources, you can continue your learning journey and solidify your PyTorch proficiency.

This comprehensive overview of the topics covered in the “PyTorch for Deep Learning & Machine Learning – Full Course” provides a structured understanding of the key concepts and techniques. Remember, the best way to learn is by practicing and experimenting with the code provided in the sources.

Here are summaries of each set of 10 pages from the source document:

Pages 1-10 Summary: Introduction to Deep Learning and PyTorch Fundamentals

These pages introduce the fundamental concepts of deep learning, positioning it as a powerful subset of machine learning. The sources draw a clear distinction between traditional programming, where explicit rules dictate output, and machine learning, where algorithms learn rules from data. The emphasis is on PyTorch as the chosen deep learning framework, highlighting its core data structure: the tensor.

The sources provide practical guidance on creating tensors using torch.tensor() and manipulating them with operations like reshaping and indexing. They underscore the crucial role of understanding tensor shapes and dimensions, connecting it to the common challenge of shape errors discussed in our earlier conversation.

This set of pages lays the groundwork for understanding both the conceptual framework of deep learning and the practical tools provided by PyTorch.

Pages 11-20 Summary: Exploring Tensors, Neural Networks, and PyTorch Documentation

These pages build upon the introduction of tensors, expanding on operations like stacking and permuting to manipulate tensor structures further. They transition into a conceptual overview of neural networks, emphasizing their ability to learn complex patterns from data. However, the sources don’t provide detailed definitions of deep learning or neural networks, encouraging you to explore these concepts independently through external resources like Wikipedia and educational channels.

The sources strongly advocate for actively engaging with PyTorch documentation. They highlight the website as a valuable resource for understanding PyTorch’s features, functions, and examples. They encourage you to spend time reading and exploring the documentation, even if you don’t fully grasp every detail initially.

Pages 21-30 Summary: The PyTorch Workflow: Data, Models, Loss, and Optimization

This section of the source delves into the core PyTorch workflow, starting with the importance of data preparation. It emphasizes the transformation of raw data into tensors, making it suitable for deep learning models. Data loaders are presented as essential tools for efficiently handling large datasets by loading data in batches.

The sources then guide you through the process of building a machine learning model in PyTorch, using the concept of subclassing nn.Module. The forward pass is introduced as a fundamental step that defines how data flows through the model’s layers. The sources explain how models are trained by fitting them to the data, highlighting the iterative process of the training loop:
1. Forward pass: Input data is fed through the model to generate predictions.
2. Loss calculation: A loss function quantifies the difference between the model’s predictions and the actual target values.
3. Backpropagation: The model’s parameters are adjusted by calculating gradients, indicating how each parameter contributes to the loss.
4. Optimization: An optimizer uses the calculated gradients to update the model’s parameters, aiming to minimize the loss.
Pages 31-40 Summary: Evaluating Models, Running Tensors, and Important Concepts

The sources focus on evaluating the model’s performance, emphasizing its significance in determining how well the model generalizes to unseen data. They mention common metrics like accuracy, precision, and recall as tools for evaluating model effectiveness.

The sources introduce the concept of running tensors on different devices (CPU and GPU) using .to(device), highlighting its importance for computational efficiency. They also discuss the use of random seeds (torch.manual_seed()) to ensure reproducibility in deep learning experiments, enabling consistent results across multiple runs.

The sources stress the importance of documentation reading as a key exercise for understanding PyTorch concepts and functionalities. They also advocate for practical coding exercises to reinforce learning and develop proficiency in applying PyTorch concepts.

Pages 41-50 Summary: Exercises, Classification Introduction, and Data Visualization

The sources dedicate these pages to practical application and reinforcement of previously learned concepts. They present exercises designed to challenge your understanding of PyTorch workflows, data manipulation, and model building. They recommend referring to the documentation, practicing independently, and checking provided solutions as a learning approach.

The focus shifts to classification problems, distinguishing between binary classification, where the task is to predict one of two classes, and multi-class classification, involving more than two classes.

The sources then begin exploring data visualization, emphasizing the importance of understanding your data before applying machine learning models. They introduce the make_circles dataset as an example and use scatter plots to visualize its structure, highlighting the need for visualization as a crucial step in the data exploration process.

Pages 51-60 Summary: Data Splitting, Building a Classification Model, and Training

The sources discuss the critical concept of splitting data into training and test sets. This separation ensures that the model is evaluated on unseen data to assess its generalization capabilities accurately. They utilize the train_test_split function to divide the data and showcase the process of building a simple binary classification model in PyTorch.

The sources emphasize the familiar training loop process, where the model iteratively learns from the training data:
1. Forward pass through the model
2. Calculation of the loss function
3. Backpropagation of gradients
4. Optimization of model parameters
They guide you through implementing these steps and visualizing the model’s training progress using loss curves, highlighting the importance of monitoring these curves for insights into the model’s learning behavior.

Pages 61-70 Summary: Multi-Class Classification, Data Visualization, and the Softmax Function

The sources delve into multi-class classification, expanding upon the previously covered binary classification. They illustrate the differences between the two and provide examples of scenarios where each is applicable.

The focus remains on data visualization, emphasizing the importance of understanding your data before applying machine learning algorithms. The sources introduce techniques for visualizing multi-class data, aiding in pattern recognition and insight generation.

The softmax function is introduced as a crucial component in multi-class classification models. The sources explain its role in converting the model’s raw outputs (logits) into probabilities, enabling interpretation and decision-making based on these probabilities.

Pages 71-80 Summary: Evaluation Metrics, Saving/Loading Models, and Computer Vision Introduction

This section explores various evaluation metrics for assessing the performance of classification models. They introduce metrics like accuracy, precision, recall, F1 score, confusion matrices, and classification reports. The sources explain the significance of each metric and how to interpret them in the context of evaluating model effectiveness.

The sources then discuss the practical aspects of saving and loading trained models, highlighting the importance of preserving model progress and enabling future use without retraining.

The focus shifts to computer vision, a field that enables computers to “see” and interpret images. They discuss the use of convolutional neural networks (CNNs) as specialized neural network architectures for image processing tasks.

Pages 81-90 Summary: Computer Vision Libraries, Data Exploration, and Mini-Batching

The sources introduce essential computer vision libraries in PyTorch, particularly highlighting torchvision. They explain the key components of torchvision, including datasets, model architectures, and image transformation tools.

They guide you through exploring a computer vision dataset, emphasizing the importance of understanding data characteristics before model building. Techniques for visualizing images and examining data structure are presented.

The concept of mini-batching is discussed as a crucial technique for efficiently training deep learning models on large datasets. The sources explain how mini-batching involves dividing the data into smaller batches, reducing memory requirements and improving training speed.

Pages 91-100 Summary: Building a CNN, Training Steps, and Evaluation

This section dives into the practical aspects of building a CNN for image classification. They guide you through defining the model’s architecture, including convolutional layers (nn.Conv2d), pooling layers, activation functions, and a final linear layer for classification.

The familiar training loop process is revisited, outlining the steps involved in training the CNN model:
1. Forward pass of data through the model
2. Calculation of the loss function
3. Backpropagation to compute gradients
4. Optimization to update model parameters
The sources emphasize the importance of monitoring the training process by visualizing loss curves and calculating evaluation metrics like accuracy and loss. They provide practical code examples for implementing these steps and evaluating the model’s performance on a test dataset.

Pages 101-110 Summary: Troubleshooting, Non-Linear Activation Functions, and Model Building

The sources provide practical advice for troubleshooting common errors in PyTorch code, encouraging the use of the data explorer’s motto: visualize, visualize, visualize. The importance of checking tensor shapes, understanding error messages, and referring to the PyTorch documentation is highlighted. They recommend searching for specific errors online, utilizing resources like Stack Overflow, and if all else fails, asking questions on the course’s GitHub discussions page.

The concept of non-linear activation functions is introduced as a crucial element in building effective neural networks. These functions, such as ReLU, introduce non-linearity into the model, enabling it to learn complex, non-linear patterns in the data. The sources emphasize the importance of combining linear and non-linear functions within a neural network to achieve powerful learning capabilities.

Building upon this concept, the sources guide you through the process of constructing a more complex classification model incorporating non-linear activation functions. They demonstrate the step-by-step implementation, highlighting the use of ReLU and its impact on the model’s ability to capture intricate relationships within the data.

Pages 111-120 Summary: Data Augmentation, Model Evaluation, and Performance Improvement

The sources introduce data augmentation as a powerful technique for artificially increasing the diversity and size of training data, leading to improved model performance. They demonstrate various data augmentation methods, including random cropping, flipping, and color adjustments, emphasizing the role of torchvision.transforms in implementing these techniques. The TrivialAugment technique is highlighted as a particularly effective and efficient data augmentation strategy.

The sources reinforce the importance of model evaluation and explore advanced techniques for assessing the performance of classification models. They introduce metrics beyond accuracy, including precision, recall, F1-score, and confusion matrices. The use of torchmetrics and other libraries for calculating these metrics is demonstrated.

The sources discuss strategies for improving model performance, focusing on optimizing training speed and efficiency. They introduce concepts like mixed precision training and highlight the potential benefits of using TPUs (Tensor Processing Units) for accelerated deep learning tasks.

Pages 121-130 Summary: CNN Hyperparameters, Custom Datasets, and Image Loading

The sources provide a deeper exploration of CNN hyperparameters, focusing on kernel size, stride, and padding. They utilize the CNN Explainer website as a valuable resource for visualizing and understanding the impact of these hyperparameters on the convolutional operations within a CNN. They guide you through calculating output shapes based on these hyperparameters, emphasizing the importance of understanding the transformations applied to the input data as it passes through the network’s layers.

The concept of custom datasets is introduced, moving beyond the use of pre-built datasets like FashionMNIST. The sources outline the process of creating a custom dataset using PyTorch’s Dataset class, enabling you to work with your own data sources. They highlight the importance of structuring your data appropriately for use with PyTorch’s data loading utilities.

They demonstrate techniques for loading images using PyTorch, leveraging libraries like PIL (Python Imaging Library) and showcasing the steps involved in reading image data, converting it into tensors, and preparing it for use in a deep learning model.

Pages 131-140 Summary: Building a Custom Dataset, Data Visualization, and Data Augmentation

The sources guide you step-by-step through the process of building a custom dataset in PyTorch, specifically focusing on creating a food image classification dataset called FoodVision Mini. They cover techniques for organizing image data, creating class labels, and implementing a custom dataset class that inherits from PyTorch’s Dataset class.

They emphasize the importance of data visualization throughout the process, demonstrating how to visually inspect images, verify labels, and gain insights into the dataset’s characteristics. They provide code examples for plotting random images from the custom dataset, enabling visual confirmation of data loading and preprocessing steps.

The sources revisit data augmentation in the context of custom datasets, highlighting its role in improving model generalization and robustness. They demonstrate the application of various data augmentation techniques using torchvision.transforms to artificially expand the training dataset and introduce variations in the images.

Pages 141-150 Summary: Training and Evaluation with a Custom Dataset, Transfer Learning, and Advanced Topics

The sources guide you through the process of training and evaluating a deep learning model using your custom dataset (FoodVision Mini). They cover the steps involved in setting up data loaders, defining a model architecture, implementing a training loop, and evaluating the model’s performance using appropriate metrics. They emphasize the importance of monitoring training progress through visualization techniques like loss curves and exploring the model’s predictions on test data.

The sources introduce transfer learning as a powerful technique for leveraging pre-trained models to improve performance on a new task, especially when working with limited data. They explain the concept of using a model trained on a large dataset (like ImageNet) as a starting point and fine-tuning it on your custom dataset to achieve better results.

The sources provide an overview of advanced topics in PyTorch deep learning, including:
- Model experiment tracking: Tools and techniques for managing and tracking multiple deep learning experiments, enabling efficient comparison and analysis of model variations.
- PyTorch paper replicating: Replicating research papers using PyTorch, a valuable approach for understanding cutting-edge deep learning techniques and applying them to your own projects.
- PyTorch workflow debugging: Strategies for debugging and troubleshooting issues that may arise during the development and training of deep learning models in PyTorch.
These advanced topics provide a glimpse into the broader landscape of deep learning research and development using PyTorch, encouraging further exploration and experimentation beyond the foundational concepts covered in the previous sections.

Pages 151-160 Summary: Custom Datasets, Data Exploration, and the FoodVision Mini Dataset

The sources emphasize the importance of custom datasets when working with data that doesn’t fit into pre-existing structures like FashionMNIST. They highlight the different domain libraries available in PyTorch for handling specific types of data, including:
- Torchvision: for image data
- Torchtext: for text data
- Torchaudio: for audio data
- Torchrec: for recommendation systems data
Each of these libraries has a datasets module that provides tools for loading and working with data from that domain. Additionally, the sources mention Torchdata, which is a more general-purpose data loading library that is still under development.

The sources guide you through the process of creating a custom image dataset called FoodVision Mini, based on the larger Food101 dataset. They provide detailed instructions for:
1. Obtaining the Food101 data: This involves downloading the dataset from its original source.
2. Structuring the data: The sources recommend organizing the data in a specific folder structure, where each subfolder represents a class label and contains images belonging to that class.
3. Exploring the data: The sources emphasize the importance of becoming familiar with the data through visualization and exploration. This can help you identify potential issues with the data and gain insights into its characteristics.
They introduce the concept of becoming one with the data, spending significant time understanding its structure, format, and nuances before diving into model building. This echoes the data explorer’s motto: visualize, visualize, visualize.

The sources provide practical advice for exploring the dataset, including walking through directories and visualizing images to confirm the organization and content of the data. They introduce a helper function called walk_through_dir that allows you to systematically traverse the dataset’s folder structure and gather information about the number of directories and images within each class.

Pages 161-170 Summary: Creating a Custom Dataset Class and Loading Images

The sources continue the process of building the FoodVision Mini custom dataset, guiding you through creating a custom dataset class using PyTorch’s Dataset class. They outline the essential components and functionalities of such a class:
1. Initialization (__init__): This method sets up the dataset’s attributes, including the target directory containing the data and any necessary transformations to be applied to the images.
2. Length (__len__): This method returns the total number of samples in the dataset, providing a way to iterate through the entire dataset.
3. Item retrieval (__getitem__): This method retrieves a specific sample (image and label) from the dataset based on its index, enabling access to individual data points during training.
The sources demonstrate how to load images using the PIL (Python Imaging Library) and convert them into tensors, a format suitable for PyTorch deep learning models. They provide a detailed implementation of the load_image function, which takes an image path as input and returns a PIL image object. This function is then utilized within the __getitem__ method to load and preprocess images on demand.

They highlight the steps involved in creating a class-to-index mapping, associating each class label with a numerical index, a requirement for training classification models in PyTorch. This mapping is generated by scanning the target directory and extracting the class names from the subfolder names.

Pages 171-180 Summary: Data Visualization, Data Augmentation Techniques, and Implementing Transformations

The sources reinforce the importance of data visualization as an integral part of building a custom dataset. They provide code examples for creating a function that displays random images from the dataset along with their corresponding labels. This visual inspection helps ensure that the images are loaded correctly, the labels are accurate, and the data is appropriately preprocessed.

They further explore data augmentation techniques, highlighting their significance in enhancing model performance and generalization. They demonstrate the implementation of various augmentation methods, including random horizontal flipping, random cropping, and color jittering, using torchvision.transforms. These augmentations introduce variations in the training images, artificially expanding the dataset and helping the model learn more robust features.

The sources introduce the TrivialAugment technique, a data augmentation strategy that leverages randomness to apply a series of transformations to images, promoting diversity in the training data. They provide code examples for implementing TrivialAugment using torchvision.transforms and showcase its impact on the visual appearance of the images. They suggest experimenting with different augmentation strategies and visualizing their effects to understand their impact on the dataset.

Pages 181-190 Summary: Building a TinyVGG Model and Evaluating its Performance

The sources guide you through building a TinyVGG model architecture, a simplified version of the VGG convolutional neural network architecture. They demonstrate the step-by-step implementation of the model’s layers, including convolutional layers, ReLU activation functions, and max-pooling layers, using torch.nn modules. They use the CNN Explainer website as a visual reference for the TinyVGG architecture and encourage exploration of this resource to gain a deeper understanding of the model’s structure and operations.

The sources introduce the torchinfo package, a helpful tool for summarizing the structure and parameters of a PyTorch model. They demonstrate its usage for the TinyVGG model, providing a clear representation of the input and output shapes of each layer, the number of parameters in each layer, and the overall model size. This information helps in verifying the model’s architecture and understanding its computational complexity.

They walk through the process of evaluating the TinyVGG model’s performance on the FoodVision Mini dataset, covering the steps involved in setting up data loaders, defining a training loop, and calculating metrics like loss and accuracy. They emphasize the importance of monitoring training progress through visualization techniques like loss curves, plotting the loss value over epochs to observe the model’s learning trajectory and identify potential issues like overfitting.

Pages 191-200 Summary: Implementing Training and Testing Steps, and Setting Up a Training Loop

The sources guide you through the implementation of separate functions for the training step and testing step of the model training process. These functions encapsulate the logic for processing a single batch of data during training and testing, respectively.

The train_step function, as described in the sources, performs the following actions:
1. Forward pass: Passes the input batch through the model to obtain predictions.
2. Loss calculation: Computes the loss between the predictions and the ground truth labels.
3. Backpropagation: Calculates the gradients of the loss with respect to the model’s parameters.
4. Optimizer step: Updates the model’s parameters based on the calculated gradients to minimize the loss.
The test_step function is similar to the training step, but it omits the backpropagation and optimizer step since the goal during testing is to evaluate the model’s performance on unseen data without updating its parameters.

The sources then demonstrate how to integrate these functions into a training loop. This loop iterates over the specified number of epochs, processing the training data in batches. For each epoch, the loop performs the following steps:
1. Training phase: Calls the train_step function for each batch of training data, updating the model’s parameters.
2. Testing phase: Calls the test_step function for each batch of testing data, evaluating the model’s performance on unseen data.
The sources emphasize the importance of monitoring training progress by tracking metrics like loss and accuracy during both the training and testing phases. This allows you to observe how well the model is learning and identify potential issues like overfitting.

Pages 201-210 Summary: Visualizing Model Predictions and Exploring the Concept of Transfer Learning

The sources emphasize the value of visualizing the model’s predictions to gain insights into its performance and identify potential areas for improvement. They guide you through the process of making predictions on a set of test images and displaying the images along with their predicted and actual labels. This visual assessment helps you understand how well the model is generalizing to unseen data and can reveal patterns in the model’s errors.

They introduce the concept of transfer learning, a powerful technique in deep learning where you leverage knowledge gained from training a model on a large dataset to improve the performance of a model on a different but related task. The sources suggest exploring the torchvision.models module, which provides a collection of pre-trained models for various computer vision tasks. They highlight that these pre-trained models can be used as a starting point for your own models, either by fine-tuning the entire model or using parts of it as feature extractors.

They provide an overview of how to load pre-trained models from the torchvision.models module and modify their architecture to suit your specific task. The sources encourage experimentation with different pre-trained models and fine-tuning strategies to achieve optimal performance on your custom dataset.

Pages 211-310 Summary: Fine-Tuning a Pre-trained ResNet Model, Multi-Class Classification, and Exploring Binary vs. Multi-Class Problems

The sources shift focus to fine-tuning a pre-trained ResNet model for the FoodVision Mini dataset. They highlight the advantages of using a pre-trained model, such as faster training and potentially better performance due to leveraging knowledge learned from a larger dataset. The sources guide you through:
1. Loading a pre-trained ResNet model: They show how to use the torchvision.models module to load a pre-trained ResNet model, such as ResNet18 or ResNet34.
2. Modifying the final fully connected layer: To adapt the model to the FoodVision Mini dataset, the sources demonstrate how to change the output size of the final fully connected layer to match the number of classes in the dataset (3 in this case).
3. Freezing the initial layers: The sources discuss the strategy of freezing the weights of the initial layers of the pre-trained model to preserve the learned features from the larger dataset. This helps prevent catastrophic forgetting, where the model loses its previously acquired knowledge during fine-tuning.
4. Training the modified model: They provide instructions for training the fine-tuned model on the FoodVision Mini dataset, emphasizing the importance of monitoring training progress and evaluating the model’s performance.
The sources transition to discussing multi-class classification, explaining the distinction between binary classification (predicting between two classes) and multi-class classification (predicting among more than two classes). They provide examples of both types of classification problems:
- Binary Classification: Identifying email as spam or not spam, classifying images as containing a cat or a dog.
- Multi-class Classification: Categorizing images of different types of food, assigning topics to news articles, predicting the sentiment of a text review.
They introduce the ImageNet dataset, a large-scale dataset for image classification with 1000 object classes, as an example of a multi-class classification problem. They highlight the use of the softmax activation function for multi-class classification, explaining its role in converting the model’s raw output (logits) into probability scores for each class.

The sources guide you through building a neural network for a multi-class classification problem using PyTorch. They illustrate:
1. Creating a multi-class dataset: They use the sklearn.datasets.make_blobs function to generate a synthetic dataset with multiple classes for demonstration purposes.
2. Visualizing the dataset: The sources emphasize the importance of visualizing the dataset to understand its structure and distribution of classes.
3. Building a neural network model: They walk through the steps of defining a neural network model with multiple layers and activation functions using torch.nn modules.
4. Choosing a loss function: For multi-class classification, they introduce the cross-entropy loss function and explain its suitability for this type of problem.
5. Setting up an optimizer: They discuss the use of optimizers, such as stochastic gradient descent (SGD), for updating the model’s parameters during training.
6. Training the model: The sources provide instructions for training the multi-class classification model, highlighting the importance of monitoring training progress and evaluating the model’s performance.
Pages 311-410 Summary: Building a Robust Training Loop, Working with Nonlinearities, and Performing Model Sanity Checks

The sources guide you through building a more robust training loop for the multi-class classification problem, incorporating best practices like using a validation set for monitoring overfitting. They provide a detailed code implementation of the training loop, highlighting the key steps:
1. Iterating over epochs: The loop iterates over a specified number of epochs, processing the training data in batches.
2. Forward pass: For each batch, the input data is passed through the model to obtain predictions.
3. Loss calculation: The loss between the predictions and the target labels is computed using the chosen loss function.
4. Backward pass: The gradients of the loss with respect to the model’s parameters are calculated through backpropagation.
5. Optimizer step: The optimizer updates the model’s parameters based on the calculated gradients.
6. Validation: After each epoch, the model’s performance is evaluated on a separate validation set to monitor overfitting.
The sources introduce the concept of nonlinearities in neural networks and explain the importance of activation functions in introducing non-linearity to the model. They discuss various activation functions, such as:
- ReLU (Rectified Linear Unit): A popular activation function that sets negative values to zero and leaves positive values unchanged.
- Sigmoid: An activation function that squashes the input values between 0 and 1, commonly used for binary classification problems.
- Softmax: An activation function used for multi-class classification, producing a probability distribution over the different classes.
They demonstrate how to incorporate these activation functions into the model architecture and explain their impact on the model’s ability to learn complex patterns in the data.

The sources stress the importance of performing model sanity checks to verify that the model is functioning correctly and learning as expected. They suggest techniques like:
1. Testing on a simpler problem: Before training on the full dataset, the sources recommend testing the model on a simpler problem with known solutions to ensure that the model’s architecture and implementation are sound.
2. Visualizing model predictions: Comparing the model’s predictions to the ground truth labels can help identify potential issues with the model’s learning process.
3. Checking the loss function: Monitoring the loss value during training can provide insights into how well the model is optimizing its parameters.
Pages 411-510 Summary: Exploring Multi-class Classification Metrics and Deep Diving into Convolutional Neural Networks

The sources explore a range of multi-class classification metrics beyond accuracy, emphasizing that different metrics provide different perspectives on the model’s performance. They introduce:
- Precision: A measure of the proportion of correctly predicted positive cases out of all positive predictions.
- Recall: A measure of the proportion of correctly predicted positive cases out of all actual positive cases.
- F1-score: A harmonic mean of precision and recall, providing a balanced measure of the model’s performance.
- Confusion matrix: A visualization tool that shows the counts of true positive, true negative, false positive, and false negative predictions, providing a detailed breakdown of the model’s performance across different classes.
They guide you through implementing these metrics using PyTorch and visualizing the confusion matrix to gain insights into the model’s strengths and weaknesses.

The sources transition to discussing convolutional neural networks (CNNs), a specialized type of neural network architecture well-suited for image classification tasks. They provide an in-depth explanation of the key components of a CNN, including:
1. Convolutional layers: Layers that apply convolution operations to the input image, extracting features at different spatial scales.
2. Activation functions: Functions like ReLU that introduce non-linearity to the model, enabling it to learn complex patterns.
3. Pooling layers: Layers that downsample the feature maps, reducing the computational complexity and increasing the model’s robustness to variations in the input.
4. Fully connected layers: Layers that connect all the features extracted by the convolutional and pooling layers, performing the final classification.
They provide a visual explanation of the convolution operation, using the CNN Explainer website as a reference to illustrate how filters are applied to the input image to extract features. They discuss important hyperparameters of convolutional layers, such as:
- Kernel size: The size of the filter used for the convolution operation.
- Stride: The step size used to move the filter across the input image.
- Padding: The technique of adding extra pixels around the borders of the input image to control the output size of the convolutional layer.
Pages 511-610 Summary: Building a CNN Model from Scratch and Understanding Convolutional Layers

The sources provide a step-by-step guide to building a CNN model from scratch using PyTorch for the FoodVision Mini dataset. They walk through the process of defining the model architecture, including specifying the convolutional layers, activation functions, pooling layers, and fully connected layers. They emphasize the importance of carefully designing the model architecture to suit the specific characteristics of the dataset and the task at hand. They recommend starting with a simpler architecture and gradually increasing the model’s complexity if needed.

They delve deeper into understanding convolutional layers, explaining how they work and their role in extracting features from images. They illustrate:
1. Filters: Convolutional layers use filters (also known as kernels) to scan the input image, detecting patterns like edges, corners, and textures.
2. Feature maps: The output of a convolutional layer is a set of feature maps, each representing the presence of a particular feature in the input image.
3. Hyperparameters: They revisit the importance of hyperparameters like kernel size, stride, and padding in controlling the output size and feature extraction capabilities of convolutional layers.
The sources guide you through experimenting with different hyperparameter settings for the convolutional layers, emphasizing the importance of understanding how these choices affect the model’s performance. They recommend using visualization techniques, such as displaying the feature maps generated by different convolutional layers, to gain insights into how the model is learning features from the data.

The sources emphasize the iterative nature of the model development process, where you experiment with different architectures, hyperparameters, and training strategies to optimize the model’s performance. They recommend keeping track of the different experiments and their results to identify the most effective approaches.

Pages 611-710 Summary: Understanding CNN Building Blocks, Implementing Max Pooling, and Building a TinyVGG Model

The sources guide you through a deeper understanding of the fundamental building blocks of a convolutional neural network (CNN) for image classification. They highlight the importance of:
- Convolutional Layers: These layers extract features from input images using learnable filters. They discuss the interplay of hyperparameters like kernel size, stride, and padding, emphasizing their role in shaping the output feature maps and controlling the network’s receptive field.
- Activation Functions: Introducing non-linearity into the network is crucial for learning complex patterns. They revisit popular activation functions like ReLU (Rectified Linear Unit), which helps prevent vanishing gradients and speeds up training.
- Pooling Layers: Pooling layers downsample feature maps, making the network more robust to variations in the input image while reducing computational complexity. They explain the concept of max pooling, where the maximum value within a pooling window is selected, preserving the most prominent features.
The sources provide a detailed code implementation for max pooling using PyTorch’s torch.nn.MaxPool2d module, demonstrating how to apply it to the output of convolutional layers. They showcase how to calculate the output dimensions of the pooling layer based on the input size, stride, and pooling kernel size.

Building on these foundational concepts, the sources guide you through the construction of a TinyVGG model, a simplified version of the popular VGG architecture known for its effectiveness in image classification tasks. They demonstrate how to define the network architecture using PyTorch, stacking convolutional layers, activation functions, and pooling layers to create a deep and hierarchical representation of the input image. They emphasize the importance of designing the network structure based on principles like increasing the number of filters in deeper layers to capture more complex features.

The sources highlight the role of flattening the output of the convolutional layers before feeding it into fully connected layers, transforming the multi-dimensional feature maps into a one-dimensional vector. This transformation prepares the extracted features for the final classification task. They emphasize the importance of aligning the output size of the flattening operation with the input size of the subsequent fully connected layer.

Pages 711-810 Summary: Training a TinyVGG Model, Addressing Overfitting, and Evaluating the Model

The sources guide you through training the TinyVGG model on the FoodVision Mini dataset, emphasizing the importance of structuring the training process for optimal performance. They showcase a training loop that incorporates:
- Data Loading: Using DataLoader from PyTorch to efficiently load and batch training data, shuffling the samples in each epoch to prevent the model from learning spurious patterns from the data order.
- Device Agnostic Code: Writing code that can seamlessly switch between CPU and GPU devices for training and inference, making the code more flexible and adaptable to different hardware setups.
- Forward Pass: Passing the input data through the model to obtain predictions, applying the softmax function to the output logits to obtain probabilities for each class.
- Loss Calculation: Computing the loss between the model’s predictions and the ground truth labels using a suitable loss function, typically cross-entropy loss for multi-class classification tasks.
- Backward Pass: Calculating gradients of the loss with respect to the model’s parameters using backpropagation, highlighting the importance of understanding this fundamental algorithm that allows neural networks to learn from data.
- Optimization: Updating the model’s parameters using an optimizer like stochastic gradient descent (SGD) to minimize the loss and improve the model’s ability to make accurate predictions.
The sources emphasize the importance of monitoring the training process to ensure the model is learning effectively and generalizing well to unseen data. They guide you through tracking metrics like training loss and accuracy across epochs, visualizing them to identify potential issues like overfitting, where the model performs well on the training data but struggles to generalize to new data.

The sources address the problem of overfitting, suggesting techniques like:
- Data Augmentation: Artificially increasing the diversity of the training data by applying random transformations to the images, such as rotations, flips, and color adjustments, making the model more robust to variations in the input.
- Dropout: Randomly deactivating a proportion of neurons during training, forcing the network to learn more robust and generalizable features.
The sources showcase how to implement these techniques in PyTorch, highlighting the importance of finding the right balance between overfitting and underfitting, where the model is too simple to capture the patterns in the data.

The sources guide you through evaluating the trained model on the test set, measuring its performance using metrics like accuracy, precision, recall, and the F1-score. They emphasize the importance of using a separate test set, unseen during training, to assess the model’s ability to generalize to new data. They showcase how to generate a confusion matrix to visualize the model’s performance across different classes, identifying which classes the model struggles with the most.

The sources provide insights into analyzing the confusion matrix to gain a deeper understanding of the model’s strengths and weaknesses, informing further improvements and refinements. They emphasize that evaluating a model is not merely about reporting a single accuracy score, but rather a multifaceted process of understanding its behavior and limitations.

The main topic of the book, based on the provided excerpts and our conversation history, is deep learning with PyTorch. The book appears to function as a comprehensive course, designed to guide readers from foundational concepts to practical implementation, ultimately empowering them to build their own deep learning models.
- The book begins by introducing fundamental concepts:
- Machine Learning (ML) and Deep Learning (DL): The book establishes a clear understanding of these core concepts, explaining that DL is a subset of ML. [1-3] It emphasizes that DL is particularly well-suited for tasks involving complex patterns in large datasets. [1, 2]
- PyTorch: The book highlights PyTorch as a popular and powerful framework for deep learning. [4, 5] It emphasizes the practical, hands-on nature of the course, encouraging readers to “see things happen” rather than getting bogged down in theoretical definitions. [1, 3, 6]
- Tensors: The book underscores the role of tensors as the fundamental building blocks of data in deep learning, explaining how they represent data numerically for processing within neural networks. [5, 7, 8]
- The book then transitions into the PyTorch workflow, outlining the key steps involved in building and training deep learning models:
- Preparing and Loading Data: The book emphasizes the critical importance of data preparation, [9] highlighting techniques for loading, splitting, and visualizing data. [10-17]
- Building Models: The book guides readers through the process of constructing neural network models in PyTorch, introducing key modules like torch.nn. [18-22] It covers essential concepts like:
- Sub-classing nn.Module to define custom models [20]
- Implementing the forward method to define the flow of data through the network [21, 22]
- Training Models: The book details the training process, explaining:
- Loss Functions: These measure how well the model is performing, guiding the optimization process. [23, 24]
- Optimizers: These update the model’s parameters based on the calculated gradients, aiming to minimize the loss and improve accuracy. [25, 26]
- Training Loops: These iterate through the data, performing forward and backward passes to update the model’s parameters. [26-29]
- The Importance of Monitoring: The book stresses the need to track metrics like loss and accuracy during training to ensure the model is learning effectively and to diagnose issues like overfitting. [30-32]
- Evaluating Models: The book explains techniques for evaluating the performance of trained models on a separate test set, unseen during training. [15, 30, 33] It introduces metrics like accuracy, precision, recall, and the F1-score to assess model performance. [34, 35]
- Saving and Loading Models: The book provides instructions on how to save trained models and load them for later use, preserving the model’s learned parameters. [36-39]
- Beyond the foundational workflow, the book explores specific applications of deep learning:
- Classification: The book dedicates significant attention to classification problems, which involve categorizing data into predefined classes. [40-42] It covers:
- Binary Classification: Distinguishing between two classes (e.g., spam or not spam) [41, 43]
- Multi-Class Classification: Categorizing into more than two classes (e.g., different types of images) [41, 43]
- Computer Vision: The book dives into the world of computer vision, which focuses on enabling computers to “see” and interpret images. [44, 45] It introduces:
- Convolutional Neural Networks (CNNs): Specialized architectures designed to effectively process image data. [44-46]
- Torchvision: PyTorch’s library specifically designed for computer vision tasks. [47]
- Throughout the book, there is a strong emphasis on practical implementation, with:
- Coding Examples: The book uses plentiful code snippets to illustrate concepts and techniques, encouraging readers to experiment and learn by doing. [1, 3, 6, 48, 49]
- Exercises: The book provides exercises at the end of each section to reinforce learning and challenge readers to apply their knowledge. [50-64]
- Troubleshooting Tips: The book offers practical advice on how to debug code, interpret errors, and find solutions to common problems. [65-78]
- The book encourages continuous learning beyond the course material:
- Extracurricular Resources: The book consistently points readers to additional resources, such as the official PyTorch documentation, research papers, and blog posts, to deepen their understanding and explore more advanced topics. [7, 8, 19, 35, 37, 39, 55, 57-59, 61, 63, 64, 71, 79-94]
- The book strongly advocates for a visual and experimental approach to deep learning:
- Visualization: The book frequently encourages visualizing data and model outputs to gain insights and understand complex concepts. [14, 16, 17, 95-101]
- Experimentation: The book encourages readers to actively experiment with code, adjust parameters, and observe the effects to develop a deeper understanding. [32, 51, 53, 65, 102, 103]
In essence, the book provides a guided journey into the world of deep learning with PyTorch. It doesn’t just explain concepts—it empowers readers to build, train, and evaluate their own deep learning models. It does this by combining clear explanations, engaging code examples, practical exercises, and insightful references to supplementary resources, creating a comprehensive and stimulating learning experience.

Python’s Role in Deep Learning and Machine Learning with PyTorch

The sources strongly emphasize the use of Python as the primary programming language for working with PyTorch, a popular deep learning framework. While the sources focus on teaching PyTorch itself, they highlight Python’s integral role in making deep learning accessible and efficient. Here’s a discussion of Python’s significance, drawing upon insights from the sources:
- Python’s User-Friendliness: The sources mention that PyTorch “allows you to write fast deep learning code in Python” [1], implying that Python’s readability and ease of use make it suitable for expressing complex deep learning concepts in code. This is particularly beneficial for beginners with some Python experience (3-6 months is suggested) [2]. Python’s clear syntax and extensive libraries simplify the process of translating mathematical ideas into working deep learning models.
- PyTorch’s Pythonic Nature: The sources highlight how PyTorch seamlessly integrates with Python. Phrases like “beautifully interactive with pure Python” [3] and “testament to the Pythonic nature of PyTorch” [3] suggest that PyTorch leverages Python’s strengths, allowing developers to work with familiar Python constructs like functions and classes while building deep learning models. This tight integration enhances flexibility and allows for the incorporation of custom Python code directly into PyTorch workflows, as seen in the creation of custom accuracy functions [3].
- Integration with the Scientific Python Ecosystem: The sources mention the interaction between PyTorch and NumPy, a core library in the scientific Python ecosystem. PyTorch provides methods for converting data between NumPy arrays and PyTorch tensors [4, 5]. This interoperability signifies that PyTorch doesn’t exist in isolation. It leverages the rich ecosystem of scientific computing tools available in Python, allowing developers to utilize familiar libraries for tasks like data manipulation, visualization, and analysis in conjunction with PyTorch’s deep learning capabilities.
- Python as the Foundation: The sources consistently present code examples in Python, using Python syntax and constructs to demonstrate PyTorch concepts. This pervasive use of Python throughout the course material reinforces its position as the foundational language for interacting with PyTorch.
The sources, while focusing specifically on PyTorch, make it evident that Python plays a critical role in enabling deep learning. Python’s user-friendliness, its seamless integration with PyTorch, and its connection to a broader ecosystem of scientific computing tools contribute to PyTorch’s popularity and make deep learning more accessible to a wider audience.

Note: It is worth noting that while Python is dominant in the deep learning landscape, other languages are also used for deep learning, such as C++ and R. This information is not directly derived from the sources you provided and you may want to independently verify it.

Okay, here is a detailed summary of the book, broken down into sets of 10 pages, based on the provided excerpts. Keep in mind that these excerpts do not represent the full text of the book, so this summary provides insights into the covered topics and the overall teaching style.

Pages 1-10: Laying the Foundation: Machine Learning, Deep Learning, and PyTorch
- Introduction (Pages 1-2): These pages set the stage for the course, highlighting the importance of hands-on learning and experimentation with PyTorch. They emphasize the availability of course materials on GitHub and through the online book version at learnpytorch.io. It is also stated that the book may contain more content than is covered in the video transcript.
- Understanding Deep Learning (Pages 3-6): The book provides a concise overview of machine learning (ML) and deep learning (DL), emphasizing DL’s ability to handle complex patterns in large datasets. It suggests focusing on practical implementation rather than dwelling on detailed definitions, as these can be easily accessed online. The importance of considering simpler, rule-based solutions before resorting to ML is also stressed.
- Embracing Self-Learning (Pages 6-7): The book encourages active learning by suggesting readers explore topics like deep learning and neural networks independently, utilizing resources such as Wikipedia and specific YouTube channels like 3Blue1Brown. It stresses the value of forming your own understanding by consulting multiple sources and synthesizing information.
- Introducing PyTorch (Pages 8-10): PyTorch is introduced as a prominent deep learning framework, particularly popular in research. Its Pythonic nature is highlighted, making it efficient for writing deep learning code. The book directs readers to the official PyTorch documentation as a primary resource for exploring the framework’s capabilities.
Pages 11-20: PyTorch Fundamentals: Tensors, Operations, and More
- Getting Specific (Pages 11-12): The book emphasizes a hands-on approach, encouraging readers to explore concepts like tensors through online searches and coding experimentation. It highlights the importance of asking questions and actively engaging with the material rather than passively following along. The inclusion of exercises at the end of each module is mentioned to reinforce understanding.
- Learning Through Doing (Pages 12-14): The book emphasizes the importance of active learning through:
- Asking questions of yourself, the code, the community, and online resources.
- Completing the exercises provided to test knowledge and solidify understanding.
- Sharing your work to reinforce learning and contribute to the community.
- Avoiding Overthinking (Page 13): A key piece of advice is to avoid getting overwhelmed by the complexity of the subject. Starting with a clear understanding of the fundamentals and building upon them gradually is encouraged.
- Course Resources (Pages 14-17): The book reiterates the availability of course materials:
- GitHub repository: Containing code and other resources.
- GitHub discussions: A platform for asking questions and engaging with the community.
- learnpytorch.io: The online book version of the course.
- Tensors in Action (Pages 17-20): The book dives into PyTorch tensors, explaining their creation using torch.tensor and referencing the official documentation for further exploration. It demonstrates basic tensor operations, emphasizing that writing code and interacting with tensors is the best way to grasp their functionality. The use of the torch.arange function is introduced to create tensors with specific ranges and step sizes.
Pages 21-30: Understanding PyTorch’s Data Loading and Workflow
- Tensor Manipulation and Stacking (Pages 21-22): The book covers tensor manipulation techniques, including permuting dimensions (e.g., rearranging color channels, height, and width in an image tensor). The torch.stack function is introduced to concatenate tensors along a new dimension. The concept of a pseudo-random number generator and the role of a random seed are briefly touched upon, referencing the PyTorch documentation for a deeper understanding.
- Running Tensors on Devices (Pages 22-23): The book mentions the concept of running PyTorch tensors on different devices, such as CPUs and GPUs, although the details of this are not provided in the excerpts.
- Exercises and Extra Curriculum (Pages 23-27): The importance of practicing concepts through exercises is highlighted, and the book encourages readers to refer to the PyTorch documentation for deeper understanding. It provides guidance on how to approach exercises using Google Colab alongside the book material. The book also points out the availability of solution templates and a dedicated folder for exercise solutions.
- PyTorch Workflow in Action (Pages 28-31): The book begins exploring a complete PyTorch workflow, emphasizing a code-driven approach with explanations interwoven as needed. A six-step workflow is outlined:
1. Data preparation and loading
2. Building a machine learning/deep learning model
3. Fitting the model to data
4. Making predictions
5. Evaluating the model
6. Saving and loading the model
Pages 31-40: Data Preparation, Linear Regression, and Visualization
- The Two Parts of Machine Learning (Pages 31-33): The book breaks down machine learning into two fundamental parts:
- Representing Data Numerically: Converting data into a format suitable for models to process.
- Building a Model to Learn Patterns: Training a model to identify relationships within the numerical representation.
- Linear Regression Example (Pages 33-35): The book uses a linear regression example (y = a + bx) to illustrate the relationship between data and model parameters. It encourages a hands-on approach by coding the formula, emphasizing that coding helps solidify understanding compared to simply reading formulas.
- Visualizing Data (Pages 35-40): The book underscores the importance of data visualization using Matplotlib, adhering to the “visualize, visualize, visualize” motto. It provides code for plotting data, highlighting the use of scatter plots and the importance of consulting the Matplotlib documentation for detailed information on plotting functions. It guides readers through the process of creating plots, setting figure sizes, plotting training and test data, and customizing plot elements like colors, markers, and labels.
Pages 41-50: Model Building Essentials and Inference
- Color-Coding and PyTorch Modules (Pages 41-42): The book uses color-coding in the online version to enhance visual clarity. It also highlights essential PyTorch modules for data preparation, model building, optimization, evaluation, and experimentation, directing readers to the learnpytorch.io book and the PyTorch documentation.
- Model Predictions (Pages 42-43): The book emphasizes the process of making predictions using a trained model, noting the expectation that an ideal model would accurately predict output values based on input data. It introduces the concept of “inference mode,” which can enhance code performance during prediction. A Twitter thread and a blog post on PyTorch’s inference mode are referenced for further exploration.
- Understanding Loss Functions (Pages 44-47): The book dives into loss functions, emphasizing their role in measuring the discrepancy between a model’s predictions and the ideal outputs. It clarifies that loss functions can also be referred to as cost functions or criteria in different contexts. A table in the book outlines various loss functions in PyTorch, providing common values and links to documentation. The concept of Mean Absolute Error (MAE) and the L1 loss function are introduced, with encouragement to explore other loss functions in the documentation.
- Understanding Optimizers and Hyperparameters (Pages 48-50): The book explains optimizers, which adjust model parameters based on the calculated loss, with the goal of minimizing the loss over time. The distinction between parameters (values set by the model) and hyperparameters (values set by the data scientist) is made. The learning rate, a crucial hyperparameter controlling the step size of the optimizer, is introduced. The process of minimizing loss within a training loop is outlined, emphasizing the iterative nature of adjusting weights and biases.
Pages 51-60: Training Loops, Saving Models, and Recap
- Putting It All Together: The Training Loop (Pages 51-53): The book assembles the previously discussed concepts into a training loop, demonstrating the iterative process of updating a model’s parameters over multiple epochs. It shows how to track and print loss values during training, illustrating the gradual reduction of loss as the model learns. The convergence of weights and biases towards ideal values is shown as a sign of successful training.
- Saving and Loading Models (Pages 53-56): The book explains the process of saving trained models, preserving learned parameters for later use. The concept of a “state dict,” a Python dictionary mapping layers to their parameter tensors, is introduced. The use of torch.save and torch.load for saving and loading models is demonstrated. The book also references the PyTorch documentation for more detailed information on saving and loading models.
- Wrapping Up the Fundamentals (Pages 57-60): The book concludes the section on PyTorch workflow fundamentals, reiterating the key steps:
- Getting data ready
- Converting data to tensors
- Building or selecting a model
- Choosing a loss function and an optimizer
- Training the model
- Evaluating the model
- Saving and loading the model
- Exercises and Resources (Pages 57-60): The book provides exercises focused on the concepts covered in the section, encouraging readers to practice implementing a linear regression model from scratch. A variety of extracurricular resources are listed, including links to articles on gradient descent, backpropagation, loading and saving models, a PyTorch cheat sheet, and the unofficial PyTorch optimization loop song. The book directs readers to the extras folder in the GitHub repository for exercise templates and solutions.
This breakdown of the first 60 pages, based on the excerpts provided, reveals the book’s structured and engaging approach to teaching deep learning with PyTorch. It balances conceptual explanations with hands-on coding examples, exercises, and references to external resources. The book emphasizes experimentation and active learning, encouraging readers to move beyond passive reading and truly grasp the material by interacting with code and exploring concepts independently.

Note: Please keep in mind that this summary only covers the content found within the provided excerpts, which may not represent the entirety of the book.

Pages 61-70: Multi-Class Classification and Building a Neural Network
- Multi-Class Classification (Pages 61-63): The book introduces multi-class classification, where a model predicts one out of multiple possible classes. It shifts from the linear regression example to a new task involving a data set with four distinct classes. It also highlights the use of one-hot encoding to represent categorical data numerically, and emphasizes the importance of understanding the problem domain and using appropriate data representations for a given task.
- Preparing Data (Pages 63-64): The sources demonstrate the creation of a multi-class data set. The book uses PyTorch’s make_blobs function to generate synthetic data points representing four classes, each with its own color. It emphasizes the importance of visualizing the generated data and confirming that it aligns with the desired structure. The train_test_split function is used to divide the data into training and testing sets.
- Building a Neural Network (Pages 64-66): The book starts building a neural network model using PyTorch’s nn.Module class, showing how to define layers and connect them in a sequential manner. It provides a step-by-step explanation of the process:
1. Initialization: Defining the model class with layers and computations.
2. Input Layer: Specifying the number of features for the input layer based on the data set.
3. Hidden Layers: Creating hidden layers and determining their input and output sizes.
4. Output Layer: Defining the output layer with a size corresponding to the number of classes.
5. Forward Method: Implementing the forward pass, where data flows through the network.
- Matching Shapes (Pages 67-70): The book emphasizes the crucial concept of shape compatibility between layers. It shows how to calculate output shapes based on input shapes and layer parameters. It explains that input shapes must align with the expected shapes of subsequent layers to ensure smooth data flow. The book also underscores the importance of code experimentation to confirm shape alignment. The sources specifically focus on checking that the output shape of the network matches the shape of the target values (y) for training.
Pages 71-80: Loss Functions and Activation Functions
- Revisiting Loss Functions (Pages 71-73): The book revisits loss functions, now in the context of multi-class classification. It highlights that the choice of loss function depends on the specific problem type. The Mean Absolute Error (MAE), used for regression in previous examples, is not suitable for classification. Instead, the book introduces cross-entropy loss (nn.CrossEntropyLoss), emphasizing its suitability for classification tasks with multiple classes. It also mentions the BCEWithLogitsLoss, another common loss function for classification problems.
- The Role of Activation Functions (Pages 74-76): The book raises the concept of activation functions, hinting at their significance in model performance. The sources state that combining multiple linear layers in a neural network doesn’t increase model capacity because a series of linear transformations is still ultimately linear. This suggests that linear models might be limited in capturing complex, non-linear relationships in data.
- Visualizing Limitations (Pages 76-78): The sources introduce the “Data Explorer’s Motto”: “Visualize, visualize, visualize!” This highlights the importance of visualization for understanding both data and model behavior. The book provides a visualization demonstrating the limitations of a linear model, showing its inability to accurately classify data with non-linear boundaries.
- Exploring Nonlinearities (Pages 78-80): The sources pose the question, “What patterns could you draw if you were given an infinite amount of straight and non-straight lines?” This prompts readers to consider the expressive power of combining linear and non-linear components. The book then encourages exploring non-linear activation functions within the PyTorch documentation, specifically referencing torch.nn, and suggests trying to identify an activation function that has already been used in the examples. This interactive approach pushes learners to actively seek out information and connect concepts.
Pages 81-90: Building and Training with Non-Linearity
- Introducing ReLU (Pages 81-83): The sources emphasize the crucial role of non-linearity in neural network models, introducing the Rectified Linear Unit (ReLU) as a commonly used non-linear activation function. The book describes ReLU as a “magic piece of the puzzle,” highlighting its ability to add non-linearity to the model and enable the learning of more complex patterns. The sources again emphasize the importance of trying to draw various patterns using a combination of straight and curved lines to gain intuition about the impact of non-linearity.
- Building with ReLU (Pages 83-87): The book guides readers through modifying the neural network model by adding ReLU activation functions between the existing linear layers. The placement of ReLU functions within the model architecture is shown. The sources suggest experimenting with the TensorFlow Playground, a web-based tool for visualizing neural networks, to recreate the model and observe the effects of ReLU on data separation.
- Training the Enhanced Model (Pages 87-90): The book outlines the training process for the new model, utilizing familiar steps such as creating a loss function (BCEWithLogitsLoss in this case), setting up an optimizer (torch.optim.Adam), and defining training and evaluation loops. It demonstrates how to pass data through the model, calculate the loss, perform backpropagation, and update model parameters. The sources emphasize that even though the code structure is familiar, learners should strive to understand the underlying mechanisms and how they contribute to model training. It also suggests considering how the training code could be further optimized and modularized into functions for reusability.
It’s important to remember that this information is based on the provided excerpts, and the book likely covers these topics and concepts in more depth. The book’s interactive approach, focusing on experimentation, code interaction, and visualization, encourages active engagement with the material, urging readers to explore, question, and discover rather than passively follow along.

Continuing with Non-Linearity and Multi-Class Classification
- Visualizing Non-Linearity (Pages 91-94): The sources emphasize the importance of visualizing the model’s performance after incorporating the ReLU activation function. They use a custom plotting function, plot_decision_boundary, to visually assess the model’s ability to separate the circular data. The visualization reveals a significant improvement compared to the linear model, demonstrating that ReLU enables the model to learn non-linear decision boundaries and achieve a better separation of the classes.
- Pushing for Improvement (Pages 94-96): Even though the non-linear model shows improvement, the sources encourage continued experimentation to achieve even better performance. They challenge readers to improve the model’s accuracy on the test data to over 80%. This encourages an iterative approach to model development, where experimentation, analysis, and refinement are key. The sources suggest potential strategies, such as:
- Adding more layers to the network
- Increasing the number of hidden units
- Training for a greater number of epochs
- Adjusting the learning rate of the optimizer
- Multi-Class Classification Revisited (Pages 96-99): The sources return to multi-class classification, moving beyond the binary classification example of the circular data. They introduce a new data set called “X BLOB,” which consists of data points belonging to three distinct classes. This shift introduces additional challenges in model building and training, requiring adjustments to the model architecture, loss function, and evaluation metrics.
- Data Preparation and Model Building (Pages 99-102): The sources guide readers through preparing the X BLOB data set for training, using familiar steps such as splitting the data into training and testing sets and creating data loaders. The book emphasizes the importance of understanding the data set’s characteristics, such as the number of classes, and adjusting the model architecture accordingly. It also encourages experimentation with different model architectures, specifically referencing PyTorch’s torch.nn module, to find an appropriate model for the task. The TensorFlow Playground is again suggested as a tool for visualizing and experimenting with neural network architectures.
The sources repeatedly emphasize the iterative and experimental nature of machine learning and deep learning, urging learners to actively engage with the code, explore different options, and visualize results to gain a deeper understanding of the concepts. This hands-on approach fosters a mindset of continuous learning and improvement, crucial for success in these fields.

Building and Training with Non-Linearity: Pages 103-113
- The Power of Non-Linearity (Pages 103-105): The sources continue emphasizing the crucial role of non-linearity in neural networks, highlighting its ability to capture complex patterns in data. The book states that neural networks combine linear and non-linear functions to find patterns in data. It reiterates that linear functions alone are limited in their expressive power and that non-linear functions, like ReLU, enable models to learn intricate decision boundaries and achieve better separation of classes. The sources encourage readers to experiment with different non-linear activation functions and observe their impact on model performance, reinforcing the idea that experimentation is essential in machine learning.
- Multi-Class Model with Non-Linearity (Pages 105-108): Building upon the previous exploration, the sources guide readers through constructing a multi-class classification model with a non-linear activation function. The book provides a step-by-step breakdown of the model architecture, including:
1. Input Layer: Takes in features from the data set, same as before.
2. Hidden Layers: Incorporate linear transformations using PyTorch’s nn.Linear layers, just like in previous models.
3. ReLU Activation: Introduces ReLU activation functions between the linear layers, adding non-linearity to the model.
4. Output Layer: Produces a set of raw output values, also known as logits, corresponding to the number of classes.
- Prediction Probabilities (Pages 108-110): The sources explain that the raw output logits from the model need to be converted into probabilities to interpret the model’s predictions. They introduce the torch.softmax function, which transforms the logits into a probability distribution over the classes, indicating the likelihood of each class for a given input. The book emphasizes that understanding the relationship between logits, probabilities, and model predictions is crucial for evaluating and interpreting model outputs.
- Training and Evaluation (Pages 110-111): The sources outline the training process for the multi-class model, utilizing familiar steps such as setting up a loss function (Cross-Entropy Loss is recommended for multi-class classification), defining an optimizer (torch.optim.SGD), creating training and testing loops, and evaluating the model’s performance using loss and accuracy metrics. The sources reiterate the importance of device-agnostic code, ensuring that the model and data reside on the same device (CPU or GPU) for seamless computation. They also encourage readers to experiment with different optimizers and hyperparameters, such as learning rate and batch size, to observe their effects on training dynamics and model performance.
- Experimentation and Visualization (Pages 111-113): The sources strongly advocate for ongoing experimentation, urging readers to modify the model, adjust hyperparameters, and visualize results to gain insights into model behavior. They demonstrate how removing the ReLU activation function leads to a model with linear decision boundaries, resulting in a significant decrease in accuracy, highlighting the importance of non-linearity in capturing complex patterns. The sources also encourage readers to refer back to previous notebooks, experiment with different model architectures, and explore advanced visualization techniques to enhance their understanding of the concepts and improve model performance.
The consistent theme across these sections is the value of active engagement and experimentation. The sources emphasize that learning in machine learning and deep learning is an iterative process. Readers are encouraged to question assumptions, try different approaches, visualize results, and continuously refine their models based on observations and experimentation. This hands-on approach is crucial for developing a deep understanding of the concepts and fostering the ability to apply these techniques to real-world problems.

The Impact of Non-Linearity and Multi-Class Classification Challenges: Pages 113-116
- Non-Linearity’s Impact on Model Performance: The sources examine the critical role non-linearity plays in a model’s ability to accurately classify data. They demonstrate this by training a model without the ReLU activation function, resulting in linear decision boundaries and significantly reduced accuracy. The visualizations provided highlight the stark difference between the model with ReLU and the one without, showcasing how non-linearity enables the model to capture the circular patterns in the data and achieve better separation between classes [1]. This emphasizes the importance of understanding how different activation functions contribute to a model’s capacity to learn complex relationships within data.
- Understanding the Data and Model Relationship (Pages 115-116): The sources remind us that evaluating a model is as crucial as building one. They highlight the importance of becoming one with the data, both at the beginning and after training a model, to gain a deeper understanding of its behavior and performance. Analyzing the model’s predictions on the data helps identify potential issues, such as overfitting or underfitting, and guides further experimentation and refinement [2].
- Key Takeaways: The sources reinforce several key concepts and best practices in machine learning and deep learning:
- Visualize, Visualize, Visualize: Visualizing data and model predictions is crucial for understanding patterns, identifying potential issues, and guiding model development.
- Experiment, Experiment, Experiment: Trying different approaches, adjusting hyperparameters, and iteratively refining models based on observations is essential for achieving optimal performance.
- The Data Scientist’s/Machine Learning Practitioner’s Motto: Experimentation is at the heart of successful machine learning, encouraging continuous learning and improvement.
- Steps in Modeling with PyTorch: The sources repeatedly reinforce a structured workflow for building and training models in PyTorch, emphasizing the importance of following a methodical approach to ensure consistency and reproducibility.
The sources conclude this section by directing readers to a set of exercises and extra curriculum designed to solidify their understanding of non-linearity, multi-class classification, and the steps involved in building, training, and evaluating models in PyTorch. These resources provide valuable opportunities for hands-on practice and further exploration of the concepts covered. They also serve as a reminder that learning in these fields is an ongoing process that requires continuous engagement, experimentation, and a willingness to iterate and refine models based on observations and analysis [3].

Continuing the Computer Vision Workflow: Pages 116-129
- Introducing Computer Vision and CNNs: The sources introduce a new module focusing on computer vision and convolutional neural networks (CNNs). They acknowledge the excitement surrounding this topic and emphasize its importance as a core concept within deep learning. The sources also provide clear instructions on how to access help and resources if learners encounter challenges during the module, encouraging active engagement and a problem-solving mindset. They reiterate the motto of “if in doubt, run the code,” highlighting the value of practical experimentation. They also point to available resources, including the PyTorch Deep Learning repository, specific notebooks, and a dedicated discussions tab for questions and answers.
- Understanding Custom Datasets: The sources explain the concept of custom datasets, recognizing that while pre-built datasets like FashionMNIST are valuable for learning, real-world applications often involve working with unique data. They acknowledge the potential need for custom data loading solutions when existing libraries don’t provide the necessary functionality. The sources introduce the idea of creating a custom PyTorch dataset class by subclassing torch.utils.data.Dataset and implementing specific methods to handle data loading and preparation tailored to the unique requirements of the custom dataset.
- Building a Baseline Model (Pages 118-120): The sources guide readers through building a baseline computer vision model using PyTorch. They emphasize the importance of understanding the input and output shapes to ensure the model is appropriately configured for the task. The sources also introduce the concept of creating a dummy forward pass to check the model’s functionality and verify the alignment of input and output dimensions.
- Training the Baseline Model (Pages 120-125): The sources step through the process of training the baseline computer vision model. They provide a comprehensive breakdown of the code, including the use of a progress bar for tracking training progress. The steps highlighted include:
1. Setting up the training loop: Iterating through epochs and batches of data
2. Performing the forward pass: Passing data through the model to obtain predictions
3. Calculating the loss: Measuring the difference between predictions and ground truth labels
4. Backpropagation: Calculating gradients to update model parameters
5. Updating model parameters: Using the optimizer to adjust weights based on calculated gradients
- Evaluating Model Performance (Pages 126-128): The sources stress the importance of comprehensive evaluation, going beyond simple loss and accuracy metrics. They introduce techniques like plotting loss curves to visualize training dynamics and gain insights into model behavior. The sources also emphasize the value of experimentation, encouraging readers to explore the impact of different devices (CPU vs. GPU) on training time and performance.
- Improving Through Experimentation: The sources encourage ongoing experimentation to improve model performance. They introduce the idea of building a better model with non-linearity, suggesting the inclusion of activation functions like ReLU. They challenge readers to try building such a model and experiment with different configurations to observe their impact on results.
The sources maintain their consistent focus on hands-on learning, guiding readers through each step of building, training, and evaluating computer vision models using PyTorch. They emphasize the importance of understanding the underlying concepts while actively engaging with the code, trying different approaches, and visualizing results to gain deeper insights and build practical experience.

Functionizing Code for Efficiency and Readability: Pages 129-139
- The Benefits of Functionizing Training and Evaluation Loops: The sources introduce the concept of functionizing code, specifically focusing on training and evaluation (testing) loops in PyTorch. They explain that writing reusable functions for these repetitive tasks brings several advantages:
- Improved code organization and readability: Breaking down complex processes into smaller, modular functions enhances the overall structure and clarity of the code. This makes it easier to understand, maintain, and modify in the future.
- Reduced errors: Encapsulating common operations within functions helps prevent inconsistencies and errors that can arise from repeatedly writing similar code blocks.
- Increased efficiency: Reusable functions streamline the development process by eliminating the need to rewrite the same code for different models or datasets.
- Creating the train_step Function (Pages 130-132): The sources guide readers through creating a function called train_step that encapsulates the logic of a single training step within a PyTorch training loop. The function takes several arguments:
- model: The PyTorch model to be trained
- data_loader: The data loader providing batches of training data
- loss_function: The loss function used to calculate the training loss
- optimizer: The optimizer responsible for updating model parameters
- accuracy_function: A function for calculating the accuracy of the model’s predictions
- device: The device (CPU or GPU) on which to perform the computations
- The train_step function performs the following steps for each batch of training data:
1. Sets the model to training mode using model.train()
2. Sends the input data and labels to the specified device
3. Performs the forward pass by passing the data through the model
4. Calculates the loss using the provided loss function
5. Performs backpropagation to calculate gradients
6. Updates model parameters using the optimizer
7. Calculates and accumulates the training loss and accuracy for the batch
- Creating the test_step Function (Pages 132-136): The sources proceed to create a function called test_step that performs a single evaluation step on a batch of testing data. This function follows a similar structure to train_step, but with key differences:
- It sets the model to evaluation mode using model.eval() to disable certain behaviors, such as dropout, specific to training.
- It utilizes the torch.inference_mode() context manager to potentially optimize computations for inference tasks, aiming for speed improvements.
- It calculates and accumulates the testing loss and accuracy for the batch without updating the model’s parameters.
- Combining train_step and test_step into a train Function (Pages 137-139): The sources combine the functionality of train_step and test_step into a single function called train, which orchestrates the entire training and evaluation process over a specified number of epochs. The train function takes arguments similar to train_step and test_step, including the number of epochs to train for. It iterates through the specified epochs, calling train_step for each batch of training data and test_step for each batch of testing data. It tracks and prints the training and testing loss and accuracy for each epoch, providing a clear view of the model’s progress during training.
By encapsulating the training and evaluation logic into these functions, the sources demonstrate best practices in PyTorch code development, emphasizing modularity, readability, and efficiency. This approach makes it easier to experiment with different models, datasets, and hyperparameters while maintaining a structured and manageable codebase.

Leveraging Functions for Model Training and Evaluation: Pages 139-148
- Training Model 1 Using the train Function: The sources demonstrate how to use the newly created train function to train the model_1 that was built earlier. They highlight that only a few lines of code are needed to initiate the training process, showcasing the efficiency gained from functionization.
- Examining Training Results and Performance Comparison: The sources emphasize the importance of carefully examining the training results, particularly the training and testing loss curves. They point out that while model_1 achieves good results, the baseline model_0 appears to perform slightly better. This observation prompts a discussion on potential reasons for the difference in performance, including the possibility that the simpler baseline model might be better suited for the dataset or that further experimentation and hyperparameter tuning might be needed for model_1 to surpass model_0. The sources also highlight the impact of using a GPU for computations, showing that training on a GPU generally leads to faster training times compared to using a CPU.
- Creating a Results Dictionary to Track Experiments: The sources introduce the concept of creating a dictionary to store the results of different experiments. This organized approach allows for easy comparison and analysis of model performance across various configurations and hyperparameter settings. They emphasize the importance of such systematic tracking, especially when exploring multiple models and variations, to gain insights into the factors influencing performance and make informed decisions about model selection and improvement.
- Visualizing Loss Curves for Model Analysis: The sources encourage visualizing the loss curves using a function called plot_loss_curves. They stress the value of visual representations in understanding the training dynamics and identifying potential issues like overfitting or underfitting. By plotting the training and testing losses over epochs, it becomes easier to assess whether the model is learning effectively and generalizing well to unseen data. The sources present different scenarios for loss curves, including:
- Underfitting: The training loss remains high, indicating that the model is not capturing the patterns in the data effectively.
- Overfitting: The training loss decreases significantly, but the testing loss increases, suggesting that the model is memorizing the training data and failing to generalize to new examples.
- Good Fit: Both the training and testing losses decrease and converge, indicating that the model is learning effectively and generalizing well to unseen data.
- Addressing Overfitting and Introducing Data Augmentation: The sources acknowledge overfitting as a common challenge in machine learning and introduce data augmentation as one technique to mitigate it. Data augmentation involves creating variations of existing training data by applying transformations like random rotations, flips, or crops. This expands the effective size of the training set, potentially improving the model’s ability to generalize to new data. They acknowledge that while data augmentation may not always lead to significant improvements, it remains a valuable tool in the machine learning practitioner’s toolkit, especially when dealing with limited datasets or complex models prone to overfitting.
- Building and Training a CNN Model: The sources shift focus towards building a convolutional neural network (CNN) using PyTorch. They guide readers through constructing a CNN architecture, referencing the TinyVGG model from the CNN Explainer website as a starting point. The process involves stacking convolutional layers, activation functions (ReLU), and pooling layers to create a network capable of learning features from images effectively. They emphasize the importance of choosing appropriate hyperparameters, such as the number of filters, kernel size, and padding, and understanding their influence on the model’s capacity and performance.
- Creating Functions for Training and Evaluation with Custom Datasets: The sources revisit the concept of functionization, this time adapting the train_step and test_step functions to work with custom datasets. They highlight the importance of writing reusable and adaptable code that can handle various data formats and scenarios.
The sources continue to guide learners through a comprehensive workflow for building, training, and evaluating models in PyTorch, introducing advanced concepts and techniques along the way. They maintain their focus on practical application, encouraging hands-on experimentation, visualization, and analysis to deepen understanding and foster mastery of the tools and concepts involved in machine learning and deep learning.

Training and Evaluating Models with Custom Datasets: Pages 171-187
- Building the TinyVGG Architecture: The sources guide the creation of a CNN model based on the TinyVGG architecture. The model consists of convolutional layers, ReLU activation functions, and max-pooling layers arranged in a specific pattern to extract features from images effectively. The sources highlight the importance of understanding the role of each layer and how they work together to process image data. They also mention a blog post, “Making deep learning go brrr from first principles,” which might provide further insights into the principles behind deep learning models. You might want to explore this resource for a deeper understanding.
- Adapting Training and Evaluation Functions for Custom Datasets: The sources revisit the train_step and test_step functions, modifying them to accommodate custom datasets. They emphasize the need for flexibility in code, enabling it to handle different data formats and structures. The changes involve ensuring the data is loaded and processed correctly for the specific dataset used.
- Creating a train Function for Custom Dataset Training: The sources combine the train_step and test_step functions within a new train function specifically designed for custom datasets. This function orchestrates the entire training and evaluation process, looping through epochs, calling the appropriate step functions for each batch of data, and tracking the model’s performance.
- Training and Evaluating the Model: The sources demonstrate the process of training the TinyVGG model on the custom food image dataset using the newly created train function. They emphasize the importance of setting random seeds for reproducibility, ensuring consistent results across different runs.
- Analyzing Loss Curves and Accuracy Trends: The sources analyze the training results, focusing on the loss curves and accuracy trends. They point out that the model exhibits good performance, with the loss decreasing and the accuracy increasing over epochs. They also highlight the potential for further improvement by training for a longer duration.
- Exploring Different Loss Curve Scenarios: The sources discuss different types of loss curves, including:
- Underfitting: The training loss remains high, indicating the model isn’t effectively capturing the data patterns.
- Overfitting: The training loss decreases substantially, but the testing loss increases, signifying the model is memorizing the training data and failing to generalize to new examples.
- Good Fit: Both training and testing losses decrease and converge, demonstrating that the model is learning effectively and generalizing well.
- Addressing Overfitting with Data Augmentation: The sources introduce data augmentation as a technique to combat overfitting. Data augmentation creates variations of the training data through transformations like rotations, flips, and crops. This approach effectively expands the training dataset, potentially improving the model’s generalization abilities. They acknowledge that while data augmentation might not always yield significant enhancements, it remains a valuable strategy, especially for smaller datasets or complex models prone to overfitting.
- Building a Model with Data Augmentation: The sources demonstrate how to build a TinyVGG model incorporating data augmentation techniques. They explore the impact of data augmentation on model performance.
- Visualizing Results and Evaluating Performance: The sources advocate for visualizing results to gain insights into model behavior. They encourage using techniques like plotting loss curves and creating confusion matrices to assess the model’s effectiveness.
- Saving and Loading the Best Model: The sources highlight the importance of saving the best-performing model to preserve its state for future use. They demonstrate the process of saving and loading a PyTorch model.
- Exercises and Extra Curriculum: The sources provide guidance on accessing exercises and supplementary materials, encouraging learners to further explore and solidify their understanding of custom datasets, data augmentation, and CNNs in PyTorch.
The sources provide a comprehensive walkthrough of building, training, and evaluating models with custom datasets in PyTorch, introducing and illustrating various concepts and techniques along the way. They underscore the value of practical application, experimentation, and analysis to enhance understanding and skill development in machine learning and deep learning.

Continuing the Exploration of Custom Datasets and Data Augmentation
- Building a Model with Data Augmentation: The sources guide the construction of a TinyVGG model incorporating data augmentation techniques to potentially improve its generalization ability and reduce overfitting. [1] They introduce data augmentation as a way to create variations of existing training data by applying transformations like random rotations, flips, or crops. [1] This increases the effective size of the training dataset and exposes the model to a wider range of input patterns, helping it learn more robust features.
- Training the Model with Data Augmentation and Analyzing Results: The sources walk through the process of training the model with data augmentation and evaluating its performance. [2] They observe that, in this specific case, data augmentation doesn’t lead to substantial improvements in quantitative metrics. [2] The reasons for this could be that the baseline model might already be underfitting, or the specific augmentations used might not be optimal for the dataset. They emphasize that experimenting with different augmentations and hyperparameters is crucial to determine the most effective strategies for a given problem.
- Visualizing Loss Curves and Emphasizing the Importance of Evaluation: The sources stress the importance of visualizing results, especially loss curves, to understand the training dynamics and identify potential issues like overfitting or underfitting. [2] They recommend using the plot_loss_curves function to visually compare the training and testing losses across epochs. [2]
- Providing Access to Exercises and Extra Curriculum: The sources conclude by directing learners to the resources available for practicing the concepts covered, including an exercise template notebook and example solutions. [3] They encourage readers to attempt the exercises independently and use the example solutions as a reference only after making a genuine effort. [3] The exercises focus on building a CNN model for image classification, highlighting the steps involved in data loading, model creation, training, and evaluation. [3]
- Concluding the Section on Custom Datasets and Looking Ahead: The sources wrap up the section on working with custom datasets and using data augmentation techniques. [4] They point out that learners have now covered a significant portion of the course material and gained valuable experience in building, training, and evaluating PyTorch models for image classification tasks. [4] They briefly touch upon the next steps in the deep learning journey, including deployment, and encourage learners to continue exploring and expanding their knowledge. [4]
The sources aim to equip learners with the necessary tools and knowledge to tackle real-world deep learning projects. They advocate for a hands-on, experimental approach, emphasizing the importance of understanding the data, choosing appropriate models and techniques, and rigorously evaluating the results. They also encourage learners to continuously seek out new information and refine their skills through practice and exploration.

Exploring Techniques for Model Improvement and Evaluation: Pages 188-190
- Examining the Impact of Data Augmentation: The sources continue to assess the effectiveness of data augmentation in improving model performance. They observe that, despite its potential benefits, data augmentation might not always result in significant enhancements. In the specific example provided, the model trained with data augmentation doesn’t exhibit noticeable improvements compared to the baseline model. This outcome could be attributed to the baseline model potentially underfitting the data, implying that the model’s capacity is insufficient to capture the complexities of the dataset even with augmented data. Alternatively, the specific data augmentations employed might not be well-suited to the dataset, leading to minimal performance gains.
- Analyzing Loss Curves to Understand Model Behavior: The sources emphasize the importance of visualizing results, particularly loss curves, to gain insights into the model’s training dynamics. They recommend plotting the training and validation loss curves to observe how the model’s performance evolves over epochs. These visualizations help identify potential issues such as:
- Underfitting: When both training and validation losses remain high, suggesting the model isn’t effectively learning the patterns in the data.
- Overfitting: When the training loss decreases significantly while the validation loss increases, indicating the model is memorizing the training data rather than learning generalizable features.
- Good Fit: When both training and validation losses decrease and converge, demonstrating the model is learning effectively and generalizing well to unseen data.
- Directing Learners to Exercises and Supplementary Materials: The sources encourage learners to engage with the exercises and extra curriculum provided to solidify their understanding of the concepts covered. They point to resources like an exercise template notebook and example solutions designed to reinforce the knowledge acquired in the section. The exercises focus on building a CNN model for image classification, covering aspects like data loading, model creation, training, and evaluation.
The sources strive to equip learners with the critical thinking skills necessary to analyze model performance, identify potential problems, and explore strategies for improvement. They highlight the value of visualizing results and understanding the implications of different loss curve patterns. Furthermore, they encourage learners to actively participate in the provided exercises and seek out supplementary materials to enhance their practical skills in deep learning.

Evaluating the Effectiveness of Data Augmentation

The sources consistently emphasize the importance of evaluating the impact of data augmentation on model performance. While data augmentation is a widely used technique to mitigate overfitting and potentially improve generalization ability, its effectiveness can vary depending on the specific dataset and model architecture.

In the context of the food image classification task, the sources demonstrate building a TinyVGG model with and without data augmentation. They analyze the results and observe that, in this particular instance, data augmentation doesn’t lead to significant improvements in quantitative metrics like loss or accuracy. This outcome could be attributed to several factors:
- Underfitting Baseline Model: The baseline model, even without augmentation, might already be underfitting the data. This suggests that the model’s capacity is insufficient to capture the complexities of the dataset effectively. In such scenarios, data augmentation might not provide substantial benefits as the model’s limitations prevent it from leveraging the augmented data fully.
- Suboptimal Augmentations: The specific data augmentation techniques used might not be well-suited to the characteristics of the food image dataset. The chosen transformations might not introduce sufficient diversity or might inadvertently alter crucial features, leading to limited performance gains.
- Dataset Size: The size of the original dataset could influence the impact of data augmentation. For larger datasets, data augmentation might have a more pronounced effect, as it helps expand the training data and exposes the model to a wider range of variations. However, for smaller datasets, the benefits of augmentation might be less noticeable.
The sources stress the importance of experimentation and analysis to determine the effectiveness of data augmentation for a specific task. They recommend exploring different augmentation techniques, adjusting hyperparameters, and carefully evaluating the results to find the optimal strategy. They also point out that even if data augmentation doesn’t result in substantial quantitative improvements, it can still contribute to a more robust and generalized model. [1, 2]

Exploring Data Augmentation and Addressing Overfitting

The sources highlight the importance of data augmentation as a technique to combat overfitting in machine learning models, particularly in the realm of computer vision. They emphasize that data augmentation involves creating variations of the existing training data by applying transformations such as rotations, flips, or crops. This effectively expands the training dataset and presents the model with a wider range of input patterns, promoting the learning of more robust and generalizable features.

However, the sources caution that data augmentation is not a guaranteed solution and its effectiveness can vary depending on several factors, including:
- The nature of the dataset: The type of data and the inherent variability within the dataset can influence the impact of data augmentation. Certain datasets might benefit significantly from augmentation, while others might exhibit minimal improvement.
- The model architecture: The complexity and capacity of the model can determine how effectively it can leverage augmented data. A simple model might not fully utilize the augmented data, while a more complex model might be prone to overfitting even with augmentation.
- The choice of augmentation techniques: The specific transformations applied during augmentation play a crucial role in its success. Selecting augmentations that align with the characteristics of the data and the task at hand is essential. Inappropriate or excessive augmentations can even hinder performance.
The sources demonstrate the application of data augmentation in the context of a food image classification task using a TinyVGG model. They train the model with and without augmentation and compare the results. Notably, they observe that, in this particular scenario, data augmentation does not lead to substantial improvements in quantitative metrics such as loss or accuracy. This outcome underscores the importance of carefully evaluating the impact of data augmentation and not assuming its universal effectiveness.

To gain further insights into the model’s behavior and the effects of data augmentation, the sources recommend visualizing the training and validation loss curves. These visualizations can reveal patterns that indicate:
- Underfitting: If both the training and validation losses remain high, it suggests the model is not adequately learning from the data, even with augmentation.
- Overfitting: If the training loss decreases while the validation loss increases, it indicates the model is memorizing the training data and failing to generalize to unseen data.
- Good Fit: If both the training and validation losses decrease and converge, it signifies the model is learning effectively and generalizing well.
The sources consistently emphasize the importance of experimentation and analysis when applying data augmentation. They encourage trying different augmentation techniques, fine-tuning hyperparameters, and rigorously evaluating the results to determine the optimal strategy for a given problem. They also highlight that, even if data augmentation doesn’t yield significant quantitative gains, it can still contribute to a more robust and generalized model.

Ultimately, the sources advocate for a nuanced approach to data augmentation, recognizing its potential benefits while acknowledging its limitations. They urge practitioners to adopt a data-driven methodology, carefully considering the characteristics of the dataset, the model architecture, and the task requirements to determine the most effective data augmentation strategy.

The Purpose and Impact of Inference Mode in PyTorch

The sources introduce inference mode, a feature in PyTorch designed to optimize the model for making predictions, often referred to as “inference” or “evaluation” in machine learning. Inference mode is activated using the torch.inference_mode context manager, as demonstrated in source [1].

Key Benefits of Inference Mode

While the sources don’t go into extensive detail about the internal workings of inference mode, they highlight its primary benefits:
- Improved Speed: Inference mode disables gradient calculations and certain operations not required during prediction, resulting in faster code execution. Source [2] specifically mentions a Twitter thread where PyTorch developers discuss the speed enhancements achieved using inference mode.
- Reduced Memory Consumption: By disabling gradient tracking, inference mode reduces the memory footprint of the model during prediction. This can be particularly advantageous when deploying models on resource-constrained devices or handling large datasets.
Observing the Difference

Source [3] presents a comparison between predictions made with and without inference mode. The key distinction lies in the presence or absence of a grad_fn attribute in the output. The grad_fn attribute is associated with gradient calculations, essential for training but unnecessary during inference. When inference mode is active, the output lacks the grad_fn, indicating that gradient tracking is disabled, leading to speed and memory optimizations.

Making Inference Mode a Habit

The sources encourage developing the habit of using inference mode whenever making predictions with a PyTorch model. This practice ensures that the model operates in its most efficient mode for inference tasks, maximizing performance and minimizing resource utilization.

Beyond the Sources

The sources provide a high-level overview of inference mode and its benefits, recommending further exploration through external resources, particularly the PyTorch documentation and the mentioned Twitter thread. For a deeper understanding of the technical aspects and implementation details of inference mode, consulting the official PyTorch documentation would be beneficial.

Building a Robust PyTorch Training Loop: Pages 201-210

The sources transition into providing a detailed walkthrough of constructing a robust training loop in PyTorch for a machine-learning task involving straight-line data. This example focuses on regression, where the goal is to predict a continuous numerical value. They emphasize that while this specific task involves a simple linear relationship, the concepts and steps involved are generalizable to more complex scenarios.

Here’s a breakdown of the key elements covered in the sources:
- Data Generation and Preparation: The sources guide the reader through generating a synthetic dataset representing a straight line with a predefined weight and bias. This dataset simulates a real-world scenario where the goal is to train a model to learn the underlying relationship between input features and target variables.
- Model Definition: The sources introduce the nn.Linear module, a fundamental building block in PyTorch for defining linear layers in neural networks. They demonstrate how to instantiate a linear layer, specifying the input and output dimensions based on the dataset. This layer will learn the weight and bias parameters during training to approximate the straight-line relationship.
- Loss Function and Optimizer: The sources explain the importance of a loss function in training a machine learning model. In this case, they use the Mean Squared Error (MSE) loss, a common choice for regression tasks that measures the average squared difference between the predicted and actual values. They also introduce the concept of an optimizer, specifically Stochastic Gradient Descent (SGD), responsible for updating the model’s parameters to minimize the loss function during training.
- Training Loop Structure: The sources outline the core components of a training loop:
- Iterating Through Epochs: The training process typically involves multiple passes over the entire training dataset, each pass referred to as an epoch. The loop iterates through the specified number of epochs, performing the training steps for each epoch.
- Forward Pass: For each batch of data, the model makes predictions based on the current parameter values. This step involves passing the input data through the linear layer and obtaining the output, referred to as logits.
- Loss Calculation: The loss function (MSE in this example) is used to compute the difference between the model’s predictions (logits) and the actual target values.
- Backpropagation: This step involves calculating the gradients of the loss with respect to the model’s parameters. These gradients indicate the direction and magnitude of adjustments needed to minimize the loss.
- Optimizer Step: The optimizer (SGD in this case) utilizes the calculated gradients to update the model’s weight and bias parameters, moving them towards values that reduce the loss.
- Visualizing the Training Process: The sources emphasize the importance of visualizing the training progress to gain insights into the model’s behavior. They demonstrate plotting the loss values and parameter updates over epochs, helping to understand how the model is learning and whether the loss is decreasing as expected.
- Illustrating Epochs and Stepping the Optimizer: The sources use a coin analogy to explain the concept of epochs and the role of the optimizer in adjusting model parameters. They compare each epoch to moving closer to a coin at the back of a couch, with the optimizer taking steps to reduce the distance to the target (the coin).
The sources provide a comprehensive guide to constructing a fundamental PyTorch training loop for a regression problem, emphasizing the key components and the rationale behind each step. They stress the importance of visualization to understand the training dynamics and the role of the optimizer in guiding the model towards a solution that minimizes the loss function.

Understanding Non-Linearities and Activation Functions: Pages 211-220

The sources shift their focus to the concept of non-linearities in neural networks and their crucial role in enabling models to learn complex patterns beyond simple linear relationships. They introduce activation functions as the mechanism for introducing non-linearity into the model’s computations.

Here’s a breakdown of the key concepts covered in the sources:
- Limitations of Linear Models: The sources revisit the previous example of training a linear model to fit a straight line. They acknowledge that while linear models are straightforward to understand and implement, they are inherently limited in their capacity to model complex, non-linear relationships often found in real-world data.
- The Need for Non-Linearities: The sources emphasize that introducing non-linearity into the model’s architecture is essential for capturing intricate patterns and making accurate predictions on data with non-linear characteristics. They highlight that without non-linearities, neural networks would essentially collapse into a series of linear transformations, offering no advantage over simple linear models.
- Activation Functions: The sources introduce activation functions as the primary means of incorporating non-linearities into neural networks. Activation functions are applied to the output of linear layers, transforming the linear output into a non-linear representation. They act as “decision boundaries,” allowing the network to learn more complex and nuanced relationships between input features and target variables.
- Sigmoid Activation Function: The sources specifically discuss the sigmoid activation function, a common choice that squashes the input values into a range between 0 and 1. They highlight that while sigmoid was historically popular, it has limitations, particularly in deep networks where it can lead to vanishing gradients, hindering training.
- ReLU Activation Function: The sources present the ReLU (Rectified Linear Unit) activation function as a more modern and widely used alternative to sigmoid. ReLU is computationally efficient and addresses the vanishing gradient problem associated with sigmoid. It simply sets all negative values to zero and leaves positive values unchanged, introducing non-linearity while preserving the benefits of linear behavior in certain regions.
- Visualizing the Impact of Non-Linearities: The sources emphasize the importance of visualization to understand the impact of activation functions. They demonstrate how the addition of a ReLU activation function to a simple linear model drastically changes the model’s decision boundary, enabling it to learn non-linear patterns in a toy dataset of circles. They showcase how the ReLU-augmented model achieves near-perfect performance, highlighting the power of non-linearities in enhancing model capabilities.
- Exploration of Activation Functions in torch.nn: The sources guide the reader to explore the torch.nn module in PyTorch, which contains a comprehensive collection of activation functions. They encourage exploring the documentation and experimenting with different activation functions to understand their properties and impact on model behavior.
The sources provide a clear and concise introduction to the fundamental concepts of non-linearities and activation functions in neural networks. They emphasize the limitations of linear models and the essential role of activation functions in empowering models to learn complex patterns. The sources encourage a hands-on approach, urging readers to experiment with different activation functions in PyTorch and visualize their effects on model behavior.

Optimizing Gradient Descent: Pages 221-230

The sources move on to refining the gradient descent process, a crucial element in training machine-learning models. They highlight several techniques and concepts aimed at enhancing the efficiency and effectiveness of gradient descent.
- Gradient Accumulation and the optimizer.zero_grad() Method: The sources explain the concept of gradient accumulation, where gradients are calculated and summed over multiple batches before being applied to update model parameters. They emphasize the importance of resetting the accumulated gradients to zero before each batch using the optimizer.zero_grad() method. This prevents gradients from previous batches from interfering with the current batch’s calculations, ensuring accurate gradient updates.
- The Intertwined Nature of Gradient Descent Steps: The sources point out the interconnectedness of the steps involved in gradient descent:
- optimizer.zero_grad(): Resets the gradients to zero.
- loss.backward(): Calculates gradients through backpropagation.
- optimizer.step(): Updates model parameters based on the calculated gradients.
- They emphasize that these steps work in tandem to optimize the model parameters, moving them towards values that minimize the loss function.
- Learning Rate Scheduling and the Coin Analogy: The sources introduce the concept of learning rate scheduling, a technique for dynamically adjusting the learning rate, a hyperparameter controlling the size of parameter updates during training. They use the analogy of reaching for a coin at the back of a couch to explain this concept.
- Large Steps Initially: When starting the arm far from the coin (analogous to the initial stages of training), larger steps are taken to cover more ground quickly.
- Smaller Steps as the Target Approaches: As the arm gets closer to the coin (similar to approaching the optimal solution), smaller, more precise steps are needed to avoid overshooting the target.
- The sources suggest exploring resources on learning rate scheduling for further details.
- Visualizing Model Improvement: The sources demonstrate the positive impact of training for more epochs, showing how predictions align better with the target values as training progresses. They visualize the model’s predictions alongside the actual data points, illustrating how the model learns to fit the data more accurately over time.
- The torch.no_grad() Context Manager for Evaluation: The sources introduce the torch.no_grad() context manager, used during the evaluation phase to disable gradient calculations. This optimization enhances speed and reduces memory consumption, as gradients are unnecessary for evaluating a trained model.
- The Jingle for Remembering Training Steps: To help remember the key steps in a training loop, the sources introduce a catchy jingle: “For an epoch in a range, do the forward pass, calculate the loss, optimizer zero grad, loss backward, optimizer step, step, step.” This mnemonic device reinforces the sequence of actions involved in training a model.
- Customizing Printouts and Monitoring Metrics: The sources emphasize the flexibility of customizing printouts during training to monitor relevant metrics. They provide examples of printing the loss, weights, and bias values at specific intervals (every 10 epochs in this case) to track the training progress. They also hint at introducing accuracy metrics in later stages.
- Reinitializing the Model and the Importance of Random Seeds: The sources demonstrate reinitializing the model to start training from scratch, showcasing how the model begins with random predictions but progressively improves as training progresses. They emphasize the role of random seeds in ensuring reproducibility, allowing for consistent model initialization and experimentation.
The sources provide a comprehensive exploration of techniques and concepts for optimizing the gradient descent process in PyTorch. They cover gradient accumulation, learning rate scheduling, and the use of context managers for efficient evaluation. They emphasize visualization to monitor progress and the importance of random seeds for reproducible experiments.

Saving, Loading, and Evaluating Models: Pages 231-240

The sources guide readers through saving a trained model, reloading it for later use, and exploring additional evaluation metrics beyond just loss.
- Saving a Trained Model with torch.save(): The sources introduce the torch.save() function in PyTorch to save a trained model to a file. They emphasize the importance of saving models to preserve the learned parameters, allowing for later reuse without retraining. The code examples demonstrate saving the model’s state dictionary, containing the learned parameters, to a file named “01_pytorch_workflow_model_0.pth”.
- Verifying Model File Creation with ls: The sources suggest using the ls command in a terminal or command prompt to verify that the model file has been successfully created in the designated directory.
- Loading a Saved Model with torch.load(): The sources then present the torch.load() function for loading a saved model back into the environment. They highlight the ease of loading saved models, allowing for continued training or deployment for making predictions without the need to repeat the entire training process. They challenge readers to attempt loading the saved model before providing the code solution.
- Examining Loaded Model Parameters: The sources suggest examining the loaded model’s parameters, particularly the weights and biases, to confirm that they match the values from the saved model. This step ensures that the model has been loaded correctly and is ready for further use.
- Improving Model Performance with More Epochs: The sources revisit the concept of training for more epochs to improve model performance. They demonstrate how increasing the number of epochs can lead to lower loss and better alignment between predictions and target values. They encourage experimentation with different epoch values to observe the impact on model accuracy.
- Plotting Loss Curves to Visualize Training Progress: The sources showcase plotting loss curves to visualize the training progress over time. They track the loss values for both the training and test sets across epochs and plot these values to observe the trend of decreasing loss as training proceeds. The sources point out that if the training and test loss curves converge closely, it indicates that the model is generalizing well to unseen data, a desirable outcome.
- Storing Useful Values During Training: The sources recommend creating empty lists to store useful values during training, such as epoch counts, loss values, and test loss values. This organized storage facilitates later analysis and visualization of the training process.
- Reviewing Code, Slides, and Extra Curriculum: The sources encourage readers to review the code, accompanying slides, and extra curriculum resources for a deeper understanding of the concepts covered. They particularly recommend the book version of the course, which contains comprehensive explanations and additional resources.
This section of the sources focuses on the practical aspects of saving, loading, and evaluating PyTorch models. The sources provide clear code examples and explanations for these essential tasks, enabling readers to efficiently manage their trained models and assess their performance. They continue to emphasize the importance of visualization for understanding training progress and model behavior.

Building and Understanding Neural Networks: Pages 241-250

The sources transition from focusing on fundamental PyTorch workflows to constructing and comprehending neural networks for more complex tasks, particularly classification. They guide readers through building a neural network designed to classify data points into distinct categories.
- Shifting Focus to PyTorch Fundamentals: The sources highlight that the upcoming content will concentrate on the core principles of PyTorch, shifting away from the broader workflow-oriented perspective. They direct readers to specific sections in the accompanying resources, such as the PyTorch Fundamentals notebook and the online book version of the course, for supplementary materials and in-depth explanations.
- Exercises and Extra Curriculum: The sources emphasize the availability of exercises and extra curriculum materials to enhance learning and practical application. They encourage readers to actively engage with these resources to solidify their understanding of the concepts.
- Introduction to Neural Network Classification: The sources mark the beginning of a new section focused on neural network classification, a common machine learning task where models learn to categorize data into predefined classes. They distinguish between binary classification (one thing or another) and multi-class classification (more than two classes).
- Examples of Classification Problems: To illustrate classification tasks, the sources provide real-world examples:
- Image Classification: Classifying images as containing a cat or a dog.
- Spam Filtering: Categorizing emails as spam or not spam.
- Social Media Post Classification: Labeling posts on platforms like Facebook or Twitter based on their content.
- Fraud Detection: Identifying fraudulent transactions.
- Multi-Class Classification with Wikipedia Labels: The sources extend the concept of multi-class classification to using labels from the Wikipedia page for “deep learning.” They note that the Wikipedia page itself has multiple categories or labels, such as “deep learning,” “artificial neural networks,” “artificial intelligence,” and “emerging technologies.” This example highlights how a machine learning model could be trained to classify text based on multiple labels.
- Architecture, Input/Output Shapes, Features, and Labels: The sources outline the key aspects of neural network classification models that they will cover:
- Architecture: The structure and organization of the neural network, including the layers and their connections.
- Input/Output Shapes: The dimensions of the data fed into the model and the expected dimensions of the model’s predictions.
- Features: The input variables or characteristics used by the model to make predictions.
- Labels: The target variables representing the classes or categories to which the data points belong.
- Practical Example with the make_circles Dataset: The sources introduce a hands-on example using the make_circles dataset from scikit-learn, a Python library for machine learning. They generate a synthetic dataset consisting of 1000 data points arranged in two concentric circles, each circle representing a different class.
- Data Exploration and Visualization: The sources emphasize the importance of exploring and visualizing data before model building. They print the first five samples of both the features (X) and labels (Y) and guide readers through understanding the structure of the data. They acknowledge that discerning patterns from raw numerical data can be challenging and advocate for visualization to gain insights.
- Creating a Dictionary for Structured Data Representation: The sources structure the data into a dictionary format to organize the features (X1, X2) and labels (Y) for each sample. They explain the rationale behind this approach, highlighting how it improves readability and understanding of the dataset.
- Transitioning to Visualization: The sources prepare to shift from numerical representations to visual representations of the data, emphasizing the power of visualization for revealing patterns and gaining a deeper understanding of the dataset’s characteristics.
This section of the sources marks a transition to a more code-centric and hands-on approach to understanding neural networks for classification. They introduce essential concepts, provide real-world examples, and guide readers through a practical example using a synthetic dataset. They continue to advocate for visualization as a crucial tool for data exploration and model understanding.

Visualizing and Building a Classification Model: Pages 251-260

The sources demonstrate how to visualize the make_circles dataset and begin constructing a neural network model designed for binary classification.
- Visualizing the make_circles Dataset: The sources utilize Matplotlib, a Python plotting library, to visualize the make_circles dataset created earlier. They emphasize the data explorer’s motto: “Visualize, visualize, visualize,” underscoring the importance of visually inspecting data to understand patterns and relationships. The visualization reveals two distinct circles, each representing a different class, confirming the expected structure of the dataset.
- Splitting Data into Training and Test Sets: The sources guide readers through splitting the dataset into training and test sets using array slicing. They explain the rationale for this split:
- Training Set: Used to train the model and allow it to learn patterns from the data.
- Test Set: Held back from training and used to evaluate the model’s performance on unseen data, providing an estimate of its ability to generalize to new examples.
- They calculate and verify the lengths of the training and test sets, ensuring that the split adheres to the desired proportions (in this case, 80% for training and 20% for testing).
- Building a Simple Neural Network with PyTorch: The sources initiate building a simple neural network model using PyTorch. They introduce essential components of a PyTorch model:
- torch.nn.Module: The base class for all neural network modules in PyTorch.
- __init__ Method: The constructor method where model layers are defined.
- forward Method: Defines the forward pass of data through the model.
- They guide readers through creating a class named CircleModelV0 that inherits from torch.nn.Module and outline the steps for defining the model’s layers and the forward pass logic.
- Key Concepts in the Neural Network Model:
- Linear Layers: The model uses linear layers (torch.nn.Linear), which apply a linear transformation to the input data.
- Non-Linear Activation Function (Sigmoid): The model employs a non-linear activation function, specifically the sigmoid function (torch.sigmoid), to introduce non-linearity into the model. Non-linearity allows the model to learn more complex patterns in the data.
- Input and Output Dimensions: The sources carefully consider the input and output dimensions of each layer to ensure compatibility between the layers and the data. They emphasize the importance of aligning these dimensions to prevent errors during model execution.
- Visualizing the Neural Network Architecture: The sources present a visual representation of the neural network architecture, highlighting the flow of data through the layers, the application of the sigmoid activation function, and the final output representing the model’s prediction. They encourage readers to visualize their own neural networks to aid in comprehension.
- Loss Function and Optimizer: The sources introduce the concept of a loss function and an optimizer, crucial components of the training process:
- Loss Function: Measures the difference between the model’s predictions and the true labels, providing a signal to guide the model’s learning.
- Optimizer: Updates the model’s parameters (weights and biases) based on the calculated loss, aiming to minimize the loss and improve the model’s accuracy.
- They select the binary cross-entropy loss function (torch.nn.BCELoss) and the stochastic gradient descent (SGD) optimizer (torch.optim.SGD) for this classification task. They mention that alternative loss functions and optimizers exist and provide resources for further exploration.
- Training Loop and Evaluation: The sources establish a training loop, a fundamental process in machine learning where the model iteratively learns from the training data. They outline the key steps involved in each iteration of the loop:
1. Forward Pass: Pass the training data through the model to obtain predictions.
2. Calculate Loss: Compute the loss using the chosen loss function.
3. Zero Gradients: Reset the gradients of the model’s parameters.
4. Backward Pass (Backpropagation): Calculate the gradients of the loss with respect to the model’s parameters.
5. Update Parameters: Adjust the model’s parameters using the optimizer based on the calculated gradients.
- They perform a small number of training epochs (iterations over the entire training dataset) to demonstrate the training process. They evaluate the model’s performance after training by calculating the loss on the test data.
- Visualizing Model Predictions: The sources visualize the model’s predictions on the test data using Matplotlib. They plot the data points, color-coded by their true labels, and overlay the decision boundary learned by the model, illustrating how the model separates the data into different classes. They note that the model’s predictions, although far from perfect at this early stage of training, show some initial separation between the classes, indicating that the model is starting to learn.
- Improving a Model: An Overview: The sources provide a high-level overview of techniques for improving the performance of a machine learning model. They suggest various strategies for enhancing model accuracy, including adding more layers, increasing the number of hidden units, training for a longer duration, and incorporating non-linear activation functions. They emphasize that these strategies may not always guarantee improvement and that experimentation is crucial to determine the optimal approach for a particular dataset and problem.
- Saving and Loading Models with PyTorch: The sources reiterate the importance of saving trained models for later use. They demonstrate the use of torch.save() to save the model’s state dictionary to a file. They also showcase how to load a saved model using torch.load(), allowing for reuse without the need for retraining.
- Transition to Putting It All Together: The sources prepare to transition to a section where they will consolidate the concepts covered so far by working through a comprehensive example that incorporates the entire machine learning workflow, emphasizing practical application and problem-solving.
This section of the sources focuses on the practical aspects of building and training a simple neural network for binary classification. They guide readers through defining the model architecture, choosing a loss function and optimizer, implementing a training loop, and visualizing the model’s predictions. They also introduce strategies for improving model performance and reinforce the importance of saving and loading trained models.

Putting It All Together: Pages 261-270

The sources revisit the key steps in the PyTorch workflow, bringing together the concepts covered previously to solidify readers’ understanding of the end-to-end process. They emphasize a code-centric approach, encouraging readers to code along to reinforce their learning.
- Reiterating the PyTorch Workflow: The sources highlight the importance of practicing the PyTorch workflow to gain proficiency. They guide readers through a step-by-step review of the process, emphasizing a shift toward coding over theoretical explanations.
- The Importance of Practice: The sources stress that actively writing and running code is crucial for internalizing concepts and developing practical skills. They encourage readers to participate in coding exercises and explore additional resources to enhance their understanding.
- Data Preparation and Transformation into Tensors: The sources reiterate the initial steps of preparing data and converting it into tensors, a format suitable for PyTorch models. They remind readers of the importance of data exploration and transformation, emphasizing that these steps are fundamental to successful model development.
- Model Building, Loss Function, and Optimizer Selection: The sources revisit the core components of model construction:
- Building or Selecting a Model: Choosing an appropriate model architecture or constructing a custom model based on the problem’s requirements.
- Picking a Loss Function: Selecting a loss function that measures the difference between the model’s predictions and the true labels, guiding the model’s learning process.
- Building an Optimizer: Choosing an optimizer that updates the model’s parameters based on the calculated loss, aiming to minimize the loss and improve the model’s accuracy.
- Training Loop and Model Fitting: The sources highlight the central role of the training loop in machine learning. They recap the key steps involved in each iteration:
1. Forward Pass: Pass the training data through the model to obtain predictions.
2. Calculate Loss: Compute the loss using the chosen loss function.
3. Zero Gradients: Reset the gradients of the model’s parameters.
4. Backward Pass (Backpropagation): Calculate the gradients of the loss with respect to the model’s parameters.
5. Update Parameters: Adjust the model’s parameters using the optimizer based on the calculated gradients.
- Making Predictions and Evaluating the Model: The sources remind readers of the steps involved in using the trained model to make predictions on new data and evaluating its performance using appropriate metrics, such as loss and accuracy. They emphasize the importance of evaluating models on unseen data (the test set) to assess their ability to generalize to new examples.
- Saving and Loading Trained Models: The sources reiterate the value of saving trained models to avoid retraining. They demonstrate the use of torch.save() to save the model’s state dictionary to a file and torch.load() to load a saved model for reuse.
- Exercises and Extra Curriculum Resources: The sources consistently emphasize the availability of exercises and extra curriculum materials to supplement learning. They direct readers to the accompanying resources, such as the online book and the GitHub repository, where these materials can be found. They encourage readers to actively engage with these resources to solidify their understanding and develop practical skills.
- Transition to Convolutional Neural Networks: The sources prepare to move into a new section focused on computer vision and convolutional neural networks (CNNs), indicating that readers have gained a solid foundation in the fundamental PyTorch workflow and are ready to explore more advanced deep learning architectures. [1]
This section of the sources serves as a review and consolidation of the key concepts and steps involved in the PyTorch workflow. It reinforces the importance of practice and hands-on coding and prepares readers to explore more specialized deep learning techniques, such as CNNs for computer vision tasks.

Navigating Resources and Deep Learning Concepts: Pages 271-280

The sources transition into discussing resources for further learning and exploring essential deep learning concepts, setting the stage for a deeper understanding of PyTorch and its applications.
- Emphasizing Continuous Learning: The sources emphasize the importance of ongoing learning in the ever-evolving field of deep learning. They acknowledge that a single course cannot cover every aspect of PyTorch and encourage readers to actively seek out additional resources to expand their knowledge.
- Recommended Resources for PyTorch Mastery: The sources provide specific recommendations for resources that can aid in further exploration of PyTorch:
- Google Search: A fundamental tool for finding answers to specific questions, troubleshooting errors, and exploring various concepts related to PyTorch and deep learning. [1, 2]
- PyTorch Documentation: The official PyTorch documentation serves as an invaluable reference for understanding PyTorch’s functions, modules, and classes. The sources demonstrate how to effectively navigate the documentation to find information about specific functions, such as torch.arange. [3]
- GitHub Repository: The sources highlight a dedicated GitHub repository that houses the materials covered in the course, including notebooks, code examples, and supplementary resources. They encourage readers to utilize this repository as a learning aid and a source of reference. [4-14]
- Learn PyTorch Website: The sources introduce an online book version of the course, accessible through a website, offering a readable format for revisiting course content and exploring additional chapters that cover more advanced topics, including transfer learning, model experiment tracking, and paper replication. [1, 4, 5, 7, 11, 15-30]
- Course Q&A Forum: The sources acknowledge the importance of community support and encourage readers to utilize a dedicated Q&A forum, possibly on GitHub, to seek assistance from instructors and fellow learners. [4, 8, 11, 15]
- Encouraging Active Exploration of Definitions: The sources recommend that readers proactively research definitions of key deep learning concepts, such as deep learning and neural networks. They suggest using resources like Google Search and Wikipedia to explore various interpretations and develop a personal understanding of these concepts. They prioritize hands-on work over rote memorization of definitions. [1, 2]
- Structured Approach to the Course: The sources suggest a structured approach to navigating the course materials, presenting them in numerical order for ease of comprehension. They acknowledge that alternative learning paths exist but recommend following the numerical sequence for clarity. [31]
- Exercises, Extra Curriculum, and Documentation Reading: The sources emphasize the significance of hands-on practice and provide exercises designed to reinforce the concepts covered in the course. They also highlight the availability of extra curriculum materials for those seeking to deepen their understanding. Additionally, they encourage readers to actively engage with the PyTorch documentation to familiarize themselves with its structure and content. [6, 10, 12, 13, 16, 18-21, 23, 24, 28-30, 32-34]
This section of the sources focuses on directing readers towards valuable learning resources and fostering a mindset of continuous learning in the dynamic field of deep learning. They provide specific recommendations for accessing course materials, leveraging the PyTorch documentation, engaging with the community, and exploring definitions of key concepts. They also encourage active participation in exercises, exploration of extra curriculum content, and familiarization with the PyTorch documentation to enhance practical skills and deepen understanding.

Introducing the Coding Environment: Pages 281-290

The sources transition from theoretical discussion and resource navigation to a more hands-on approach, guiding readers through setting up their coding environment and introducing Google Colab as the primary tool for the course.
- Shifting to Hands-On Coding: The sources signal a shift in focus toward practical coding exercises, encouraging readers to actively participate and write code alongside the instructions. They emphasize the importance of getting involved with hands-on work rather than solely focusing on theoretical definitions.
- Introducing Google Colab: The sources introduce Google Colab, a cloud-based Jupyter notebook environment, as the primary tool for coding throughout the course. They suggest that using Colab facilitates a consistent learning experience and removes the need for local installations and setup, allowing readers to focus on learning PyTorch. They recommend using Colab as the preferred method for following along with the course materials.
- Advantages of Google Colab: The sources highlight the benefits of using Google Colab, including its accessibility, ease of use, and collaborative features. Colab provides a pre-configured environment with necessary libraries and dependencies already installed, simplifying the setup process for readers. Its cloud-based nature allows access from various devices and facilitates code sharing and collaboration.
- Navigating the Colab Interface: The sources guide readers through the basic functionality of Google Colab, demonstrating how to create new notebooks, run code cells, and access various features within the Colab environment. They introduce essential commands, such as torch.version and torchvision.version, for checking the versions of installed libraries.
- Creating and Running Code Cells: The sources demonstrate how to create new code cells within Colab notebooks and execute Python code within these cells. They illustrate the use of print() statements to display output and introduce the concept of importing necessary libraries, such as torch for PyTorch functionality.
- Checking Library Versions: The sources emphasize the importance of ensuring compatibility between PyTorch and its associated libraries. They demonstrate how to check the versions of installed libraries, such as torch and torchvision, using commands like torch.__version__ and torchvision.__version__. This step ensures that readers are using compatible versions for the upcoming code examples and exercises.
- Emphasizing Hands-On Learning: The sources reiterate their preference for hands-on learning and a code-centric approach, stating that they will prioritize coding together rather than spending extensive time on slides or theoretical explanations.
This section of the sources marks a transition from theoretical discussions and resource exploration to a more hands-on coding approach. They introduce Google Colab as the primary coding environment for the course, highlighting its benefits and demonstrating its basic functionality. The sources guide readers through creating code cells, running Python code, and checking library versions to ensure compatibility. By focusing on practical coding examples, the sources encourage readers to actively participate in the learning process and reinforce their understanding of PyTorch concepts.

Setting the Stage for Classification: Pages 291-300

The sources shift focus to classification problems, a fundamental task in machine learning, and begin by explaining the core concepts of binary, multi-class, and multi-label classification, providing examples to illustrate each type. They then delve into the specifics of binary and multi-class classification, setting the stage for building classification models in PyTorch.
- Introducing Classification Problems: The sources introduce classification as a key machine learning task where the goal is to categorize data into predefined classes or categories. They differentiate between various types of classification problems:
- Binary Classification: Involves classifying data into one of two possible classes. Examples include:
- Image Classification: Determining whether an image contains a cat or a dog.
- Spam Detection: Classifying emails as spam or not spam.
- Fraud Detection: Identifying fraudulent transactions from legitimate ones.
- Multi-Class Classification: Deals with classifying data into one of multiple (more than two) classes. Examples include:
- Image Recognition: Categorizing images into different object classes, such as cars, bicycles, and pedestrians.
- Handwritten Digit Recognition: Classifying handwritten digits into the numbers 0 through 9.
- Natural Language Processing: Assigning text documents to specific topics or categories.
- Multi-Label Classification: Involves assigning multiple labels to a single data point. Examples include:
- Image Tagging: Assigning multiple tags to an image, such as “beach,” “sunset,” and “ocean.”
- Text Classification: Categorizing documents into multiple relevant topics.
- Understanding the ImageNet Dataset: The sources reference the ImageNet dataset, a large-scale dataset commonly used in computer vision research, as an example of multi-class classification. They point out that ImageNet contains thousands of object categories, making it a challenging dataset for multi-class classification tasks.
- Illustrating Multi-Label Classification with Wikipedia: The sources use a Wikipedia article about deep learning as an example of multi-label classification. They point out that the article has multiple categories assigned to it, such as “deep learning,” “artificial neural networks,” and “artificial intelligence,” demonstrating that a single data point (the article) can have multiple labels.
- Real-World Examples of Classification: The sources provide relatable examples from everyday life to illustrate different classification scenarios:
- Photo Categorization: Modern smartphone cameras often automatically categorize photos based on their content, such as “people,” “food,” or “landscapes.”
- Email Filtering: Email services frequently categorize emails into folders like “primary,” “social,” or “promotions,” performing a multi-class classification task.
- Focusing on Binary and Multi-Class Classification: The sources acknowledge the existence of other types of classification but choose to focus on binary and multi-class classification for the remainder of the section. They indicate that these two types are fundamental and provide a strong foundation for understanding more complex classification scenarios.
This section of the sources sets the stage for exploring classification problems in PyTorch. They introduce different types of classification, providing examples and real-world applications to illustrate each type. The sources emphasize the importance of understanding binary and multi-class classification as fundamental building blocks for more advanced classification tasks. By providing clear definitions, examples, and a structured approach, the sources prepare readers to build and train classification models using PyTorch.

Building a Binary Classification Model with PyTorch: Pages 301-310

The sources begin the practical implementation of a binary classification model using PyTorch. They guide readers through generating a synthetic dataset, exploring its characteristics, and visualizing it to gain insights into the data before proceeding to model building.
- Generating a Synthetic Dataset with make_circles: The sources introduce the make_circles function from the sklearn.datasets module to create a synthetic dataset for binary classification. This function generates a dataset with two concentric circles, each representing a different class. The sources provide a code example using make_circles to generate 1000 samples, storing the features in the variable X and the corresponding labels in the variable Y. They emphasize the common convention of using capital X to represent a matrix of features and capital Y for labels.
- Exploring the Dataset: The sources guide readers through exploring the characteristics of the generated dataset:
- Examining the First Five Samples: The sources provide code to display the first five samples of both features (X) and labels (Y) using array slicing. They use print() statements to display the output, encouraging readers to visually inspect the data.
- Formatting for Clarity: The sources emphasize the importance of presenting data in a readable format. They use a dictionary to structure the data, mapping feature names (X1 and X2) to the corresponding values and including the label (Y). This structured format enhances the readability and interpretation of the data.
- Visualizing the Data: The sources highlight the importance of visualizing data, especially in classification tasks. They emphasize the data explorer’s motto: “visualize, visualize, visualize.” They point out that while patterns might not be evident from numerical data alone, visualization can reveal underlying structures and relationships.
- Visualizing with Matplotlib: The sources introduce Matplotlib, a popular Python plotting library, for visualizing the generated dataset. They provide a code example using plt.scatter() to create a scatter plot of the data, with different colors representing the two classes. The visualization reveals the circular structure of the data, with one class forming an inner circle and the other class forming an outer circle. This visual representation provides a clear understanding of the dataset’s characteristics and the challenge posed by the binary classification task.
This section of the sources marks the beginning of hands-on model building with PyTorch. They start by generating a synthetic dataset using make_circles, allowing for controlled experimentation and a clear understanding of the data’s structure. They guide readers through exploring the dataset’s characteristics, both numerically and visually. The use of Matplotlib to visualize the data reinforces the importance of understanding data patterns before proceeding to model development. By emphasizing the data explorer’s motto, the sources encourage readers to actively engage with the data and gain insights that will inform their subsequent modeling choices.

Exploring Model Architecture and PyTorch Fundamentals: Pages 311-320

The sources proceed with building a simple neural network model using PyTorch, introducing key components like layers, neurons, activation functions, and matrix operations. They guide readers through understanding the model’s architecture, emphasizing the connection between the code and its visual representation. They also highlight PyTorch’s role in handling computations and the importance of visualizing the network’s structure.
- Creating a Simple Neural Network Model: The sources guide readers through creating a basic neural network model in PyTorch. They introduce the concept of layers, representing different stages of computation in the network, and neurons, the individual processing units within each layer. They provide code to construct a model with:
- An Input Layer: Takes in two features, corresponding to the X1 and X2 features from the generated dataset.
- A Hidden Layer: Consists of five neurons, introducing the idea of hidden layers for learning complex patterns.
- An Output Layer: Produces a single output, suitable for binary classification.
- Relating Code to Visual Representation: The sources emphasize the importance of understanding the connection between the code and its visual representation. They encourage readers to visualize the network’s structure, highlighting the flow of data through the input, hidden, and output layers. This visualization clarifies how the network processes information and makes predictions.
- PyTorch’s Role in Computation: The sources explain that while they write the code to define the model’s architecture, PyTorch handles the underlying computations. PyTorch takes care of matrix operations, activation functions, and other mathematical processes involved in training and using the model.
- Illustrating Network Structure with torch.nn.Linear: The sources use the torch.nn.Linear module to create the layers in the neural network. They provide code examples demonstrating how to define the input and output dimensions for each layer, emphasizing that the output of one layer becomes the input to the subsequent layer.
- Understanding Input and Output Shapes: The sources emphasize the significance of input and output shapes in neural networks. They explain that the input shape corresponds to the number of features in the data, while the output shape depends on the type of problem. In this case, the binary classification model has an output shape of one, representing a single probability score for the positive class.
This section of the sources introduces readers to the fundamental concepts of building neural networks in PyTorch. They guide through creating a simple binary classification model, explaining the key components like layers, neurons, and activation functions. The sources emphasize the importance of visualizing the network’s structure and understanding the connection between the code and its visual representation. They highlight PyTorch’s role in handling computations and guide readers through defining the input and output shapes for each layer, ensuring the model’s structure aligns with the dataset and the classification task. By combining code examples with clear explanations, the sources provide a solid foundation for building and understanding neural networks in PyTorch.

Setting up for Success: Approaching the PyTorch Deep Learning Course: Pages 321-330

The sources transition from the specifics of model architecture to a broader discussion about navigating the PyTorch deep learning course effectively. They emphasize the importance of active learning, self-directed exploration, and leveraging available resources to enhance understanding and skill development.
- Embracing Google and Exploration: The sources advocate for active learning and encourage learners to “Google it.” They suggest that encountering unfamiliar concepts or terms should prompt learners to independently research and explore, using search engines like Google to delve deeper into the subject matter. This approach fosters a self-directed learning style and encourages learners to go beyond the course materials.
- Prioritizing Hands-On Experience: The sources stress the significance of hands-on experience over theoretical definitions. They acknowledge that while definitions are readily available online, the focus of the course is on practical implementation and building models. They encourage learners to prioritize coding and experimentation to solidify their understanding of PyTorch.
- Utilizing Wikipedia for Definitions: The sources specifically recommend Wikipedia as a reliable resource for looking up definitions. They recognize Wikipedia’s comprehensive and well-maintained content, suggesting it as a valuable tool for learners seeking clear and accurate explanations of technical terms.
- Structuring the Course for Effective Learning: The sources outline a structured approach to the course, breaking down the content into manageable modules and emphasizing a sequential learning process. They introduce the concept of “chapters” as distinct units of learning, each covering specific topics and building upon previous knowledge.
- Encouraging Questions and Discussion: The sources foster an interactive learning environment, encouraging learners to ask questions and engage in discussions. They highlight the importance of seeking clarification and sharing insights with instructors and peers to enhance the learning experience. They recommend utilizing online platforms, such as GitHub discussion pages, for asking questions and engaging in course-related conversations.
- Providing Course Materials on GitHub: The sources ensure accessibility to course materials by making them readily available on GitHub. They specify the repository where learners can access code, notebooks, and other resources used throughout the course. They also mention “learnpytorch.io” as an alternative location where learners can find an online, readable book version of the course content.
This section of the sources provides guidance on approaching the PyTorch deep learning course effectively. The sources encourage a self-directed learning style, emphasizing the importance of active exploration, independent research, and hands-on experimentation. They recommend utilizing online resources, including search engines and Wikipedia, for in-depth understanding and advocate for engaging in discussions and seeking clarification. By outlining a structured approach, providing access to comprehensive course materials, and fostering an interactive learning environment, the sources aim to equip learners with the necessary tools and mindset for a successful PyTorch deep learning journey.

Navigating Course Resources and Documentation: Pages 331-340

The sources guide learners on how to effectively utilize the course resources and navigate PyTorch documentation to enhance their learning experience. They emphasize the importance of referring to the materials provided on GitHub, engaging in Q&A sessions, and familiarizing oneself with the structure and features of the online book version of the course.
- Identifying Key Resources: The sources highlight three primary resources for the PyTorch course:
- Materials on GitHub: The sources specify a GitHub repository (“Mr. D. Burks in my GitHub slash PyTorch deep learning” [1]) as the central location for accessing course materials, including outlines, code, notebooks, and additional resources. This repository serves as a comprehensive hub for learners to find everything they need to follow along with the course. They note that this repository is a work in progress [1] but assure users that the organization will remain largely the same [1].
- Course Q&A: The sources emphasize the importance of asking questions and seeking clarification throughout the learning process. They encourage learners to utilize the designated Q&A platform, likely a forum or discussion board, to post their queries and engage with instructors and peers. This interactive component of the course fosters a collaborative learning environment and provides a valuable avenue for resolving doubts and gaining insights.
- Course Online Book (learnpytorch.io): The sources recommend referring to the online book version of the course, accessible at “learn pytorch.io” [2, 3]. This platform offers a structured and readable format for the course content, presenting the material in a more organized and comprehensive manner compared to the video lectures. The online book provides learners with a valuable resource to reinforce their understanding and revisit concepts in a more detailed format.
- Navigating the Online Book: The sources describe the key features of the online book platform, highlighting its user-friendly design and functionality:
- Readable Format and Search Functionality: The online book presents the course content in a clear and easily understandable format, making it convenient for learners to review and grasp the material. Additionally, the platform offers search functionality, enabling learners to quickly locate specific topics or concepts within the book. This feature enhances the book’s usability and allows learners to efficiently find the information they need.
- Structured Headings and Images: The online book utilizes structured headings and includes relevant images to organize and illustrate the content effectively. The use of headings breaks down the material into logical sections, improving readability and comprehension. The inclusion of images provides visual aids to complement the textual explanations, further enhancing understanding and engagement.
This section of the sources focuses on guiding learners on how to effectively utilize the various resources provided for the PyTorch deep learning course. The sources emphasize the importance of accessing the materials on GitHub, actively engaging in Q&A sessions, and utilizing the online book version of the course to supplement learning. By describing the structure and features of these resources, the sources aim to equip learners with the knowledge and tools to navigate the course effectively, enhance their understanding of PyTorch, and ultimately succeed in their deep learning journey.

Deep Dive into PyTorch Tensors: Pages 341-350

The sources shift focus to PyTorch tensors, the fundamental data structure for working with numerical data in PyTorch. They explain how to create tensors using various methods and introduce essential tensor operations like indexing, reshaping, and stacking. The sources emphasize the significance of tensors in deep learning, highlighting their role in representing data and performing computations. They also stress the importance of understanding tensor shapes and dimensions for effective manipulation and model building.
- Introducing the torch.nn Module: The sources introduce the torch.nn module as the core component for building neural networks in PyTorch. They explain that torch.nn provides a collection of classes and functions for defining and working with various layers, activation functions, and loss functions. They highlight that almost everything in PyTorch relies on torch.tensor as the foundational data structure.
- Creating PyTorch Tensors: The sources provide a practical introduction to creating PyTorch tensors using the torch.tensor function. They emphasize that this function serves as the primary method for creating tensors, which act as multi-dimensional arrays for storing and manipulating numerical data. They guide readers through basic examples, illustrating how to create tensors from lists of values.
- Encouraging Exploration of PyTorch Documentation: The sources consistently encourage learners to explore the official PyTorch documentation for in-depth understanding and reference. They specifically recommend spending at least 10 minutes reviewing the documentation for torch.tensor after completing relevant video tutorials. This practice fosters familiarity with PyTorch’s functionalities and encourages a self-directed learning approach.
- Exploring the torch.arange Function: The sources introduce the torch.arange function for generating tensors containing a sequence of evenly spaced values within a specified range. They provide code examples demonstrating how to use torch.arange to create tensors similar to Python’s built-in range function. They also explain the function’s parameters, including start, end, and step, allowing learners to control the sequence generation.
- Highlighting Deprecated Functions: The sources point out that certain PyTorch functions, like torch.range, may become deprecated over time as the library evolves. They inform learners about such deprecations and recommend using updated functions like torch.arange as alternatives. This awareness ensures learners are using the most current and recommended practices.
- Addressing Tensor Shape Compatibility in Reshaping: The sources discuss the concept of shape compatibility when reshaping tensors using the torch.reshape function. They emphasize that the new shape specified for the tensor must be compatible with the original number of elements in the tensor. They provide examples illustrating both compatible and incompatible reshaping scenarios, explaining the potential errors that may arise when incompatibility occurs. They also note that encountering and resolving errors during coding is a valuable learning experience, promoting problem-solving skills.
- Understanding Tensor Stacking with torch.stack: The sources introduce the torch.stack function for combining multiple tensors along a new dimension. They explain that stacking effectively concatenates tensors, creating a higher-dimensional tensor. They guide readers through code examples, demonstrating how to use torch.stack to combine tensors and control the stacking dimension using the dim parameter. They also reference the torch.stack documentation, encouraging learners to review it for a comprehensive understanding of the function’s usage.
- Illustrating Tensor Permutation with torch.permute: The sources delve into the torch.permute function for rearranging the dimensions of a tensor. They explain that permuting changes the order of axes in a tensor, effectively reshaping it without altering the underlying data. They provide code examples demonstrating how to use torch.permute to change the order of dimensions, illustrating the transformation of tensor shape. They also connect this concept to real-world applications, particularly in image processing, where permuting can be used to rearrange color channels, height, and width dimensions.
- Explaining Random Seed for Reproducibility: The sources address the importance of setting a random seed for reproducibility in deep learning experiments. They introduce the concept of pseudo-random number generators and explain how setting a random seed ensures consistent results when working with random processes. They link to PyTorch documentation for further exploration of random number generation and the role of random seeds.
- Providing Guidance on Exercises and Curriculum: The sources transition to discussing exercises and additional curriculum for learners to solidify their understanding of PyTorch fundamentals. They refer to the “PyTorch fundamentals notebook,” which likely contains a collection of exercises and supplementary materials for learners to practice the concepts covered in the course. They recommend completing these exercises to reinforce learning and gain hands-on experience. They also mention that each chapter in the online book concludes with exercises and extra curriculum, providing learners with ample opportunities for practice and exploration.
This section focuses on introducing PyTorch tensors, a fundamental concept in deep learning, and providing practical examples of tensor manipulation using functions like torch.arange, torch.reshape, and torch.stack. The sources encourage learners to refer to PyTorch documentation for comprehensive understanding and highlight the significance of tensors in representing data and performing computations. By combining code demonstrations with explanations and real-world connections, the sources equip learners with a solid foundation for working with tensors in PyTorch.

Working with Loss Functions and Optimizers in PyTorch: Pages 351-360

The sources transition to a discussion of loss functions and optimizers, crucial components of the training process for neural networks in PyTorch. They explain that loss functions measure the difference between model predictions and actual target values, guiding the optimization process towards minimizing this difference. They introduce different types of loss functions suitable for various machine learning tasks, such as binary classification and multi-class classification, highlighting their specific applications and characteristics. The sources emphasize the significance of selecting an appropriate loss function based on the nature of the problem and the desired model output. They also explain the role of optimizers in adjusting model parameters to reduce the calculated loss, introducing common optimizer choices like Stochastic Gradient Descent (SGD) and Adam, each with its unique approach to parameter updates.
- Understanding Binary Cross Entropy Loss: The sources introduce binary cross entropy loss as a commonly used loss function for binary classification problems, where the model predicts one of two possible classes. They note that PyTorch provides multiple implementations of binary cross entropy loss, including torch.nn.BCELoss and torch.nn.BCEWithLogitsLoss. They highlight a key distinction: torch.nn.BCELoss requires inputs to have already passed through the sigmoid activation function, while torch.nn.BCEWithLogitsLoss incorporates the sigmoid activation internally, offering enhanced numerical stability. The sources emphasize the importance of understanding these differences and selecting the appropriate implementation based on the model’s structure and activation functions.
- Exploring Loss Functions and Optimizers for Diverse Problems: The sources emphasize that PyTorch offers a wide range of loss functions and optimizers suitable for various machine learning problems beyond binary classification. They recommend referring to the online book version of the course for a comprehensive overview and code examples of different loss functions and optimizers applicable to diverse tasks. This comprehensive resource aims to equip learners with the knowledge to select appropriate components for their specific machine learning applications.
- Outlining the Training Loop Steps: The sources outline the key steps involved in a typical training loop for a neural network:
1. Forward Pass: Input data is fed through the model to obtain predictions.
2. Loss Calculation: The difference between predictions and actual target values is measured using the chosen loss function.
3. Optimizer Zeroing Gradients: Accumulated gradients from previous iterations are reset to zero.
4. Backpropagation: Gradients of the loss function with respect to model parameters are calculated, indicating the direction and magnitude of parameter adjustments needed to minimize the loss.
5. Optimizer Step: Model parameters are updated based on the calculated gradients and the optimizer’s update rule.
- Applying Sigmoid Activation for Binary Classification: The sources emphasize the importance of applying the sigmoid activation function to the raw output (logits) of a binary classification model before making predictions. They explain that the sigmoid function transforms the logits into a probability value between 0 and 1, representing the model’s confidence in each class.
- Illustrating Tensor Rounding and Dimension Squeezing: The sources demonstrate the use of torch.round to round tensor values to the nearest integer, often used for converting predicted probabilities into class labels in binary classification. They also explain the use of torch.squeeze to remove singleton dimensions from tensors, ensuring compatibility for operations requiring specific tensor shapes.
- Structuring Training Output for Clarity: The sources highlight the practice of organizing training output to enhance clarity and monitor progress. They suggest printing relevant metrics like epoch number, loss, and accuracy at regular intervals, allowing users to track the model’s learning progress over time.
This section introduces the concepts of loss functions and optimizers in PyTorch, emphasizing their importance in the training process. It guides learners on choosing suitable loss functions based on the problem type and provides insights into common optimizer choices. By explaining the steps involved in a typical training loop and showcasing practical code examples, the sources aim to equip learners with a solid understanding of how to train neural networks effectively in PyTorch.

Building and Evaluating a PyTorch Model: Pages 361-370

The sources transition to the practical application of the previously introduced concepts, guiding readers through the process of building, training, and evaluating a PyTorch model for a specific task. They emphasize the importance of structuring code clearly and organizing output for better understanding and analysis. The sources highlight the iterative nature of model development, involving multiple steps of training, evaluation, and refinement.
- Defining a Simple Linear Model: The sources provide a code example demonstrating how to define a simple linear model in PyTorch using torch.nn.Linear. They explain that this model takes a specified number of input features and produces a corresponding number of output features, performing a linear transformation on the input data. They stress that while this simple model may not be suitable for complex tasks, it serves as a foundational example for understanding the basics of building neural networks in PyTorch.
- Emphasizing Visualization in Data Exploration: The sources reiterate the importance of visualization in data exploration, encouraging readers to represent data visually to gain insights and understand patterns. They advocate for the “data explorer’s motto: visualize, visualize, visualize,” suggesting that visualizing data helps users become more familiar with its structure and characteristics, aiding in the model development process.
- Preparing Data for Model Training: The sources outline the steps involved in preparing data for model training, which often includes splitting data into training and testing sets. They explain that the training set is used to train the model, while the testing set is used to evaluate its performance on unseen data. They introduce a simple method for splitting data based on a predetermined index and mention the popular scikit-learn library’s train_test_split function as a more robust method for random data splitting. They highlight that data splitting ensures that the model’s ability to generalize to new data is assessed accurately.
- Creating a Training Loop: The sources provide a code example demonstrating the creation of a training loop, a fundamental component of training neural networks. The training loop iterates over the training data for a specified number of epochs, performing the steps outlined previously: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They emphasize that one epoch represents a complete pass through the entire training dataset. They also explain the concept of a “training loop” as the iterative process of updating model parameters over multiple epochs to minimize the loss function. They provide guidance on customizing the training loop, such as printing out loss and other metrics at specific intervals to monitor training progress.
- Visualizing Loss and Parameter Convergence: The sources encourage visualizing the loss function’s value over epochs to observe its convergence, indicating the model’s learning progress. They also suggest tracking changes in model parameters (weights and bias) to understand how they adjust during training to minimize the loss. The sources highlight that these visualizations provide valuable insights into the training process and help users assess the model’s effectiveness.
- Understanding the Concept of Overfitting: The sources introduce the concept of overfitting, a common challenge in machine learning, where a model performs exceptionally well on the training data but poorly on unseen data. They explain that overfitting occurs when the model learns the training data too well, capturing noise and irrelevant patterns that hinder its ability to generalize. They mention that techniques like early stopping, regularization, and data augmentation can mitigate overfitting, promoting better model generalization.
- Evaluating Model Performance: The sources guide readers through evaluating a trained model’s performance using the testing set, data that the model has not seen during training. They calculate the loss on the testing set to assess how well the model generalizes to new data. They emphasize the importance of evaluating the model on data separate from the training set to obtain an unbiased estimate of its real-world performance. They also introduce the idea of visualizing model predictions alongside the ground truth data (actual labels) to gain qualitative insights into the model’s behavior.
- Saving and Loading a Trained Model: The sources highlight the significance of saving a trained PyTorch model to preserve its learned parameters for future use. They provide a code example demonstrating how to save the model’s state dictionary, which contains the trained weights and biases, using torch.save. They also show how to load a saved model using torch.load, enabling users to reuse trained models without retraining.
This section guides readers through the practical steps of building, training, and evaluating a simple linear model in PyTorch. The sources emphasize visualization as a key aspect of data exploration and model understanding. By combining code examples with clear explanations and introducing essential concepts like overfitting and model evaluation, the sources equip learners with a practical foundation for building and working with neural networks in PyTorch.

Understanding Neural Networks and PyTorch Resources: Pages 371-380

The sources shift focus to neural networks, providing a conceptual understanding and highlighting resources for further exploration. They encourage active learning by posing challenges to readers, prompting them to apply their knowledge and explore concepts independently. The sources also emphasize the practical aspects of learning PyTorch, advocating for a hands-on approach with code over theoretical definitions.
- Encouraging Exploration of Neural Network Definitions: The sources acknowledge the abundance of definitions for neural networks available online and encourage readers to formulate their own understanding by exploring various sources. They suggest engaging with external resources like Google searches and Wikipedia to broaden their knowledge and develop a personal definition of neural networks.
- Recommending a Hands-On Approach to Learning: The sources advocate for a hands-on approach to learning PyTorch, emphasizing the importance of practical experience over theoretical definitions. They prioritize working with code and experimenting with different concepts to gain a deeper understanding of the framework.
- Presenting Key PyTorch Resources: The sources introduce valuable resources for learning PyTorch, including:
- GitHub Repository: A repository containing all course materials, including code examples, notebooks, and supplementary resources.
- Course Q&A: A dedicated platform for asking questions and seeking clarification on course content.
- Online Book: A comprehensive online book version of the course, providing in-depth explanations and code examples.
- Highlighting Benefits of the Online Book: The sources highlight the advantages of the online book version of the course, emphasizing its user-friendly features:
- Searchable Content: Users can easily search for specific topics or keywords within the book.
- Interactive Elements: The book incorporates interactive elements, allowing users to engage with the content more dynamically.
- Comprehensive Material: The book covers a wide range of PyTorch concepts and provides in-depth explanations.
- Demonstrating PyTorch Documentation Usage: The sources demonstrate how to effectively utilize PyTorch documentation, emphasizing its value as a reference guide. They showcase examples of searching for specific functions within the documentation, highlighting the clear explanations and usage examples provided.
- Addressing Common Errors in Deep Learning: The sources acknowledge that shape errors are common in deep learning, emphasizing the importance of understanding tensor shapes and dimensions for successful model implementation. They provide examples of shape errors encountered during code demonstrations, illustrating how mismatched tensor dimensions can lead to errors. They encourage users to pay close attention to tensor shapes and use debugging techniques to identify and resolve such issues.
- Introducing the Concept of Tensor Stacking: The sources introduce the concept of tensor stacking using torch.stack, explaining its functionality in concatenating a sequence of tensors along a new dimension. They clarify the dim parameter, which specifies the dimension along which the stacking operation is performed. They provide code examples demonstrating the usage of torch.stack and its impact on tensor shapes, emphasizing its utility in combining tensors effectively.
- Explaining Tensor Permutation: The sources explain tensor permutation as a method for rearranging the dimensions of a tensor using torch.permute. They emphasize that permuting a tensor changes how the data is viewed without altering the underlying data itself. They illustrate the concept with an example of permuting a tensor representing color channels, height, and width of an image, highlighting how the permutation operation reorders these dimensions while preserving the image data.
- Introducing Indexing on Tensors: The sources introduce the concept of indexing on tensors, a fundamental operation for accessing specific elements or subsets of data within a tensor. They present a challenge to readers, asking them to practice indexing on a given tensor to extract specific values. This exercise aims to reinforce the understanding of tensor indexing and its practical application.
- Explaining Random Seed and Random Number Generation: The sources explain the concept of a random seed in the context of random number generation, highlighting its role in controlling the reproducibility of random processes. They mention that setting a random seed ensures that the same sequence of random numbers is generated each time the code is executed, enabling consistent results for debugging and experimentation. They provide external resources, such as documentation links, for those interested in delving deeper into random number generation concepts in computing.
This section transitions from general concepts of neural networks to practical aspects of using PyTorch, highlighting valuable resources for further exploration and emphasizing a hands-on learning approach. By demonstrating documentation usage, addressing common errors, and introducing tensor manipulation techniques like stacking, permutation, and indexing, the sources equip learners with essential tools for working effectively with PyTorch.

Building a Model with PyTorch: Pages 381-390

The sources guide readers through building a more complex model in PyTorch, introducing the concept of subclassing nn.Module to create custom model architectures. They highlight the importance of understanding the PyTorch workflow, which involves preparing data, defining a model, selecting a loss function and optimizer, training the model, making predictions, and evaluating performance. The sources emphasize that while the steps involved remain largely consistent across different tasks, understanding the nuances of each step and how they relate to the specific problem being addressed is crucial for effective model development.
- Introducing the nn.Module Class: The sources explain that in PyTorch, neural network models are built by subclassing the nn.Module class, which provides a structured framework for defining model components and their interactions. They highlight that this approach offers flexibility and organization, enabling users to create custom architectures tailored to specific tasks.
- Defining a Custom Model Architecture: The sources provide a code example demonstrating how to define a custom model architecture by subclassing nn.Module. They emphasize the key components of a model definition:
- Constructor (__init__): This method initializes the model’s layers and other components.
- Forward Pass (forward): This method defines how the input data flows through the model’s layers during the forward propagation step.
- Understanding PyTorch Building Blocks: The sources explain that PyTorch provides a rich set of building blocks for neural networks, contained within the torch.nn module. They highlight that nn contains various layers, activation functions, loss functions, and other components essential for constructing neural networks.
- Illustrating the Flow of Data Through a Model: The sources visually illustrate the flow of data through the defined model, using diagrams to represent the input features, hidden layers, and output. They explain that the input data is passed through a series of linear transformations (nn.Linear layers) and activation functions, ultimately producing an output that corresponds to the task being addressed.
- Creating a Training Loop with Multiple Epochs: The sources demonstrate how to create a training loop that iterates over the training data for a specified number of epochs, performing the steps involved in training a neural network: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They highlight the importance of training for multiple epochs to allow the model to learn from the data iteratively and adjust its parameters to minimize the loss function.
- Observing Loss Reduction During Training: The sources show the output of the training loop, emphasizing how the loss value decreases over epochs, indicating that the model is learning from the data and improving its performance. They explain that this decrease in loss signifies that the model’s predictions are becoming more aligned with the actual labels.
- Emphasizing Visual Inspection of Data: The sources reiterate the importance of visualizing data, advocating for visually inspecting the data before making predictions. They highlight that understanding the data’s characteristics and patterns is crucial for informed model development and interpretation of results.
- Preparing Data for Visualization: The sources guide readers through preparing data for visualization, including splitting it into training and testing sets and organizing it into appropriate data structures. They mention using libraries like matplotlib to create visual representations of the data, aiding in data exploration and understanding.
- Introducing the torch.no_grad Context: The sources introduce the concept of the torch.no_grad context, explaining its role in performing computations without tracking gradients. They highlight that this context is particularly useful during model evaluation or inference, where gradient calculations are not required, leading to more efficient computation.
- Defining a Testing Loop: The sources guide readers through defining a testing loop, similar to the training loop, which iterates over the testing data to evaluate the model’s performance on unseen data. They emphasize the importance of evaluating the model on data separate from the training set to obtain an unbiased assessment of its ability to generalize. They outline the steps involved in the testing loop: performing a forward pass, calculating the loss, and accumulating relevant metrics like loss and accuracy.
The sources provide a comprehensive walkthrough of building and training a more sophisticated neural network model in PyTorch. They emphasize the importance of understanding the PyTorch workflow, from data preparation to model evaluation, and highlight the flexibility and organization offered by subclassing nn.Module to create custom model architectures. They continue to stress the value of visual inspection of data and encourage readers to explore concepts like data visualization and model evaluation in detail.

Building and Evaluating Models in PyTorch: Pages 391-400

The sources focus on training and evaluating a regression model in PyTorch, emphasizing the iterative nature of model development and improvement. They guide readers through the process of building a simple model, training it, evaluating its performance, and identifying areas for potential enhancements. They introduce the concept of non-linearity in neural networks, explaining how the addition of non-linear activation functions can enhance a model’s ability to learn complex patterns.
- Building a Regression Model with PyTorch: The sources provide a step-by-step guide to building a simple regression model using PyTorch. They showcase the creation of a model with linear layers (nn.Linear), illustrating how to define the input and output dimensions of each layer. They emphasize that for regression tasks, the output layer typically has a single output unit representing the predicted value.
- Creating a Training Loop for Regression: The sources demonstrate how to create a training loop specifically for regression tasks. They outline the familiar steps involved: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They emphasize that the loss function used for regression differs from classification tasks, typically employing mean squared error (MSE) or similar metrics to measure the difference between predicted and actual values.
- Observing Loss Reduction During Regression Training: The sources show the output of the training loop for the regression model, highlighting how the loss value decreases over epochs, indicating that the model is learning to predict the target values more accurately. They explain that this decrease in loss signifies that the model’s predictions are converging towards the actual values.
- Evaluating the Regression Model: The sources guide readers through evaluating the trained regression model. They emphasize the importance of using a separate testing dataset to assess the model’s ability to generalize to unseen data. They outline the steps involved in evaluating the model on the testing set, including performing a forward pass, calculating the loss, and accumulating metrics.
- Visualizing Regression Model Predictions: The sources advocate for visualizing the predictions of the regression model, explaining that visual inspection can provide valuable insights into the model’s performance and potential areas for improvement. They suggest plotting the predicted values against the actual values, allowing users to assess how well the model captures the underlying relationship in the data.
- Introducing Non-Linearities in Neural Networks: The sources introduce the concept of non-linearity in neural networks, explaining that real-world data often exhibits complex, non-linear relationships. They highlight that incorporating non-linear activation functions into neural network models can significantly enhance their ability to learn and represent these intricate patterns. They mention activation functions like ReLU (Rectified Linear Unit) as common choices for introducing non-linearity.
- Encouraging Experimentation with Non-Linearities: The sources encourage readers to experiment with different non-linear activation functions, explaining that the choice of activation function can impact model performance. They suggest trying various activation functions and observing their effects on the model’s ability to learn from the data and make accurate predictions.
- Highlighting the Role of Hyperparameters: The sources emphasize that various components of a neural network, such as the number of layers, number of units in each layer, learning rate, and activation functions, are hyperparameters that can be adjusted to influence model performance. They encourage experimentation with different hyperparameter settings to find optimal configurations for specific tasks.
- Demonstrating the Impact of Adding Layers: The sources visually demonstrate the effect of adding more layers to a neural network model, explaining that increasing the model’s depth can enhance its ability to learn complex representations. They show how a deeper model, compared to a shallower one, can better capture the intricacies of the data and make more accurate predictions.
- Illustrating the Addition of ReLU Activation Functions: The sources provide a visual illustration of incorporating ReLU activation functions into a neural network model. They show how ReLU introduces non-linearity by applying a thresholding operation to the output of linear layers, enabling the model to learn non-linear decision boundaries and better represent complex relationships in the data.
This section guides readers through the process of building, training, and evaluating a regression model in PyTorch, emphasizing the iterative nature of model development. The sources highlight the importance of visualizing predictions and the role of non-linear activation functions in enhancing model capabilities. They encourage experimentation with different architectures and hyperparameters, fostering a deeper understanding of the factors influencing model performance and promoting a data-driven approach to model building.

Working with Tensors and Data in PyTorch: Pages 401-410

The sources guide readers through various aspects of working with tensors and data in PyTorch, emphasizing the fundamental role tensors play in deep learning computations. They introduce techniques for creating, manipulating, and understanding tensors, highlighting their importance in representing and processing data for neural networks.
- Creating Tensors in PyTorch: The sources detail methods for creating tensors in PyTorch, focusing on the torch.arange() function. They explain that torch.arange() generates a tensor containing a sequence of evenly spaced values within a specified range. They provide code examples illustrating the use of torch.arange() with various parameters like start, end, and step to control the generated sequence.
- Understanding the Deprecation of torch.range(): The sources note that the torch.range() function, previously used for creating tensors with a range of values, has been deprecated in favor of torch.arange(). They encourage users to adopt torch.arange() for creating tensors containing sequences of values.
- Exploring Tensor Shapes and Reshaping: The sources emphasize the significance of understanding tensor shapes in PyTorch, explaining that the shape of a tensor determines its dimensionality and the arrangement of its elements. They introduce the concept of reshaping tensors, using functions like torch.reshape() to modify a tensor’s shape while preserving its total number of elements. They provide code examples demonstrating how to reshape tensors to match specific requirements for various operations or layers in neural networks.
- Stacking Tensors Together: The sources introduce the torch.stack() function, explaining its role in concatenating a sequence of tensors along a new dimension. They explain that torch.stack() takes a list of tensors as input and combines them into a higher-dimensional tensor, effectively stacking them together along a specified dimension. They illustrate the use of torch.stack() with code examples, highlighting how it can be used to combine multiple tensors into a single structure.
- Permuting Tensor Dimensions: The sources explore the concept of permuting tensor dimensions, explaining that it involves rearranging the axes of a tensor. They introduce the torch.permute() function, which reorders the dimensions of a tensor according to specified indices. They demonstrate the use of torch.permute() with code examples, emphasizing its application in tasks like transforming image data from the format (Height, Width, Channels) to (Channels, Height, Width), which is often required by convolutional neural networks.
- Visualizing Tensors and Their Shapes: The sources advocate for visualizing tensors and their shapes, explaining that visual inspection can aid in understanding the structure and arrangement of tensor data. They suggest using tools like matplotlib to create graphical representations of tensors, allowing users to better comprehend the dimensionality and organization of tensor elements.
- Indexing and Slicing Tensors: The sources guide readers through techniques for indexing and slicing tensors, explaining how to access specific elements or sub-regions within a tensor. They demonstrate the use of square brackets ([]) for indexing tensors, illustrating how to retrieve elements based on their indices along various dimensions. They further explain how slicing allows users to extract a portion of a tensor by specifying start and end indices along each dimension. They provide code examples showcasing various indexing and slicing operations, emphasizing their role in manipulating and extracting data from tensors.
- Introducing the Concept of Random Seeds: The sources introduce the concept of random seeds, explaining their significance in controlling the randomness in PyTorch operations that involve random number generation. They explain that setting a random seed ensures that the same sequence of random numbers is generated each time the code is run, promoting reproducibility of results. They provide code examples demonstrating how to set a random seed using torch.manual_seed(), highlighting its importance in maintaining consistency during model training and experimentation.
- Exploring the torch.rand() Function: The sources explore the torch.rand() function, explaining its role in generating tensors filled with random numbers drawn from a uniform distribution between 0 and 1. They provide code examples demonstrating the use of torch.rand() to create tensors of various shapes filled with random values.
- Discussing Running Tensors and GPUs: The sources introduce the concept of running tensors on GPUs (Graphics Processing Units), explaining that GPUs offer significant computational advantages for deep learning tasks compared to CPUs. They highlight that PyTorch provides mechanisms for transferring tensors to and from GPUs, enabling users to leverage GPU acceleration for training and inference.
- Emphasizing Documentation and Extra Resources: The sources consistently encourage readers to refer to the PyTorch documentation for detailed information on functions, modules, and concepts. They also highlight the availability of supplementary resources, including online tutorials, blog posts, and research papers, to enhance understanding and provide deeper insights into various aspects of PyTorch.
This section guides readers through various techniques for working with tensors and data in PyTorch, highlighting the importance of understanding tensor shapes, reshaping, stacking, permuting, indexing, and slicing operations. They introduce concepts like random seeds and GPU acceleration, emphasizing the importance of leveraging available documentation and resources to enhance understanding and facilitate effective deep learning development using PyTorch.

Constructing and Training Neural Networks with PyTorch: Pages 411-420

The sources focus on building and training neural networks in PyTorch, specifically in the context of binary classification tasks. They guide readers through the process of creating a simple neural network architecture, defining a suitable loss function, setting up an optimizer, implementing a training loop, and evaluating the model’s performance on test data. They emphasize the use of activation functions, such as the sigmoid function, to introduce non-linearity into the network and enable it to learn complex decision boundaries.
- Building a Neural Network for Binary Classification: The sources provide a step-by-step guide to constructing a neural network specifically for binary classification. They show the creation of a model with linear layers (nn.Linear) stacked sequentially, illustrating how to define the input and output dimensions of each layer. They emphasize that the output layer for binary classification tasks typically has a single output unit, representing the probability of the positive class.
- Using the Sigmoid Activation Function: The sources introduce the sigmoid activation function, explaining its role in transforming the output of linear layers into a probability value between 0 and 1. They highlight that the sigmoid function introduces non-linearity into the network, allowing it to model complex relationships between input features and the target class.
- Creating a Training Loop for Binary Classification: The sources demonstrate the implementation of a training loop tailored for binary classification tasks. They outline the familiar steps involved: forward pass to calculate the loss, optimizer zeroing gradients, backpropagation to calculate gradients, and optimizer step to update model parameters.
- Understanding Binary Cross-Entropy Loss: The sources explain the concept of binary cross-entropy loss, a common loss function used for binary classification tasks. They describe how binary cross-entropy loss measures the difference between the predicted probabilities and the true labels, guiding the model to learn to make accurate predictions.
- Calculating Accuracy for Binary Classification: The sources demonstrate how to calculate accuracy for binary classification tasks. They show how to convert the model’s predicted probabilities into binary predictions using a threshold (typically 0.5), comparing these predictions to the true labels to determine the percentage of correctly classified instances.
- Evaluating the Model on Test Data: The sources emphasize the importance of evaluating the trained model on a separate testing dataset to assess its ability to generalize to unseen data. They outline the steps involved in testing the model, including performing a forward pass on the test data, calculating the loss, and computing the accuracy.
- Plotting Predictions and Decision Boundaries: The sources advocate for visualizing the model’s predictions and decision boundaries, explaining that visual inspection can provide valuable insights into the model’s behavior and performance. They suggest using plotting techniques to display the decision boundary learned by the model, illustrating how the model separates data points belonging to different classes.
- Using Helper Functions to Simplify Code: The sources introduce the use of helper functions to organize and streamline the code for training and evaluating the model. They demonstrate how to encapsulate repetitive tasks, such as plotting predictions or calculating accuracy, into reusable functions, improving code readability and maintainability.
This section guides readers through the construction and training of neural networks for binary classification in PyTorch. The sources emphasize the use of activation functions to introduce non-linearity, the choice of suitable loss functions and optimizers, the implementation of a training loop, and the evaluation of the model on test data. They highlight the importance of visualizing predictions and decision boundaries and introduce techniques for organizing code using helper functions.

Exploring Non-Linearities and Multi-Class Classification in PyTorch: Pages 421-430

The sources continue the exploration of neural networks, focusing on incorporating non-linearities using activation functions and expanding into multi-class classification. They guide readers through the process of enhancing model performance by adding non-linear activation functions, transitioning from binary classification to multi-class classification, choosing appropriate loss functions and optimizers, and evaluating model performance with metrics such as accuracy.
- Incorporating Non-Linearity with Activation Functions: The sources emphasize the crucial role of non-linear activation functions in enabling neural networks to learn complex patterns and relationships within data. They introduce the ReLU (Rectified Linear Unit) activation function, highlighting its effectiveness and widespread use in deep learning. They explain that ReLU introduces non-linearity by setting negative values to zero and passing positive values unchanged. This simple yet powerful activation function allows neural networks to model non-linear decision boundaries and capture intricate data representations.
- Understanding the Importance of Non-Linearity: The sources provide insights into the rationale behind incorporating non-linearity into neural networks. They explain that without non-linear activation functions, a neural network, regardless of its depth, would essentially behave as a single linear layer, severely limiting its ability to learn complex patterns. Non-linear activation functions, like ReLU, introduce bends and curves into the model’s decision boundaries, allowing it to capture non-linear relationships and make more accurate predictions.
- Transitioning to Multi-Class Classification: The sources smoothly transition from binary classification to multi-class classification, where the task involves classifying data into more than two categories. They explain the key differences between binary and multi-class classification, highlighting the need for adjustments in the model’s output layer and the choice of loss function and activation function.
- Using Softmax for Multi-Class Classification: The sources introduce the softmax activation function, commonly used in the output layer of multi-class classification models. They explain that softmax transforms the raw output scores (logits) of the network into a probability distribution over the different classes, ensuring that the predicted probabilities for all classes sum up to one.
- Choosing an Appropriate Loss Function for Multi-Class Classification: The sources guide readers in selecting appropriate loss functions for multi-class classification. They discuss cross-entropy loss, a widely used loss function for multi-class classification tasks, explaining how it measures the difference between the predicted probability distribution and the true label distribution.
- Implementing a Training Loop for Multi-Class Classification: The sources outline the steps involved in implementing a training loop for multi-class classification models. They demonstrate the familiar process of iterating through the training data in batches, performing a forward pass, calculating the loss, backpropagating to compute gradients, and updating the model’s parameters using an optimizer.
- Evaluating Multi-Class Classification Models: The sources focus on evaluating the performance of multi-class classification models using metrics like accuracy. They explain that accuracy measures the percentage of correctly classified instances over the entire dataset, providing an overall assessment of the model’s predictive ability.
- Visualizing Multi-Class Classification Results: The sources suggest visualizing the predictions and decision boundaries of multi-class classification models, emphasizing the importance of visual inspection for gaining insights into the model’s behavior and performance. They demonstrate techniques for plotting the decision boundaries learned by the model, showing how the model divides the feature space to separate data points belonging to different classes.
- Highlighting the Interplay of Linear and Non-linear Functions: The sources emphasize the combined effect of linear transformations (performed by linear layers) and non-linear transformations (introduced by activation functions) in allowing neural networks to learn complex patterns. They explain that the interplay of linear and non-linear functions enables the model to capture intricate data representations and make accurate predictions across a wide range of tasks.
This section guides readers through the process of incorporating non-linearity into neural networks using activation functions like ReLU and transitioning from binary to multi-class classification using the softmax activation function. The sources discuss the choice of appropriate loss functions for multi-class classification, demonstrate the implementation of a training loop, and highlight the importance of evaluating model performance using metrics like accuracy and visualizing decision boundaries to gain insights into the model’s behavior. They emphasize the critical role of combining linear and non-linear functions to enable neural networks to effectively learn complex patterns within data.

Visualizing and Building Neural Networks for Multi-Class Classification: Pages 431-440

The sources emphasize the importance of visualization in understanding data patterns and building intuition for neural network architectures. They guide readers through the process of visualizing data for multi-class classification, designing a simple neural network for this task, understanding input and output shapes, and selecting appropriate loss functions and optimizers. They introduce tools like PyTorch’s nn.Sequential container to structure models and highlight the flexibility of PyTorch for customizing neural networks.
- Visualizing Data for Multi-Class Classification: The sources advocate for visualizing data before building models, especially for multi-class classification. They illustrate the use of scatter plots to display data points with different colors representing different classes. This visualization helps identify patterns, clusters, and potential decision boundaries that a neural network could learn.
- Designing a Neural Network for Multi-Class Classification: The sources demonstrate the construction of a simple neural network for multi-class classification using PyTorch’s nn.Sequential container, which allows for a streamlined definition of the model’s architecture by stacking layers in a sequential order. They show how to define linear layers (nn.Linear) with appropriate input and output dimensions based on the number of features and the number of classes in the dataset.
- Determining Input and Output Shapes: The sources guide readers in determining the input and output shapes for the different layers of the neural network. They explain that the input shape of the first layer is determined by the number of features in the dataset, while the output shape of the last layer corresponds to the number of classes. The input and output shapes of intermediate layers can be adjusted to control the network’s capacity and complexity. They highlight the importance of ensuring that the input and output dimensions of consecutive layers are compatible for a smooth flow of data through the network.
- Selecting Loss Functions and Optimizers: The sources discuss the importance of choosing appropriate loss functions and optimizers for multi-class classification. They explain the concept of cross-entropy loss, a commonly used loss function for this type of classification task, and discuss its role in guiding the model to learn to make accurate predictions. They also mention optimizers like Stochastic Gradient Descent (SGD), highlighting their role in updating the model’s parameters to minimize the loss function.
- Using PyTorch’s nn Module for Neural Network Components: The sources emphasize the use of PyTorch’s nn module, which contains building blocks for constructing neural networks. They specifically demonstrate the use of nn.Linear for creating linear layers and nn.Sequential for structuring the model by combining multiple layers in a sequential manner. They highlight that PyTorch offers a vast array of modules within the nn package for creating diverse and sophisticated neural network architectures.
This section encourages the use of visualization to gain insights into data patterns for multi-class classification and guides readers in designing simple neural networks for this task. The sources emphasize the importance of understanding and setting appropriate input and output shapes for the different layers of the network and provide guidance on selecting suitable loss functions and optimizers. They showcase PyTorch’s flexibility and its powerful nn module for constructing neural network architectures.

Building a Multi-Class Classification Model: Pages 441-450

The sources continue the discussion of multi-class classification, focusing on designing a neural network architecture and creating a custom MultiClassClassification model in PyTorch. They guide readers through the process of defining the input and output shapes of each layer based on the number of features and classes in the dataset, constructing the model using PyTorch’s nn.Linear and nn.Sequential modules, and testing the data flow through the model with a forward pass. They emphasize the importance of understanding how the shape of data changes as it passes through the different layers of the network.
- Defining the Neural Network Architecture: The sources present a structured approach to designing a neural network architecture for multi-class classification. They outline the key components of the architecture:
- Input layer shape: Determined by the number of features in the dataset.
- Hidden layers: Allow the network to learn complex relationships within the data. The number of hidden layers and the number of neurons (hidden units) in each layer can be customized to control the network’s capacity and complexity.
- Output layer shape: Corresponds to the number of classes in the dataset. Each output neuron represents a different class.
- Output activation: Typically uses the softmax function for multi-class classification. Softmax transforms the network’s output scores (logits) into a probability distribution over the classes, ensuring that the predicted probabilities sum to one.
- Creating a Custom MultiClassClassification Model in PyTorch: The sources guide readers in implementing a custom MultiClassClassification model using PyTorch. They demonstrate how to define the model class, inheriting from PyTorch’s nn.Module, and how to structure the model using nn.Sequential to stack layers in a sequential manner.
- Using nn.Linear for Linear Transformations: The sources explain the use of nn.Linear for creating linear layers in the neural network. nn.Linear applies a linear transformation to the input data, calculating a weighted sum of the input features and adding a bias term. The weights and biases are the learnable parameters of the linear layer that the network adjusts during training to make accurate predictions.
- Testing Data Flow Through the Model: The sources emphasize the importance of testing the data flow through the model to ensure that the input and output shapes of each layer are compatible. They demonstrate how to perform a forward pass with dummy data to verify that data can successfully pass through the network without encountering shape errors.
- Troubleshooting Shape Issues: The sources provide tips for troubleshooting shape issues, highlighting the significance of paying attention to the error messages that PyTorch provides. Error messages related to shape mismatches often provide clues about which layers or operations need adjustments to ensure compatibility.
- Visualizing Shape Changes with Print Statements: The sources suggest using print statements within the model’s forward method to display the shape of the data as it passes through each layer. This visual inspection helps confirm that data transformations are occurring as expected and aids in identifying and resolving shape-related issues.
This section guides readers through the process of designing and implementing a multi-class classification model in PyTorch. The sources emphasize the importance of understanding input and output shapes for each layer, utilizing PyTorch’s nn.Linear for linear transformations, using nn.Sequential for structuring the model, and verifying the data flow with a forward pass. They provide tips for troubleshooting shape issues and encourage the use of print statements to visualize shape changes, facilitating a deeper understanding of the model’s architecture and behavior.

Training and Evaluating the Multi-Class Classification Model: Pages 451-460

The sources shift focus to the practical aspects of training and evaluating the multi-class classification model in PyTorch. They guide readers through creating a training loop, setting up an optimizer and loss function, implementing a testing loop to evaluate model performance on unseen data, and calculating accuracy as a performance metric. The sources emphasize the iterative nature of model training, involving forward passes, loss calculation, backpropagation, and parameter updates using an optimizer.
- Creating a Training Loop in PyTorch: The sources emphasize the importance of a training loop in machine learning, which is the process of iteratively training a model on a dataset. They guide readers in creating a training loop in PyTorch, incorporating the following key steps:
1. Iterating over epochs: An epoch represents one complete pass through the entire training dataset. The number of epochs determines how many times the model will see the training data during the training process.
2. Iterating over batches: The training data is typically divided into smaller batches to make the training process more manageable and efficient. Each batch contains a subset of the training data.
3. Performing a forward pass: Passing the input data (a batch of data) through the model to generate predictions.
4. Calculating the loss: Comparing the model’s predictions to the true labels to quantify how well the model is performing. This comparison is done using a loss function, such as cross-entropy loss for multi-class classification.
5. Performing backpropagation: Calculating gradients of the loss function with respect to the model’s parameters. These gradients indicate how much each parameter contributes to the overall error.
6. Updating model parameters: Adjusting the model’s parameters (weights and biases) using an optimizer, such as Stochastic Gradient Descent (SGD). The optimizer uses the calculated gradients to update the parameters in a direction that minimizes the loss function.
- Setting up an Optimizer and Loss Function: The sources demonstrate how to set up an optimizer and a loss function in PyTorch. They explain that optimizers play a crucial role in updating the model’s parameters to minimize the loss function during training. They showcase the use of the Adam optimizer (torch.optim.Adam), a popular optimization algorithm for deep learning. For the loss function, they use the cross-entropy loss (nn.CrossEntropyLoss), a common choice for multi-class classification tasks.
- Evaluating Model Performance with a Testing Loop: The sources guide readers in creating a testing loop in PyTorch to evaluate the trained model’s performance on unseen data (the test dataset). The testing loop follows a similar structure to the training loop but without the backpropagation and parameter update steps. It involves performing a forward pass on the test data, calculating the loss, and often using additional metrics like accuracy to assess the model’s generalization capability.
- Calculating Accuracy as a Performance Metric: The sources introduce accuracy as a straightforward metric for evaluating classification model performance. Accuracy measures the proportion of correctly classified samples in the test dataset, providing a simple indication of how well the model generalizes to unseen data.
This section emphasizes the importance of the training loop, which iteratively improves the model’s performance by adjusting its parameters based on the calculated loss. It guides readers through implementing the training loop in PyTorch, setting up an optimizer and loss function, creating a testing loop to evaluate model performance, and calculating accuracy as a basic performance metric for classification tasks.

Refining and Improving Model Performance: Pages 461-470

The sources guide readers through various strategies for refining and improving the performance of the multi-class classification model. They cover techniques like adjusting the learning rate, experimenting with different optimizers, exploring the concept of nonlinear activation functions, and understanding the idea of running tensors on a Graphical Processing Unit (GPU) for faster training. They emphasize that model improvement in machine learning often involves experimentation, trial-and-error, and a systematic approach to evaluating and comparing different model configurations.
- Adjusting the Learning Rate: The sources emphasize the importance of the learning rate in the training process. They explain that the learning rate controls the size of the steps the optimizer takes when updating model parameters during backpropagation. A high learning rate may lead to the model missing the optimal minimum of the loss function, while a very low learning rate can cause slow convergence, making the training process unnecessarily lengthy. The sources suggest experimenting with different learning rates to find an appropriate balance between speed and convergence.
- Experimenting with Different Optimizers: The sources highlight the importance of choosing an appropriate optimizer for training neural networks. They mention that different optimizers use different strategies for updating model parameters based on the calculated gradients, and some optimizers might be more suitable than others for specific problems or datasets. The sources encourage readers to experiment with various optimizers available in PyTorch, such as Stochastic Gradient Descent (SGD), Adam, and RMSprop, to observe their impact on model performance.
- Introducing Nonlinear Activation Functions: The sources introduce the concept of nonlinear activation functions and their role in enhancing the capacity of neural networks. They explain that linear layers alone can only model linear relationships within the data, limiting the complexity of patterns the model can learn. Nonlinear activation functions, applied to the outputs of linear layers, introduce nonlinearities into the model, enabling it to learn more complex relationships and capture nonlinear patterns in the data. The sources mention the sigmoid activation function as an example, but PyTorch offers a variety of nonlinear activation functions within the nn module.
- Utilizing GPUs for Faster Training: The sources touch on the concept of running PyTorch tensors on a GPU (Graphical Processing Unit) to significantly speed up the training process. GPUs are specialized hardware designed for parallel computations, making them particularly well-suited for the matrix operations involved in deep learning. By utilizing a GPU, training times can be significantly reduced, allowing for faster experimentation and model development.
- Improving a Model: The sources discuss the iterative process of improving a machine learning model, highlighting that model development rarely produces optimal results on the first attempt. They suggest a systematic approach involving the following:
- Starting simple: Beginning with a simpler model architecture and gradually increasing complexity if needed.
- Experimenting with hyperparameters: Tuning parameters like learning rate, batch size, and the number of hidden layers to find an optimal configuration.
- Evaluating and comparing results: Carefully analyzing the model’s performance on the training and test datasets, using metrics like loss and accuracy to assess its effectiveness and generalization capabilities.
This section guides readers in exploring various strategies for refining and improving the multi-class classification model. The sources emphasize the importance of adjusting the learning rate, experimenting with different optimizers, introducing nonlinear activation functions for enhanced model capacity, and leveraging GPUs for faster training. They underscore the iterative nature of model improvement, encouraging readers to adopt a systematic approach involving experimentation, hyperparameter tuning, and thorough evaluation.

Please note that specific recommendations about optimal learning rates or best optimizers for a given problem may vary depending on the dataset, model architecture, and other factors. These aspects often require experimentation and a deeper understanding of the specific machine learning problem being addressed.

Exploring the PyTorch Workflow and Model Evaluation: Pages 471-480

The sources guide readers through crucial aspects of the PyTorch workflow, focusing on saving and loading trained models, understanding common choices for loss functions and optimizers, and exploring additional classification metrics beyond accuracy. They delve into the concept of a confusion matrix as a valuable tool for evaluating classification models, providing deeper insights into the model’s performance across different classes. The sources advocate for a holistic approach to model evaluation, emphasizing that multiple metrics should be considered to gain a comprehensive understanding of a model’s strengths and weaknesses.
- Saving and Loading Trained PyTorch Models: The sources emphasize the importance of saving trained models in PyTorch. They demonstrate the process of saving a model’s state dictionary, which contains the learned parameters (weights and biases), using torch.save(). They also showcase the process of loading a saved model using torch.load(), enabling users to reuse trained models for inference or further training.
- Common Choices for Loss Functions and Optimizers: The sources present a table summarizing common choices for loss functions and optimizers in PyTorch, specifically tailored for binary and multi-class classification tasks. They provide brief descriptions of each loss function and optimizer, highlighting key characteristics and situations where they are commonly used. For binary classification, they mention the Binary Cross Entropy Loss (nn.BCELoss) and the Stochastic Gradient Descent (SGD) optimizer as common choices. For multi-class classification, they mention the Cross Entropy Loss (nn.CrossEntropyLoss) and the Adam optimizer.
- Exploring Additional Classification Metrics: The sources introduce additional classification metrics beyond accuracy, emphasizing the importance of considering multiple metrics for a comprehensive evaluation. They touch on precision, recall, the F1 score, confusion matrices, and classification reports as valuable tools for assessing model performance, particularly when dealing with imbalanced datasets or situations where different types of errors carry different weights.
- Constructing and Interpreting a Confusion Matrix: The sources introduce the confusion matrix as a powerful tool for visualizing the performance of a classification model. They explain that a confusion matrix displays the counts (or proportions) of correctly and incorrectly classified instances for each class. The rows of the matrix typically represent the true classes, while the columns represent the predicted classes. Each cell in the matrix represents the number of instances that were classified as belonging to a particular predicted class when their true class was different. The sources guide readers through creating a confusion matrix in PyTorch using the torchmetrics library, which provides a dedicated ConfusionMatrix class. They emphasize that confusion matrices offer valuable insights into:
- True positives (TP): Correctly predicted positive instances.
- True negatives (TN): Correctly predicted negative instances.
- False positives (FP): Incorrectly predicted positive instances (Type I errors).
- False negatives (FN): Incorrectly predicted negative instances (Type II errors).
This section highlights the practical steps of saving and loading trained PyTorch models, providing users with the ability to reuse trained models for different purposes. It presents common choices for loss functions and optimizers, aiding users in selecting appropriate configurations for their classification tasks. The sources expand the discussion on classification metrics, introducing additional measures like precision, recall, the F1 score, and the confusion matrix. They advocate for using a combination of metrics to gain a more nuanced understanding of model performance, particularly when addressing real-world problems where different types of errors have varying consequences.

Visualizing and Evaluating Model Predictions: Pages 481-490

The sources guide readers through the process of visualizing and evaluating the predictions made by the trained convolutional neural network (CNN) model. They emphasize the importance of going beyond overall accuracy and examining individual predictions to gain a deeper understanding of the model’s behavior and identify potential areas for improvement. The sources introduce techniques for plotting predictions visually, comparing model predictions to ground truth labels, and using a confusion matrix to assess the model’s performance across different classes.
- Visualizing Model Predictions: The sources introduce techniques for visualizing model predictions on individual images from the test dataset. They suggest randomly sampling a set of images from the test dataset, obtaining the model’s predictions for these images, and then displaying both the images and their corresponding predicted labels. This approach allows for a qualitative assessment of the model’s performance, enabling users to visually inspect how well the model aligns with human perception.
- Comparing Predictions to Ground Truth: The sources stress the importance of comparing the model’s predictions to the ground truth labels associated with the test images. By visually aligning the predicted labels with the true labels, users can quickly identify instances where the model makes correct predictions and instances where it errs. This comparison helps to pinpoint specific types of images or classes that the model might struggle with, providing valuable insights for further model refinement.
- Creating a Confusion Matrix for Deeper Insights: The sources reiterate the value of a confusion matrix for evaluating classification models. They guide readers through creating a confusion matrix using libraries like torchmetrics and mlxtend, which offer tools for calculating and visualizing confusion matrices. The confusion matrix provides a comprehensive overview of the model’s performance across all classes, highlighting the counts of true positives, true negatives, false positives, and false negatives. This visualization helps to identify classes that the model might be confusing, revealing patterns of misclassification that can inform further model development or data augmentation strategies.
This section guides readers through practical techniques for visualizing and evaluating the predictions made by the trained CNN model. The sources advocate for a multi-faceted evaluation approach, emphasizing the value of visually inspecting individual predictions, comparing them to ground truth labels, and utilizing a confusion matrix to analyze the model’s performance across all classes. By combining qualitative and quantitative assessment methods, users can gain a more comprehensive understanding of the model’s capabilities, identify its strengths and weaknesses, and glean insights for potential improvements.

Getting Started with Computer Vision and Convolutional Neural Networks: Pages 491-500

The sources introduce the field of computer vision and convolutional neural networks (CNNs), providing readers with an overview of key libraries, resources, and the basic concepts involved in building computer vision models with PyTorch. They guide readers through setting up the necessary libraries, understanding the structure of CNNs, and preparing to work with image datasets. The sources emphasize a hands-on approach to learning, encouraging readers to experiment with code and explore the concepts through practical implementation.
- Essential Computer Vision Libraries in PyTorch: The sources present several essential libraries commonly used for computer vision tasks in PyTorch, highlighting their functionalities and roles in building and training CNNs:
- Torchvision: This library serves as the core domain library for computer vision in PyTorch. It provides utilities for data loading, image transformations, pre-trained models, and more. Within torchvision, several sub-modules are particularly relevant:
- datasets: This module offers a collection of popular computer vision datasets, including ImageNet, CIFAR10, CIFAR100, MNIST, and FashionMNIST, readily available for download and use in PyTorch.
- models: This module contains a variety of pre-trained CNN architectures, such as ResNet, AlexNet, VGG, and Inception, which can be used directly for inference or fine-tuned for specific tasks.
- transforms: This module provides a range of image transformations, including resizing, cropping, flipping, and normalization, which are crucial for preprocessing image data before feeding it into a CNN.
- utils: This module offers helpful utilities for tasks like visualizing images, displaying model summaries, and saving and loading checkpoints.
- Matplotlib: This versatile plotting library is essential for visualizing images, plotting training curves, and exploring data patterns in computer vision tasks.
- Exploring Convolutional Neural Networks: The sources provide a high-level introduction to CNNs, explaining that they are specialized neural networks designed for processing data with a grid-like structure, such as images. They highlight the key components of a CNN:
- Convolutional Layers: These layers apply a series of learnable filters (kernels) to the input image, extracting features like edges, textures, and patterns. The filters slide across the input image, performing convolutions to produce feature maps that highlight specific characteristics of the image.
- Pooling Layers: These layers downsample the feature maps generated by convolutional layers, reducing their spatial dimensions while preserving important features. Pooling layers help to make the model more robust to variations in the position of features within the image.
- Fully Connected Layers: These layers, often found in the final stages of a CNN, connect all the features extracted by the convolutional and pooling layers, enabling the model to learn complex relationships between these features and perform high-level reasoning about the image content.
- Obtaining and Preparing Image Datasets: The sources guide readers through the process of obtaining image datasets for training computer vision models, emphasizing the importance of:
- Choosing the right dataset: Selecting a dataset relevant to the specific computer vision task being addressed.
- Understanding dataset structure: Familiarizing oneself with the organization of images and labels within the dataset, ensuring compatibility with PyTorch’s data loading mechanisms.
- Preprocessing images: Applying necessary transformations to the images, such as resizing, cropping, normalization, and data augmentation, to prepare them for input into a CNN.
This section serves as a starting point for readers venturing into the world of computer vision and CNNs using PyTorch. The sources introduce essential libraries, resources, and basic concepts, equipping readers with the foundational knowledge and tools needed to begin building and training computer vision models. They highlight the structure of CNNs, emphasizing the roles of convolutional, pooling, and fully connected layers in processing image data. The sources stress the importance of selecting appropriate image datasets, understanding their structure, and applying necessary preprocessing steps to prepare the data for training.

Getting Hands-on with the FashionMNIST Dataset: Pages 501-510

The sources walk readers through the practical steps involved in working with the FashionMNIST dataset for image classification using PyTorch. They cover checking library versions, exploring the torchvision.datasets module, setting up the FashionMNIST dataset for training, understanding data loaders, and visualizing samples from the dataset. The sources emphasize the importance of familiarizing oneself with the dataset’s structure, accessing its elements, and gaining insights into the images and their corresponding labels.
- Checking Library Versions for Compatibility: The sources recommend checking the versions of the PyTorch and torchvision libraries to ensure compatibility and leverage the latest features. They provide code snippets to display the version numbers of both libraries using torch.__version__ and torchvision.__version__. This step helps to avoid potential issues arising from version mismatches and ensures a smooth workflow.
- Exploring the torchvision.datasets Module: The sources introduce the torchvision.datasets module as a valuable resource for accessing a variety of popular computer vision datasets. They demonstrate how to explore the available datasets within this module, providing examples like Caltech101, CIFAR100, CIFAR10, MNIST, FashionMNIST, and ImageNet. The sources explain that these datasets can be easily downloaded and loaded into PyTorch using dedicated functions within the torchvision.datasets module.
- Setting Up the FashionMNIST Dataset: The sources guide readers through the process of setting up the FashionMNIST dataset for training an image classification model. They outline the following steps:
1. Importing Necessary Modules: Import the required modules from torchvision.datasets and torchvision.transforms.
2. Downloading the Dataset: Download the FashionMNIST dataset using the FashionMNIST class from torchvision.datasets, specifying the desired root directory for storing the dataset.
3. Applying Transformations: Apply transformations to the images using the transforms.Compose function. Common transformations include:
- transforms.ToTensor(): Converts PIL images (common format for image data) to PyTorch tensors.
- transforms.Normalize(): Normalizes the pixel values of the images, typically to a range of 0 to 1 or -1 to 1, which can help to improve model training.
- Understanding Data Loaders: The sources introduce data loaders as an essential component for efficiently loading and iterating through datasets in PyTorch. They explain that data loaders provide several benefits:
- Batching: They allow you to easily create batches of data, which is crucial for training models on large datasets that cannot be loaded into memory all at once.
- Shuffling: They can shuffle the data between epochs, helping to prevent the model from memorizing the order of the data and improving its ability to generalize.
- Parallel Loading: They support parallel loading of data, which can significantly speed up the training process.
- Visualizing Samples from the Dataset: The sources emphasize the importance of visualizing samples from the dataset to gain a better understanding of the data being used for training. They provide code examples for iterating through a data loader, extracting image tensors and their corresponding labels, and displaying the images using matplotlib. This visual inspection helps to ensure that the data has been loaded and preprocessed correctly and can provide insights into the characteristics of the images within the dataset.
This section offers practical guidance on working with the FashionMNIST dataset for image classification. The sources emphasize the importance of checking library versions, exploring available datasets in torchvision.datasets, setting up the FashionMNIST dataset for training, understanding the role of data loaders, and visually inspecting samples from the dataset. By following these steps, readers can effectively load, preprocess, and visualize image data, laying the groundwork for building and training computer vision models.

Mini-Batches and Building a Baseline Model with Linear Layers: Pages 511-520

The sources introduce the concept of mini-batches in machine learning, explaining their significance in training models on large datasets. They guide readers through the process of creating mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. The sources then demonstrate how to build a simple baseline model using linear layers for classifying images from the FashionMNIST dataset, highlighting the steps involved in setting up the model’s architecture, defining the input and output shapes, and performing a forward pass to verify data flow.
- The Importance of Mini-Batches: The sources explain that mini-batches play a crucial role in training machine learning models, especially when dealing with large datasets. They break down the dataset into smaller, manageable chunks called mini-batches, which are processed by the model in each training iteration. Using mini-batches offers several advantages:
- Efficient Memory Usage: Processing the entire dataset at once can overwhelm the computer’s memory, especially for large datasets. Mini-batches allow the model to work on smaller portions of the data, reducing memory requirements and making training feasible.
- Faster Training: Updating the model’s parameters after each sample can be computationally expensive. Mini-batches enable the model to calculate gradients and update parameters based on a group of samples, leading to faster convergence and reduced training time.
- Improved Generalization: Training on mini-batches introduces some randomness into the process, as the samples within each batch are shuffled. This randomness can help the model to learn more robust patterns and improve its ability to generalize to unseen data.
- Creating Mini-Batches with DataLoader: The sources demonstrate how to create mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. The DataLoader class provides a convenient way to iterate through the dataset in batches, handling shuffling, batching, and data loading automatically. It takes the dataset as input, along with the desired batch size and other optional parameters.
- Building a Baseline Model with Linear Layers: The sources guide readers through the construction of a simple baseline model using linear layers for classifying images from the FashionMNIST dataset. They outline the following steps:
1. Defining the Model Architecture: The sources start by creating a class called LinearModel that inherits from nn.Module, which is the base class for all neural network modules in PyTorch. Within the class, they define the following layers:
- A linear layer (nn.Linear) that takes the flattened input image (784 features, representing the 28×28 pixels of a FashionMNIST image) and maps it to a hidden layer with a specified number of units.
- Another linear layer that maps the hidden layer to the output layer, producing a tensor of scores for each of the 10 classes in FashionMNIST.
1. Setting Up the Input and Output Shapes: The sources emphasize the importance of aligning the input and output shapes of the linear layers to ensure proper data flow through the model. They specify the input features and output features for each linear layer based on the dataset’s characteristics and the desired number of hidden units.
2. Performing a Forward Pass: The sources demonstrate how to perform a forward pass through the model using a randomly generated tensor. This step verifies that the data flows correctly through the layers and helps to confirm the expected output shape. They print the output tensor and its shape, providing insights into the model’s behavior.
This section introduces the concept of mini-batches and their importance in machine learning, providing practical guidance on creating mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. It then demonstrates how to build a simple baseline model using linear layers for classifying images, highlighting the steps involved in defining the model architecture, setting up the input and output shapes, and verifying data flow through a forward pass. This foundation prepares readers for building more complex convolutional neural networks for image classification tasks.

Training and Evaluating a Linear Model on the FashionMNIST Dataset: Pages 521-530

The sources guide readers through the process of training and evaluating the previously built linear model on the FashionMNIST dataset, focusing on creating a training loop, setting up a loss function and an optimizer, calculating accuracy, and implementing a testing loop to assess the model’s performance on unseen data.
- Setting Up the Loss Function and Optimizer: The sources explain that a loss function quantifies how well the model’s predictions match the true labels, with lower loss values indicating better performance. They discuss common choices for loss functions and optimizers, emphasizing the importance of selecting appropriate options based on the problem and dataset.
- The sources specifically recommend binary cross-entropy loss (BCE) for binary classification problems and cross-entropy loss (CE) for multi-class classification problems.
- They highlight that PyTorch provides both nn.BCELoss and nn.CrossEntropyLoss implementations for these loss functions.
- For the optimizer, the sources mention stochastic gradient descent (SGD) as a common choice, with PyTorch offering the torch.optim.SGD class for its implementation.
- Creating a Training Loop: The sources outline the fundamental steps involved in a training loop, emphasizing the iterative process of adjusting the model’s parameters to minimize the loss and improve its ability to classify images correctly. The typical steps in a training loop include:
1. Forward Pass: Pass a batch of data through the model to obtain predictions.
2. Calculate the Loss: Compare the model’s predictions to the true labels using the chosen loss function.
3. Optimizer Zero Grad: Reset the gradients calculated from the previous batch to avoid accumulating gradients across batches.
4. Loss Backward: Perform backpropagation to calculate the gradients of the loss with respect to the model’s parameters.
5. Optimizer Step: Update the model’s parameters based on the calculated gradients and the optimizer’s learning rate.
- Calculating Accuracy: The sources introduce accuracy as a metric for evaluating the model’s performance, representing the percentage of correctly classified samples. They provide a code snippet to calculate accuracy by comparing the predicted labels to the true labels.
- Implementing a Testing Loop: The sources explain the importance of evaluating the model’s performance on a separate set of data, the test set, that was not used during training. This helps to assess the model’s ability to generalize to unseen data and prevent overfitting, where the model performs well on the training data but poorly on new data. The testing loop follows similar steps to the training loop, but without updating the model’s parameters:
1. Forward Pass: Pass a batch of test data through the model to obtain predictions.
2. Calculate the Loss: Compare the model’s predictions to the true test labels using the loss function.
3. Calculate Accuracy: Determine the percentage of correctly classified test samples.
The sources provide code examples for implementing the training and testing loops, including detailed explanations of each step. They also emphasize the importance of monitoring the loss and accuracy values during training to track the model’s progress and ensure that it is learning effectively. These steps provide a comprehensive understanding of the training and evaluation process, enabling readers to apply these techniques to their own image classification tasks.

Building and Training a Multi-Layer Model with Non-Linear Activation Functions: Pages 531-540

The sources extend the image classification task by introducing non-linear activation functions and building a more complex multi-layer model. They emphasize the importance of non-linearity in enabling neural networks to learn complex patterns and improve classification accuracy. The sources guide readers through implementing the ReLU (Rectified Linear Unit) activation function and constructing a multi-layer model, demonstrating its performance on the FashionMNIST dataset.
- The Role of Non-Linear Activation Functions: The sources explain that linear models, while straightforward, are limited in their ability to capture intricate relationships in data. Introducing non-linear activation functions between linear layers enhances the model’s capacity to learn complex patterns. Non-linear activation functions allow the model to approximate non-linear decision boundaries, enabling it to classify data points that are not linearly separable.
- Introducing ReLU Activation: The sources highlight ReLU as a popular non-linear activation function, known for its simplicity and effectiveness. ReLU replaces negative values in the input tensor with zero, while retaining positive values. This simple operation introduces non-linearity into the model, allowing it to learn more complex representations of the data. The sources provide the code for implementing ReLU in PyTorch using nn.ReLU().
- Constructing a Multi-Layer Model: The sources guide readers through building a more complex model with multiple linear layers and ReLU activations. They introduce a three-layer model:
1. A linear layer that takes the flattened input image (784 features) and maps it to a hidden layer with a specified number of units.
2. A ReLU activation function applied to the output of the first linear layer.
3. Another linear layer that maps the activated hidden layer to a second hidden layer with a specified number of units.
4. A ReLU activation function applied to the output of the second linear layer.
5. A final linear layer that maps the activated second hidden layer to the output layer (10 units, representing the 10 classes in FashionMNIST).
- Training and Evaluating the Multi-Layer Model: The sources demonstrate how to train and evaluate this multi-layer model using the same training and testing loops described in the previous pages summary. They emphasize that the inclusion of ReLU activations between the linear layers significantly enhances the model’s performance compared to the previous linear models. This improvement highlights the crucial role of non-linearity in enabling neural networks to learn complex patterns and achieve higher classification accuracy.
The sources provide code examples for implementing the multi-layer model with ReLU activations, showcasing the steps involved in defining the model’s architecture, setting up the layers and activations, and training the model using the established training and testing loops. These examples offer practical guidance on building and training more complex models with non-linear activation functions, laying the foundation for understanding and implementing even more sophisticated architectures like convolutional neural networks.

Improving Model Performance and Visualizing Predictions: Pages 541-550

The sources discuss strategies for improving the performance of machine learning models, focusing on techniques to enhance a model’s ability to learn from data and make accurate predictions. They also guide readers through visualizing the model’s predictions, providing insights into its decision-making process and highlighting areas for potential improvement.
- Improving a Model’s Performance: The sources acknowledge that achieving satisfactory results with machine learning models often involves an iterative process of experimentation and refinement. They outline several strategies to improve a model’s performance, emphasizing that the effectiveness of these techniques can vary depending on the complexity of the problem and the characteristics of the dataset. Some common approaches include:
1. Adding More Layers: Increasing the depth of the neural network by adding more layers can enhance its capacity to learn complex representations of the data. However, adding too many layers can lead to overfitting, especially if the dataset is small.
2. Adding More Hidden Units: Increasing the number of hidden units within each layer can also enhance the model’s ability to capture intricate patterns. Similar to adding more layers, adding too many hidden units can contribute to overfitting.
3. Training for Longer: Allowing the model to train for a greater number of epochs can provide more opportunities to adjust its parameters and minimize the loss. However, excessive training can also lead to overfitting, especially if the model’s capacity is high.
4. Changing the Learning Rate: The learning rate determines the step size the optimizer takes when updating the model’s parameters. A learning rate that is too high can cause the optimizer to overshoot the optimal values, while a learning rate that is too low can slow down convergence. Experimenting with different learning rates can improve the model’s ability to find the optimal parameter values.
- Visualizing Model Predictions: The sources stress the importance of visualizing the model’s predictions to gain insights into its decision-making process. Visualizations can reveal patterns in the data that the model is capturing and highlight areas where it is struggling to make accurate predictions. The sources guide readers through creating visualizations using Matplotlib, demonstrating how to plot the model’s predictions for different classes and analyze its performance.
The sources provide practical advice and code examples for implementing these improvement strategies, encouraging readers to experiment with different techniques to find the optimal configuration for their specific problem. They also emphasize the value of visualizing model predictions to gain a deeper understanding of its strengths and weaknesses, facilitating further model refinement and improvement. This section equips readers with the knowledge and tools to iteratively improve their models and enhance their understanding of the model’s behavior through visualizations.

Saving, Loading, and Evaluating Models: Pages 551-560

The sources shift their focus to the practical aspects of saving, loading, and comprehensively evaluating trained models. They emphasize the importance of preserving trained models for future use, enabling the application of trained models to new data without retraining. The sources also introduce techniques for assessing model performance beyond simple accuracy, providing a more nuanced understanding of a model’s strengths and weaknesses.
- Saving and Loading Trained Models: The sources highlight the significance of saving trained models to avoid the time and computational expense of retraining. They outline the process of saving a model’s state dictionary, which contains the learned parameters (weights and biases), using PyTorch’s torch.save() function. The sources provide a code example demonstrating how to save a model’s state dictionary to a file, typically with a .pth extension. They also explain how to load a saved model using torch.load(), emphasizing the need to create an instance of the model with the same architecture before loading the saved state dictionary.
- Making Predictions With a Loaded Model: The sources guide readers through making predictions using a loaded model, emphasizing the importance of setting the model to evaluation mode (model.eval()) before making predictions. Evaluation mode deactivates certain layers, such as dropout, that are used during training but not during inference. They provide a code snippet illustrating the process of loading a saved model, setting it to evaluation mode, and using it to generate predictions on new data.
- Evaluating Model Performance Beyond Accuracy: The sources acknowledge that accuracy, while a useful metric, can provide an incomplete picture of a model’s performance, especially when dealing with imbalanced datasets where some classes have significantly more samples than others. They introduce the concept of a confusion matrix as a valuable tool for evaluating classification models. A confusion matrix displays the number of correct and incorrect predictions for each class, providing a detailed breakdown of the model’s performance across different classes. The sources explain how to interpret a confusion matrix, highlighting its ability to reveal patterns in misclassifications and identify classes where the model is performing poorly.
The sources guide readers through the essential steps of saving, loading, and evaluating trained models, equipping them with the skills to manage trained models effectively and perform comprehensive assessments of model performance beyond simple accuracy. This section focuses on the practical aspects of deploying and understanding the behavior of trained models, providing a valuable foundation for applying machine learning models to real-world tasks.

Putting it All Together: A PyTorch Workflow and Building a Classification Model: Pages 561 – 570

The sources guide readers through a comprehensive PyTorch workflow for building and training a classification model, consolidating the concepts and techniques covered in previous sections. They illustrate this workflow by constructing a binary classification model to classify data points generated using the make_circles dataset in scikit-learn.
- PyTorch End-to-End Workflow: The sources outline a structured approach to developing PyTorch models, encompassing the following key steps:
1. Data: Acquire, prepare, and transform data into a suitable format for training. This step involves understanding the dataset, loading the data, performing necessary preprocessing steps, and splitting the data into training and testing sets.
2. Model: Choose or build a model architecture appropriate for the task, considering the complexity of the problem and the nature of the data. This step involves selecting suitable layers, activation functions, and other components of the model.
3. Loss Function: Select a loss function that quantifies the difference between the model’s predictions and the actual target values. The choice of loss function depends on the type of problem (e.g., binary classification, multi-class classification, regression).
4. Optimizer: Choose an optimization algorithm that updates the model’s parameters to minimize the loss function. Popular optimizers include stochastic gradient descent (SGD), Adam, and RMSprop.
5. Training Loop: Implement a training loop that iteratively feeds the training data to the model, calculates the loss, and updates the model’s parameters using the chosen optimizer.
6. Evaluation: Evaluate the trained model’s performance on the testing set using appropriate metrics, such as accuracy, precision, recall, and the confusion matrix.
- Building a Binary Classification Model: The sources demonstrate this workflow by creating a binary classification model to classify data points generated using scikit-learn’s make_circles dataset. They guide readers through:
1. Generating the Dataset: Using make_circles to create a dataset of data points arranged in concentric circles, with each data point belonging to one of two classes.
2. Visualizing the Data: Employing Matplotlib to visualize the generated data points, providing a visual representation of the classification task.
3. Building the Model: Constructing a multi-layer neural network with linear layers and ReLU activation functions. The output layer utilizes the sigmoid activation function to produce probabilities for the two classes.
4. Choosing the Loss Function and Optimizer: Selecting the binary cross-entropy loss function (nn.BCELoss) and the stochastic gradient descent (SGD) optimizer for this binary classification task.
5. Implementing the Training Loop: Implementing the training loop to train the model, including the steps for calculating the loss, backpropagation, and updating the model’s parameters.
6. Evaluating the Model: Assessing the model’s performance using accuracy, precision, recall, and visualizing the predictions.
The sources provide a clear and structured approach to developing PyTorch models for classification tasks, emphasizing the importance of a systematic workflow that encompasses data preparation, model building, loss function and optimizer selection, training, and evaluation. This section offers a practical guide to applying the concepts and techniques covered in previous sections to build a functioning classification model, preparing readers for more complex tasks and datasets.

Multi-Class Classification with PyTorch: Pages 571-580

The sources introduce the concept of multi-class classification, expanding on the binary classification discussed in previous sections. They guide readers through building a multi-class classification model using PyTorch, highlighting the key differences and considerations when dealing with problems involving more than two classes. The sources utilize a synthetic dataset of multi-dimensional blobs created using scikit-learn’s make_blobs function to illustrate this process.
- Multi-Class Classification: The sources distinguish multi-class classification from binary classification, explaining that multi-class classification involves assigning data points to one of several possible classes. They provide examples of real-world multi-class classification problems, such as classifying images into different categories (e.g., cats, dogs, birds) or identifying different types of objects in an image.
- Building a Multi-Class Classification Model: The sources outline the steps for building a multi-class classification model in PyTorch, emphasizing the adjustments needed compared to binary classification:
1. Generating the Dataset: Using scikit-learn’s make_blobs function to create a synthetic dataset with multiple classes, where each data point has multiple features and belongs to one specific class.
2. Visualizing the Data: Utilizing Matplotlib to visualize the generated data points and their corresponding class labels, providing a visual understanding of the multi-class classification problem.
3. Building the Model: Constructing a neural network with linear layers and ReLU activation functions. The key difference in multi-class classification lies in the output layer. Instead of a single output neuron with a sigmoid activation function, the output layer has multiple neurons, one for each class. The softmax activation function is applied to the output layer to produce a probability distribution over the classes.
4. Choosing the Loss Function and Optimizer: Selecting an appropriate loss function for multi-class classification, such as the cross-entropy loss (nn.CrossEntropyLoss), and choosing an optimizer like stochastic gradient descent (SGD) or Adam.
5. Implementing the Training Loop: Implementing the training loop to train the model, similar to binary classification but using the chosen loss function and optimizer for multi-class classification.
6. Evaluating the Model: Evaluating the performance of the trained model using appropriate metrics for multi-class classification, such as accuracy and the confusion matrix. The sources emphasize that accuracy alone may not be sufficient for evaluating models on imbalanced datasets and suggest exploring other metrics like precision and recall.
The sources provide a comprehensive guide to building and training multi-class classification models in PyTorch, highlighting the adjustments needed in model architecture, loss function, and evaluation metrics compared to binary classification. By working through a concrete example using the make_blobs dataset, the sources equip readers with the fundamental knowledge and practical skills to tackle multi-class classification problems using PyTorch.

Enhancing a Model and Introducing Nonlinearities: Pages 581 – 590

The sources discuss strategies for improving the performance of machine learning models and introduce the concept of nonlinear activation functions, which play a crucial role in enabling neural networks to learn complex patterns in data. They explore ways to enhance a previously built multi-class classification model and introduce the ReLU (Rectified Linear Unit) activation function as a widely used nonlinearity in deep learning.
- Improving a Model’s Performance: The sources acknowledge that achieving satisfactory results with a machine learning model often involves experimentation and iterative improvement. They present several strategies for enhancing a model’s performance, including:
1. Adding More Layers: Increasing the depth of the neural network by adding more layers can allow the model to learn more complex representations of the data. The sources suggest that adding layers can be particularly beneficial for tasks with intricate data patterns.
2. Increasing Hidden Units: Expanding the number of hidden units within each layer can provide the model with more capacity to capture and learn the underlying patterns in the data.
3. Training for Longer: Extending the number of training epochs can give the model more opportunities to learn from the data and potentially improve its performance. However, training for too long can lead to overfitting, where the model performs well on the training data but poorly on unseen data.
4. Using a Smaller Learning Rate: Decreasing the learning rate can lead to more stable training and allow the model to converge to a better solution, especially when dealing with complex loss landscapes.
5. Adding Nonlinearities: Incorporating nonlinear activation functions between layers is essential for enabling neural networks to learn nonlinear relationships in the data. Without nonlinearities, the model would essentially be a series of linear transformations, limiting its ability to capture complex patterns.
- Introducing the ReLU Activation Function: The sources introduce the ReLU activation function as a widely used nonlinearity in deep learning. They describe ReLU’s simple yet effective operation: it outputs the input directly if the input is positive and outputs zero if the input is negative. Mathematically, ReLU(x) = max(0, x).
- The sources highlight the benefits of ReLU, including its computational efficiency and its tendency to mitigate the vanishing gradient problem, which can hinder training in deep networks.
- Incorporating ReLU into the Model: The sources guide readers through adding ReLU activation functions to the previously built multi-class classification model. They demonstrate how to insert ReLU layers between the linear layers of the model, enabling the network to learn nonlinear decision boundaries and improve its ability to classify the data.
The sources provide a practical guide to improving machine learning model performance and introduce the concept of nonlinearities, emphasizing the importance of ReLU activation functions in enabling neural networks to learn complex data patterns. By incorporating ReLU into the multi-class classification model, the sources showcase the power of nonlinearities in enhancing a model’s ability to capture and represent the underlying structure of the data.

Building and Evaluating Convolutional Neural Networks: Pages 591 – 600

The sources transition from traditional feedforward neural networks to convolutional neural networks (CNNs), a specialized architecture particularly effective for computer vision tasks. They emphasize the power of CNNs in automatically learning and extracting features from images, eliminating the need for manual feature engineering. The sources utilize a simplified version of the VGG architecture, dubbed “TinyVGG,” to illustrate the building blocks of CNNs and their application in image classification.
- Convolutional Neural Networks (CNNs): The sources introduce CNNs as a powerful type of neural network specifically designed for processing data with a grid-like structure, such as images. They explain that CNNs excel in computer vision tasks because they exploit the spatial relationships between pixels in an image, learning to identify patterns and features that are relevant for classification.
- Key Components of CNNs: The sources outline the fundamental building blocks of CNNs:
1. Convolutional Layers: Convolutional layers perform convolutions, a mathematical operation that involves sliding a filter (also called a kernel) over the input image to extract features. The filter acts as a pattern detector, learning to recognize specific shapes, edges, or textures in the image.
2. Activation Functions: Non-linear activation functions, such as ReLU, are applied to the output of convolutional layers to introduce non-linearity into the network, enabling it to learn complex patterns.
3. Pooling Layers: Pooling layers downsample the output of convolutional layers, reducing the spatial dimensions of the feature maps while retaining the most important information. Common pooling operations include max pooling and average pooling.
4. Fully Connected Layers: Fully connected layers, similar to those in traditional feedforward networks, are often used in the final stages of a CNN to perform classification based on the extracted features.
- Building TinyVGG: The sources guide readers through implementing a simplified version of the VGG architecture, named TinyVGG, to demonstrate how to build and train a CNN for image classification. They detail the architecture of TinyVGG, which consists of:
1. Convolutional Blocks: Multiple convolutional blocks, each comprising convolutional layers, ReLU activation functions, and a max pooling layer.
2. Classifier Layer: A final classifier layer consisting of a flattening operation followed by fully connected layers to perform classification.
- Training and Evaluating TinyVGG: The sources provide code for training TinyVGG using the FashionMNIST dataset, a collection of grayscale images of clothing items. They demonstrate how to define the training loop, calculate the loss, perform backpropagation, and update the model’s parameters using an optimizer. They also guide readers through evaluating the trained model’s performance using accuracy and other relevant metrics.
The sources provide a clear and accessible introduction to CNNs and their application in image classification, demonstrating the power of CNNs in automatically learning features from images without manual feature engineering. By implementing and training TinyVGG, the sources equip readers with the practical skills and understanding needed to build and work with CNNs for computer vision tasks.

Visualizing CNNs and Building a Custom Dataset: Pages 601-610

The sources emphasize the importance of understanding how convolutional neural networks (CNNs) operate and guide readers through visualizing the effects of convolutional layers, kernels, strides, and padding. They then transition to the concept of custom datasets, explaining the need to go beyond pre-built datasets and create datasets tailored to specific machine learning problems. The sources utilize the Food101 dataset, creating a smaller subset called “Food Vision Mini” to illustrate building a custom dataset for image classification.
- Visualizing CNNs: The sources recommend using the CNN Explainer website (https://poloclub.github.io/cnn-explainer/) to gain a deeper understanding of how CNNs work.
- They acknowledge that the mathematical operations involved in convolutions can be challenging to grasp. The CNN Explainer provides an interactive visualization that allows users to experiment with different CNN parameters and observe their effects on the input image.
- Key Insights from CNN Explainer: The sources highlight the following key concepts illustrated by the CNN Explainer:
1. Kernels: Kernels, also called filters, are small matrices that slide across the input image, extracting features by performing element-wise multiplications and summations. The values within the kernel represent the weights that the CNN learns during training.
2. Strides: Strides determine how much the kernel moves across the input image in each step. Larger strides result in a larger downsampling of the input, reducing the spatial dimensions of the output feature maps.
3. Padding: Padding involves adding extra pixels around the borders of the input image. Padding helps control the spatial dimensions of the output feature maps and can prevent information loss at the edges of the image.
- Building a Custom Dataset: The sources recognize that many real-world machine learning problems require creating custom datasets that are not readily available. They guide readers through the process of building a custom dataset for image classification, using the Food101 dataset as an example.
- Creating Food Vision Mini: The sources construct a smaller subset of the Food101 dataset called Food Vision Mini, which contains only three classes (pizza, steak, and sushi) and a reduced number of images. They advocate for starting with a smaller dataset for experimentation and development, scaling up to the full dataset once the model and workflow are established.
- Standard Image Classification Format: The sources emphasize the importance of organizing the dataset into a standard image classification format, where images are grouped into separate folders corresponding to their respective classes. This standard format facilitates data loading and preprocessing using PyTorch’s built-in tools.
- Loading Image Data using ImageFolder: The sources introduce PyTorch’s ImageFolder class, a convenient tool for loading image data that is organized in the standard image classification format. They demonstrate how to use ImageFolder to create dataset objects for the training and testing splits of Food Vision Mini.
- They highlight the benefits of ImageFolder, including its automatic labeling of images based on their folder location and its ability to apply transformations to the images during loading.
- Visualizing the Custom Dataset: The sources encourage visualizing the custom dataset to ensure that the images and labels are loaded correctly. They provide code for displaying random images and their corresponding labels from the training dataset, enabling a qualitative assessment of the dataset’s content.
The sources offer a practical guide to understanding and visualizing CNNs and provide a step-by-step approach to building a custom dataset for image classification. By using the Food Vision Mini dataset as a concrete example, the sources equip readers with the knowledge and skills needed to create and work with datasets tailored to their specific machine learning problems.

Building a Custom Dataset Class and Exploring Data Augmentation: Pages 611-620

The sources shift from using the convenient ImageFolder class to building a custom Dataset class in PyTorch, providing greater flexibility and control over data loading and preprocessing. They explain the structure and key methods of a custom Dataset class and demonstrate how to implement it for the Food Vision Mini dataset. The sources then explore data augmentation techniques, emphasizing their role in improving model generalization by artificially increasing the diversity of the training data.
- Building a Custom Dataset Class: The sources guide readers through creating a custom Dataset class in PyTorch, offering a more versatile approach compared to ImageFolder for handling image data. They outline the essential components of a custom Dataset:
1. Initialization (__init__): The initialization method sets up the necessary attributes of the dataset, such as the image paths, labels, and transformations.
2. Length (__len__): The length method returns the total number of samples in the dataset, allowing PyTorch’s data loaders to determine the dataset’s size.
3. Get Item (__getitem__): The get item method retrieves a specific sample from the dataset given its index. It typically involves loading the image, applying transformations, and returning the transformed image and its corresponding label.
- Implementing the Custom Dataset: The sources provide a step-by-step implementation of a custom Dataset class for the Food Vision Mini dataset. They demonstrate how to:
1. Collect Image Paths and Labels: Iterate through the image directories and store the paths to each image along with their corresponding labels.
2. Define Transformations: Specify the desired image transformations to be applied during data loading, such as resizing, cropping, and converting to tensors.
3. Implement __getitem__: Retrieve the image at the given index, apply transformations, and return the transformed image and label as a tuple.
- Benefits of Custom Dataset Class: The sources highlight the advantages of using a custom Dataset class:
1. Flexibility: Custom Dataset classes offer greater control over data loading and preprocessing, allowing developers to tailor the data handling process to their specific needs.
2. Extensibility: Custom Dataset classes can be easily extended to accommodate various data formats and incorporate complex data loading logic.
3. Code Clarity: Custom Dataset classes promote code organization and readability, making it easier to understand and maintain the data loading pipeline.
- Data Augmentation: The sources introduce data augmentation as a crucial technique for improving the generalization ability of machine learning models. Data augmentation involves artificially expanding the training dataset by applying various transformations to the original images.
- Purpose of Data Augmentation: The goal of data augmentation is to expose the model to a wider range of variations in the data, reducing the risk of overfitting and enabling the model to learn more robust and generalizable features.
- Types of Data Augmentations: The sources showcase several common data augmentation techniques, including:
1. Random Flipping: Flipping images horizontally or vertically.
2. Random Cropping: Cropping images to different sizes and positions.
3. Random Rotation: Rotating images by a random angle.
4. Color Jitter: Adjusting image brightness, contrast, saturation, and hue.
- Benefits of Data Augmentation: The sources emphasize the following benefits of data augmentation:
1. Increased Data Diversity: Data augmentation artificially expands the training dataset, exposing the model to a wider range of image variations.
2. Improved Generalization: Training on augmented data helps the model learn more robust features that generalize better to unseen data.
3. Reduced Overfitting: Data augmentation can mitigate overfitting by preventing the model from memorizing specific examples in the training data.
- Incorporating Data Augmentations: The sources guide readers through applying data augmentations to the Food Vision Mini dataset using PyTorch’s transforms module.
- They demonstrate how to compose multiple transformations into a pipeline, applying them sequentially to the images during data loading.
- Visualizing Augmented Images: The sources encourage visualizing the augmented images to ensure that the transformations are being applied as expected. They provide code for displaying random augmented images from the training dataset, allowing a qualitative assessment of the augmentation pipeline’s effects.
The sources provide a comprehensive guide to building a custom Dataset class in PyTorch, empowering readers to handle data loading and preprocessing with greater flexibility and control. They then explore the concept and benefits of data augmentation, emphasizing its role in enhancing model generalization by introducing artificial diversity into the training data.

Constructing and Training a TinyVGG Model: Pages 621-630

The sources guide readers through constructing a TinyVGG model, a simplified version of the VGG (Visual Geometry Group) architecture commonly used in computer vision. They explain the rationale behind TinyVGG’s design, detail its layers and activation functions, and demonstrate how to implement it in PyTorch. They then focus on training the TinyVGG model using the custom Food Vision Mini dataset. They highlight the importance of setting a random seed for reproducibility and illustrate the training process using a combination of code and explanatory text.
- Introducing TinyVGG Architecture: The sources introduce the TinyVGG architecture as a simplified version of the VGG architecture, well-known for its performance in image classification tasks.
- Rationale Behind TinyVGG: They explain that TinyVGG aims to capture the essential elements of the VGG architecture while using fewer layers and parameters, making it more computationally efficient and suitable for smaller datasets like Food Vision Mini.
- Layers and Activation Functions in TinyVGG: The sources provide a detailed breakdown of the layers and activation functions used in the TinyVGG model:
1. Convolutional Layers (nn.Conv2d): Multiple convolutional layers are used to extract features from the input images. Each convolutional layer applies a set of learnable filters (kernels) to the input, generating feature maps that highlight different patterns in the image.
2. ReLU Activation Function (nn.ReLU): The rectified linear unit (ReLU) activation function is applied after each convolutional layer. ReLU introduces non-linearity into the model, allowing it to learn complex relationships between features. It is defined as f(x) = max(0, x), meaning it outputs the input directly if it is positive and outputs zero if the input is negative.
3. Max Pooling Layers (nn.MaxPool2d): Max pooling layers downsample the feature maps by selecting the maximum value within a small window. This reduces the spatial dimensions of the feature maps while retaining the most salient features.
4. Flatten Layer (nn.Flatten): The flatten layer converts the multi-dimensional feature maps from the convolutional layers into a one-dimensional feature vector. This vector is then fed into the fully connected layers for classification.
5. Linear Layer (nn.Linear): The linear layer performs a matrix multiplication on the input feature vector, producing a set of scores for each class.
- Implementing TinyVGG in PyTorch: The sources guide readers through implementing the TinyVGG architecture using PyTorch’s nn.Module class. They define a class called TinyVGG that inherits from nn.Module and implements the model’s architecture in its __init__ and forward methods.
- __init__ Method: This method initializes the model’s layers, including convolutional layers, ReLU activation functions, max pooling layers, a flatten layer, and a linear layer for classification.
- forward Method: This method defines the flow of data through the model, taking an input tensor and passing it through the various layers in the correct sequence.
- Setting the Random Seed: The sources stress the importance of setting a random seed before training the model using torch.manual_seed(42). This ensures that the model’s initialization and training process are deterministic, making the results reproducible.
- Training the TinyVGG Model: The sources demonstrate how to train the TinyVGG model on the Food Vision Mini dataset. They provide code for:
1. Creating an Instance of the Model: Instantiating the TinyVGG class creates an object representing the model.
2. Choosing a Loss Function: Selecting an appropriate loss function to measure the difference between the model’s predictions and the true labels.
3. Setting up an Optimizer: Choosing an optimization algorithm to update the model’s parameters during training, aiming to minimize the loss function.
4. Defining a Training Loop: Implementing a loop that iterates through the training data, performs forward and backward passes, updates model parameters, and tracks the training progress.
The sources provide a practical walkthrough of constructing and training a TinyVGG model using the Food Vision Mini dataset. They explain the architecture’s design principles, detail its layers and activation functions, and demonstrate how to implement and train the model in PyTorch. They emphasize the importance of setting a random seed for reproducibility, enabling others to replicate the training process and results.

Visualizing the Model, Evaluating Performance, and Comparing Results: Pages 631-640

The sources move towards visualizing the TinyVGG model’s layers and their effects on input data, offering insights into how convolutional neural networks process information. They then focus on evaluating the model’s performance using various metrics, emphasizing the need to go beyond simple accuracy and consider measures like precision, recall, and F1 score for a more comprehensive assessment. Finally, the sources introduce techniques for comparing the performance of different models, highlighting the role of dataframes in organizing and presenting the results.
- Visualizing TinyVGG’s Convolutional Layers: The sources explore how to visualize the convolutional layers of the TinyVGG model.
- They leverage the CNN Explainer website, which offers an interactive tool for understanding the workings of convolutional neural networks.
- The sources guide readers through creating dummy data in the same shape as the input data used in the CNN Explainer, allowing them to observe how the model’s convolutional layers transform the input.
- The sources emphasize the importance of understanding hyperparameters like kernel size, stride, and padding and their influence on the convolutional operation.
- Understanding Kernel Size, Stride, and Padding: The sources explain the significance of key hyperparameters involved in convolutional layers:
1. Kernel Size: Refers to the size of the filter that slides across the input image. A larger kernel captures a wider receptive field, allowing the model to learn more complex features. However, a larger kernel also increases the number of parameters and computational complexity.
2. Stride: Determines the step size at which the kernel moves across the input. A larger stride results in a smaller output feature map, effectively downsampling the input.
3. Padding: Involves adding extra pixels around the input image to control the output size and prevent information loss at the edges. Different padding strategies, such as “same” padding or “valid” padding, influence how the kernel interacts with the image boundaries.
- Evaluating Model Performance: The sources shift focus to evaluating the performance of the trained TinyVGG model. They emphasize that relying solely on accuracy may not provide a complete picture, especially when dealing with imbalanced datasets where one class might dominate the others.
- Metrics Beyond Accuracy: The sources introduce several additional metrics for evaluating classification models:
1. Precision: Measures the proportion of correctly predicted positive instances out of all instances predicted as positive. A high precision indicates that the model is good at avoiding false positives.
2. Recall: Measures the proportion of correctly predicted positive instances out of all actual positive instances. A high recall suggests that the model is effective at identifying most of the positive instances.
3. F1 Score: The harmonic mean of precision and recall, providing a balanced measure that considers both false positives and false negatives. It is particularly useful when dealing with imbalanced datasets where precision and recall might provide conflicting insights.
- Confusion Matrix: The sources introduce the concept of a confusion matrix, a powerful tool for visualizing the performance of a classification model.
- Structure of a Confusion Matrix: The confusion matrix is a table that shows the counts of true positives, true negatives, false positives, and false negatives for each class, providing a detailed breakdown of the model’s prediction patterns.
- Benefits of Confusion Matrix: The confusion matrix helps identify classes that the model struggles with, providing insights into potential areas for improvement.
- Comparing Model Performance: The sources explore techniques for comparing the performance of different models trained on the Food Vision Mini dataset. They demonstrate how to use Pandas dataframes to organize and present the results clearly and concisely.
- Creating a Dataframe for Comparison: The sources guide readers through creating a dataframe that includes relevant metrics like training time, training loss, test loss, and test accuracy for each model. This allows for a side-by-side comparison of their performance.
- Benefits of Dataframes: Dataframes provide a structured and efficient way to handle and analyze tabular data. They enable easy sorting, filtering, and visualization of the results, facilitating the process of model selection and comparison.
The sources emphasize the importance of going beyond simple accuracy when evaluating classification models. They introduce a range of metrics, including precision, recall, and F1 score, and highlight the usefulness of the confusion matrix in providing a detailed analysis of the model’s prediction patterns. The sources then demonstrate how to use dataframes to compare the performance of multiple models systematically, aiding in model selection and understanding the impact of different design choices or training strategies.

Building, Training, and Evaluating a Multi-Class Classification Model: Pages 641-650

The sources transition from binary classification, where models distinguish between two classes, to multi-class classification, which involves predicting one of several possible classes. They introduce the concept of multi-class classification, comparing it to binary classification, and use the Fashion MNIST dataset as an example, where models need to classify images into ten different clothing categories. The sources guide readers through adapting the TinyVGG architecture and training process for this multi-class setting, explaining the modifications needed for handling multiple classes.
- From Binary to Multi-Class Classification: The sources explain the shift from binary to multi-class classification.
- Binary Classification: Involves predicting one of two possible classes, like “cat” or “dog” in an image classification task.
- Multi-Class Classification: Extends the concept to predicting one of multiple classes, as in the Fashion MNIST dataset, where models must classify images into classes like “T-shirt,” “Trouser,” “Pullover,” “Dress,” “Coat,” “Sandal,” “Shirt,” “Sneaker,” “Bag,” and “Ankle Boot.” [1, 2]
- Adapting TinyVGG for Multi-Class Classification: The sources explain how to modify the TinyVGG architecture for multi-class problems.
- Output Layer: The key change involves adjusting the output layer of the TinyVGG model. The number of output units in the final linear layer needs to match the number of classes in the dataset. For Fashion MNIST, this means having ten output units, one for each clothing category. [3]
- Activation Function: They also recommend using the softmax activation function in the output layer for multi-class classification. The softmax function converts the raw output scores (logits) from the linear layer into a probability distribution over the classes, where each probability represents the model’s confidence in assigning the input to that particular class. [4]
- Choosing the Right Loss Function and Optimizer: The sources guide readers through selecting appropriate loss functions and optimizers for multi-class classification:
- Cross-Entropy Loss: They recommend using the cross-entropy loss function, a common choice for multi-class classification tasks. Cross-entropy loss measures the dissimilarity between the predicted probability distribution and the true label distribution. [5]
- Optimizers: The sources discuss using optimizers like Stochastic Gradient Descent (SGD) or Adam to update the model’s parameters during training, aiming to minimize the cross-entropy loss. [5]
- Training the Multi-Class Model: The sources demonstrate how to train the adapted TinyVGG model on the Fashion MNIST dataset, following a similar training loop structure used in previous sections:
- Data Loading: Loading batches of image data and labels from the Fashion MNIST dataset using PyTorch’s DataLoader. [6, 7]
- Forward Pass: Passing the input data through the model to obtain predictions (logits). [8]
- Calculating Loss: Computing the cross-entropy loss between the predicted logits and the true labels. [8]
- Backpropagation: Calculating gradients of the loss with respect to the model’s parameters. [8]
- Optimizer Step: Updating the model’s parameters using the chosen optimizer, aiming to minimize the loss. [8]
- Evaluating Performance: The sources reiterate the importance of evaluating model performance using metrics beyond simple accuracy, especially in multi-class settings.
- Precision, Recall, F1 Score: They encourage considering metrics like precision, recall, and F1 score, which provide a more nuanced understanding of the model’s ability to correctly classify instances across different classes. [9]
- Confusion Matrix: They highlight the usefulness of the confusion matrix, allowing visualization of the model’s prediction patterns and identification of classes the model struggles with. [10]
The sources smoothly transition readers from binary to multi-class classification. They outline the key differences, provide clear instructions on adapting the TinyVGG architecture for multi-class tasks, and guide readers through the training process. They emphasize the need for comprehensive model evaluation, suggesting the use of metrics beyond accuracy and showcasing the value of the confusion matrix in analyzing the model’s performance.

Evaluating Model Predictions and Understanding Data Augmentation: Pages 651-660

The sources guide readers through evaluating model predictions on individual samples from the Fashion MNIST dataset, emphasizing the importance of visual inspection and understanding where the model succeeds or fails. They then introduce the concept of data augmentation as a technique for artificially increasing the diversity of the training data, aiming to improve the model’s generalization ability and robustness.
- Visually Evaluating Model Predictions: The sources demonstrate how to make predictions on individual samples from the test set and visualize them alongside their true labels.
- Selecting Random Samples: They guide readers through selecting random samples from the test data, preparing the images for visualization using matplotlib, and making predictions using the trained model.
- Visualizing Predictions: They showcase a technique for creating a grid of images, displaying each test sample alongside its predicted label and its true label. This visual approach provides insights into the model’s performance on specific instances.
- Analyzing Results: The sources encourage readers to analyze the visual results, looking for patterns in the model’s predictions and identifying instances where it might be making errors. This process helps understand the strengths and weaknesses of the model’s learned representations.
- Confusion Matrix for Deeper Insights: The sources revisit the concept of the confusion matrix, introduced earlier, as a powerful tool for evaluating classification model performance.
- Creating a Confusion Matrix: They guide readers through creating a confusion matrix using libraries like torchmetrics and mlxtend, which offer convenient functions for computing and visualizing confusion matrices.
- Interpreting the Confusion Matrix: The sources explain how to interpret the confusion matrix, highlighting the patterns in the model’s predictions and identifying classes that might be easily confused.
- Benefits of Confusion Matrix: They emphasize that the confusion matrix provides a more granular view of the model’s performance compared to simple accuracy, allowing for a deeper understanding of its prediction patterns.
- Data Augmentation: The sources introduce the concept of data augmentation as a technique to improve model generalization and performance.
- Definition of Data Augmentation: They define data augmentation as the process of artificially increasing the diversity of the training data by applying various transformations to the original images.
- Benefits of Data Augmentation: The sources explain that data augmentation helps expose the model to a wider range of variations during training, making it more robust to changes in input data and improving its ability to generalize to unseen examples.
- Common Data Augmentation Techniques: The sources discuss several commonly used data augmentation techniques:
1. Random Cropping: Involves randomly selecting a portion of the image to use for training, helping the model learn to recognize objects regardless of their location within the image.
2. Random Flipping: Horizontally flipping images, teaching the model to recognize objects even when they are mirrored.
3. Random Rotation: Rotating images by a random angle, improving the model’s ability to handle different object orientations.
4. Color Jitter: Adjusting the brightness, contrast, saturation, and hue of images, making the model more robust to variations in lighting and color.
- Applying Data Augmentation in PyTorch: The sources demonstrate how to apply data augmentation using PyTorch’s transforms module, which offers a wide range of built-in transformations for image data. They create a custom transformation pipeline that includes random cropping, random horizontal flipping, and random rotation. They then visualize examples of augmented images, highlighting the diversity introduced by these transformations.
The sources guide readers through evaluating individual model predictions, showcasing techniques for visual inspection and analysis using matplotlib. They reiterate the importance of the confusion matrix as a tool for gaining deeper insights into the model’s prediction patterns. They then introduce the concept of data augmentation, explaining its purpose and benefits. The sources provide clear explanations of common data augmentation techniques and demonstrate how to apply them using PyTorch’s transforms module, emphasizing the role of data augmentation in improving model generalization and robustness.

Building and Training a TinyVGG Model on a Custom Dataset: Pages 661-670

The sources shift focus to building and training a TinyVGG convolutional neural network model on the custom food dataset (pizza, steak, sushi) prepared in the previous sections. They guide readers through the process of model definition, setting up a loss function and optimizer, and defining training and testing steps for the model. The sources emphasize a step-by-step approach, encouraging experimentation and understanding of the model’s architecture and training dynamics.
- Defining the TinyVGG Architecture: The sources provide a detailed breakdown of the TinyVGG architecture, outlining the layers and their configurations:
- Convolutional Blocks: They describe the arrangement of convolutional layers (nn.Conv2d), activation functions (typically ReLU – nn.ReLU), and max-pooling layers (nn.MaxPool2d) within convolutional blocks. They explain how these blocks extract features from the input images at different levels of abstraction.
- Classifier Layer: They describe the classifier layer, consisting of a flattening operation (nn.Flatten) followed by fully connected linear layers (nn.Linear). This layer takes the extracted features from the convolutional blocks and maps them to the output classes (pizza, steak, sushi).
- Model Implementation: The sources guide readers through implementing the TinyVGG model in PyTorch, showing how to define the model class by subclassing nn.Module:
- __init__ Method: They demonstrate the initialization of the model’s layers within the __init__ method, setting up the convolutional blocks and the classifier layer.
- forward Method: They explain the forward method, which defines the flow of data through the model during the forward pass, outlining how the input data passes through each layer and transformation.
- Input and Output Shape Verification: The sources stress the importance of verifying the input and output shapes of each layer in the model. They encourage readers to print the shapes at different stages to ensure the data is flowing correctly through the network and that the dimensions are as expected. They also mention techniques for troubleshooting shape mismatches.
- Introducing torchinfo Package: The sources introduce the torchinfo package as a helpful tool for summarizing the architecture of a PyTorch model, providing information about layer shapes, parameters, and the overall structure of the model. They demonstrate how to use torchinfo to get a concise overview of the defined TinyVGG model.
- Setting Up the Loss Function and Optimizer: The sources guide readers through selecting a suitable loss function and optimizer for training the TinyVGG model:
- Cross-Entropy Loss: They recommend using the cross-entropy loss function for the multi-class classification problem of the food dataset. They explain that cross-entropy loss is commonly used for classification tasks and measures the difference between the predicted probability distribution and the true label distribution.
- Stochastic Gradient Descent (SGD) Optimizer: They suggest using the SGD optimizer for updating the model’s parameters during training. They explain that SGD is a widely used optimization algorithm that iteratively adjusts the model’s parameters to minimize the loss function.
- Defining Training and Testing Steps: The sources provide code for defining the training and testing steps of the model training process:
- train_step Function: They define a train_step function, which takes a batch of training data as input, performs a forward pass through the model, calculates the loss, performs backpropagation to compute gradients, and updates the model’s parameters using the optimizer. They emphasize accumulating the loss and accuracy over the batches within an epoch.
- test_step Function: They define a test_step function, which takes a batch of testing data as input, performs a forward pass to get predictions, calculates the loss, and accumulates the loss and accuracy over the batches. They highlight that the test_step does not involve updating the model’s parameters, as it’s used for evaluation purposes.
The sources guide readers through the process of defining the TinyVGG architecture, verifying layer shapes, setting up the loss function and optimizer, and defining the training and testing steps for the model. They emphasize the importance of understanding the model’s structure and the flow of data through it. They encourage readers to experiment and pay attention to details to ensure the model is correctly implemented and set up for training.

Training, Evaluating, and Saving the TinyVGG Model: Pages 671-680

The sources guide readers through the complete training process of the TinyVGG model on the custom food dataset, highlighting techniques for visualizing training progress, evaluating model performance, and saving the trained model for later use. They emphasize practical considerations, such as setting up training loops, tracking loss and accuracy metrics, and making predictions on test data.
- Implementing the Training Loop: The sources provide code for implementing the training loop, iterating through multiple epochs and performing training and testing steps for each epoch. They break down the training loop into clear steps:
- Epoch Iteration: They use a for loop to iterate over the specified number of training epochs.
- Setting Model to Training Mode: Before starting the training step for each epoch, they explicitly set the model to training mode using model.train(). They explain that this is important for activating certain layers, like dropout or batch normalization, which behave differently during training and evaluation.
- Iterating Through Batches: Within each epoch, they use another for loop to iterate through the batches of data from the training data loader.
- Calling the train_step Function: For each batch, they call the previously defined train_step function, which performs a forward pass, calculates the loss, performs backpropagation, and updates the model’s parameters.
- Accumulating Loss and Accuracy: They accumulate the training loss and accuracy values over the batches within an epoch.
- Setting Model to Evaluation Mode: Before starting the testing step, they set the model to evaluation mode using model.eval(). They explain that this deactivates training-specific behaviors of certain layers.
- Iterating Through Test Batches: They iterate through the batches of data from the test data loader.
- Calling the test_step Function: For each batch, they call the test_step function, which calculates the loss and accuracy on the test data.
- Accumulating Test Loss and Accuracy: They accumulate the test loss and accuracy values over the test batches.
- Calculating Average Loss and Accuracy: After iterating through all the training and testing batches, they calculate the average training loss, training accuracy, test loss, and test accuracy for the epoch.
- Printing Epoch Statistics: They print the calculated statistics for each epoch, providing a clear view of the model’s progress during training.
- Visualizing Training Progress: The sources emphasize the importance of visualizing the training process to gain insights into the model’s learning dynamics:
- Creating Loss and Accuracy Curves: They guide readers through creating plots of the training loss and accuracy values over the epochs, allowing for visual inspection of how the model is improving.
- Analyzing Loss Curves: They explain how to analyze the loss curves, looking for trends that indicate convergence or potential issues like overfitting. They suggest that a steadily decreasing loss curve generally indicates good learning progress.
- Saving and Loading the Best Model: The sources highlight the importance of saving the model with the best performance achieved during training:
- Tracking the Best Test Loss: They introduce a variable to track the best test loss achieved so far during training.
- Saving the Model When Test Loss Improves: They include a condition within the training loop to save the model’s state dictionary (model.state_dict()) whenever a new best test loss is achieved.
- Loading the Saved Model: They demonstrate how to load the saved model’s state dictionary using torch.load() and use it to restore the model’s parameters for later use.
- Evaluating the Loaded Model: The sources guide readers through evaluating the performance of the loaded model on the test data:
- Performing a Test Pass: They use the test_step function to calculate the loss and accuracy of the loaded model on the entire test dataset.
- Comparing Results: They compare the results of the loaded model with the results obtained during training to ensure that the loaded model performs as expected.
The sources provide a comprehensive walkthrough of the training process for the TinyVGG model, emphasizing the importance of setting up the training loop, tracking loss and accuracy metrics, visualizing training progress, saving the best model, and evaluating its performance. They offer practical tips and best practices for effective model training, encouraging readers to actively engage in the process, analyze the results, and gain a deeper understanding of how the model learns and improves.

Understanding and Implementing Custom Datasets: Pages 681-690

The sources shift focus to explaining the concept and implementation of custom datasets in PyTorch, emphasizing the flexibility and customization they offer for handling diverse types of data beyond pre-built datasets. They guide readers through the process of creating a custom dataset class, understanding its key methods, and visualizing samples from the custom dataset.
- Introducing Custom Datasets: The sources introduce the concept of custom datasets in PyTorch, explaining that they allow for greater control and flexibility in handling data that doesn’t fit the structure of pre-built datasets. They highlight that custom datasets are especially useful when working with:
- Data in Non-Standard Formats: Data that is not readily available in formats supported by pre-built datasets, requiring specific loading and processing steps.
- Data with Unique Structures: Data with specific organizational structures or relationships that need to be represented in a particular way.
- Data Requiring Specialized Transformations: Data that requires specific transformations or augmentations to prepare it for model training.
- Using torchvision.datasets.ImageFolder : The sources acknowledge that the torchvision.datasets.ImageFolder class can handle many image classification datasets. They explain that ImageFolder works well when the data follows a standard directory structure, where images are organized into subfolders representing different classes. However, they also emphasize the need for custom dataset classes when dealing with data that doesn’t conform to this standard structure.
- Building FoodVisionMini Custom Dataset: The sources guide readers through creating a custom dataset class called FoodVisionMini, designed to work with the smaller subset of the Food 101 dataset (pizza, steak, sushi) prepared earlier. They outline the key steps and considerations involved:
- Subclassing torch.utils.data.Dataset: They explain that custom dataset classes should inherit from the torch.utils.data.Dataset class, which provides the basic framework for representing a dataset in PyTorch.
- Implementing Required Methods: They highlight the essential methods that need to be implemented in a custom dataset class:
- __init__ Method: The __init__ method initializes the dataset, taking the necessary arguments, such as the data directory, transformations to be applied, and any other relevant information.
- __len__ Method: The __len__ method returns the total number of samples in the dataset.
- __getitem__ Method: The __getitem__ method retrieves a data sample at a given index. It typically involves loading the data, applying transformations, and returning the processed data and its corresponding label.
- __getitem__ Method Implementation: The sources provide a detailed breakdown of implementing the __getitem__ method in the FoodVisionMini dataset:
- Getting the Image Path: The method first determines the file path of the image to be loaded based on the provided index.
- Loading the Image: It uses PIL.Image.open() to open the image file.
- Applying Transformations: It applies the specified transformations (if any) to the loaded image.
- Converting to Tensor: It converts the transformed image to a PyTorch tensor.
- Returning Data and Label: It returns the processed image tensor and its corresponding class label.
- Overriding the __len__ Method: The sources also explain the importance of overriding the __len__ method to return the correct number of samples in the custom dataset. They demonstrate a simple implementation that returns the length of the list of image file paths.
- Visualizing Samples from the Custom Dataset: The sources emphasize the importance of visually inspecting samples from the custom dataset to ensure that the data is loaded and processed correctly. They guide readers through creating a function to display random images from the dataset, including their labels, to verify the dataset’s integrity and the effectiveness of applied transformations.
The sources provide a detailed guide to understanding and implementing custom datasets in PyTorch. They explain the motivations for using custom datasets, the key methods to implement, and practical considerations for loading, processing, and visualizing data. They encourage readers to explore the flexibility of custom datasets and create their own to handle diverse data formats and structures for their specific machine learning tasks.

Exploring Data Augmentation and Building the TinyVGG Model Architecture: Pages 691-700

The sources introduce the concept of data augmentation, a powerful technique for enhancing the diversity and robustness of training datasets, and then guide readers through building the TinyVGG model architecture using PyTorch.
- Visualizing the Effects of Data Augmentation: The sources demonstrate the visual effects of applying data augmentation techniques to images from the custom food dataset. They showcase examples where images have been:
- Cropped: Portions of the original images have been removed, potentially changing the focus or composition.
- Darkened/Brightened: The overall brightness or contrast of the images has been adjusted, simulating variations in lighting conditions.
- Shifted: The content of the images has been moved within the frame, altering the position of objects.
- Rotated: The images have been rotated by a certain angle, introducing variations in orientation.
- Color-Modified: The color balance or saturation of the images has been altered, simulating variations in color perception.
The sources emphasize that applying these augmentations randomly during training can help the model learn more robust and generalizable features, making it less sensitive to variations in image appearance and less prone to overfitting the training data.
- Creating a Function to Display Random Transformed Images: The sources provide code for creating a function to display random images from the custom dataset after they have been transformed using data augmentation techniques. This function allows for visual inspection of the augmented images, helping readers understand the impact of different transformations on the dataset. They explain how this function can be used to:
- Verify Transformations: Ensure that the intended augmentations are being applied correctly to the images.
- Assess Augmentation Strength: Evaluate whether the strength or intensity of the augmentations is appropriate for the dataset and task.
- Visualize Data Diversity: Observe the increased diversity in the dataset resulting from data augmentation.
- Implementing the TinyVGG Model Architecture: The sources guide readers through implementing the TinyVGG model architecture, a convolutional neural network architecture known for its simplicity and effectiveness in image classification tasks. They outline the key building blocks of the TinyVGG model:
- Convolutional Blocks (conv_block): The model uses multiple convolutional blocks, each consisting of:
- Convolutional Layers (nn.Conv2d): These layers apply learnable filters to the input image, extracting features at different scales and orientations.
- ReLU Activation Layers (nn.ReLU): These layers introduce non-linearity into the model, allowing it to learn complex patterns in the data.
- Max Pooling Layers (nn.MaxPool2d): These layers downsample the feature maps, reducing their spatial dimensions while retaining the most important features.
- Classifier Layer: The convolutional blocks are followed by a classifier layer, which consists of:
- Flatten Layer (nn.Flatten): This layer converts the multi-dimensional feature maps from the convolutional blocks into a one-dimensional feature vector.
- Linear Layer (nn.Linear): This layer performs a linear transformation on the feature vector, producing output logits that represent the model’s predictions for each class.
The sources emphasize the hierarchical structure of the TinyVGG model, where the convolutional blocks progressively extract more abstract and complex features from the input image, and the classifier layer uses these features to make predictions. They explain that the TinyVGG model’s simple yet effective design makes it a suitable choice for various image classification tasks, and its modular structure allows for customization and experimentation with different layer configurations.
- Troubleshooting Shape Mismatches: The sources address the common issue of shape mismatches that can occur when building deep learning models, emphasizing the importance of carefully checking the input and output dimensions of each layer:
- Using Error Messages as Guides: They explain that error messages related to shape mismatches can provide valuable clues for identifying the source of the issue.
- Printing Shapes for Verification: They recommend printing the shapes of tensors at various points in the model to verify that the dimensions are as expected and to trace the flow of data through the model.
- Calculating Shapes Manually: They suggest calculating the expected output shapes of convolutional and pooling layers manually, considering factors like kernel size, stride, and padding, to ensure that the model is structured correctly.
- Using torchinfo for Model Summary: The sources introduce the torchinfo package, a useful tool for visualizing the structure and parameters of a PyTorch model. They explain that torchinfo can provide a comprehensive summary of the model, including:
- Layer Information: The type and configuration of each layer in the model.
- Input and Output Shapes: The expected dimensions of tensors at each stage of the model.
- Number of Parameters: The total number of trainable parameters in the model.
- Memory Usage: An estimate of the model’s memory requirements.
The sources demonstrate how to use torchinfo to summarize the TinyVGG model, highlighting its ability to provide insights into the model’s architecture and complexity, and assist in debugging shape-related issues.

The sources provide a practical guide to understanding and implementing data augmentation techniques, building the TinyVGG model architecture, and troubleshooting common issues. They emphasize the importance of visualizing the effects of augmentations, carefully checking layer shapes, and utilizing tools like torchinfo for model analysis. These steps lay the foundation for training the TinyVGG model on the custom food dataset in subsequent sections.

Training and Evaluating the TinyVGG Model on a Custom Dataset: Pages 701-710

The sources guide readers through training and evaluating the TinyVGG model on the custom food dataset, explaining how to implement training and evaluation loops, track model performance, and visualize results.
- Preparing for Model Training: The sources outline the steps to prepare for training the TinyVGG model:
- Setting a Random Seed: They emphasize the importance of setting a random seed for reproducibility. This ensures that the random initialization of model weights and any data shuffling during training is consistent across different runs, making it easier to compare and analyze results. [1]
- Creating a List of Image Paths: They generate a list of paths to all the image files in the custom dataset. This list will be used to access and process images during training. [1]
- Visualizing Data with PIL: They demonstrate how to use the Python Imaging Library (PIL) to:
- Open and Display Images: Load and display images from the dataset using PIL.Image.open(). [2]
- Convert Images to Arrays: Transform images into numerical arrays using np.array(), enabling further processing and analysis. [3]
- Inspect Color Channels: Examine the red, green, and blue (RGB) color channels of images, understanding how color information is represented numerically. [3]
- Implementing Image Transformations: They review the concept of image transformations and their role in preparing images for model input, highlighting:
- Conversion to Tensors: Transforming images into PyTorch tensors, the required data format for inputting data into PyTorch models. [3]
- Resizing and Cropping: Adjusting image dimensions to ensure consistency and compatibility with the model’s input layer. [3]
- Normalization: Scaling pixel values to a specific range, typically between 0 and 1, to improve model training stability and efficiency. [3]
- Data Augmentation: Applying random transformations to images during training to increase data diversity and prevent overfitting. [4]
- Utilizing ImageFolder for Data Loading: The sources demonstrate the convenience of using the torchvision.datasets.ImageFolder class for loading images from a directory structured according to image classification standards. They explain how ImageFolder:
- Organizes Data by Class: Automatically infers class labels based on the subfolder structure of the image directory, streamlining data organization. [5]
- Provides Data Length: Offers a __len__ method to determine the number of samples in the dataset, useful for tracking progress during training. [5]
- Enables Sample Access: Implements a __getitem__ method to retrieve a specific image and its corresponding label based on its index, facilitating data access during training. [5]
- Creating DataLoader for Batch Processing: The sources emphasize the importance of using the torch.utils.data.DataLoader class to create data loaders, explaining their role in:
- Batching Data: Grouping multiple images and labels into batches, allowing the model to process multiple samples simultaneously, which can significantly speed up training. [6]
- Shuffling Data: Randomizing the order of samples within batches to prevent the model from learning spurious patterns based on the order of data presentation. [6]
- Loading Data Efficiently: Optimizing data loading and transfer, especially when working with large datasets, to minimize training time and resource usage. [6]
- Visualizing a Sample and Label: The sources guide readers through visualizing an image and its label from the custom dataset using Matplotlib, allowing for a visual confirmation that the data is being loaded and processed correctly. [7]
- Understanding Data Shape and Transformations: The sources highlight the importance of understanding how data shapes change as they pass through different stages of the model:
- Color Channels First (NCHW): PyTorch often expects images in the format “Batch Size (N), Color Channels (C), Height (H), Width (W).” [8]
- Transformations and Shape: They reiterate the importance of verifying that image transformations result in the expected output shapes, ensuring compatibility with subsequent layers. [8]
- Replicating ImageFolder Functionality: The sources provide code for replicating the core functionality of ImageFolder manually. They explain that this exercise can deepen understanding of how custom datasets are created and provide a foundation for building more specialized datasets in the future. [9]
The sources meticulously guide readers through the essential steps of preparing data, loading it using ImageFolder, and creating data loaders for efficient batch processing. They emphasize the importance of data visualization, shape verification, and understanding the transformations applied to images. These detailed explanations set the stage for training and evaluating the TinyVGG model on the custom food dataset.

Constructing the Training Loop and Evaluating Model Performance: Pages 711-720

The sources focus on building the training loop and evaluating the performance of the TinyVGG model on the custom food dataset. They introduce techniques for tracking training progress, calculating loss and accuracy, and visualizing the training process.
- Creating Training and Testing Step Functions: The sources explain the importance of defining separate functions for the training and testing steps. They guide readers through implementing these functions:
- train_step Function: This function outlines the steps involved in a single training iteration. It includes:
1. Setting the Model to Train Mode: The model is set to training mode (model.train()) to enable gradient calculations and updates during backpropagation.
2. Performing a Forward Pass: The input data (images) is passed through the model to obtain the output predictions (logits).
3. Calculating the Loss: The predicted logits are compared to the true labels using a loss function (e.g., cross-entropy loss), providing a measure of how well the model’s predictions match the actual data.
4. Calculating the Accuracy: The model’s accuracy is calculated by determining the percentage of correct predictions.
5. Zeroing Gradients: The gradients from the previous iteration are reset to zero (optimizer.zero_grad()) to prevent their accumulation and ensure that each iteration’s gradients are calculated independently.
6. Performing Backpropagation: The gradients of the loss function with respect to the model’s parameters are calculated (loss.backward()), tracing the path of error back through the network.
7. Updating Model Parameters: The optimizer updates the model’s parameters (optimizer.step()) based on the calculated gradients, adjusting the model’s weights and biases to minimize the loss function.
8. Returning Loss and Accuracy: The function returns the calculated loss and accuracy for the current training iteration, allowing for performance monitoring.
- test_step Function: This function performs a similar process to the train_step function, but without gradient calculations or parameter updates. It is designed to evaluate the model’s performance on a separate test dataset, providing an unbiased assessment of how well the model generalizes to unseen data.
- Implementing the Training Loop: The sources outline the structure of the training loop, which iteratively trains and evaluates the model over a specified number of epochs:
- Looping through Epochs: The loop iterates through the desired number of epochs, allowing the model to see and learn from the training data multiple times.
- Looping through Batches: Within each epoch, the loop iterates through the batches of data provided by the training data loader.
- Calling train_step and test_step: For each batch, the train_step function is called to train the model, and periodically, the test_step function is called to evaluate the model’s performance on the test dataset.
- Tracking and Accumulating Loss and Accuracy: The loss and accuracy values from each batch are accumulated to calculate the average loss and accuracy for the entire epoch.
- Printing Progress: The training progress, including epoch number, loss, and accuracy, is printed to the console, providing a real-time view of the model’s performance.
- Using tqdm for Progress Bars: The sources recommend using the tqdm library to create progress bars, which visually display the progress of the training loop, making it easier to track how long each epoch takes and estimate the remaining training time.
- Visualizing Training Progress with Loss Curves: The sources emphasize the importance of visualizing the model’s training progress by plotting loss curves. These curves show how the loss function changes over time (epochs or batches), providing insights into:
- Model Convergence: Whether the model is successfully learning and reducing the error on the training data, indicated by a decreasing loss curve.
- Overfitting: If the loss on the training data continues to decrease while the loss on the test data starts to increase, it might indicate that the model is overfitting the training data and not generalizing well to unseen data.
- Understanding Ideal and Problematic Loss Curves: The sources provide examples of ideal and problematic loss curves, helping readers identify patterns that suggest healthy training progress or potential issues that may require adjustments to the model’s architecture, hyperparameters, or training process.
The sources provide a detailed guide to constructing the training loop, tracking model performance, and visualizing the training process. They explain how to implement training and testing steps, use tqdm for progress tracking, and interpret loss curves to monitor the model’s learning and identify potential issues. These steps are crucial for successfully training and evaluating the TinyVGG model on the custom food dataset.

Experiment Tracking and Enhancing Model Performance: Pages 721-730

The sources guide readers through tracking model experiments and exploring techniques to enhance the TinyVGG model’s performance on the custom food dataset. They explain methods for comparing results, adjusting hyperparameters, and introduce the concept of transfer learning.
- Comparing Model Results: The sources introduce strategies for comparing the results of different model training experiments. They demonstrate how to:
- Create a Dictionary to Store Results: Organize the results of each experiment, including loss, accuracy, and training time, into separate dictionaries for easy access and comparison.
- Use Pandas DataFrames for Analysis: Leverage the power of Pandas DataFrames to:
- Structure Results: Neatly organize the results from different experiments into a tabular format, facilitating clear comparisons.
- Sort and Analyze Data: Sort and analyze the data to identify trends, such as which model configuration achieved the lowest loss or highest accuracy, and to observe how changes in hyperparameters affect performance.
- Exploring Ways to Improve a Model: The sources discuss various techniques for improving the performance of a deep learning model, including:
- Adjusting Hyperparameters: Modifying hyperparameters, such as the learning rate, batch size, and number of epochs, can significantly impact model performance. They suggest experimenting with these parameters to find optimal settings for a given dataset.
- Adding More Layers: Increasing the depth of the model by adding more layers can potentially allow the model to learn more complex representations of the data, leading to improved accuracy.
- Adding More Hidden Units: Increasing the number of hidden units in each layer can also enhance the model’s capacity to learn intricate patterns in the data.
- Training for Longer: Training the model for more epochs can sometimes lead to further improvements, but it is crucial to monitor the loss curves for signs of overfitting.
- Using a Different Optimizer: Different optimizers employ distinct strategies for updating model parameters. Experimenting with various optimizers, such as Adam or RMSprop, might yield better performance compared to the default stochastic gradient descent (SGD) optimizer.
- Leveraging Transfer Learning: The sources introduce the concept of transfer learning, a powerful technique where a model pre-trained on a large dataset is used as a starting point for training on a smaller, related dataset. They explain how transfer learning can:
- Improve Performance: Benefit from the knowledge gained by the pre-trained model, often resulting in faster convergence and higher accuracy on the target dataset.
- Reduce Training Time: Leverage the pre-trained model’s existing feature representations, potentially reducing the need for extensive training from scratch.
- Making Predictions on a Custom Image: The sources demonstrate how to use the trained model to make predictions on a custom image. This involves:
- Loading and Transforming the Image: Loading the image using PIL, applying the same transformations used during training (resizing, normalization, etc.), and converting the image to a PyTorch tensor.
- Passing the Image through the Model: Inputting the transformed image tensor into the trained model to obtain the predicted logits.
- Applying Softmax for Probabilities: Converting the raw logits into probabilities using the softmax function, indicating the model’s confidence in each class prediction.
- Determining the Predicted Class: Selecting the class with the highest probability as the model’s prediction for the input image.
- Understanding Model Performance: The sources emphasize the importance of evaluating the model’s performance both quantitatively and qualitatively:
- Quantitative Evaluation: Using metrics like loss and accuracy to assess the model’s performance numerically, providing objective measures of its ability to learn and generalize.
- Qualitative Evaluation: Examining predictions on individual images to gain insights into the model’s decision-making process. This can help identify areas where the model struggles and suggest potential improvements to the training data or model architecture.
The sources cover important aspects of tracking experiments, improving model performance, and making predictions. They explain methods for comparing results, discuss various hyperparameter tuning techniques and introduce transfer learning. They also guide readers through making predictions on custom images and emphasize the importance of both quantitative and qualitative evaluation to understand the model’s strengths and limitations.

Building Custom Datasets with PyTorch: Pages 731-740

The sources shift focus to constructing custom datasets in PyTorch. They explain the motivation behind creating custom datasets, walk through the process of building one for the food classification task, and highlight the importance of understanding the dataset structure and visualizing the data.
- Understanding the Need for Custom Datasets: The sources explain that while pre-built datasets like FashionMNIST are valuable for learning and experimentation, real-world machine learning projects often require working with custom datasets specific to the problem at hand. Building custom datasets allows for greater flexibility and control over the data used for training models.
- Creating a Custom ImageDataset Class: The sources guide readers through creating a custom dataset class named ImageDataset, which inherits from the Dataset class provided by PyTorch. They outline the key steps and methods involved:
1. Initialization (__init__): This method initializes the dataset by:
- Defining the root directory where the image data is stored.
- Setting up the transformation pipeline to be applied to each image (e.g., resizing, normalization).
- Creating a list of image file paths by recursively traversing the directory structure.
- Generating a list of corresponding labels based on the image’s parent directory (representing the class).
1. Calculating Dataset Length (__len__): This method returns the total number of samples in the dataset, determined by the length of the image file path list. This allows PyTorch’s data loaders to know how many samples are available.
2. Getting a Sample (__getitem__): This method fetches a specific sample from the dataset given its index. It involves:
- Retrieving the image file path and label corresponding to the provided index.
- Loading the image using PIL.
- Applying the defined transformations to the image.
- Converting the image to a PyTorch tensor.
- Returning the transformed image tensor and its associated label.
- Mapping Class Names to Integers: The sources demonstrate a helper function that maps class names (e.g., “pizza”, “steak”, “sushi”) to integer labels (e.g., 0, 1, 2). This is necessary for PyTorch models, which typically work with numerical labels.
- Visualizing Samples and Labels: The sources stress the importance of visually inspecting the data to gain a better understanding of the dataset’s structure and contents. They guide readers through creating a function to display random images from the custom dataset along with their corresponding labels, allowing for a qualitative assessment of the data.
The sources provide a comprehensive overview of building custom datasets in PyTorch, specifically focusing on creating an ImageDataset class for image classification tasks. They outline the essential methods for initialization, calculating length, and retrieving samples, along with the process of mapping class names to integers and visualizing the data.

Visualizing and Augmenting Custom Datasets: Pages 741-750

The sources focus on visualizing data from the custom ImageDataset and introduce the concept of data augmentation as a technique to enhance model performance. They guide readers through creating a function to display random images from the dataset and explore various data augmentation techniques, specifically using the torchvision.transforms module.
- Creating a Function to Display Random Images: The sources outline the steps involved in creating a function to visualize random images from the custom dataset, enabling a qualitative assessment of the data and the transformations applied. They provide detailed guidance on:
1. Function Definition: Define a function that accepts the dataset, class names, the number of images to display (defaulting to 10), and a boolean flag (display_shape) to optionally show the shape of each image.
2. Limiting Display for Practicality: To prevent overwhelming the display, the function caps the maximum number of images to 10. If the user requests more than 10 images, the function automatically sets the limit to 10 and disables the display_shape option.
3. Random Sampling: Generate a list of random indices within the range of the dataset’s length using random.sample. The number of indices to sample is determined by the n parameter (number of images to display).
4. Setting up the Plot: Create a Matplotlib figure with a size adjusted based on the number of images to display.
5. Iterating through Samples: Loop through the randomly sampled indices, retrieving the corresponding image and label from the dataset using the __getitem__ method.
6. Creating Subplots: For each image, create a subplot within the Matplotlib figure, arranging them in a single row.
7. Displaying Images: Use plt.imshow to display the image within its designated subplot.
8. Setting Titles: Set the title of each subplot to display the class name of the image.
9. Optional Shape Display: If the display_shape flag is True, print the shape of each image tensor below its subplot.
- Introducing Data Augmentation: The sources highlight the importance of data augmentation, a technique that artificially increases the diversity of training data by applying various transformations to the original images. Data augmentation helps improve the model’s ability to generalize and reduces the risk of overfitting. They provide a conceptual explanation of data augmentation and its benefits, emphasizing its role in enhancing model robustness and performance.
- Exploring torchvision.transforms: The sources guide readers through the torchvision.transforms module, a valuable tool in PyTorch that provides a range of image transformations for data augmentation. They discuss specific transformations like:
- RandomHorizontalFlip: Randomly flips the image horizontally with a given probability.
- RandomRotation: Rotates the image by a random angle within a specified range.
- ColorJitter: Randomly adjusts the brightness, contrast, saturation, and hue of the image.
- RandomResizedCrop: Crops a random portion of the image and resizes it to a given size.
- ToTensor: Converts the PIL image to a PyTorch tensor.
- Normalize: Normalizes the image tensor using specified mean and standard deviation values.
- Visualizing Transformed Images: The sources demonstrate how to visualize images after applying data augmentation transformations. They create a new transformation pipeline incorporating the desired augmentations and then use the previously defined function to display random images from the dataset after they have been transformed.
The sources provide valuable insights into visualizing custom datasets and leveraging data augmentation to improve model training. They explain the creation of a function to display random images, introduce data augmentation as a concept, and explore various transformations provided by the torchvision.transforms module. They also demonstrate how to visualize the effects of these transformations, allowing for a better understanding of how they augment the training data.

Implementing a Convolutional Neural Network for Food Classification: Pages 751-760

The sources shift focus to building and training a convolutional neural network (CNN) to classify images from the custom food dataset. They walk through the process of implementing a TinyVGG architecture, setting up training and testing functions, and evaluating the model’s performance.
- Building a TinyVGG Architecture: The sources introduce the TinyVGG architecture as a simplified version of the popular VGG network, known for its effectiveness in image classification tasks. They provide a step-by-step guide to constructing the TinyVGG model using PyTorch:
1. Defining Input Shape and Hidden Units: Establish the input shape of the images, considering the number of color channels, height, and width. Also, determine the number of hidden units to use in convolutional layers.
2. Constructing Convolutional Blocks: Create two convolutional blocks, each consisting of:
- A 2D convolutional layer (nn.Conv2d) to extract features from the input images.
- A ReLU activation function (nn.ReLU) to introduce non-linearity.
- Another 2D convolutional layer.
- Another ReLU activation function.
- A max-pooling layer (nn.MaxPool2d) to downsample the feature maps, reducing their spatial dimensions.
1. Creating the Classifier Layer: Define the classifier layer, responsible for producing the final classification output. This layer comprises:
- A flattening layer (nn.Flatten) to convert the multi-dimensional feature maps from the convolutional blocks into a one-dimensional feature vector.
- A linear layer (nn.Linear) to perform the final classification, mapping the features to the number of output classes.
- A ReLU activation function.
- Another linear layer to produce the final output with the desired number of classes.
1. Combining Layers in nn.Sequential: Utilize nn.Sequential to organize and connect the convolutional blocks and the classifier layer in a sequential manner, defining the flow of data through the model.
- Verifying Model Architecture with torchinfo: The sources introduce the torchinfo package as a helpful tool for summarizing and verifying the architecture of a PyTorch model. They demonstrate its usage by passing the created TinyVGG model to torchinfo.summary, providing a concise overview of the model’s layers, input and output shapes, and the number of trainable parameters.
- Setting up Training and Testing Functions: The sources outline the process of creating functions for training and testing the TinyVGG model. They provide a detailed explanation of the steps involved in each function:
- Training Function (train_step): This function handles a single training step, accepting the model, data loader, loss function, optimizer, and device as input:
1. Set the model to training mode (model.train()).
2. Iterate through batches of data from the data loader.
3. For each batch, send the input data and labels to the specified device.
4. Perform a forward pass through the model to obtain predictions (logits).
5. Calculate the loss using the provided loss function.
6. Perform backpropagation to compute gradients.
7. Update model parameters using the optimizer.
8. Accumulate training loss for the epoch.
9. Return the average training loss.
- Testing Function (test_step): This function evaluates the model’s performance on a given dataset, accepting the model, data loader, loss function, and device as input:
1. Set the model to evaluation mode (model.eval()).
2. Disable gradient calculation using torch.no_grad().
3. Iterate through batches of data from the data loader.
4. For each batch, send the input data and labels to the specified device.
5. Perform a forward pass through the model to obtain predictions.
6. Calculate the loss.
7. Accumulate testing loss.
8. Return the average testing loss.
- Training and Evaluating the Model: The sources guide readers through the process of training the TinyVGG model using the defined training function. They outline steps such as:
1. Instantiating the model and moving it to the desired device (CPU or GPU).
2. Defining the loss function (e.g., cross-entropy loss) and optimizer (e.g., SGD).
3. Setting up the training loop for a specified number of epochs.
4. Calling the train_step function for each epoch to train the model on the training data.
5. Evaluating the model’s performance on the test data using the test_step function.
6. Tracking and printing training and testing losses for each epoch.
- Visualizing the Loss Curve: The sources emphasize the importance of visualizing the loss curve to monitor the model’s training progress and detect potential issues like overfitting or underfitting. They provide guidance on creating a plot showing the training loss over epochs, allowing users to observe how the loss decreases as the model learns.
- Preparing for Model Improvement: The sources acknowledge that the initial performance of the TinyVGG model may not be optimal. They suggest various techniques to potentially improve the model’s performance in subsequent steps, paving the way for further experimentation and model refinement.
The sources offer a comprehensive walkthrough of building and training a TinyVGG model for image classification using a custom food dataset. They detail the architecture of the model, explain the training and testing procedures, and highlight the significance of visualizing the loss curve. They also lay the foundation for exploring techniques to enhance the model’s performance in later stages.

Improving Model Performance and Tracking Experiments: Pages 761-770

The sources transition from establishing a baseline model to exploring techniques for enhancing its performance and introduce methods for tracking experimental results. They focus on data augmentation strategies using the torchvision.transforms module and creating a system for comparing different model configurations.
- Evaluating the Custom ImageDataset: The sources revisit the custom ImageDataset created earlier, emphasizing the importance of assessing its functionality. They use the previously defined plot_random_images function to visually inspect a sample of images from the dataset, confirming that the images are loaded correctly and transformed as intended.
- Data Augmentation for Enhanced Performance: The sources delve deeper into data augmentation as a crucial technique for improving the model’s ability to generalize to unseen data. They highlight how data augmentation artificially increases the diversity and size of the training data, leading to more robust models that are less prone to overfitting.
- Exploring torchvision.transforms for Augmentation: The sources guide users through different data augmentation techniques available in the torchvision.transforms module. They explain the purpose and effects of various transformations, including:
- RandomHorizontalFlip: Randomly flips the image horizontally, adding variability to the dataset.
- RandomRotation: Rotates the image by a random angle within a specified range, exposing the model to different orientations.
- ColorJitter: Randomly adjusts the brightness, contrast, saturation, and hue of the image, making the model more robust to variations in lighting and color.
- Visualizing Augmented Images: The sources demonstrate how to visualize the effects of data augmentation by applying transformations to images and then displaying the transformed images. This visual inspection helps understand the impact of the augmentations and ensure they are applied correctly.
- Introducing TrivialAugment: The sources introduce TrivialAugment, a data augmentation strategy that randomly applies a sequence of simple augmentations to each image. They explain that TrivialAugment has been shown to be effective in improving model performance, particularly when combined with other techniques. They provide a link to a research paper for further reading on TrivialAugment, encouraging users to explore the strategy in more detail.
- Applying TrivialAugment to the Custom Dataset: The sources guide users through applying TrivialAugment to the custom food dataset. They create a new transformation pipeline incorporating TrivialAugment and then use the plot_random_images function to display a sample of augmented images, allowing users to visually assess the impact of the augmentations.
- Creating a System for Comparing Model Results: The sources shift focus to establishing a structured approach for tracking and comparing the performance of different model configurations. They create a dictionary called compare_results to store results from various model experiments. This dictionary is designed to hold information such as training time, training loss, testing loss, and testing accuracy for each model.
- Setting Up a Pandas DataFrame: The sources introduce Pandas DataFrames as a convenient tool for organizing and analyzing experimental results. They convert the compare_results dictionary into a Pandas DataFrame, providing a structured table-like representation of the results, making it easier to compare the performance of different models.
The sources provide valuable insights into techniques for improving model performance, specifically focusing on data augmentation strategies. They guide users through various transformations available in the torchvision.transforms module, explain the concept and benefits of TrivialAugment, and demonstrate how to visualize the effects of these augmentations. Moreover, they introduce a structured approach for tracking and comparing experimental results using a dictionary and a Pandas DataFrame, laying the groundwork for systematic model experimentation and analysis.

Predicting on a Custom Image and Wrapping Up the Custom Datasets Section: Pages 771-780

The sources shift focus to making predictions on a custom image using the trained TinyVGG model and summarize the key concepts covered in the custom datasets section. They guide users through the process of preparing the image, making predictions, and analyzing the results.
- Preparing a Custom Image for Prediction: The sources outline the steps for preparing a custom image for prediction:
1. Obtaining the Image: Acquire an image that aligns with the classes the model was trained on. In this case, the image should be of either pizza, steak, or sushi.
2. Resizing and Converting to RGB: Ensure the image is resized to the dimensions expected by the model (64×64 in this case) and converted to RGB format. This resizing step is crucial as the model was trained on images with specific dimensions and expects the same input format during prediction.
3. Converting to a PyTorch Tensor: Transform the image into a PyTorch tensor using torchvision.transforms.ToTensor(). This conversion is necessary to feed the image data into the PyTorch model.
- Making Predictions with the Trained Model: The sources walk through the process of using the trained TinyVGG model to make predictions on the prepared custom image:
1. Setting the Model to Evaluation Mode: Switch the model to evaluation mode using model.eval(). This step ensures that the model behaves appropriately for prediction, deactivating functionalities like dropout that are only used during training.
2. Performing a Forward Pass: Pass the prepared image tensor through the model to obtain the model’s predictions (logits).
3. Applying Softmax to Obtain Probabilities: Convert the raw logits into prediction probabilities using the softmax function (torch.softmax()). Softmax transforms the logits into a probability distribution, where each value represents the model’s confidence in the image belonging to a particular class.
4. Determining the Predicted Class: Identify the class with the highest predicted probability, representing the model’s final prediction for the input image.
- Analyzing the Prediction Results: The sources emphasize the importance of carefully analyzing the prediction results, considering both quantitative and qualitative aspects. They highlight that even if the model’s accuracy may not be perfect, a qualitative assessment of the predictions can provide valuable insights into the model’s behavior and potential areas for improvement.
- Summarizing the Custom Datasets Section: The sources provide a comprehensive summary of the key concepts covered in the custom datasets section:
1. Understanding Custom Datasets: They reiterate the importance of working with custom datasets, especially when dealing with domain-specific problems or when pre-trained models may not be readily available. They emphasize the ability of custom datasets to address unique challenges and tailor models to specific needs.
2. Building a Custom Dataset: They recap the process of building a custom dataset using torchvision.datasets.ImageFolder. They highlight the benefits of ImageFolder for handling image data organized in standard image classification format, where images are stored in separate folders representing different classes.
3. Creating a Custom ImageDataset Class: They review the steps involved in creating a custom ImageDataset class, demonstrating the flexibility and control this approach offers for handling and processing data. They explain the key methods required for a custom dataset, including __init__, __len__, and __getitem__, and how these methods interact with the data loader.
4. Data Augmentation Techniques: They emphasize the importance of data augmentation for improving model performance, particularly in scenarios where the training data is limited. They reiterate the techniques explored earlier, including random horizontal flipping, random rotation, color jittering, and TrivialAugment, highlighting how these techniques can enhance the model’s ability to generalize to unseen data.
5. Training and Evaluating Models: They summarize the process of training and evaluating models on custom datasets, highlighting the steps involved in setting up training loops, evaluating model performance, and visualizing results.
- Introducing Exercises and Extra Curriculum: The sources conclude the custom datasets section by providing a set of exercises and extra curriculum resources to reinforce the concepts covered. They direct users to the learnpytorch.io website and the pytorch-deep-learning GitHub repository for exercise templates, example solutions, and additional learning materials.
- Previewing Upcoming Sections: The sources briefly preview the upcoming sections of the course, hinting at topics like transfer learning, model experiment tracking, paper replicating, and more advanced architectures. They encourage users to continue their learning journey, exploring more complex concepts and techniques in deep learning with PyTorch.
The sources provide a practical guide to making predictions on a custom image using a trained TinyVGG model, carefully explaining the preparation steps, prediction process, and analysis of results. Additionally, they offer a concise summary of the key concepts covered in the custom datasets section, reinforcing the understanding of custom datasets, data augmentation techniques, and model training and evaluation. Finally, they introduce exercises and extra curriculum resources to encourage further practice and learning while previewing the exciting topics to come in the remainder of the course.

Setting Up a TinyVGG Model and Exploring Model Architectures: Pages 781-790

The sources transition from data preparation and augmentation to building a convolutional neural network (CNN) model using the TinyVGG architecture. They guide users through the process of defining the model’s architecture, understanding its components, and preparing it for training.
- Introducing the TinyVGG Architecture: The sources introduce TinyVGG, a simplified version of the VGG (Visual Geometry Group) architecture, known for its effectiveness in image classification tasks. They provide a visual representation of the TinyVGG architecture, outlining its key components, including:
- Convolutional Blocks: The foundation of TinyVGG, composed of convolutional layers (nn.Conv2d) followed by ReLU activation functions (nn.ReLU) and max-pooling layers (nn.MaxPool2d). Convolutional layers extract features from the input images, ReLU introduces non-linearity, and max-pooling downsamples the feature maps, reducing their dimensionality and making the model more robust to variations in the input.
- Classifier Layer: The final layer of TinyVGG, responsible for classifying the extracted features into different categories. It consists of a flattening layer (nn.Flatten), which converts the multi-dimensional feature maps from the convolutional blocks into a single vector, followed by a linear layer (nn.Linear) that outputs a score for each class.
- Building a TinyVGG Model in PyTorch: The sources provide a step-by-step guide to building a TinyVGG model in PyTorch using the nn.Module class. They explain the structure of the model definition, outlining the key components:
1. __init__ Method: Initializes the model’s layers and components, including convolutional blocks and the classifier layer.
2. forward Method: Defines the forward pass of the model, specifying how the input data flows through the different layers and operations.
- Understanding Input and Output Shapes: The sources emphasize the importance of understanding and verifying the input and output shapes of each layer in the model. They guide users through calculating the dimensions of the feature maps at different stages of the network, taking into account factors such as the kernel size, stride, and padding of the convolutional layers. This understanding of shape transformations is crucial for ensuring that data flows correctly through the network and for debugging potential shape mismatches.
- Passing a Random Tensor Through the Model: The sources recommend passing a random tensor with the expected input shape through the model as a preliminary step to verify the model’s architecture and identify potential shape errors. This technique helps ensure that data can successfully flow through the network before proceeding with training.
- Introducing torchinfo for Model Summary: The sources introduce the torchinfo package as a helpful tool for summarizing PyTorch models. They demonstrate how to use torchinfo.summary to obtain a concise overview of the model’s architecture, including the input and output shapes of each layer and the number of trainable parameters. This package provides a convenient way to visualize and verify the model’s structure, making it easier to understand and debug.
The sources provide a detailed walkthrough of building a TinyVGG model in PyTorch, explaining the architecture’s components, the steps involved in defining the model using nn.Module, and the significance of understanding input and output shapes. They introduce practical techniques like passing a random tensor through the model for verification and leverage the torchinfo package for obtaining a comprehensive model summary. These steps lay a solid foundation for building and understanding CNN models for image classification tasks.

Training the TinyVGG Model and Evaluating its Performance: Pages 791-800

The sources shift focus to training the constructed TinyVGG model on the custom food image dataset. They guide users through creating training and testing functions, setting up a training loop, and evaluating the model’s performance using metrics like loss and accuracy.
- Creating Training and Testing Functions: The sources outline the process of creating separate functions for the training and testing steps, promoting modularity and code reusability.
- train_step Function: This function performs a single training step, encompassing the forward pass, loss calculation, backpropagation, and parameter updates.
1. Forward Pass: It takes a batch of data from the training dataloader, passes it through the model, and obtains the model’s predictions.
2. Loss Calculation: It calculates the loss between the predictions and the ground truth labels using a chosen loss function (e.g., cross-entropy loss for classification).
3. Backpropagation: It computes the gradients of the loss with respect to the model’s parameters using the loss.backward() method. Backpropagation determines how each parameter contributed to the error, guiding the optimization process.
4. Parameter Updates: It updates the model’s parameters based on the computed gradients using an optimizer (e.g., stochastic gradient descent). The optimizer adjusts the parameters to minimize the loss, improving the model’s performance over time.
5. Accuracy Calculation: It calculates the accuracy of the model’s predictions on the current batch of training data. Accuracy measures the proportion of correctly classified samples.
- test_step Function: This function evaluates the model’s performance on a batch of test data, computing the loss and accuracy without updating the model’s parameters.
1. Forward Pass: It takes a batch of data from the testing dataloader, passes it through the model, and obtains the model’s predictions. The model’s behavior is set to evaluation mode (model.eval()) before performing the forward pass to ensure that training-specific functionalities like dropout are deactivated.
2. Loss Calculation: It calculates the loss between the predictions and the ground truth labels using the same loss function as in train_step.
3. Accuracy Calculation: It calculates the accuracy of the model’s predictions on the current batch of testing data.
- Setting up a Training Loop: The sources demonstrate the implementation of a training loop that iterates through the training data for a specified number of epochs, calling the train_step and test_step functions at each epoch.
1. Epoch Iteration: The loop iterates for a predefined number of epochs, each epoch representing a complete pass through the entire training dataset.
2. Training Phase: For each epoch, the loop iterates through the batches of training data provided by the training dataloader, calling the train_step function for each batch. The train_step function performs the forward pass, loss calculation, backpropagation, and parameter updates as described above. The training loss and accuracy values are accumulated across all batches within an epoch.
3. Testing Phase: After each epoch, the loop iterates through the batches of testing data provided by the testing dataloader, calling the test_step function for each batch. The test_step function computes the loss and accuracy on the testing data without updating the model’s parameters. The testing loss and accuracy values are also accumulated across all batches.
4. Printing Progress: The loop prints the training and testing loss and accuracy values at regular intervals, typically after each epoch or a set number of epochs. This step provides feedback on the model’s progress and allows for monitoring its performance over time.
- Visualizing Training Progress: The sources highlight the importance of visualizing the training process, particularly the loss curves, to gain insights into the model’s behavior and identify potential issues like overfitting or underfitting. They suggest plotting the training and testing losses over epochs to observe how the loss values change during training.
The sources guide users through setting up a robust training pipeline for the TinyVGG model, emphasizing modularity through separate training and testing functions and a structured training loop. They recommend monitoring and visualizing training progress, particularly using loss curves, to gain a deeper understanding of the model’s behavior and performance. These steps provide a practical foundation for training and evaluating CNN models on custom image datasets.

Training and Experimenting with the TinyVGG Model on a Custom Dataset: Pages 801-810

The sources guide users through training their TinyVGG model on the custom food image dataset using the training functions and loop set up in the previous steps. They emphasize the importance of tracking and comparing model results, including metrics like loss, accuracy, and training time, to evaluate performance and make informed decisions about model improvements.
- Tracking Model Results: The sources recommend using a dictionary to store the training and testing results for each epoch, including the training loss, training accuracy, testing loss, and testing accuracy. This approach allows users to track the model’s performance over epochs and to easily compare the results of different models or training configurations. [1]
- Setting Up the Training Process: The sources provide code for setting up the training process, including:
1. Initializing a Results Dictionary: Creating a dictionary to store the model’s training and testing results. [1]
2. Implementing the Training Loop: Utilizing the tqdm library to display a progress bar during training and iterating through the specified number of epochs. [2]
3. Calling Training and Testing Functions: Invoking the train_step and test_step functions for each epoch, passing in the necessary arguments, including the model, dataloaders, loss function, optimizer, and device. [3]
4. Updating the Results Dictionary: Storing the training and testing loss and accuracy values for each epoch in the results dictionary. [2]
5. Printing Epoch Results: Displaying the training and testing results for each epoch. [3]
6. Calculating and Printing Total Training Time: Measuring the total time taken for training and printing the result. [4]
- Evaluating and Comparing Model Results: The sources guide users through plotting the training and testing losses and accuracies over epochs to visualize the model’s performance. They explain how to analyze the loss curves for insights into the training process, such as identifying potential overfitting or underfitting. [5, 6] They also recommend comparing the results of different models trained with various configurations to understand the impact of different architectural choices or hyperparameters on performance. [7]
- Improving Model Performance: Building upon the visualization and comparison of results, the sources discuss strategies for improving the model’s performance, including:
1. Adding More Layers: Increasing the depth of the model to enable it to learn more complex representations of the data. [8]
2. Adding More Hidden Units: Expanding the capacity of each layer to enhance its ability to capture intricate patterns in the data. [8]
3. Training for Longer: Increasing the number of epochs to allow the model more time to learn from the data. [9]
4. Using a Smaller Learning Rate: Adjusting the learning rate, which determines the step size during parameter updates, to potentially improve convergence and prevent oscillations around the optimal solution. [8]
5. Trying a Different Optimizer: Exploring alternative optimization algorithms, each with its unique approach to updating parameters, to potentially find one that better suits the specific problem. [8]
6. Using Learning Rate Decay: Gradually reducing the learning rate over epochs to fine-tune the model and improve convergence towards the optimal solution. [8]
7. Adding Regularization Techniques: Implementing methods like dropout or weight decay to prevent overfitting, which occurs when the model learns the training data too well and performs poorly on unseen data. [8]
- Visualizing Loss Curves: The sources emphasize the importance of understanding and interpreting loss curves to gain insights into the training process. They provide visual examples of different loss curve shapes and explain how to identify potential issues like overfitting or underfitting based on the curves’ behavior. They also offer guidance on interpreting ideal loss curves and discuss strategies for addressing problems like overfitting or underfitting, pointing to additional resources for further exploration. [5, 10]
The sources offer a structured approach to training and evaluating the TinyVGG model on a custom food image dataset, encouraging the use of dictionaries to track results, visualizing performance through loss curves, and comparing different model configurations. They discuss potential areas for model improvement and highlight resources for delving deeper into advanced techniques like learning rate scheduling and regularization. These steps empower users to systematically experiment, analyze, and enhance their models’ performance on image classification tasks using custom datasets.

Evaluating Model Performance and Introducing Data Augmentation: Pages 811-820

The sources emphasize the need to comprehensively evaluate model performance beyond just loss and accuracy. They introduce concepts like training time and tools for visualizing comparisons between different trained models. They also explore the concept of data augmentation as a strategy to improve model performance, focusing specifically on the “Trivial Augment” technique.
- Comparing Model Results: The sources guide users through creating a Pandas DataFrame to organize and compare the results of different trained models. The DataFrame includes columns for metrics like training loss, training accuracy, testing loss, testing accuracy, and training time, allowing for a clear comparison of the models’ performance across various metrics.
- Data Augmentation: The sources explain data augmentation as a technique for artificially increasing the diversity and size of the training dataset by applying various transformations to the original images. Data augmentation aims to improve the model’s generalization ability and reduce overfitting by exposing the model to a wider range of variations within the training data.
- Trivial Augment: The sources focus on Trivial Augment [1], a data augmentation technique known for its simplicity and effectiveness. They guide users through implementing Trivial Augment using PyTorch’s torchvision.transforms module, showcasing how to apply transformations like random cropping, horizontal flipping, color jittering, and other augmentations to the training images. They provide code examples for defining a transformation pipeline using torchvision.transforms.Compose to apply a sequence of augmentations to the input images.
- Visualizing Augmented Images: The sources recommend visualizing the augmented images to ensure that the applied transformations are appropriate and effective. They provide code using Matplotlib to display a grid of augmented images, allowing users to visually inspect the impact of the transformations on the training data.
- Understanding the Benefits of Data Augmentation: The sources explain the potential benefits of data augmentation, including:
- Improved Generalization: Exposing the model to a wider range of variations within the training data can help it learn more robust and generalizable features, leading to better performance on unseen data.
- Reduced Overfitting: Increasing the diversity of the training data can mitigate overfitting, which occurs when the model learns the training data too well and performs poorly on new, unseen data.
- Increased Effective Dataset Size: Artificially expanding the training dataset through augmentations can be beneficial when the original dataset is relatively small.
The sources present a structured approach to evaluating and comparing model performance using Pandas DataFrames. They introduce data augmentation, particularly Trivial Augment, as a valuable technique for enhancing model generalization and performance. They guide users through implementing data augmentation pipelines using PyTorch’s torchvision.transforms module and recommend visualizing augmented images to ensure their effectiveness. These steps empower users to perform thorough model evaluation, understand the importance of data augmentation, and implement it effectively using PyTorch to potentially boost model performance on image classification tasks.

Exploring Convolutional Neural Networks and Building a Custom Model: Pages 821-830

The sources shift focus to the fundamentals of Convolutional Neural Networks (CNNs), introducing their key components and operations. They walk users through building a custom CNN model, incorporating concepts like convolutional layers, ReLU activation functions, max pooling layers, and flattening layers to create a model capable of learning from image data.
- Introduction to CNNs: The sources provide an overview of CNNs, explaining their effectiveness in image classification tasks due to their ability to learn spatial hierarchies of features. They introduce the essential components of a CNN, including:
1. Convolutional Layers: Convolutional layers apply filters to the input image to extract features like edges, textures, and patterns. These filters slide across the image, performing convolutions to create feature maps that capture different aspects of the input.
2. ReLU Activation Function: ReLU (Rectified Linear Unit) is a non-linear activation function applied to the output of convolutional layers. It introduces non-linearity into the model, allowing it to learn complex relationships between features.
3. Max Pooling Layers: Max pooling layers downsample the feature maps produced by convolutional layers, reducing their dimensionality while retaining important information. They help make the model more robust to variations in the input image.
4. Flattening Layer: A flattening layer converts the multi-dimensional output of the convolutional and pooling layers into a one-dimensional vector, preparing it as input for the fully connected layers of the network.
- Building a Custom CNN Model: The sources guide users through constructing a custom CNN model using PyTorch’s nn.Module class. They outline a step-by-step process, explaining how to define the model’s architecture:
1. Defining the Model Class: Creating a Python class that inherits from nn.Module, setting up the model’s structure and layers.
2. Initializing the Layers: Instantiating the convolutional layers (nn.Conv2d), ReLU activation function (nn.ReLU), max-pooling layers (nn.MaxPool2d), and flattening layer (nn.Flatten) within the model’s constructor (__init__).
3. Implementing the Forward Pass: Defining the forward method, outlining the flow of data through the model’s layers during the forward pass, including the application of convolutional operations, activation functions, and pooling.
4. Setting Model Input Shape: Determining the expected input shape for the model based on the dimensions of the input images, considering the number of color channels, height, and width.
5. Verifying Input and Output Shapes: Ensuring that the input and output shapes of each layer are compatible, using techniques like printing intermediate shapes or utilizing tools like torchinfo to summarize the model’s architecture.
- Understanding Input and Output Shapes: The sources highlight the importance of comprehending the input and output shapes of each layer in the CNN. They explain how to calculate the output shape of convolutional layers based on factors like kernel size, stride, and padding, providing resources for a deeper understanding of these concepts.
- Using torchinfo for Model Summary: The sources introduce the torchinfo package as a helpful tool for summarizing PyTorch models, visualizing their architecture, and verifying input and output shapes. They demonstrate how to use torchinfo to print a concise summary of the model’s layers, parameters, and input/output sizes, aiding in understanding the model’s structure and ensuring its correctness.
The sources provide a clear and structured introduction to CNNs and guide users through building a custom CNN model using PyTorch. They explain the key components of CNNs, including convolutional layers, activation functions, pooling layers, and flattening layers. They walk users through defining the model’s architecture, understanding input/output shapes, and using tools like torchinfo to visualize and verify the model’s structure. These steps equip users with the knowledge and skills to create and work with CNNs for image classification tasks using custom datasets.

Training and Evaluating the TinyVGG Model: Pages 831-840

The sources walk users through the process of training and evaluating the TinyVGG model using the custom dataset created in the previous steps. They guide users through setting up training and testing functions, training the model for multiple epochs, visualizing the training progress using loss curves, and comparing the performance of the custom TinyVGG model to a baseline model.
- Setting up Training and Testing Functions: The sources present Python functions for training and testing the model, highlighting the key steps involved in each phase:
- train_step Function: This function performs a single training step, iterating through batches of training data and performing the following actions:
1. Forward Pass: Passing the input data through the model to get predictions.
2. Loss Calculation: Computing the loss between the predictions and the target labels using a chosen loss function.
3. Backpropagation: Calculating gradients of the loss with respect to the model’s parameters.
4. Optimizer Update: Updating the model’s parameters using an optimization algorithm to minimize the loss.
5. Accuracy Calculation: Calculating the accuracy of the model’s predictions on the training batch.
- test_step Function: Similar to the train_step function, this function evaluates the model’s performance on the test data, iterating through batches of test data and performing the forward pass, loss calculation, and accuracy calculation.
- Training the Model: The sources guide users through training the TinyVGG model for a specified number of epochs, calling the train_step and test_step functions in each epoch. They showcase how to track and store the training and testing loss and accuracy values across epochs for later analysis and visualization.
- Visualizing Training Progress with Loss Curves: The sources emphasize the importance of visualizing the training progress by plotting loss curves. They explain that loss curves depict the trend of the loss value over epochs, providing insights into the model’s learning process.
- Interpreting Loss Curves: They guide users through interpreting loss curves, highlighting that a decreasing loss generally indicates that the model is learning effectively. They explain that if the training loss continues to decrease but the testing loss starts to increase or plateau, it might indicate overfitting, where the model performs well on the training data but poorly on unseen data.
- Comparing Models and Exploring Hyperparameter Tuning: The sources compare the performance of the custom TinyVGG model to a baseline model, providing insights into the effectiveness of the chosen architecture. They suggest exploring techniques like hyperparameter tuning to potentially improve the model’s performance.
- Hyperparameter Tuning: They briefly introduce hyperparameter tuning as the process of finding the optimal values for the model’s hyperparameters, such as learning rate, batch size, and the number of hidden units.
The sources provide a comprehensive guide to training and evaluating the TinyVGG model using the custom dataset. They outline the steps involved in creating training and testing functions, performing the training process, visualizing training progress using loss curves, and comparing the model’s performance to a baseline model. These steps equip users with a structured approach to training, evaluating, and iteratively improving CNN models for image classification tasks.

Saving, Loading, and Reflecting on the PyTorch Workflow: Pages 841-850

The sources guide users through saving and loading the trained TinyVGG model, emphasizing the importance of preserving trained models for future use. They also provide a comprehensive reflection on the key steps involved in the PyTorch workflow for computer vision tasks, summarizing the concepts and techniques covered throughout the previous sections and offering insights into the overall process.
- Saving and Loading the Trained Model: The sources highlight the significance of saving trained models to avoid retraining from scratch. They explain that saving the model’s state dictionary, which contains the learned parameters, allows for easy reloading and reuse.
- Using torch.save: They demonstrate how to use PyTorch’s torch.save function to save the model’s state dictionary to a file, specifying the file path and the state dictionary as arguments. This step ensures that the trained model’s parameters are stored persistently.
- Using torch.load: They showcase how to use PyTorch’s torch.load function to load the saved state dictionary back into a new model instance. They explain the importance of creating a new model instance with the same architecture as the saved model before loading the state dictionary. This step allows for seamless restoration of the trained model’s parameters.
- Verifying Loaded Model: They suggest making predictions using the loaded model to ensure that it performs as expected and the loading process was successful.
- Reflecting on the PyTorch Workflow: The sources provide a comprehensive recap of the essential steps involved in the PyTorch workflow for computer vision tasks, summarizing the concepts and techniques covered in the previous sections. They present a structured overview of the workflow, highlighting the following key stages:
1. Data Preparation: Preparing the data, including loading, splitting into training and testing sets, and applying necessary transformations.
2. Model Building: Constructing the neural network model, defining its architecture, layers, and activation functions.
3. Loss Function and Optimizer Selection: Choosing an appropriate loss function to measure the model’s performance and an optimizer to update the model’s parameters during training.
4. Training Loop: Implementing a training loop to iteratively train the model on the training data, performing forward passes, loss calculations, backpropagation, and optimizer updates.
5. Model Evaluation: Evaluating the model’s performance on the test data, using metrics like loss and accuracy.
6. Hyperparameter Tuning and Experimentation: Exploring different model architectures, hyperparameters, and data augmentation techniques to potentially improve the model’s performance.
7. Saving and Loading the Model: Preserving the trained model by saving its state dictionary to a file for future use.
- Encouraging Further Exploration and Practice: The sources emphasize that mastering the PyTorch workflow requires practice and encourage users to explore different datasets, models, and techniques to deepen their understanding. They recommend referring to the PyTorch documentation and online resources for additional learning and problem-solving.
The sources provide clear guidance on saving and loading trained models, emphasizing the importance of preserving trained models for reuse. They offer a thorough recap of the PyTorch workflow for computer vision tasks, summarizing the key steps and techniques covered in the previous sections. They guide users through the process of saving the model’s state dictionary and loading it back into a new model instance. By emphasizing the overall workflow and providing practical examples, the sources equip users with a solid foundation for tackling computer vision projects using PyTorch. They encourage further exploration and experimentation to solidify understanding and enhance practical skills in building, training, and deploying computer vision models.

Expanding the Horizons of PyTorch: Pages 851-860

The sources shift focus from the specific TinyVGG model and custom dataset to a broader exploration of PyTorch’s capabilities. They introduce additional concepts, resources, and areas of study within the realm of deep learning and PyTorch, encouraging users to expand their knowledge and pursue further learning beyond the scope of the initial tutorial.
- Advanced Topics and Resources for Further Learning: The sources recognize that the covered material represents a foundational introduction to PyTorch and deep learning, and they acknowledge that there are many more advanced topics and areas of specialization within this field.
- Transfer Learning: The sources highlight transfer learning as a powerful technique that involves leveraging pre-trained models on large datasets to improve the performance on new, potentially smaller datasets.
- Model Experiment Tracking: They introduce the concept of model experiment tracking, emphasizing the importance of keeping track of different model architectures, hyperparameters, and results for organized experimentation and analysis.
- PyTorch Paper Replication: The sources mention the practice of replicating research papers that introduce new deep learning architectures or techniques using PyTorch. They suggest that this is a valuable way to gain deeper understanding and practical experience with cutting-edge advancements in the field.
- Additional Chapters and Resources: The sources point to additional chapters and resources available on the learnpytorch.io website, indicating that the learning journey continues beyond the current section. They encourage users to explore these resources to deepen their understanding of various aspects of deep learning and PyTorch.
- Encouraging Continued Learning and Exploration: The sources strongly emphasize the importance of continuous learning and exploration within the field of deep learning. They recognize that deep learning is a rapidly evolving field with new architectures, techniques, and applications emerging frequently.
- Staying Updated with Advancements: They advise users to stay updated with the latest research papers, blog posts, and online courses to keep their knowledge and skills current.
- Building Projects and Experimenting: The sources encourage users to actively engage in building projects, experimenting with different datasets and models, and participating in the deep learning community.
The sources gracefully transition from the specific tutorial on TinyVGG and custom datasets to a broader perspective on the vast landscape of deep learning and PyTorch. They introduce additional topics, resources, and areas of study, encouraging users to continue their learning journey and explore more advanced concepts. By highlighting these areas and providing guidance on where to find further information, the sources empower users to expand their knowledge, skills, and horizons within the exciting and ever-evolving world of deep learning and PyTorch.

Diving into Multi-Class Classification with PyTorch: Pages 861-870

The sources introduce the concept of multi-class classification, a common task in machine learning where the goal is to categorize data into one of several possible classes. They contrast this with binary classification, which involves only two classes. The sources then present the FashionMNIST dataset, a collection of grayscale images of clothing items, as an example for demonstrating multi-class classification using PyTorch.
- Multi-Class Classification: The sources distinguish multi-class classification from binary classification, explaining that multi-class classification involves assigning data points to one of multiple possible categories, while binary classification deals with only two categories. They emphasize that many real-world problems fall under the umbrella of multi-class classification. [1]
- FashionMNIST Dataset: The sources introduce the FashionMNIST dataset, a widely used dataset for image classification tasks. This dataset comprises 70,000 grayscale images of 10 different clothing categories, including T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. The sources highlight that this dataset provides a suitable playground for experimenting with multi-class classification techniques using PyTorch. [1, 2]
- Preparing the Data: The sources outline the steps involved in preparing the FashionMNIST dataset for use in PyTorch, emphasizing the importance of loading the data, splitting it into training and testing sets, and applying necessary transformations. They mention using PyTorch’s DataLoader class to efficiently handle data loading and batching during training and testing. [2]
- Building a Multi-Class Classification Model: The sources guide users through building a simple neural network model for multi-class classification using PyTorch. They discuss the choice of layers, activation functions, and the output layer’s activation function. They mention using a softmax activation function in the output layer to produce a probability distribution over the possible classes. [2]
- Training the Model: The sources outline the process of training the multi-class classification model, highlighting the use of a suitable loss function (such as cross-entropy loss) and an optimization algorithm (such as stochastic gradient descent) to minimize the loss and improve the model’s accuracy during training. [2]
- Evaluating the Model: The sources emphasize the need to evaluate the trained model’s performance on the test dataset, using metrics such as accuracy, precision, recall, and the F1-score to assess its effectiveness in classifying images into the correct categories. [2]
- Visualization for Understanding: The sources advocate for visualizing the data and the model’s predictions to gain insights into the classification process. They suggest techniques like plotting the images and their corresponding predicted labels to qualitatively assess the model’s performance. [2]
The sources effectively introduce the concept of multi-class classification and its relevance in various machine learning applications. They guide users through the process of preparing the FashionMNIST dataset, building a neural network model, training the model, and evaluating its performance. By emphasizing visualization and providing code examples, the sources equip users with the tools and knowledge to tackle multi-class classification problems using PyTorch.

Beyond Accuracy: Exploring Additional Classification Metrics: Pages 871-880

The sources introduce several additional metrics for evaluating the performance of classification models, going beyond the commonly used accuracy metric. They highlight the importance of considering multiple metrics to gain a more comprehensive understanding of a model’s strengths and weaknesses. The sources also emphasize that the choice of appropriate metrics depends on the specific problem and the desired balance between different types of errors.
- Limitations of Accuracy: The sources acknowledge that accuracy, while a useful metric, can be misleading in situations where the classes are imbalanced. In such cases, a model might achieve high accuracy simply by correctly classifying the majority class, even if it performs poorly on the minority class.
- Precision and Recall: The sources introduce precision and recall as two important metrics that provide a more nuanced view of a classification model’s performance, particularly when dealing with imbalanced datasets.
- Precision: Precision measures the proportion of correctly classified positive instances out of all instances predicted as positive. A high precision indicates that the model is good at avoiding false positives.
- Recall: Recall, also known as sensitivity or the true positive rate, measures the proportion of correctly classified positive instances out of all actual positive instances. A high recall suggests that the model is effective at identifying all positive instances.
- F1-Score: The sources present the F1-score as a harmonic mean of precision and recall, providing a single metric that balances both precision and recall. A high F1-score indicates a good balance between minimizing false positives and false negatives.
- Confusion Matrix: The sources introduce the confusion matrix as a valuable tool for visualizing the performance of a classification model. A confusion matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a detailed breakdown of the model’s predictions across different classes.
- Classification Report: The sources mention the classification report as a comprehensive summary of key classification metrics, including precision, recall, F1-score, and support (the number of instances of each class) for each class in the dataset.
- TorchMetrics Module: The sources recommend exploring the torchmetrics module in PyTorch, which provides a wide range of pre-implemented classification metrics. Using this module simplifies the calculation and tracking of various metrics during model training and evaluation.
The sources effectively expand the discussion of classification model evaluation by introducing additional metrics that go beyond accuracy. They explain precision, recall, the F1-score, the confusion matrix, and the classification report, highlighting their importance in understanding a model’s performance, especially in cases of imbalanced datasets. By encouraging the use of the torchmetrics module, the sources provide users with practical tools to easily calculate and track these metrics during their machine learning workflows. They emphasize that choosing the right metrics depends on the specific problem and the relative importance of different types of errors.

Exploring Convolutional Neural Networks and Computer Vision: Pages 881-890

The sources mark a transition into the realm of computer vision, specifically focusing on Convolutional Neural Networks (CNNs), a type of neural network architecture highly effective for image-related tasks. They introduce core concepts of CNNs and showcase their application in image classification using the FashionMNIST dataset.
- Introduction to Computer Vision: The sources acknowledge computer vision as a rapidly expanding field within deep learning, encompassing tasks like image classification, object detection, and image segmentation. They emphasize the significance of CNNs as a powerful tool for extracting meaningful features from image data, enabling machines to “see” and interpret visual information.
- Convolutional Neural Networks (CNNs): The sources provide a foundational understanding of CNNs, highlighting their key components and how they differ from traditional neural networks.
- Convolutional Layers: They explain how convolutional layers apply filters (also known as kernels) to the input image to extract features such as edges, textures, and patterns. These filters slide across the image, performing convolutions to produce feature maps.
- Activation Functions: The sources discuss the use of activation functions like ReLU (Rectified Linear Unit) within CNNs to introduce non-linearity, allowing the network to learn complex relationships in the image data.
- Pooling Layers: They explain how pooling layers, such as max pooling, downsample the feature maps, reducing their dimensionality while retaining essential information, making the network more computationally efficient and robust to variations in the input image.
- Fully Connected Layers: The sources mention that after several convolutional and pooling layers, the extracted features are flattened and passed through fully connected layers, similar to those found in traditional neural networks, to perform the final classification.
- Applying CNNs to FashionMNIST: The sources guide users through building a simple CNN model for image classification using the FashionMNIST dataset. They walk through the process of defining the model architecture, choosing appropriate layers and hyperparameters, and training the model using the training dataset.
- Evaluation and Visualization: The sources emphasize evaluating the trained CNN model on the test dataset, using metrics like accuracy to assess its performance. They also encourage visualizing the model’s predictions and the learned feature maps to gain a deeper understanding of how the CNN is “seeing” and interpreting the images.
- Importance of Experimentation: The sources highlight that designing and training effective CNNs often involves experimentation with different architectures, hyperparameters, and training techniques. They encourage users to explore different approaches and carefully analyze the results to optimize their models for specific computer vision tasks.
Working with Tensors and Building Models in PyTorch: Pages 891-900

The sources shift focus to the practical aspects of working with tensors in PyTorch and building neural network models for both regression and classification tasks. They emphasize the importance of understanding tensor operations, data manipulation, and building blocks of neural networks within the PyTorch framework.
- Understanding Tensors: The sources reiterate the importance of tensors as the fundamental data structure in PyTorch, highlighting their role in representing data and model parameters. They discuss tensor creation, indexing, and various operations like stacking, permuting, and reshaping tensors to prepare data for use in neural networks.
- Building a Regression Model: The sources walk through the steps of building a simple linear regression model in PyTorch to predict a continuous target variable from a set of input features. They explain:
- Model Architecture: Defining a model class that inherits from PyTorch’s nn.Module, specifying the linear layers and activation functions that make up the model.
- Loss Function: Choosing an appropriate loss function, such as Mean Squared Error (MSE), to measure the difference between the model’s predictions and the actual target values.
- Optimizer: Selecting an optimizer, such as Stochastic Gradient Descent (SGD), to update the model’s parameters during training, minimizing the loss function.
- Training Loop: Implementing a training loop that iterates through the training data, performs forward and backward passes, calculates the loss, and updates the model’s parameters using the optimizer.
- Addressing Shape Errors: The sources address common shape errors that arise when working with tensors in PyTorch, emphasizing the importance of ensuring that tensor dimensions are compatible for operations like matrix multiplication. They provide examples of troubleshooting shape mismatches and adjusting tensor dimensions using techniques like reshaping or transposing.
- Visualizing Data and Predictions: The sources advocate for visualizing the data and the model’s predictions to gain insights into the regression process. They suggest plotting the input features against the target variable, along with the model’s predicted line, to visually assess the model’s fit and performance.
- Introducing Non-linearities: The sources acknowledge the limitations of linear models in capturing complex relationships in data. They introduce the concept of non-linear activation functions, such as ReLU (Rectified Linear Unit), as a way to introduce non-linearity into the model, enabling it to learn more complex patterns. They explain how incorporating ReLU layers can enhance a model’s ability to fit non-linear data.
The sources effectively transition from theoretical concepts to practical implementation by demonstrating how to work with tensors in PyTorch and build basic neural network models for both regression and classification tasks. They guide users through the essential steps of model definition, loss function selection, optimizer choice, and training loop implementation. By highlighting common pitfalls like shape errors and emphasizing visualization, the sources provide a hands-on approach to learning PyTorch and its application in building machine learning models. They also introduce the crucial concept of non-linear activation functions, laying the foundation for exploring more complex neural network architectures in subsequent sections.

Here are two ways to improve a model’s performance, based on the provided sources:
- Add More Layers to the Model: Adding more layers gives the model more opportunities to learn about patterns in the data. If a model currently has two layers with approximately 20 parameters, adding more layers would increase the number of parameters the model uses to try and learn the patterns in the data [1].
- Fit the Model for Longer: Every epoch is one pass through the data. Fitting the model for longer gives it more of a chance to learn. For example, if the model has only had 100 opportunities to look at a dataset, it may not be enough. Increasing the opportunities to 1,000 may improve the model’s results [2].
How Loss Functions Measure Model Performance

The sources explain that a loss function is crucial for training machine learning models. A loss function quantifies how “wrong” a model’s predictions are compared to the desired output. [1-6] The output of a loss function is a numerical value representing the error. Lower loss values indicate better performance.

Here’s how the loss function works in practice:
- Forward Pass: The model makes predictions on the input data. [7, 8] These predictions are often referred to as “logits” before further processing. [9-14]
- Comparing Predictions to True Values: The loss function takes the model’s predictions and compares them to the true labels from the dataset. [4, 8, 15-19]
- Calculating the Error: The loss function calculates a numerical value representing the difference between the predictions and the true labels. [1, 4-6, 8, 20-29] This value is the “loss,” and the specific calculation depends on the type of loss function used.
- Guiding Model Improvement: The loss value is used by the optimizer to adjust the model’s parameters (weights and biases) to reduce the error in subsequent predictions. [3, 20, 24, 27, 30-38] This iterative process of making predictions, calculating the loss, and updating the parameters is what drives the model’s learning during training.
The goal of training is to minimize the loss function, effectively bringing the model’s predictions closer to the true values. [4, 21, 27, 32, 37, 39-41]

The sources explain that different loss functions are appropriate for different types of problems. [42-48] For example:
- Regression problems (predicting a continuous numerical value) often use loss functions like Mean Absolute Error (MAE, also called L1 loss in PyTorch) or Mean Squared Error (MSE). [42, 44-46, 49, 50]
- Classification problems (predicting a category or class label) might use loss functions like Binary Cross Entropy (BCE) for binary classification or Cross Entropy for multi-class classification. [42, 43, 45, 46, 48, 50, 51]
The sources also highlight the importance of using the appropriate loss function for the chosen model and task. [44, 52, 53]

Key takeaway: Loss functions serve as a feedback mechanism, providing a quantitative measure of how well a model is performing. By minimizing the loss, the model learns to make more accurate predictions and improve its overall performance.

Main Steps in a PyTorch Training Loop

The sources provide a detailed explanation of the PyTorch training loop, highlighting its importance in the machine learning workflow. The training loop is the process where the model iteratively learns from the data and adjusts its parameters to improve its predictions. The sources provide code examples and explanations for both regression and classification problems.

Here is a breakdown of the main steps involved in a PyTorch training loop:

1. Setting Up
- Epochs: Define the number of epochs, which represent the number of times the model will iterate through the entire training dataset. [1]
- Training Mode: Set the model to training mode using model.train(). This activates specific settings and behaviors within the model, such as enabling dropout and batch normalization layers, crucial for training. [1, 2]
- Data Loading: Prepare the data loader to feed batches of training data to the model. [3]
2. Iterating Through Data Batches
- Loop: Initiate a loop to iterate through each batch of data provided by the data loader. [1]
3. The Optimization Loop (for each batch)
- Forward Pass: Pass the input data through the model to obtain predictions (often referred to as “logits” before further processing). [4, 5]
- Loss Calculation: Calculate the loss, which measures the difference between the model’s predictions and the true labels. Choose a loss function appropriate for the problem type (e.g., MSE for regression, Cross Entropy for classification). [5, 6]
- Zero Gradients: Reset the gradients of the model’s parameters to zero. This step is crucial to ensure that gradients from previous batches do not accumulate and affect the current batch’s calculations. [5, 7]
- Backpropagation: Calculate the gradients of the loss function with respect to the model’s parameters. This step involves going backward through the network, computing how much each parameter contributed to the loss. PyTorch handles this automatically using loss.backward(). [5, 7, 8]
- Gradient Descent: Update the model’s parameters to minimize the loss function. This step uses an optimizer (e.g., SGD, Adam) to adjust the weights and biases in the direction that reduces the loss. PyTorch’s optimizer.step() performs this parameter update. [5, 7, 8]
4. Testing (Evaluation) Loop (typically performed after each epoch)
- Evaluation Mode: Set the model to evaluation mode using model.eval(). This deactivates training-specific settings (like dropout) and prepares the model for inference. [2, 9]
- Inference Mode: Use the torch.inference_mode() context manager to perform inference. This disables gradient calculations and other operations not required for testing, potentially improving speed and memory efficiency. [9, 10]
- Forward Pass (on Test Data): Pass the test data through the model to obtain predictions. [4, 9]
- Loss Calculation (on Test Data): Calculate the loss on the test data to assess the model’s performance on unseen data. [9, 11]
- Performance Metrics: Calculate additional performance metrics relevant to the problem, such as accuracy for classification or other evaluation metrics suitable for the specific task. [9]
5. Repetition

The training loop repeats for the specified number of epochs, iterating through the data, performing the optimization steps, and periodically evaluating the model on the test data.

Key Takeaways
- The training loop is the heart of the model’s learning process in PyTorch.
- Each step in the loop plays a vital role in enabling the model to learn from data and improve its predictions.
- PyTorch simplifies many aspects of the training loop, handling backpropagation and gradient updates automatically, allowing you to focus on defining the model architecture, choosing appropriate loss functions and optimizers, and preparing the data.
The sources emphasize the importance of understanding these steps and how they work together to train effective machine learning models in PyTorch. They provide code examples and detailed explanations for each step, illustrating the concepts with practical implementations.

Reasons for PyTorch’s Popularity

The sources offer several insights into why PyTorch has become so popular for deep learning and machine learning:
- Research Favorite: The sources emphasize that PyTorch is the most popular deep learning research framework, as evidenced by its widespread use in published research papers and code repositories. [1, 2] For example, Papers with Code, a website tracking machine learning papers and their associated code, shows that 58% of the 65,000 papers with code they’ve tracked are implemented with PyTorch. [2] This popularity stems from PyTorch’s flexibility, ease of use, and ability to support cutting-edge research in various domains.
- Pythonic and User-Friendly: PyTorch is written in Python, making it highly accessible to the vast community of Python developers and researchers. [3] Its intuitive and Pythonic API simplifies the process of defining, training, and evaluating models. The dynamic computational graph enables flexible experimentation and debugging, contributing to its popularity among researchers and practitioners.
- Strong Industry Adoption: Beyond research, PyTorch has seen significant adoption by industry leaders like Tesla, Microsoft, OpenAI, and Facebook (Meta). [4-9] Tesla utilizes PyTorch for the computer vision models powering its Autopilot system. [5] OpenAI has standardized on PyTorch for its research and development. [6, 7] Facebook leverages PyTorch for its diverse machine learning applications. [9] This widespread industry adoption further validates PyTorch’s capabilities and reinforces its position as a leading deep learning framework.
- GPU Acceleration: PyTorch allows developers to leverage the power of GPUs (Graphics Processing Units) for faster computations. [3, 10] This is crucial for deep learning tasks, which often involve large datasets and computationally intensive operations. PyTorch utilizes CUDA, NVIDIA’s parallel computing platform and API, to enable efficient execution on GPUs, significantly accelerating training and inference processes. [11, 12]
- Rich Ecosystem and Resources: PyTorch benefits from a robust ecosystem of tools, libraries, and resources. Torch Hub provides access to pre-trained models, facilitating transfer learning. [3] Torchvision offers datasets, transformations, and model architectures specifically designed for computer vision tasks. [3, 13] PyTorch’s extensive documentation, active community forums, and numerous tutorials provide ample support for learners and practitioners at all levels.
- Accessibility and Ease of Setup: Tools like Google Colab further enhance PyTorch’s accessibility by providing a cloud-based environment with free GPU access and pre-installed PyTorch. [14-21] This ease of setup lowers the barrier to entry for beginners and enables rapid experimentation and prototyping without requiring extensive local hardware setup.
The sources, along with the conversation history, consistently portray PyTorch as a powerful, flexible, and user-friendly framework well-suited for both research and production-level deep learning applications. Its popularity is driven by a confluence of factors, including its research focus, Pythonic nature, strong industry support, GPU acceleration capabilities, a rich ecosystem, and ease of accessibility.

One of the main reasons for PyTorch’s popularity in research is its prominence as the most popular deep learning research framework [1, 2]. The sources emphasize that PyTorch is widely used in published research papers and code repositories.

For example, Papers with Code, a website tracking machine learning papers and their associated code, shows that 58% of the 65,000 papers with code they’ve tracked are implemented with PyTorch [3]. This demonstrates how often researchers choose PyTorch when implementing their models. This popularity likely stems from PyTorch’s flexibility, ease of use, and ability to support cutting-edge research in various domains.

The Impact of Random Seeds on Neural Network Training

The sources explain that randomness is a fundamental aspect of training neural networks in PyTorch. They highlight the use of random seeds as a mechanism for controlling this randomness and enabling reproducibility in experiments. Here’s how random seeds influence the training process:

1. Initialization with Random Numbers
- Random Parameters: Neural networks typically start with their parameters (weights and biases) initialized with random values. Source [1] emphasizes this, stating that “Random tensors are important because the way many neural networks learn is that they start with tensors full of random numbers and then adjust those random numbers to better represent the data.” The sources demonstrate this with code examples, using functions like torch.rand() to generate random tensors for weight and bias initialization.
- Stochastic Gradient Descent: The sources mention “stochastic” gradient descent [2], highlighting that the optimization process itself involves randomness. The optimizer takes random steps to update the model’s parameters, gradually reducing the loss.
2. The Role of the Random Seed
- Flavoring the Randomness: Source [3] explains the concept of a random seed as a way to “flavor the randomness” Because computers generate pseudo-random numbers, the random seed acts as a starting point for this pseudo-random number generation process. Setting a specific seed ensures that the sequence of random numbers generated will be the same every time the code is run.
- Enabling Reproducibility: Using a random seed makes experiments reproducible. This means that if you share your code and the seed value with someone else, they should be able to obtain the same results as you, assuming they are using the same software and hardware environment. Source [4] illustrates this with the example of sharing a notebook with a friend and wanting to reduce the randomness in the results.
- PyTorch Implementation: In PyTorch, you can set the random seed using torch.manual_seed(seed_value). Source [5] provides examples of setting the random seed and creating reproducible random tensors. Source [6] further notes that there is a separate seed for CUDA operations, torch.cuda.manual_seed(seed_value), to control randomness when using GPUs.
3. Effects on Training
- Deterministic Initialization: Setting a random seed ensures that the model starts with the same initial parameters every time you run the training code. This removes variability in the starting point of the optimization process.
- Consistent Optimization Path: With a fixed seed, the optimizer will take the same steps during training, leading to a more consistent learning trajectory. This can be helpful for debugging and understanding the model’s behavior.
- Comparable Experiments: Using random seeds makes it easier to compare the results of different experiments. For example, if you are trying different model architectures or hyperparameters, using the same seed across experiments ensures that the observed differences are due to the changes you made rather than random variations in initialization or optimization.
Key Takeaways
- Randomness is inherent in neural network training, impacting both parameter initialization and the optimization process.
- Random seeds provide a mechanism for controlling this randomness, making experiments reproducible and facilitating comparisons.
- Setting a random seed doesn’t eliminate randomness entirely but ensures that the same sequence of “random” numbers is used, leading to more consistent and comparable results.
The sources encourage developers to be mindful of randomness and leverage random seeds to enhance the reproducibility and reliability of their machine learning experiments. While complete determinism is often difficult to achieve due to factors beyond the random seed, using seeds is a valuable practice for improving the scientific rigor of deep learning research and development.

Training a Neural Network in PyTorch: A Step-by-Step Guide

The sources outline the primary steps involved in training a neural network using the PyTorch deep learning framework. These steps, often referred to as the PyTorch workflow, provide a structured approach to building, training, and evaluating models.

1. Data Preparation and Loading
- Data Acquisition: This initial step involves obtaining the data required for your machine-learning task. As noted in Source, data can take various forms, including structured data (e.g., spreadsheets), images, videos, audio, and even DNA sequences.
- Data Exploration: Becoming familiar with your data is crucial. This might involve visualizing the data (e.g., plotting images, creating histograms) and understanding its distribution, patterns, and potential biases.
- Data Preprocessing: Preparing the data for use with a PyTorch model often requires transformation and formatting. This could involve:
- Numerical Encoding: Converting categorical data into numerical representations, as many machine learning models operate on numerical inputs.
- Normalization: Scaling numerical features to a standard range (e.g., between 0 and 1) to prevent features with larger scales from dominating the learning process.
- Reshaping: Restructuring data into the appropriate dimensions expected by the neural network.
- Tensor Conversion: The sources emphasize that tensors are the fundamental building blocks of data in PyTorch. You’ll need to convert your data into PyTorch tensors using functions like torch.tensor().
- Dataset and DataLoader: Source recommends using PyTorch’s Dataset and DataLoader classes to efficiently manage and load data during training. A Dataset object represents your dataset, while a DataLoader provides an iterable over the dataset, enabling batching, shuffling, and other data handling operations.
2. Model Building or Selection
- Model Architecture: This step involves defining the structure of your neural network. You’ll need to decide on:
- Layer Types: PyTorch provides a wide range of layers in the torch.nn module, including linear layers (nn.Linear), convolutional layers (nn.Conv2d), recurrent layers (nn.LSTM), and more.
- Number of Layers: The depth of your network, often determined through experimentation and the complexity of the task.
- Number of Hidden Units: The dimensionality of the hidden representations within the network.
- Activation Functions: Non-linear functions applied to the output of layers to introduce non-linearity into the model.
- Model Implementation: You can build models from scratch, stacking layers together manually, or leverage pre-trained models from repositories like Torch Hub, particularly for tasks like image classification. Source showcases both approaches:
- Subclassing nn.Module: This common pattern involves creating a Python class that inherits from nn.Module. You’ll define layers as attributes of the class and implement the forward() method to specify how data flows through the network.
- Using nn.Sequential: Source demonstrates this simpler method for creating sequential models where data flows linearly through a sequence of layers.
3. Loss Function and Optimizer Selection
- Loss Function: The loss function measures how well the model is performing during training. It quantifies the difference between the model’s predictions and the actual target values. The choice of loss function depends on the nature of the problem:
- Regression: Common loss functions include Mean Squared Error (MSE) and Mean Absolute Error (MAE).
- Classification: Common loss functions include Cross-Entropy Loss and Binary Cross-Entropy Loss.
- Optimizer: The optimizer is responsible for updating the model’s parameters (weights and biases) during training, aiming to minimize the loss function. Popular optimizers in PyTorch include Stochastic Gradient Descent (SGD) and Adam.
- Hyperparameters: Both the loss function and optimizer often have hyperparameters that you’ll need to tune. For example, the learning rate for an optimizer controls the step size taken during parameter updates.
4. Training Loop Implementation
- Epochs: The training process is typically organized into epochs. An epoch involves iterating over the entire training dataset once. You’ll specify the number of epochs to train for.
- Batches: To improve efficiency, data is often processed in batches rather than individually. You’ll set the batch size, determining the number of data samples processed in each iteration of the training loop.
- Training Steps: The core of the training loop involves the following steps, repeated for each batch of data:
- Forward Pass: Passing the input data through the model to obtain predictions.
- Loss Calculation: Computing the loss by comparing predictions to the target values.
- Backpropagation: Calculating gradients of the loss with respect to the model’s parameters. This identifies how each parameter contributed to the error.
- Parameter Update: Using the optimizer to update the model’s parameters based on the calculated gradients. The goal is to adjust parameters in a direction that reduces the loss.
- Evaluation: Periodically, you’ll evaluate the model’s performance on a separate validation set to monitor its progress and prevent overfitting (where the model learns the training data too well and performs poorly on unseen data).
5. Model Saving and Loading
- Saving: Once the model is trained to a satisfactory level, you’ll want to save it for later use. The sources describe methods for saving PyTorch models, including:
- Saving the State Dictionary: This approach saves the model’s learned parameters in a dictionary-like object. It’s generally the recommended method as it’s more efficient and flexible.
- Saving the Entire Model: This saves the entire model architecture and parameters. However, it can lead to larger file sizes and potential compatibility issues if the PyTorch version changes.
- Loading: You can later load a saved model to reuse it for inference (making predictions on new data) or to continue training.
6. Additional Considerations
- GPU Acceleration: The sources highlight the importance of utilizing GPUs for faster training, especially for large models and datasets. PyTorch provides mechanisms for transferring models and data to GPUs.
- Experiment Tracking: As you train and experiment with different models and hyperparameters, it’s essential to keep track of your experiments. Tools like Weights & Biases, MLflow, and TensorBoard (mentioned in Source) can help you log metrics, visualize training progress, and compare different runs.
- Deployment: Once you have a well-trained model, you might want to deploy it to make predictions in real-world applications. This often involves considerations related to model serving, optimization, and integration with other systems.
Key Points from Sources
- Iterative Workflow: The PyTorch workflow is iterative, meaning you’ll often go back and forth between different steps, experimenting with various model architectures, hyperparameters, and data preprocessing techniques to improve performance.
- Experimentation: The sources encourage experimentation as a core part of the machine-learning process. Trying different things and observing their impact is crucial for finding the best model for your task.
- Focus on Code: While a theoretical understanding of deep learning concepts is helpful, the sources prioritize a hands-on, code-centric approach, emphasizing that writing code and running experiments is the most effective way to learn PyTorch.
- Abundant Resources: The sources emphasize the availability of extensive PyTorch documentation, tutorials, and community forums. These resources are invaluable for learning the framework and troubleshooting issues.
Beyond the Basics

The training steps described here provide a foundational understanding of how to train neural networks in PyTorch. As you progress in your deep learning journey, you’ll encounter more advanced topics and techniques. This might include:
- Transfer Learning: Leveraging pre-trained models to jump-start your model’s performance, particularly when you have limited data.
- Fine-tuning: Adapting pre-trained models to your specific task by modifying and training certain layers.
- Custom Layers and Architectures: Designing and implementing your own specialized neural network layers and architectures to address unique challenges.
- Distributed Training: Training models across multiple GPUs or machines to accelerate the process and handle extremely large datasets.
The sources provide a strong foundation for getting started with PyTorch. They encourage an experimental mindset, highlighting the importance of writing code, running experiments, and iteratively refining your models to achieve optimal performance.

Key Advantages of Utilizing PyTorch

The sources highlight several key benefits of using PyTorch as a deep learning framework, particularly for research and building machine learning models. Here’s a summary of the main advantages:

1. Research-Oriented and Widely Adopted:
- PyTorch is positioned as the “most popular research deep learning framework” [1], with statistics from December 2021 indicating that 58% of papers with code on Papers With Code were implemented using PyTorch [2]. This popularity within the research community stems from its flexibility and ease of use, making it a preferred choice for developing state-of-the-art machine learning algorithms.
- The widespread adoption of PyTorch is further evidenced by its use in prominent organizations like Tesla (for Autopilot computer vision models), OpenAI, Facebook (for in-house machine learning applications), and Microsoft [3-5].
2. Pythonic and User-Friendly:
- PyTorch is deeply integrated with Python, making it highly accessible for Python developers [1]. Its syntax and structure align closely with Pythonic conventions, reducing the learning curve for those already familiar with the language.
- This user-friendliness is emphasized throughout the sources, advocating for a hands-on, code-centric approach to learning PyTorch and stressing that “if you know Python, it’s a very user-friendly programming language” [6].
3. Dynamic Computational Graph and Debugging Ease:
- PyTorch’s dynamic computational graph is a significant advantage. Unlike static graph frameworks like TensorFlow (at least in its earlier versions), PyTorch builds the graph as you execute the code [This information is not from your provided sources]. This dynamic nature allows for greater flexibility during development, as you can modify the graph on the fly. It also simplifies debugging, as you can use standard Python debugging tools to inspect variables and step through the code.
4. GPU Acceleration and Fast Computations:
- PyTorch enables you to leverage the power of GPUs to accelerate computations [1, 7]. This is particularly crucial for deep learning, where training often involves vast amounts of data and computationally intensive operations.
- PyTorch accomplishes GPU acceleration through CUDA, NVIDIA’s parallel computing platform and API [8, 9]. It also supports TPUs (Tensor Processing Units), although GPUs are more commonly used with PyTorch [9].
- The sources highlight this benefit, noting that PyTorch “allows you to write fast deep learning code in Python” [1] and “enables you to run your code, your machine learning code accelerated on a GPU” [7].
5. Extensive Ecosystem and Community Support:
- PyTorch benefits from a rich ecosystem of tools, libraries, and resources. Source mentions Torch Hub, a repository for pre-trained models, and Torch Vision, a library specifically designed for computer vision tasks [6].
- There’s also a vibrant PyTorch community, providing support through forums, tutorials, and online resources [This information is not from your provided sources].
6. Flexibility and Customization:
- PyTorch’s flexibility allows you to build a wide range of models, from simple linear regressions to complex convolutional neural networks, and experiment with various architectures and techniques.
- The sources demonstrate this flexibility through examples, showing how to create models by subclassing nn.Module, using nn.Sequential, and implementing custom layers.
7. Focus on Tensors as Fundamental Building Blocks:
- The sources repeatedly emphasize that tensors are the core data structures in PyTorch [10, 11]. This focus on tensors simplifies data handling and manipulation, as tensors provide a unified representation for various data types, including images, text, and numerical data.
In summary, PyTorch offers a compelling combination of research-oriented features, user-friendliness, performance, and a robust ecosystem, making it a powerful and popular choice for deep learning practitioners.

PyTorch and GPU Acceleration: A CUDA-Powered Partnership

The sources provide a clear explanation of how PyTorch utilizes GPUs to achieve faster computations in deep learning. Here’s a breakdown:
- GPUs are inherently designed for numerical computation. They were initially developed for video games, which require intensive graphical processing that involves a significant number of mathematical calculations. This inherent capability makes them well-suited for the complex operations involved in deep learning. [1]
- PyTorch provides an interface to harness this computational power through CUDA. CUDA, which stands for Compute Unified Device Architecture, is NVIDIA’s parallel computing platform and API. It acts as a bridge between software, like PyTorch, and NVIDIA GPUs, enabling the software to use the GPU for general-purpose computing tasks. [2]
- By leveraging CUDA, PyTorch can execute deep learning code on NVIDIA GPUs. This allows for significant speedups, particularly when dealing with large datasets and complex models. [3]
- Device Agnostic Code: PyTorch promotes the concept of “device agnostic code”, which means the code can run on either a GPU (if available) or a CPU. This is achieved by setting a device variable that defaults to CUDA if a GPU is present and falls back to the CPU if not. [4, 5]
- The to method plays a crucial role. This method is used to move tensors and models to the desired device (either CUDA for GPU or CPU). By moving the data and the model to the GPU, PyTorch ensures that all computations are performed on the faster hardware. [6, 7]
- The sources strongly emphasize the performance benefits of using a GPU. They mention that it results in “faster computations”, allowing for faster pattern discovery in data, more experimentation, and ultimately, finding the best model. [8, 9]
Key Takeaways:
- PyTorch’s ability to utilize GPUs stems from its integration with CUDA.
- The to method allows for seamless movement of data and models to the GPU.
- Utilizing a GPU can lead to substantial speed improvements in deep learning tasks.
It’s important to note that while GPUs generally offer significant performance gains, there are situations where the overhead of transferring data to and from the GPU might outweigh the computational benefits, particularly with smaller datasets and less complex models. [10]

Top Three Errors in PyTorch

The sources identify three major error types that you’re likely to encounter when working with PyTorch and deep learning:

1. Tensor Data Type Mismatches
- The Root of the Problem: PyTorch relies heavily on tensors for representing and manipulating data. Tensors have an associated data type, such as float32, int64, or bool. Many PyTorch functions and operations require tensors to have specific data types to work correctly. If the data types of tensors involved in a calculation are incompatible, PyTorch will raise an error.
- Common Manifestations: You might encounter this error when:
- Performing mathematical operations between tensors with mismatched data types (e.g., multiplying a float32 tensor by an int64 tensor) [1, 2].
- Using a function that expects a particular data type but receiving a tensor of a different type (e.g., torch.mean requires a float32 tensor) [3-5].
- Real-World Example: The sources illustrate this error with torch.mean. If you attempt to calculate the mean of a tensor that isn’t a floating-point type, PyTorch will throw an error. To resolve this, you need to convert the tensor to float32 using tensor.type(torch.float32) [4].
- Debugging Strategies:Carefully inspect the data types of the tensors involved in the operation or function call where the error occurs.
- Use tensor.dtype to check a tensor’s data type.
- Convert tensors to the required data type using tensor.type().
- Key Insight: Pay close attention to data types. When in doubt, default to float32 as it’s PyTorch’s preferred data type [6].
2. Tensor Shape Mismatches
- The Core Issue: Tensors also have a shape, which defines their dimensionality. For example, a vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, and an image with three color channels is often represented as a 3-dimensional tensor. Many PyTorch operations, especially matrix multiplications and neural network layers, have strict requirements regarding the shapes of input tensors.
- Where It Goes Wrong:Matrix Multiplication: The inner dimensions of matrices being multiplied must match [7, 8].
- Neural Networks: The output shape of one layer needs to be compatible with the input shape of the next layer.
- Reshaping Errors: Attempting to reshape a tensor into an incompatible shape (e.g., squeezing 9 elements into a shape of 1×7) [9].
- Example in Action: The sources provide an example of a shape error during matrix multiplication using torch.matmul. If the inner dimensions don’t match, PyTorch will raise an error [8].
- Troubleshooting Tips:Shape Inspection: Thoroughly understand the shapes of your tensors using tensor.shape.
- Visualization: When possible, visualize tensors (especially high-dimensional ones) to get a better grasp of their structure.
- Reshape Carefully: Ensure that reshaping operations (tensor.reshape, tensor.view) result in compatible shapes.
- Crucial Takeaway: Always verify shape compatibility before performing operations. Shape errors are prevalent in deep learning, so be vigilant.
3. Device Mismatches (CPU vs. GPU)
- The Device Divide: PyTorch supports both CPUs and GPUs for computation. GPUs offer significant performance advantages, but require data and models to reside in GPU memory. If you attempt to perform an operation between tensors or models located on different devices, PyTorch will raise an error.
- Typical Scenarios:Moving Data to GPU: You might forget to move your input data to the GPU using tensor.to(device), leading to an error when performing calculations with a model that’s on the GPU [10].
- NumPy and GPU Tensors: NumPy operates on CPU memory, so you can’t directly use NumPy functions on GPU tensors [11]. You need to first move the tensor back to the CPU using tensor.cpu() [12].
- Source Illustration: The sources demonstrate this issue when trying to use numpy.array() on a tensor that’s on the GPU. The solution is to bring the tensor back to the CPU using tensor.cpu() [12].
- Best Practices:Device Agnostic Code: Use the device variable and the to() method to ensure that data and models are on the correct device [11, 13].
- CPU-to-GPU Transfers: Minimize the number of data transfers between the CPU and GPU, as these transfers can introduce overhead.
- Essential Reminder: Be device-aware. Always ensure that all tensors involved in an operation are on the same device (either CPU or GPU) to avoid errors.
The Big Three Errors in PyTorch and Deep Learning

The sources dedicate significant attention to highlighting the three most common errors encountered when working with PyTorch for deep learning, emphasizing that mastering these will equip you to handle a significant portion of the challenges you’ll face in your deep learning journey.

1. Tensor Not the Right Data Type
- The Core of the Issue: Tensors, the fundamental building blocks of data in PyTorch, come with associated data types (dtype), such as float32, float16, int32, and int64 [1, 2]. These data types specify how much detail a single number is stored with in memory [3]. Different PyTorch functions and operations may require specific data types to work correctly [3, 4].
- Why it’s Tricky: Sometimes operations may unexpectedly work even if tensors have different data types [4, 5]. However, other operations, especially those involved in training large neural networks, can be quite sensitive to data type mismatches and will throw errors [4].
- Debugging and Prevention:Awareness is Key: Be mindful of the data types of your tensors and the requirements of the operations you’re performing.
- Check Data Types: Utilize tensor.dtype to inspect the data type of a tensor [6].
- Conversion: If needed, convert tensors to the desired data type using tensor.type(desired_dtype) [7].
- Real-World Example: The sources provide examples of using torch.mean, a function that requires a float32 tensor [8, 9]. If you attempt to use it with an integer tensor, PyTorch will throw an error. You’ll need to convert the tensor to float32 before calculating the mean.
2. Tensor Not the Right Shape
- The Heart of the Problem: Neural networks are essentially intricate structures built upon layers of matrix multiplications. For these operations to work seamlessly, the shapes (dimensions) of tensors must be compatible [10-12].
- Shape Mismatch Scenarios: This error arises when:
- The inner dimensions of matrices being multiplied don’t match, violating the fundamental rule of matrix multiplication [10, 13].
- Neural network layers receive input tensors with incompatible shapes, preventing the data from flowing through the network as expected [11].
- You attempt to reshape a tensor into a shape that doesn’t accommodate all its elements [14].
- Troubleshooting and Best Practices:Inspect Shapes: Make it a habit to meticulously examine the shapes of your tensors using tensor.shape [6].
- Visualize: Whenever possible, try to visualize your tensors to gain a clearer understanding of their structure, especially for higher-dimensional tensors. This can help you identify potential shape inconsistencies.
- Careful Reshaping: Exercise caution when using operations like tensor.reshape or tensor.view to modify the shape of a tensor. Always ensure that the resulting shape is compatible with the intended operation or layer.
- Source Illustration: The sources offer numerous instances where shape errors occur during matrix multiplication and when passing data through neural network layers [13-18].
3. Tensor Not on the Right Device
- The Device Dilemma: PyTorch allows you to perform computations on either a CPU or a GPU, with GPUs offering substantial speed advantages for deep learning tasks [19, 20]. However, this flexibility introduces the potential for device mismatches, where you attempt to perform operations between tensors located on different devices (CPU or GPU), resulting in errors [19, 21].
- Common Culprits:Data on CPU, Model on GPU: You might load your data but forget to transfer it to the GPU using tensor.to(device), leading to an error when interacting with a model that’s on the GPU [22].
- NumPy and GPUs: NumPy, a popular Python library for numerical computing, operates on CPU memory. Attempting to directly use NumPy functions on GPU tensors will result in errors. You’ll need to first move the tensor back to the CPU using tensor.cpu() [19].
- Debugging and Solutions:Device-Agnostic Code: Adopt the practice of writing device-agnostic code, using the device variable and the to() method. This will help ensure that your data and models are consistently on the correct device [17].
- Minimize Transfers: Be mindful of the overhead associated with data transfers between the CPU and GPU. Try to minimize these transfers to optimize performance.
- Source Walkthrough: The sources provide examples of device mismatch errors, particularly when attempting to use NumPy functions with tensors on the GPU [19] and when data and models are not explicitly moved to the same device [17, 22-24].
The sources consistently emphasize that these three types of errors are incredibly prevalent in PyTorch and deep learning in general, often leading to a significant amount of troubleshooting and debugging. By understanding the nature of these errors, their common causes, and how to address them, you’ll be well-prepared to tackle a substantial portion of the challenges you’ll encounter while developing and training deep learning models with PyTorch.

The Dynamic Duo: Gradient Descent and Backpropagation

The sources highlight two fundamental algorithms that are at the heart of training neural networks: gradient descent and backpropagation. Let’s explore each of these in detail.

1. Gradient Descent: The Optimizer
- What it Does: Gradient descent is an optimization algorithm that aims to find the best set of parameters (weights and biases) for a neural network to minimize the loss function. The loss function quantifies how “wrong” the model’s predictions are compared to the actual target values.
- The Analogy: Imagine you’re standing on a mountain and want to find the lowest point (the valley). Gradient descent is like taking small steps downhill, following the direction of the steepest descent. The “steepness” is determined by the gradient of the loss function.
- In PyTorch: PyTorch provides the torch.optim module, which contains various implementations of gradient descent and other optimization algorithms. You specify the model’s parameters and a learning rate (which controls the size of the steps taken downhill). [1-3]
- Variations: There are different flavors of gradient descent:
- Stochastic Gradient Descent (SGD): Updates parameters based on the gradient calculated from a single data point or a small batch of data. This introduces some randomness (noise) into the optimization process, which can help escape local minima. [3]
- Adam: A more sophisticated variant of SGD that uses momentum and adaptive learning rates to improve convergence speed and stability. [4, 5]
- Key Insight: The choice of optimizer and its hyperparameters (like learning rate) can significantly influence the training process and the final performance of your model. Experimentation is often needed to find the best settings for a given problem.
2. Backpropagation: The Gradient Calculator
- Purpose: Backpropagation is the algorithm responsible for calculating the gradients of the loss function with respect to the neural network’s parameters. These gradients are then used by gradient descent to update the parameters in the direction that reduces the loss.
- How it Works: Backpropagation uses the chain rule from calculus to efficiently compute gradients, starting from the output layer and propagating them backward through the network layers to the input.
- The “Backward Pass”: In PyTorch, you trigger backpropagation by calling the loss.backward() method. This calculates the gradients and stores them in the grad attribute of each parameter tensor. [6-9]
- PyTorch’s Magic: PyTorch’s autograd feature handles the complexities of backpropagation automatically. You don’t need to manually implement the chain rule or derivative calculations. [10, 11]
- Essential for Learning: Backpropagation is the key to enabling neural networks to learn from data by adjusting their parameters in a way that minimizes prediction errors.
The sources emphasize that gradient descent and backpropagation work in tandem: backpropagation computes the gradients, and gradient descent uses these gradients to update the model’s parameters, gradually improving its performance over time. [6, 10]

Transfer Learning: Leveraging Existing Knowledge

Transfer learning is a powerful technique in deep learning where you take a model that has already been trained on a large dataset for a particular task and adapt it to solve a different but related task. This approach offers several advantages, especially when dealing with limited data or when you want to accelerate the training process. The sources provide examples of how transfer learning can be applied and discuss some of the key resources within PyTorch that support this technique.

The Core Idea: Instead of training a model from scratch, you start with a model that has already learned a rich set of features from a massive dataset (often called a pre-trained model). These pre-trained models are typically trained on datasets like ImageNet, which contains millions of images across thousands of categories.

How it Works:
1. Choose a Pre-trained Model: Select a pre-trained model that is relevant to your target task. For image classification, popular choices include ResNet, VGG, and Inception.
2. Feature Extraction: Use the pre-trained model as a feature extractor. You can either:
- Freeze the weights of the early layers of the model (which have learned general image features) and only train the later layers (which are more specific to your task).
- Fine-tune the entire pre-trained model, allowing all layers to adapt to your target dataset.
1. Transfer to Your Task: Replace the final layer(s) of the pre-trained model with layers that match the output requirements of your task. For example, if you’re classifying images into 10 categories, you’d replace the final layer with a layer that outputs 10 probabilities.
2. Train on Your Data: Train the modified model on your dataset. Since the pre-trained model already has a good understanding of general image features, the training process can converge faster and achieve better performance, even with limited data.
PyTorch Resources for Transfer Learning:
- Torch Hub: A repository of pre-trained models that can be easily loaded and used. The sources mention Torch Hub as a valuable resource for finding models to use in transfer learning.
- torchvision.models: Contains a collection of popular computer vision architectures (like ResNet and VGG) that come with pre-trained weights. You can easily load these models and modify them for your specific tasks.
Benefits of Transfer Learning:
- Faster Training: Since you’re not starting from random weights, the training process typically requires less time.
- Improved Performance: Pre-trained models often bring a wealth of knowledge that can lead to better accuracy on your target task, especially when you have a small dataset.
- Less Data Required: Transfer learning can be highly effective even when your dataset is relatively small.
Examples in the Sources:

The sources provide a glimpse into how transfer learning can be applied to image classification problems. For instance, you could leverage a model pre-trained on ImageNet to classify different types of food images or to distinguish between different clothing items in fashion images.

Key Takeaway: Transfer learning is a valuable technique that allows you to build upon the knowledge gained from training large models on extensive datasets. By adapting these pre-trained models, you can often achieve better results faster, particularly in scenarios where labeled data is scarce.

Here are some reasons why you might choose a machine learning algorithm over traditional programming:
- When you have problems with long lists of rules, it can be helpful to use a machine learning or a deep learning approach. For example, the rules of driving would be very difficult to code into a traditional program, but machine learning and deep learning are currently being used in self-driving cars to manage these complexities [1].
- Machine learning can be beneficial in continually changing environments because it can adapt to new data. For example, a machine learning model for self-driving cars could learn to adapt to new neighborhoods and driving conditions [2].
- Machine learning and deep learning excel at discovering insights within large collections of data. For example, the Food 101 data set contains images of 101 different kinds of food, which would be very challenging to classify using traditional programming techniques [3].
- If a problem can be solved with a simple set of rules, you should use traditional programming. For example, if you could write five steps to make your grandmother’s famous roast chicken, then it is better to do that than to use a machine learning algorithm [4, 5].
Traditional programming is when you write code to define a set of rules that map inputs to outputs. For example, you could write a program to make your grandmother’s roast chicken by defining a set of steps that map the ingredients to the finished dish [6, 7].

Machine learning, on the other hand, is when you give a computer a set of inputs and outputs, and it figures out the rules for itself. For example, you could give a machine learning algorithm a bunch of pictures of cats and dogs, and it would learn to distinguish between them [8, 9]. This is often described as supervised learning, because the algorithm is given both the inputs and the desired outputs, also known as features and labels. The algorithm’s job is to figure out the relationship between the features and the labels [8].

Deep learning is a subset of machine learning that uses neural networks with many layers. This allows deep learning models to learn more complex patterns than traditional machine learning algorithms. Deep learning is typically better for unstructured data, such as images, text, and audio [10].

Machine learning can be used for a wide variety of tasks, including:
- Image classification: Identifying the objects in an image. [11]
- Object detection: Locating objects in an image. [11]
- Natural language processing: Understanding and processing human language. [12]
- Speech recognition: Converting speech to text. [13]
- Machine translation: Translating text from one language to another. [13]
Overall, machine learning algorithms can be a powerful tool for solving complex problems that would be difficult or impossible to solve with traditional programming. However, it is important to remember that machine learning is not a silver bullet. There are many problems that are still best solved with traditional programming.

Here are the key advantages of using deep learning for problems with long lists of rules:
- Deep learning can excel at finding patterns in complex data, making it suitable for problems where it is difficult to explicitly code all of the rules. [1] For example, driving a car involves many rules, such as how to back out of a driveway, how to turn left, how to parallel park, and how to stop at an intersection. It would be extremely difficult to code all of these rules into a traditional program. [2]
- Deep learning is also well-suited for problems that involve continually changing environments. [3] This is because deep learning models can continue to learn and adapt to new data. [3] For example, a self-driving car might need to adapt to new neighborhoods and driving conditions. [3]
- Deep learning can be used to discover insights within large collections of data. [4] This is because deep learning models are able to learn complex patterns from large amounts of data. [4] For example, a deep learning model could be trained on a large dataset of food images to learn to classify different types of food. [4]
However, there are also some potential drawbacks to using deep learning for problems with long lists of rules:
- Deep learning models can be difficult to interpret. [5] This is because the patterns learned by a deep learning model are often represented as a large number of weights and biases, which can be difficult for humans to understand. [5]
- Deep learning models can be computationally expensive to train. [5] This is because deep learning models often have a large number of parameters, which require a lot of computational power to train. [5]
Overall, deep learning can be a powerful tool for solving problems with long lists of rules, but it is important to be aware of the potential drawbacks before using it.

Here is a comprehensive response to your query.

Deep Learning Models Learn by Adjusting Random Numbers

Deep learning models learn by starting with tensors full of random numbers and then adjusting those random numbers to represent data better. [1] This process is repeated over and over, with the model gradually improving its representation of the data. [2] This is a fundamental concept in deep learning. [1]

This process of adjusting random numbers is driven by two algorithms: gradient descent and backpropagation. [3, 4]
- Gradient descent minimizes the difference between the model’s predictions and the actual outputs by adjusting model parameters (weights and biases). [3, 4] The learning rate is a hyperparameter that determines how large the steps are that the model takes during gradient descent. [5, 6]
- Backpropagation calculates the gradients of the parameters with respect to the loss function. [4] In other words, backpropagation tells the model how much each parameter needs to be adjusted to reduce the error. [4] PyTorch implements backpropagation behind the scenes, making it easier to build deep learning models without needing to understand the complex math involved. [4, 7]
Deep learning models have many parameters, often thousands or even millions. [8, 9] These parameters represent the patterns that the model has learned from the data. [8, 10] By adjusting these parameters using gradient descent and backpropagation, the model can improve its performance on a given task. [1, 2]

This learning process is similar to how humans learn. For example, when a child learns to ride a bike, they start by making random movements. Through trial and error, they gradually learn to coordinate their movements and balance on the bike. Similarly, a deep learning model starts with random parameters and gradually adjusts them to better represent the data it is trying to learn.

In short, the main concept behind a deep learning model’s ability to learn is its ability to adjust a large number of random parameters to better represent the data, driven by gradient descent and backpropagation.

Supervised and Unsupervised Learning Paradigms

Supervised learning is a type of machine learning where you have data and labels. The labels are the desired outputs for each input. The goal of supervised learning is to train a model that can accurately predict the labels for new, unseen data. An example of supervised learning is training a model to discern between cat and dog photos using photos labeled as either “cat” or “dog”. [1, 2]

Unsupervised and self-supervised learning are types of machine learning where you only have data, and no labels. The goal of unsupervised learning is to find patterns in the data without any guidance from labels. The goal of self-supervised learning is similar, but the algorithm attempts to learn an inherent representation of the data without being told what to look for. [2, 3] For example, a self-supervised learning algorithm could be trained on a dataset of dog and cat photos without being told which photos are of cats and which are of dogs. The algorithm would then learn to identify the underlying patterns in the data that distinguish cats from dogs. This representation of the data could then be used to train a supervised learning model to classify cats and dogs. [3, 4]

Transfer learning is a type of machine learning where you take the patterns that one model has learned on one dataset and apply them to another dataset. This is a powerful technique that can be used to improve the performance of machine learning models on new tasks. For example, you could use a model that has been trained to classify images of dogs and cats to help train a model to classify images of birds. [4, 5]

Reinforcement learning is another machine learning paradigm that does not fall into the categories of supervised, unsupervised, or self-supervised learning. [6] In reinforcement learning, an agent learns to interact with an environment by performing actions and receiving rewards or observations in return. [6, 7] An example of reinforcement learning is teaching a dog to urinate outside by rewarding it for urinating outside. [7]

This response has comprehensively covered the key differences between supervised and unsupervised learning based on the information in the sources.

Underfitting in Machine Learning

Underfitting occurs when a machine learning model is not complex enough to capture the patterns in the training data. As a result, an underfit model will have high training error and high test error. This means it will make inaccurate predictions on both the data it was trained on and new, unseen data.

Here are some ways to identify underfitting:
- The model’s loss on the training and test data sets could be lower [1].
- The loss curve does not decrease significantly over time, remaining relatively flat [1].
- The accuracy of the model is lower than desired on both the training and test sets [2].
Here’s an analogy to better understand underfitting: Imagine you are trying to learn to play a complex piano piece but are only allowed to use one finger. You can learn to play a simplified version of the song, but it will not sound very good. You are underfitting the data because your one-finger technique is not complex enough to capture the nuances of the original piece.

Underfitting is often caused by using a model that is too simple for the data. For example, using a linear model to fit data with a non-linear relationship will result in underfitting [3]. It can also be caused by not training the model for long enough. If you stop training too early, the model may not have had enough time to learn the patterns in the data.

Here are some ways to address underfitting:
- Add more layers or units to your model: This will increase the complexity of the model and allow it to learn more complex patterns [4].
- Train for longer: This will give the model more time to learn the patterns in the data [5].
- Tweak the learning rate: If the learning rate is too high, the model may not be able to converge on a good solution. Reducing the learning rate can help the model learn more effectively [4].
- Use transfer learning: Transfer learning can help to improve the performance of a model by using knowledge learned from a previous task [6].
- Use less regularization: Regularization is a technique that can help to prevent overfitting, but if you use too much regularization, it can lead to underfitting. Reducing the amount of regularization can help the model learn more effectively [7].
The goal in machine learning is to find the sweet spot between underfitting and overfitting, where the model is complex enough to capture the patterns in the data, but not so complex that it overfits. This is an ongoing challenge, and there is no one-size-fits-all solution. However, by understanding the concepts of underfitting and overfitting, you can take steps to improve the performance of your machine learning models.

Impact of the Learning Rate on Gradient Descent

The learning rate, often abbreviated as “LR”, is a hyperparameter that determines the size of the steps taken during the gradient descent algorithm [1-3]. Gradient descent, as previously discussed, is an iterative optimization algorithm that aims to find the optimal set of model parameters (weights and biases) that minimize the loss function [4-6].

A smaller learning rate means the model parameters are adjusted in smaller increments during each iteration of gradient descent [7-10]. This leads to slower convergence, requiring more epochs to reach the optimal solution. However, a smaller learning rate can also be beneficial as it allows the model to explore the loss landscape more carefully, potentially avoiding getting stuck in local minima [11].

Conversely, a larger learning rate results in larger steps taken during gradient descent [7-10]. This can lead to faster convergence, potentially reaching the optimal solution in fewer epochs. However, a large learning rate can also be detrimental as it can cause the model to overshoot the optimal solution, leading to oscillations or even divergence, where the loss increases instead of decreasing [7, 10, 12].

Visualizing the learning rate’s effect can be helpful. Imagine trying to find the lowest point in a valley. A small learning rate is like taking small, careful steps down the slope, ensuring you don’t miss the bottom. A large learning rate is like taking large, confident strides, potentially reaching the bottom faster but risking stepping over it entirely.

The choice of learning rate is crucial and often involves experimentation to find the optimal balance between convergence speed and stability [12-15]. Setting the learning rate too low can result in unnecessarily long training times, while setting it too high can prevent the model from converging at all [7, 10, 12].

In the sources, a learning rate of 0.01 is described as “relatively large” [16]. A learning rate of 0.1 is mentioned as a common starting point but potentially too large for some scenarios [10, 17]. Smaller learning rates like 0.001 are also used, particularly with the Adam optimizer, which has a default learning rate of 0.001 [18-20].

The sources demonstrate the impact of the learning rate on the training process. In one instance, lowering the learning rate from 0.1 to 0.01 resulted in a lower loss on the test dataset [21]. Additionally, using a TensorFlow Playground visualization, varying the learning rate illustrated how it affects the speed of convergence and the ability to find the optimal solution [11, 22, 23].

Therefore, selecting an appropriate learning rate is an important consideration when training machine learning models. Experimentation and monitoring the training process are often necessary to find the learning rate that yields the best results for a specific problem and model architecture.

Here is a response to your query about the benefits and drawbacks of using deep learning models compared to traditional machine learning algorithms:

Benefits and Drawbacks of Deep Learning Compared to Traditional Machine Learning

Deep learning, a subfield of machine learning, has gained immense popularity due to its ability to achieve state-of-the-art performance on various complex tasks. However, it’s essential to consider both its advantages and disadvantages compared to traditional machine learning algorithms before deciding which approach is suitable for a particular problem.

Benefits of Deep Learning
- Handles Unstructured Data: Deep learning excels in handling unstructured data such as images, text, and audio. Traditional machine learning algorithms often struggle with such data types, requiring extensive feature engineering to convert them into structured formats [1, 2]. Deep learning models, particularly convolutional neural networks (CNNs) for images and recurrent neural networks (RNNs) or transformers for text, automatically learn relevant features from raw data, simplifying the process [1-3].
- Superior Performance on Complex Tasks: Deep learning models, due to their depth and complexity, can capture intricate patterns and relationships in data that traditional algorithms may miss [3-5]. This enables them to outperform traditional methods on tasks like image recognition, natural language processing, and speech synthesis [6-8].
- Adaptability and Continuous Learning: Deep learning models can adapt to changing environments and learn from new data continuously [9]. This is crucial for applications like self-driving cars, where the model needs to adjust to new scenarios and learn from ongoing experiences [4, 9, 10].
Drawbacks of Deep Learning
- Black Box Nature and Explainability: Deep learning models often lack explainability. Their complex architectures and vast number of parameters make it challenging to interpret how they arrive at their predictions [11, 12]. This can be a concern in applications where understanding the reasoning behind a decision is critical, such as medical diagnosis or financial modeling [11, 13].
- Data Hungry and Computationally Intensive: Deep learning models typically require massive amounts of data to train effectively [14, 15]. Gathering and preparing such large datasets can be time-consuming and expensive. Additionally, training deep learning models can be computationally demanding, requiring powerful hardware like GPUs or TPUs [16, 17].
- Potential for Overfitting: Due to their complexity, deep learning models are susceptible to overfitting, where they learn the training data too well but generalize poorly to unseen data [18-20]. This requires careful tuning and techniques like regularization or dropout to mitigate [21, 22].
Traditional Machine Learning Algorithms: Strengths and Limitations

Traditional machine learning algorithms, like linear regression, support vector machines (SVMs), and decision trees, have been widely used and remain relevant for many applications. They often excel in scenarios where data is structured, explainability is crucial, or computational resources are limited.
- Strengths: They generally require less data, are easier to interpret, and are computationally less demanding compared to deep learning models [23, 24].
- Limitations: They may struggle with complex tasks involving unstructured data, require extensive feature engineering, and may not perform as well as deep learning on specific tasks.
The sources primarily focus on deep learning, and there is limited information on specific advantages and disadvantages of traditional machine learning algorithms. You may want to independently verify this information.

The choice between deep learning and traditional machine learning depends on the specific problem, available data, computational resources, and the importance of explainability. Deep learning often outperforms traditional methods on complex tasks with unstructured data. However, traditional algorithms may be a better choice when data is structured, interpretability is crucial, or computational resources are limited.

Google’s Number One Rule of Machine Learning: Benefits of Simplicity

Google’s number one rule of machine learning emphasizes choosing simplicity over complexity when possible. It states: “If you can build a simple rule-based system that doesn’t require machine learning, do that.” [1-3] This principle, although seemingly counterintuitive at the start of a machine learning course, highlights the importance of considering the trade-offs before deploying complex models. [4]

Here’s why it’s crucial to consider Google’s number one rule:
- Efficiency and Maintainability: Simple rule-based systems are often easier to implement, understand, maintain, and debug compared to machine learning models. They require less code, are more transparent in their decision-making process, and don’t need extensive training data or computational resources. [1-3]
- Reliability and Predictability: Rule-based systems provide predictable outcomes based on predefined rules. In contrast, machine learning models, especially deep learning models, involve probabilistic predictions, meaning there’s always a chance of error. [5] For tasks requiring absolute certainty or where errors are unacceptable, a rule-based system might be a more suitable choice. [5]
- Reduced Development Time and Costs: Building and deploying a machine learning model involves several steps, including data collection, preprocessing, model selection, training, and evaluation. This process can be time-consuming and resource-intensive. If a simple rule-based system can achieve the desired outcome, it can significantly reduce development time and costs. [1, 2]
- Avoiding Unnecessary Complexity: Machine learning models, especially deep learning models, can become highly complex, making them challenging to interpret and debug. Using a machine learning model when a simpler solution exists introduces unnecessary complexity, potentially leading to difficulties in maintenance and troubleshooting. [4]
The sources provide an analogy to illustrate this principle. If a simple set of five rules can accurately map ingredients to a Sicilian grandmother’s roast chicken recipe, there’s no need to employ a complex machine learning model. The rule-based system, in this case, would be more efficient and reliable. [1, 2]

However, it’s important to acknowledge that rule-based systems have limitations. They may not be suitable for complex problems with a vast number of rules, constantly changing environments, or situations requiring insights from large datasets. [6, 7]

Therefore, Google’s number one rule encourages a thoughtful approach to problem-solving, urging consideration of simpler alternatives before resorting to the complexity of machine learning. It emphasizes that machine learning, although powerful, is not a universal solution and should be applied judiciously when the problem demands it. [4, 7]

Here is a breakdown of the difference between parameters and hyperparameters in machine learning, based on the provided sources:

Parameters: Learned by the Model
- Parameters are the internal values of a machine learning model that are learned automatically during the training process. [1]
- They are responsible for capturing patterns and relationships within the data. [1]
- Examples of parameters include weights and biases in a neural network. [1, 2]
- Parameters are updated iteratively through optimization algorithms like gradient descent, guided by the loss function. [3, 4]
- The number of parameters can vary significantly depending on the complexity of the model and the dataset. Models can have from a few parameters to millions or even billions. [2]
- In the context of PyTorch, accessing model parameters can be done using model.parameters(). [5]
Hyperparameters: Set by the Machine Learning Engineer
- Hyperparameters are external configurations that are set by the machine learning engineer or data scientist before training the model. [4]
- They control the learning process and influence the behavior of the model, such as its complexity, learning speed, and ability to generalize. [6]
- Examples of hyperparameters:
- Learning rate (LR) [7]
- Number of hidden layers [8]
- Number of hidden units per layer [8]
- Number of epochs [9]
- Activation functions [8]
- Loss function [8]
- Optimizer [8]
- Batch size [10]
- Choosing appropriate hyperparameters is crucial for optimal model performance. [6]
- Finding the best hyperparameter settings often involves experimentation and techniques like grid search or random search. [This information about grid search or random search is not from your provided sources and you may want to independently verify it.]
Key Differences
- Learned vs. Set: The key difference is that parameters are learned by the model during training, while hyperparameters are set manually before training.
- Internal vs. External: Parameters are internal to the model, representing its learned knowledge, whereas hyperparameters are external configurations that guide the learning process.
- Optimization Target vs. Optimization Control: The model’s optimization algorithms aim to find the optimal parameter values, while hyperparameters control how this optimization process occurs.
The sources provide a clear distinction between parameters and hyperparameters. Parameters are like the model’s internal settings that it adjusts to capture patterns in the data. Hyperparameters are the external knobs that the machine learning engineer tweaks to guide the model’s learning process. Understanding this distinction is essential for building and training effective machine learning models.

Here is an explanation of back propagation and its role in gradient descent, based on the provided sources and our conversation history.

Back Propagation and Gradient Descent: A Collaborative Learning Process

Back propagation and gradient descent are two essential algorithms that work together to enable a machine learning model to learn from data and improve its performance. These concepts are particularly relevant to deep learning models, which involve complex architectures with numerous parameters that need to be optimized.

Back Propagation: Calculating the Gradients

Back propagation is an algorithm that calculates the gradients of the loss function with respect to each parameter in the model. The gradients represent the direction and magnitude of change needed in each parameter to minimize the loss function.
- Forward Pass: It begins with a forward pass, where data is fed through the model’s layers, and predictions are generated.
- Loss Calculation: The difference between these predictions and the actual target values is quantified using a loss function.
- Backward Pass: The back propagation algorithm then works backward through the network, starting from the output layer and moving towards the input layer.
- Chain Rule: It uses the chain rule of calculus to calculate the gradients of the loss function with respect to each parameter. This process involves calculating the partial derivatives of the loss function with respect to the outputs of each layer, and then using these derivatives to calculate the gradients for the parameters within that layer.
- Gradient Accumulation: The gradients are accumulated during this backward pass, providing information about how each parameter contributes to the overall error.
Gradient Descent: Updating the Parameters

Gradient descent is an optimization algorithm that uses the gradients calculated by back propagation to update the model’s parameters iteratively. The goal is to find the parameter values that minimize the loss function, leading to improved model performance.
- Learning Rate: The learning rate is a hyperparameter that determines the step size taken in the direction of the negative gradient. It controls how much the parameters are adjusted during each update.
- Iterative Updates: Gradient descent starts with an initial set of parameter values (often randomly initialized) and repeatedly updates these values based on the calculated gradients.
- Minimizing the Loss: The update rule involves moving the parameters in the opposite direction of the gradient, scaled by the learning rate. This process continues iteratively until the loss function reaches a minimum or a satisfactory level of convergence is achieved.
The Interplay

Back propagation provides the essential information needed for gradient descent to operate. By calculating the gradients of the loss function with respect to each parameter, back propagation tells gradient descent which direction to move each parameter to reduce the error.

The sources emphasize that PyTorch handles the intricate mathematics of back propagation and gradient descent behind the scenes. When you define your model, loss function, and optimizer in PyTorch, and execute the training loop, these algorithms are automatically triggered to update the model’s parameters. The loss.backward() function triggers back propagation, and the optimizer.step() function performs the parameter update using gradient descent.

Example: Imagine a neural network trained to classify images of cats and dogs. During training, back propagation calculates how much each weight and bias parameter contributes to misclassifications. Gradient descent then uses this information to adjust these parameters slightly, moving them in a direction that reduces the error. This process continues iteratively, gradually improving the model’s ability to distinguish between cats and dogs.

In summary, back propagation and gradient descent form a powerful duo in machine learning. Back propagation determines the direction and magnitude of change needed in the model’s parameters, and gradient descent uses this information to iteratively update the parameters, driving the model towards better performance. While the mathematical details can be complex, PyTorch simplifies this process by abstracting away the complexity, allowing you to focus on building and training models without manually implementing these algorithms.

The Role of Random Seeds in Neural Networks

The sources provide a detailed explanation of how randomness plays a crucial role in the initialization and training of neural networks. Here’s how random seeds influence this randomness:

Random Initialization of Parameters:
- Neural networks start with random values for their parameters, such as weights and biases. [1, 2] This random initialization is essential for breaking symmetry and allowing the model to explore different regions of the parameter space during training.
- Without random initialization, all neurons in a layer would learn the same features, hindering the network’s ability to learn complex patterns. [This point is not explicitly mentioned in your sources, so you may want to independently verify it.]
Sources of Randomness in PyTorch:
- PyTorch uses pseudo-random number generators to create these random values. [3] Pseudo-randomness means that while the generated numbers appear random, they are actually determined by a deterministic algorithm.
- Random Tensor Creation: When you create a random tensor in PyTorch using functions like torch.rand(), the underlying random number generator determines the values within that tensor. [1, 4] Each time you run the code, you get a different set of random values.
The Impact of Random Seeds:
- Reproducibility: The problem with this inherent randomness is that it makes it difficult to reproduce experiments. If you share your code with someone else, they will likely get different results due to the different random initializations.
- Controlling the Randomness: A random seed allows you to “flavor” the randomness. [5] Setting a seed using torch.manual_seed() ensures that the random number generator starts from a specific point, producing the same sequence of random numbers every time you run the code. [6]
- Flavors of Randomness: Think of each seed value as a different “flavor” of randomness. [6] While the numbers will still be random, they will be the same random numbers every time you use that specific seed.
Benefits of Using Random Seeds:
- Consistent Results: Using a random seed enables you to reproduce experiments and ensure consistency in your results. This is essential for debugging, sharing code, and comparing different model architectures or hyperparameter settings.
- Controlled Experiments: Random seeds allow you to control the randomness in your experiments, isolating the effects of other factors like model architecture or hyperparameters.
Limitations of Random Seeds:
- Not True Randomness: It’s important to remember that even with a random seed, you’re not getting true randomness. The numbers are still generated by a deterministic algorithm.
- Generalization: Over-reliance on a specific random seed may lead to a model that overfits to the specific random initialization, potentially harming its ability to generalize to unseen data. [This is a point worth noting but is not from your sources.]
Practical Usage:
- The sources recommend setting the random seed at the beginning of your code using torch.manual_seed(). [6] This ensures that all subsequent random operations in your code, such as creating random tensors or initializing model parameters, use the same “flavored” randomness.
- CUDA Random Seed: If you’re using CUDA for GPU computations, you also need to set the CUDA random seed using torch.cuda.manual_seed() to control randomness on the GPU. [7]
In summary, random seeds are a valuable tool for controlling the randomness inherent in neural networks, enabling reproducibility and facilitating controlled experiments. By setting a random seed, you ensure that the same “flavored” randomness is used every time you run your code, leading to consistent and reproducible results.

Here is an in-depth explanation of how batch size influences the efficiency of training a neural network, drawing on the information provided in the sources and our previous conversation.

Batch Size: A Balancing Act for Efficiency and Performance

Batch size is a crucial hyperparameter in machine learning that determines how many samples are processed by the model before updating its parameters. Instead of processing the entire training dataset in one go, the data is divided into smaller groups called batches. The model iterates through these batches, updating its parameters after processing each batch.

Impact of Batch Size on Training:
- Computational Efficiency: The sources highlight that batch size significantly impacts computational efficiency. Processing a large batch of images requires significant memory and computational power. Using a smaller batch size can make training more manageable, especially when dealing with limited hardware resources or large datasets.
- Gradient Update Frequency: A smaller batch size leads to more frequent updates to the model’s parameters because the gradients are calculated and applied after each batch. This can lead to faster convergence, especially in the early stages of training.
- Generalization: Using smaller batch sizes can also improve the model’s ability to generalize to unseen data. This is because the model is exposed to a more diverse set of samples during each epoch, potentially leading to a more robust representation of the data.
Choosing the Right Batch Size:
- Hardware Constraints: The sources emphasize that hardware constraints play a significant role in determining the batch size. If you have a powerful GPU with ample memory, you can use larger batch sizes without running into memory issues. However, if you’re working with limited hardware, smaller batch sizes may be necessary.
- Dataset Size: The size of your dataset also influences the choice of batch size. For smaller datasets, you might be able to use larger batch sizes, but for massive datasets, smaller batch sizes are often preferred.
- Experimentation: Finding the optimal batch size often involves experimentation. The sources recommend starting with a common batch size like 32 and adjusting it based on the specific problem and hardware limitations.
Mini-Batch Gradient Descent:
- Efficiency and Performance Trade-off: The concept of using batches to train a neural network is called mini-batch gradient descent. Mini-batch gradient descent strikes a balance between the computational efficiency of batch gradient descent (processing the entire dataset in one go) and the faster convergence of stochastic gradient descent (processing one sample at a time).
- Advantages of Mini-Batches: The sources list two primary benefits of using mini-batches:
1. Computational Efficiency: Mini-batches make it feasible to train models on large datasets that might not fit entirely in memory.
2. More Frequent Gradient Updates: More frequent updates lead to potentially faster convergence and can help the model escape local minima during training.
Example from the Sources:
- In the context of image classification using the Fashion MNIST dataset, the sources demonstrate how a batch size of 32 is used to divide the 60,000 training images into smaller, manageable batches. This allows the model to process and learn from the data more efficiently.
Key Considerations When Choosing Batch Size:
- Larger Batch Sizes:Can be more computationally efficient because you’re processing more samples in one go.
- Can potentially lead to smoother gradient updates, but also may require more memory.
- Risk of overfitting if the batch size is too large and doesn’t allow the model to explore diverse samples.
- Smaller Batch Sizes:Lead to more frequent gradient updates, potentially leading to faster convergence, especially in the early stages of training.
- Can help the model generalize better to unseen data due to exposure to more diverse samples during training.
- May be less computationally efficient as you’re processing fewer samples at a time.
In conclusion, batch size is a critical hyperparameter that significantly influences the efficiency of training a neural network. Choosing the right batch size involves considering hardware constraints, dataset size, and experimental findings. Mini-batch gradient descent, by processing the data in batches, offers a balance between computational efficiency and performance, enabling the training of complex models on large datasets.

Please note that while I have tried my best to provide a comprehensive response, some insights may be based on general machine learning knowledge and best practices that extend beyond the specific details provided in the sources. You may want to independently verify this additional information.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
November 7, 2025
PyTorch Deep Learning & Machine Learning
This PDF excerpt details a PyTorch deep learning course. The course teaches PyTorch fundamentals, including tensor manipulation and neural network architecture. It covers various machine learning concepts, such as linear and non-linear regression, classification (binary and multi-class), and computer vision. Practical coding examples using Google Colab are provided throughout, demonstrating model building, training, testing, saving, and loading. The course also addresses common errors and troubleshooting techniques, emphasizing practical application and experimentation.

PyTorch Deep Learning Study Guide

Quiz
1. What is the difference between a scalar and a vector? A scalar is a single number, while a vector has magnitude and direction and is represented by multiple numbers in a single dimension.
2. How can you determine the number of dimensions of a tensor? You can determine the number of dimensions of a tensor by counting the number of pairs of square brackets, or by calling the endim function on a tensor.
3. What is the purpose of the .shape attribute of a tensor? The .shape attribute of a tensor returns a tuple that represents the size of each dimension of the tensor. It indicates the number of elements in each dimension, providing information about the tensor’s structure.
4. What does the dtype of a tensor represent? The dtype of a tensor represents the data type of the elements within the tensor, such as float32, float16, or int32. It specifies how the numbers are stored in memory, impacting precision and memory usage.
5. What is the difference between reshape and view when manipulating tensors? Both reshape and view change the shape of a tensor. Reshape copies data and allocates new memory, while view creates a new view of the existing tensor data, meaning that changes in the view will impact the original data.
6. Explain what tensor aggregation is and provide an example. Tensor aggregation involves reducing the number of elements in a tensor by applying an operation like min, max, or mean. For example, finding the minimum value in a tensor reduces all of the elements to a single number.
7. What does the stack function do to tensors and how is it different from unsqueeze? The stack function concatenates a sequence of tensors along a new dimension, increasing the dimensions of the tensor by one. The unsqueeze adds a single dimension to a target tensor at a specified dimension.
8. What does the term “device agnostic code” mean, and why is it important in PyTorch? Device-agnostic code in PyTorch means writing code that can run on either a CPU or GPU without modification. This is important for portability and leveraging the power of GPUs when available.
9. In PyTorch, what is a “parameter”, how is it created, and what special property does it have? A “parameter” is a special type of tensor created using nn.parameter that is a module attribute. When assigned as a module attribute, parameters are automatically added to a module’s parameter list, enabling gradient tracking during training.
10. Explain the primary difference between the training loop and the testing/evaluation loop in a neural network. The training loop involves the forward pass, loss calculation, backpropagation and updating the model’s parameters through optimization, whereas the testing/evaluation loop involves only the forward pass and loss and/or accuracy calculation without gradient calculation and parameter updates.
Essay Questions
1. Discuss the importance of tensor operations in deep learning. Provide specific examples of how reshaping, indexing, and aggregation are utilized.
2. Explain the significance of data types in PyTorch tensors, and elaborate on the potential issues that can arise from data type mismatches during tensor operations.
3. Compare and contrast the use of reshape, view, stack, squeeze, and unsqueeze when dealing with tensors. In what scenarios might one operation be preferable over another?
4. Describe the key steps involved in the training loop of a neural network. Explain the role of the loss function, optimizer, and backpropagation in the learning process.
5. Explain the purpose of the torch.utils.data.DataLoader and the advantages it provides. Discuss how it can improve the efficiency and ease of use of data during neural network training.
Glossary

Scalar: A single numerical value. It has no direction or multiple dimensions.

Vector: A mathematical object that has both magnitude and direction, often represented as an ordered list of numbers, i.e. in one dimension.

Matrix: A rectangular array of numbers arranged in rows and columns, i.e. in two dimensions.

Tensor: A generalization of scalars, vectors, and matrices. It can have any number of dimensions.

Dimension (dim): Refers to the number of indices needed to address individual elements in a tensor, which is also the number of bracket pairs.

Shape: A tuple that describes the size of each dimension of a tensor.

Dtype: The data type of the elements in a tensor, such as float32, int64, etc.

Indexing: Selecting specific elements or sub-tensors from a tensor using their positions in the dimensions.

Reshape: Changing the shape of a tensor while preserving the number of elements.

View: Creating a new view of a tensor’s data without copying. Changing the view will change the original data, and vice versa.

Aggregation: Reducing the number of elements in a tensor by applying an operation (e.g., min, max, mean).

Stack: Combining multiple tensors along a new dimension.

Squeeze: Removing dimensions of size 1 from a tensor.

Unsqueeze: Adding a new dimension of size 1 to a tensor.

Device: The hardware on which computations are performed (e.g., CPU, GPU).

Device Agnostic Code: Code that can run on different devices (CPU or GPU) without modification.

Parameter (nn.Parameter): A special type of tensor that can be tracked during training, is a module attribute and is automatically added to a module’s parameter list.

Epoch: A complete pass through the entire training dataset.

Training Loop: The process of iterating through the training data, calculating loss, and updating model parameters.

Testing/Evaluation Loop: The process of evaluating model performance on a separate test dataset.

DataLoader: A utility in PyTorch that creates an iterable over a dataset, managing batching and shuffling of the data.

Flatten: A layer that flattens a multi-dimensional tensor into a single dimension.

PyTorch Deep Learning Fundamentals

Okay, here’s a detailed briefing document summarizing the key themes and ideas from the provided source, with relevant quotes included:

Briefing Document: PyTorch Deep Learning Fundamentals

Introduction:

This document summarizes the core concepts and practical implementations of PyTorch for deep learning, as detailed in the provided course excerpts. The focus is on tensors, their properties, manipulations, and usage within the context of neural network building and training.

I. Tensors: The Building Blocks
- Definition: Tensors are the fundamental data structure in PyTorch, used to encode data as numbers. Traditional terms like scalars, vectors, and matrices are all represented as tensors in PyTorch.
- “basically anytime you encode data into numbers, it’s of a tensor data type.”
- Scalars: A single number.
- “A single number, number of dimensions, zero.”
- Vectors: Have magnitude and direction and typically have more than one number.
- “a vector typically has more than one number”
- “a number with direction, number of dimensions, one”
- Matrices: Two-dimensional tensors.
- “a matrix, a tensor.”
- Dimensions (ndim): Represented by the number of square bracket pairings in the tensor’s definition.
- “dimension is like number of square brackets…number of pairs of closing square brackets.”
- Shape: Defines the size of each dimension in a tensor.
- For example, a vector [1, 2] has a shape of (2,) or (2,1). A matrix [[1, 2], [3, 4]] has a shape of (2, 2).
- “the shape of the vector is two. So we have two by one elements.”
- Data Type (dtype): Tensors have a data type (e.g., float32, float16, int32, long). The default dtype in PyTorch is float32.
- “the default data type in pytorch, even if it’s specified as none is going to come out as float 32.”
- It’s important to ensure tensors have compatible data types when performing operations to avoid errors.
- Device: Tensors can reside on different devices, such as the CPU or GPU (CUDA). Device-agnostic code is recommended to handle this.
II. Tensor Creation and Manipulation
- Creation:torch.tensor(): Creates tensors from lists or NumPy arrays.
- torch.zeros(): Creates a tensor filled with zeros.
- torch.ones(): Creates a tensor filled with ones.
- torch.arange(): Creates a 1D tensor with a range of values.
- torch.rand(): Creates a tensor with random values.
- torch.randn(): Creates a tensor with random values from normal distribution.
- torch.zeros_like()/torch.ones_like()/torch.rand_like(): Creates tensors with the same shape as another tensor.
- Indexing: Tensors can be accessed via numerical indices, allowing one to extract elements or subsets.
- “This is where the square brackets, the pairings come into play.”
- Reshaping:reshape(): Changes the shape of a tensor, provided the total number of elements remains the same.
- view(): Creates a view of the tensor, sharing the same memory, but does not change the shape of the original tensor. Modifying a view changes the original tensor.
- Stacking: torch.stack() concatenates tensors along a new dimension. torch.vstack() and torch.hstack() are similar along specific axes.
- Squeezing and Unsqueezing: squeeze() removes dimensions of size 1, and unsqueeze() adds dimensions of size 1.
- Element-wise operations: standard operations like +, -, *, / are applied element-wise.
- If reassigning the tensor variable (e.g., tensor = tensor * 10), the original tensor will be changed.
- Matrix Multiplication: Use @ operator (or .matmul() function). Inner dimensions must match for valid matrix multiplication.
- “inner dimensions must match.”
- Transpose: tensor.T will tranpose a tensor (swap rows/columns)
- Aggregation: Functions like torch.min(), torch.max(), torch.mean(), and their respective index finders like torch.argmin()/torch.argmax() reduce the tensor to scalar values.
- “So you’re turning it from nine elements to one element, hence aggregation.”
- Attributes: tensors have attributes like dtype, shape (or size), and can be retrieved with tensor.dtype or tensor.shape (or tensor.size())
III. Neural Networks with PyTorch
- torch.nn Module: The module provides building blocks for creating neural networks.
- “nn is the building block layer for neural networks.”
- nn.Module: The base class for all neural network modules. Custom models should inherit from this class.
- Linear Layers (nn.Linear): Represents a linear transformation (y = Wx + b).
- Activation Functions: Non-linear functions such as ReLU (Rectified Linear Unit) and Sigmoid, enable neural networks to learn complex patterns.
- “one divided by one plus torch exponential of negative x.”
- Parameter (nn.Parameter): A special type of tensor that is added to a module’s parameter list, allowing automatic gradient tracking
- “Parameters are torch tensor subclasses…automatically added to the list of its parameters.”
- It’s critical to set requires_grad=True for parameters that need to be optimized during training.
- Sequential Container (nn.Sequential): A convenient way to create models by stacking layers in a sequence.
- Forward Pass: The computation of the model’s output given the input data. This is implemented in the forward() method of a class inheriting from nn.Module.
- “Do the forward pass.”
- Loss Functions: Measure the difference between the predicted and actual values.
- “Calculate the loss.”
- Optimizers: Algorithms that update the model’s parameters based on the loss function during training (e.g., torch.optim.SGD).
- “optimise a step, step, step.”
- Use optimizer.zero_grad() to reset the gradients before each training step.
- Training Loop: The iterative process of:
1. Forward pass
2. Calculate Loss
3. Optimizer zero grad
4. Loss backwards
5. Optimizer Step
- Evaluation Mode: Set the model to model.eval() before doing inference (testing/evaluation), and it sets requires_grad=False
IV. Data Handling
- torch.utils.data.Dataset: A class for representing datasets, and custom datasets can be built using this.
- torch.utils.data.DataLoader: An iterable to batch data for use during training.
- “This creates a Python iterable over a data set.”
- Transforms: Functions that modify data (e.g., images) before they are used in training. They can be composed together.
- “This little transforms module, the torch vision library will change that back to 64 64.”
- Device Agnostic Data: Send data to the appropriate device (CPU/GPU) using .to(device)
- NumPy Interoperability: PyTorch can handle NumPy arrays with torch.from_numpy(), but the data type needs to be changed to torch.float32 from float64
V. Visualization
- Matplotlib: Library is used for visualizing plots and images.
- “Our data explorers motto is visualize, visualize, visualize.”
- plt.imshow(): Displays images.
- plt.plot(): Displays data in a line plot.
VI. Key Practices
- Visualize, Visualize, Visualize: Emphasized for data exploration.
- Device-Agnostic Code: Aim to write code that can run on both CPU and GPU.
- Typo Avoidance: Be careful to avoid typos as they can cause errors.
VII. Specific Examples/Concepts Highlighted:
- Image data: tensors are often (height, width, color_channels) or (batch_size, color_channels, height, width)
- Linear regression: the formula y=weight * x + bias
- Non linear transformations: using activation functions to introduce non-linearity
- Multi-class data sets: Using make_blobs function to generate multiple data classes.
- Convolutional layers (nn.Conv2d): For processing images, which require specific parameters like in-channels, out-channels, kernel size, stride, and padding.
- Flatten layer (nn.Flatten): Used to flatten the input into a vector before a linear layer.
- Data Loaders: Batches of data in an iterable for training or evaluation loops.
Conclusion:

This document provides a foundation for understanding the essential elements of PyTorch for deep learning. It highlights the importance of tensors, their manipulation, and their role in building and training neural networks. Key concepts such as the training loop, device-agnostic coding, and the value of visualization are also emphasized.

This briefing should serve as a useful reference for anyone learning PyTorch and deep learning fundamentals from these course materials.

PyTorch Fundamentals: Tensors and Neural Networks

1. What is a tensor in PyTorch and how does it relate to scalars, vectors, and matrices?

In PyTorch, a tensor is the fundamental data structure used to represent data. Think of it as a generalization of scalars, vectors, and matrices. A scalar is a single number (0 dimensions), a vector has magnitude and direction, and is represented by one dimension, while a matrix has two dimensions. Tensors can have any number of dimensions and can store numerical data of various types. In essence, when you encode any kind of data into numbers within PyTorch, it becomes a tensor. PyTorch uses the term tensor to refer to any of these data types.

2. How are the dimensions and shape of a tensor determined?

The dimension of a tensor can be determined by the number of square bracket pairs used to define it. For example, [1, 2, 3] is a vector with one dimension (one pair of square brackets), and [[1, 2], [3, 4]] is a matrix with two dimensions (two pairs). The shape of a tensor refers to the size of each dimension. For instance, [1, 2, 3] has a shape of (3), meaning 3 elements in the first dimension, while [[1, 2], [3, 4]] has a shape of (2, 2), meaning 2 rows and 2 columns. Note: The shape is determined by the number of elements in each dimension.

3. How do you create tensors with specific values in PyTorch?

PyTorch provides various functions to create tensors:
- torch.tensor([value1, value2, …]) directly creates a tensor from a Python list. You can control the data type (dtype) of the tensor during its creation by passing the dtype argument.
- torch.zeros(size) creates a tensor filled with zeros of the specified size.
- torch.ones(size) creates a tensor filled with ones of the specified size.
- torch.rand(size) creates a tensor filled with random values from a uniform distribution (between 0 and 1) of the specified size.
- torch.arange(start, end, step) creates a 1D tensor containing values from start to end (exclusive), incrementing by step.
- torch.zeros_like(other_tensor) and torch.ones_like(other_tensor) create tensors with the same shape and dtype as the other_tensor, filled with zeros or ones respectively.
4. What is the importance of data types (dtypes) in tensors, and how can they be changed?

Data types determine how data is stored in memory, which has implications for precision and memory usage. The default data type in PyTorch is torch.float32. To change a tensor’s data type, you can use the .type() method, e.g. tensor.type(torch.float16) will convert a tensor to 16 bit float. While PyTorch can often automatically handle operations between different data types, using the correct data type can prevent unexpected errors or behaviors. It’s good to be explicit.

5. What are tensor attributes such as shape, size, and Dtype and how do they relate to tensor manipulation?

These are attributes that can be used to understand, manipulate, and diagnose issues with tensors.
- Shape: An attribute that represents the dimensions of the tensor. For example, a matrix might have a shape of (3, 4), indicating it has 3 rows and 4 columns. You can access this information by using .shape
- Size: Acts like .shape but is a method i.e. .size(). It will return the dimensions of the tensor.
- Dtype: Stands for data type. This defines the way the data is stored and impacts precision and memory use. You can access this by using .dtype.
These attributes can be used to diagnose issues, for example you might want to ensure all tensors have compatible data types and dimensions for multiplication.

6. How do operations like reshape, view, stack, unsqueeze, and squeeze modify the shape of tensors?
- reshape(new_shape): Changes the shape of a tensor to a new shape, as long as the total number of elements remains the same, a tensor with 9 elements can be reshaped into (3, 3) or (9, 1) for example.
- view(new_shape): Similar to reshape, but it can only be used to change the dimensions of a contiguous tensor (a tensor that has elements in continuous memory) and will also share the same memory as the original tensor meaning changes will impact each other.
- stack(tensors, dim): Concatenates multiple tensors along a new dimension (specified by dim) and increases the overall dimensionality by 1.
- unsqueeze(dim): Inserts a new dimension of size one at a specified position, increasing the overall dimensionality by 1.
- squeeze(): Removes all dimensions with size one in a tensor, reducing overall dimensionality of a tensor.
7. What are the key components of a basic neural network training loop?

The key components include:
- Forward Pass: The input data goes through the model, producing the output.
- Calculate Loss: The error is calculated by comparing the output to the true labels.
- Zero Gradients: Previous gradients are cleared before starting a new iteration to prevent accumulating them across iterations.
- Backward Pass: The error is backpropagated through the network to calculate gradients.
- Optimize Step: The model’s parameters are updated based on the gradients using an optimizer.
- Testing / Validation Step: The model’s performance is evaluated against a test or validation dataset.
8. What is the purpose of torch.nn.Module and torch.nn.Parameter in PyTorch?
- torch.nn.Module is a base class for creating neural network models. Modules provide a way to organize and group layers and functions, such as linear layers, activation functions, and other model components. It keeps track of learnable parameters.
- torch.nn.Parameter is a special subclass of torch.Tensor that is used to represent the learnable parameters of a model. When parameters are assigned as module attributes, PyTorch automatically registers them for gradient tracking and optimization. It tracks gradient when ‘requires_grad’ is set to true. Setting requires_grad=True on parameters tells PyTorch to calculate and store gradients for them during backpropagation.
PyTorch: A Deep Learning Framework

PyTorch is a machine learning framework written in Python that is used for deep learning and other machine learning tasks [1]. The framework is popular for research and allows users to write fast deep learning code that can be accelerated by GPUs [2, 3].

Key aspects of PyTorch include:
- Tensors: PyTorch uses tensors as a fundamental building block for numerical data representation. These can be of various types, and neural networks perform mathematical operations on them [4, 5].
- Neural Networks: PyTorch is often used for building neural networks, including fully connected and convolutional neural networks [6]. These networks are constructed using layers from the torch.nn module [7].
- GPU Acceleration: PyTorch can leverage GPUs via CUDA to accelerate machine learning code. GPUs are fast at numerical calculations, which are very important in deep learning [8-10].
- Flexibility: The framework allows for customization, and users can combine layers in different ways to build various kinds of neural networks [6, 11].
- Popularity: PyTorch is a popular research machine learning framework, with 58% of papers with code implemented using PyTorch [2, 12, 13]. It is used by major organizations such as Tesla, OpenAI, Facebook, and Microsoft [14-16].
The typical workflow when using PyTorch for deep learning includes:
- Data Preparation: The first step is getting the data ready, which can involve numerical encoding, turning the data into tensors, and loading the data [17-19].
- Model Building: PyTorch models are built using the nn.Module class as a base and defining the forward computation [20-23]. This includes choosing appropriate layers and defining their interconnections [11].
- Model Fitting: The model is fitted to the data using an optimization loop and a loss function [19]. This involves calculating gradients using back propagation and updating model parameters using gradient descent [24-27].
- Model Evaluation: Model performance is evaluated by measuring how well the model performs on unseen data, using metrics such as accuracy and loss [28].
- Saving and Loading: Trained models can be saved and reloaded using the torch.save, torch.load, and torch.nn.Module.load_state_dict functions [29, 30].
Some additional notes on PyTorch include:
- Reproducibility: Randomness is important in neural networks; it’s necessary to set random seeds to ensure reproducibility of experiments [31, 32].
- Device Agnostic Code: It’s useful to write device agnostic code, which means code that can run on either a CPU or a GPU [33, 34].
- Integration: PyTorch integrates well with other libraries, such as NumPy, which is useful for pre-processing and other numerical tasks [35, 36].
- Documentation: The PyTorch website and documentation serve as the primary resource for learning about the framework [2, 37, 38].
- Community Support: Online forums and communities provide places to ask questions and share code [38-40].
Overall, PyTorch is a very popular and powerful tool for deep learning and machine learning [2, 12, 13]. It provides tools to enable users to build, train, and deploy neural networks with ease [3, 16, 41].

Understanding Machine Learning Models

Machine learning models learn patterns from data, which is converted into numerical representations, and then use these patterns to make predictions or classifications [1-4]. The models are built using code and math [1].

Here are some key aspects of machine learning models based on the sources:
- Data Transformation: Machine learning models require data to be converted into numbers, a process sometimes called numerical encoding [1-4]. This can include images, text, tables of numbers, audio files, or any other type of data [1].
- Pattern Recognition: After data is converted to numbers, machine learning models use algorithms to find patterns in that data [1, 3-5]. These patterns can be complex and are often not interpretable by humans [6, 7]. The models can learn patterns through code, using algorithms to find the relationships in the numerical data [5].
- Traditional Programming vs. Machine Learning: In traditional programming, rules are hand-written to manipulate input data and produce desired outputs [8]. In contrast, machine learning algorithms learn these rules from data [9, 10].
- Supervised Learning: Many machine learning algorithms use supervised learning. This involves providing input data along with corresponding output data (features and labels), and then the algorithm learns the relationships between the inputs and outputs [9].
- Parameters: Machine learning models learn parameters that represent the patterns in the data [6, 11]. Parameters are values that the model sets itself [12]. These are often numerical and can be large, sometimes numbering in the millions or even trillions [6].
- Explainability: The patterns learned by a deep learning model are often uninterpretable by a human [6]. Sometimes, these patterns are lists of numbers in the millions, which is difficult for a person to understand [6, 7].
- Model Evaluation: The performance of a machine learning model can be evaluated by making predictions and comparing those predictions to known labels or targets [13-15]. The goal of training a model is to move from some unknown parameters to a better, known representation of the data [16]. The loss function is used to measure how wrong a model’s predictions are compared to the ideal predictions [17].
- Model Types: Machine learning models include:
- Linear Regression: Models which use a linear formula to draw patterns in data [18]. These models use parameters such as weights and biases to perform forward computation [18].
- Neural Networks: Neural networks are the foundation of deep learning [19]. These are typically used for unstructured data such as images [19, 20]. They use a combination of linear and non-linear functions to draw patterns in data [21-23].
- Convolutional Neural Networks (CNNs): These are a type of neural network often used for computer vision tasks [19, 24]. They process images through a series of layers, identifying spatial features in the data [25].
- Gradient Boosted Machines: Algorithms such as XGBoost are often used for structured data [26].
- Use Cases: Machine learning can be applied to virtually any problem where data can be converted into numbers and patterns can be found [3, 4]. However, simple rule-based systems are preferred if they can solve a problem, and machine learning should not be used simply because it can [5, 27]. Machine learning is useful for complex problems with long lists of rules [28, 29].
- Model Training: The training process is iterative and involves multiple steps, and it can also be seen as an experimental process [30, 31]. In each step, the machine learning model is used to make predictions and its parameters are adjusted to minimize error [13, 32].
In summary, machine learning models are algorithms that can learn patterns from data by converting the data into numbers, using various algorithms, and adjusting parameters to improve performance. Models are typically evaluated against known data with a loss function, and there are many types of models and use cases depending on the type of problem [6, 9-11, 13, 32].

Understanding Neural Networks

Neural networks are a type of machine learning model inspired by the structure of the human brain [1]. They are comprised of interconnected nodes, or neurons, organized in layers, and they are used to identify patterns in data [1-3].

Here are some key concepts for understanding neural networks:
- Structure:
- Layers: Neural networks are made of layers, including an input layer, one or more hidden layers, and an output layer [1, 2]. The ‘deep’ in deep learning comes from having multiple hidden layers [1, 4].
- Nodes/Neurons: Each layer is composed of nodes or neurons [4, 5]. Each node performs a mathematical operation on the input it receives.
- Connections: Nodes in adjacent layers are connected, and these connections have associated weights that are adjusted during the learning process [6].
- Architecture: The arrangement of layers and connections determines the neural network’s architecture [7].
- Function:
- Forward Pass: In a forward pass, input data is passed through the network, layer by layer [8]. Each layer performs mathematical operations on the input, using linear and non-linear functions [5, 9].
- Mathematical Operations: Each layer is typically a combination of linear (straight line) and nonlinear (non-straight line) functions [9].
- Nonlinearity: Nonlinear functions, such as ReLU or sigmoid, are critical for enabling the network to learn complex patterns [9-11].
- Representation Learning: The network learns a representation of the input data by manipulating patterns and features through its layers [6, 12]. This representation is also called a weight matrix or weight tensor [13].
- Output: The output of the network is a representation of the learned patterns, which can be converted into a human-understandable format [12-14].
- Learning Process:
- Random Initialization: Neural networks start with random numbers as parameters, and they adjust those numbers to better represent the data [15, 16].
- Loss Function: A loss function is used to measure how wrong the model’s predictions are compared to ideal predictions [17-19].
- Backpropagation: Backpropagation is an algorithm that calculates the gradients of the loss with respect to the model’s parameters [20].
- Gradient Descent: Gradient descent is an optimization algorithm used to update model parameters to minimize the loss function [20, 21].
- Types of Neural Networks:
- Fully Connected Neural Networks: These networks have connections between all nodes in adjacent layers [1, 22].
- Convolutional Neural Networks (CNNs): CNNs are particularly useful for processing images and other visual data, and they use convolutional layers to identify spatial features [1, 23, 24].
- Recurrent Neural Networks (RNNs): These are often used for sequence data [1, 25].
- Transformers: Transformers have become popular in recent years and are used in natural language processing and other applications [1, 25, 26].
- Customization: Neural networks are highly customizable, and they can be designed in many different ways [4, 25, 27]. The specific architecture and layers used are often tailored to the specific problem at hand [22, 24, 26-28].
Neural networks are a core component of deep learning, and they can be applied to a wide range of problems including image recognition, natural language processing, and many others [22, 23, 25, 26]. The key to using neural networks effectively is to convert data into a numerical representation, design a network that can learn patterns from the data, and use optimization techniques to train the model.

Machine Learning Model Training

The model training process in machine learning involves using algorithms to adjust a model’s parameters so it can learn patterns from data and make accurate predictions [1, 2]. Here’s an overview of the key steps in training a model, according to the sources:
- Initialization: The process begins with a model that has randomly assigned parameters, such as weights and biases [1, 3]. These parameters are what the model adjusts during training [4, 5].
- Data Input: The training process requires input data to be passed through the model [1]. The data is typically split into a training set for learning and a test set for evaluation [6].
- Forward Pass: Input data is passed through the model, layer by layer [7]. Each layer performs mathematical operations on the input, which may include both linear and nonlinear functions [8]. This forward computation produces a prediction, called the model’s output or sometimes logits [9, 10].
- Loss Calculation: A loss function is used to measure how wrong the model’s predictions are compared to the ideal outputs [4, 11]. The loss function provides a numerical value that represents the error or deviation of the model’s predictions from the actual values [12]. The goal of the training process is to minimize this loss [12, 13].
- Backpropagation: After the loss is calculated, the backpropagation algorithm computes the gradients of the loss with respect to the model’s parameters [2, 14, 15]. Gradients indicate the direction and magnitude of the change needed to reduce the loss [1].
- Optimization: An optimizer uses the calculated gradients to update the model’s parameters [4, 11, 16]. Gradient descent is a commonly used optimization algorithm that adjusts the parameters to minimize the loss [1, 2, 15]. The learning rate is a hyperparameter that determines the size of the adjustments [5, 17].
- Training Loop: The process of forward pass, loss calculation, backpropagation, and optimization is repeated iteratively through a training loop [11, 17, 18]. The training loop is where the model learns patterns on the training data [19]. Each iteration of the loop is called an epoch [20].
- Evaluation: After training, the model’s performance is evaluated on a separate test data set [19]. This evaluation helps to measure how well the model has learned and whether it can generalize to unseen data [21].
In PyTorch, the training loop typically involves these steps:
1. Setting the model to training mode using model.train() [22, 23]. This tells the model to track gradients so that they can be used to update the model’s parameters [23].
2. Performing a forward pass by passing the data through the model.
3. Calculating the loss by comparing the model’s prediction with the actual data labels.
4. Setting gradients to zero using optimizer.zero_grad() [24].
5. Performing backpropagation using loss.backward() [15, 24].
6. Updating the model’s parameters using optimizer.step() [24].
During training, models can have two modes: train and evaluation.
- The train mode tracks gradients and other settings to learn from the data [22, 23].
- The evaluation mode turns off settings not needed for evaluation such as dropout, and it turns off gradient tracking to make the code run faster [25, 26].
Other key points about the model training process are:
- Hyperparameters: The training process involves the use of hyperparameters, which are values set by the user, like the learning rate or the number of epochs [5, 23].
- Experimentation: Model training is often an experimental process, with various parameters and settings being tried to find the best performing model [27, 28].
- Data: The quality and quantity of the training data has a big effect on the model’s performance [29, 30].
- Reproducibility: Randomness is an important part of training; to reproduce results, it is necessary to set random seeds [31, 32].
- Visualization: Visualizing model training through metrics such as accuracy and loss curves is important in understanding whether the model is learning effectively [33-35].
- Inference: When making predictions after training, the term inference is also used [36]. Inference uses a model to make predictions using unseen data [26, 36].
In summary, the model training process in machine learning involves iteratively adjusting a model’s parameters to minimize error by using the techniques of gradient descent and backpropagation [1, 2, 14, 15].

PyTorch Model Deployment

The sources discuss model deployment in the context of saving and loading models, which is a key part of making a model usable in an application or other context. Here’s a breakdown of model deployment methods based on the sources:
- Saving Models:State Dictionary: The recommended way to save a PyTorch model is to save its state dictionary [1, 2]. The state dictionary contains the model’s learned parameters, such as weights and biases [3, 4]. This is more flexible than saving the entire model [2].
- File Extension: PyTorch models are commonly saved with a .pth or .pt file extension [5].
- Saving Process: The saving process involves creating a directory path, defining a model name, and then using torch.save() to save the state dictionary to the specified file path [6, 7].
- Flexibility: Saving the state dictionary provides flexibility in how the model is loaded and used [8].
- Loading Models:Loading State Dictionary: To load a saved model, you must create a new instance of the model class and then load the saved state dictionary into that instance [4]. This is done using the load_state_dict() method, along with torch.load(), which reads the file containing the saved state dictionary [9, 10].
- New Instance: When loading a model, it’s important to remember that you must create a new instance of the model class, and then load the saved parameters into that instance using the load_state_dict method [4, 9, 11].
- Loading Process: The loading process involves creating a new instance of the model and then calling load_state_dict on the model with the file path to the saved model [12].
- Inference Mode:Evaluation Mode: Before loading a model for use, the model is typically set to evaluation mode by calling model.eval() [13, 14]. This turns off settings not needed for evaluation, such as dropout layers [15-17].
- Gradient Tracking: It is also common to use inference mode via the context manager torch.inference_mode to turn off gradient tracking, which speeds up the process of making predictions [18-21]. This is used when you are not training the model, but rather using it to make predictions [19].
- Deployment Context:Reusability: The sources mention that a saved model can be reused in the same notebook or sent to a friend to try out, or used in a week’s time [22].
- Cloud Deployment: Models can be deployed in applications or in the cloud [23].
- Model Transfer:Transfer Learning: The source mentions that parameters from one model could be used in another model; this process is called transfer learning [24].
- Other Considerations:Device Agnostic Code: It is recommended to write code that is device agnostic, so it can run on either a CPU or a GPU [25-27].
- Reproducibility: Random seeds should be set for reproducibility [28, 29].
- Model Equivalence: After loading a model, it is important to test that the loaded model is equivalent to the original model by comparing predictions [14, 30-32].
In summary, model deployment involves saving the trained model’s parameters using its state dictionary, loading these parameters into a new model instance, and using the model in evaluation mode with inference turned on, to make predictions. The sources emphasize the importance of saving models for later use, sharing them, and deploying them in applications or cloud environments.

PyTorch for Deep Learning & Machine Learning – Full Course

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
October 6, 2025
Harvard CS50’s Artificial Intelligence with Python – Full University Course
This source explains how AI can be used for problem-solving, moving from explicit instructions to learning from data. It introduces supervised learning, where AI learns to map inputs to outputs using labeled datasets, covering classification tasks and nearest neighbor algorithms. The source also discusses linear regression, support vector machines, and techniques like perceptron learning. It transitions to reinforcement learning, where AI learns through rewards and punishments in an environment, and touches on unsupervised learning with clustering techniques like k-means. Finally, the document explores neural networks, detailing their structure, training via gradient descent and backpropagation, and their applications in various AI problems.

Propositional Logic, Model Checking, and Beyond: A Comprehensive Study Guide

I. Review of Key Concepts
- Propositional Logic: A system for representing logical statements and reasoning about their truth values.
- Propositional Symbols: Variables representing simple statements that can be either true or false (e.g., P, Q, R).
- Logical Connectives: Symbols used to combine propositional symbols into more complex statements:
- and (∧): Both statements must be true for the combined statement to be true.
- or (∨): At least one statement must be true for the combined statement to be true.
- not (¬): Reverses the truth value of a statement.
- implies (→): If the first statement is true, then the second statement must also be true.
- biconditional (↔): Both statements have the same truth value (both true or both false).
- Knowledge Base (KB): A set of sentences representing facts known about the world.
- Query (α): A question about the world that we want to answer using the KB.
- Entailment (KB ⊨ α): The relationship between the KB and a query, meaning that the KB logically implies the query; whenever the KB is true, the query must also be true.
- Model: An assignment of truth values (true or false) to all propositional symbols in the language. Represents a possible world or state.
- Model Checking: An algorithm for determining entailment by enumerating all possible models and checking if, in every model where the KB is true, the query is also true.
- Inference Algorithm: A procedure to derive new sentences from existing ones in the KB.
- Inference Rules: Logical equivalences used to manipulate and simplify logical expressions (e.g., implication elimination, De Morgan’s laws, distributive law).
- Soundness: An inference algorithm is sound if it only derives conclusions that are entailed by the KB.
- Completeness: An inference algorithm is complete if it can derive all conclusions that are entailed by the KB.
- Conjunctive Normal Form (CNF): A logical sentence expressed as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals.
- Clause: A disjunction of literals (e.g., P or not Q or R).
- Literal: A propositional symbol or its negation (e.g., P, not Q).
- Resolution: An inference rule that combines two clauses containing complementary literals to produce a new clause.
- Factoring: Removing duplicate literals within a clause.
- Empty Clause: The result of resolving two contradictory clauses, representing a contradiction (always false).
- Inference by Resolution: An algorithm for proving entailment by converting the KB and the negation of the query to CNF, and then repeatedly applying the resolution rule until the empty clause is derived.
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Inclusion-Exclusion Formula: A formula for calculating the probability of A or B: P(A or B) = P(A) + P(B) – P(A and B).
- Marginalization: Calculating the probability of a variable by summing over all possible values of other variables: P(A) = Σ P(A and B).
- Conditioning: Expressing the probability of A in terms of the conditional probability of A given B and the probability of B: P(A) = P(A|B) * P(B) + P(A|¬B) * P(¬B).
- Conditional Probability: The probability of event A occurring given that event B has already occurred, denoted P(A|B).
- Random Variable: A variable whose value is a numerical outcome of a random phenomenon.
- Heuristic Function: An estimate of the “goodness” of a state (e.g., the distance to the goal).
- Local Search: A class of optimization algorithms that start with an initial state and iteratively improve it by moving to neighboring states.
- Hill Climbing: A local search algorithm that repeatedly moves to the neighbor with the highest value.
- Steepest Ascent Hill Climbing: Chooses the best neighbor among all neighbors in each iteration.
- Stochastic Hill Climbing: Chooses a neighbor randomly from the neighbors that are better than the current state.
- First Choice Hill Climbing: Chooses the first neighbor with a higher value and moves there.
- Random Restart Hill Climbing: Runs hill climbing multiple times with different initial states and returns the best result.
- Local Beam Search: Keeps track of k best states and expands all of them in each iteration.
- Local Maximum/Minimum: A state that is better than all its neighbors but not the best state overall.
- Simulated Annealing: A local search algorithm that sometimes accepts worse neighbors with a probability that decreases over time (temperature).
- Temperature (in Simulated Annealing): A parameter that controls the probability of accepting worse neighbors; high temperature means higher probability, and low temperature means lower probability.
- Delta E (ΔE): The difference in value (or cost) between the current state and a neighboring state.
- Traveling Salesman Problem (TSP): Finding the shortest possible route that visits every city and returns to the origin city.
- NP-Complete Problems: A class of problems for which no known polynomial-time algorithm exists.
- Linear Programming: A mathematical technique for optimizing a linear objective function subject to linear equality and inequality constraints.
- Objective Function: A mathematical expression to be minimized or maximized in linear programming.
- Constraints: Restrictions or limitations on the values of variables in linear programming.
- Constraint Satisfaction Problem (CSP): A problem where the goal is to find values for a set of variables that satisfy a set of constraints.
- Variables (in CSP): Entities with associated domains of possible values.
- Domains (in CSP): The set of possible values that can be assigned to a variable.
- Constraints (in CSP): Restrictions on the values that variables can take, specifying allowable combinations of values.
- Unary Constraint: A constraint involving only one variable.
- Binary Constraint: A constraint involving two variables.
- Node Consistency: Ensuring that all values in a variable’s domain satisfy the variable’s unary constraints.
- Arc Consistency: Ensuring that for every value in a variable’s domain, there exists a consistent value in the domain of each of its neighboring variables.
- AC3: A common algorithm for enforcing arc consistency.
- Backtracking Search: A recursive algorithm that explores possible solutions by trying different values for variables and backtracking when a constraint is violated.
- Minimum Remaining Values (MRV) Heuristic: A variable selection strategy that chooses the variable with the fewest remaining legal values.
- Degree Heuristic: A variable selection strategy that chooses the variable involved in the largest number of constraints on other unassigned variables.
- Least Constraining Value Heuristic: A value selection strategy that chooses the value that rules out the fewest choices for neighboring variables in the constraint graph.
- Supervised Machine Learning: A type of machine learning where an algorithm learns from labeled data to make predictions or classifications.
- Inputs (x): The features or attributes used by a machine learning model to make predictions.
- Outputs (y): The target variables or labels that a machine learning model is trained to predict.
- Hypothesis Function (h): A mathematical function that maps inputs to outputs.
- Weights (w): Parameters in a machine learning model that determine the importance of each input feature.
- Learning Rate (α): A parameter that controls the step size during training.
- Threshold Function: A function that outputs one value if the input is above a threshold and another value if the input is below the threshold.
- Logistic Regression: A statistical method for binary classification using a logistic function to model the probability of a certain class or event.
- Soft Threshold: A function that smoothly transitions between two values, allowing for outputs between 0 and 1.
- Dot Product: A mathematical operation that multiplies corresponding elements of two vectors and sums the results.
- Gradient Descent: An iterative optimization algorithm for finding the minimum of a function.
- Stochastic Gradient Descent: An optimization algorithm that updates the parameters of a machine learning model using the gradient computed from a single randomly chosen data point.
- Mini-Batch Gradient Descent: An optimization algorithm that updates the parameters of a machine learning model using the gradient computed from a small batch of data points.
- Neural Networks: A type of machine learning model inspired by the structure of the human brain, consisting of interconnected nodes (neurons) organized in layers.
- Activation Function: A function applied to the output of a neuron in a neural network to introduce non-linearity.
- Layers (in Neural Networks): A level of nodes that receive input from other nodes and pass their output to additional nodes.
- Natural Language Processing (NLP): The branch of AI that deals with the interaction between computers and human language.
- Syntax: The set of rules that govern the structure of sentences in a language.
- Semantics: The meaning of words, phrases, and sentences in a language.
- Formal Grammar: A set of rules for generating sentences in a language.
- Context-Free Grammar: A type of formal grammar where rules consist of a single non-terminal symbol on the left-hand side.
- Terminal Symbol: A symbol that represents a word in a language.
- Non-Terminal Symbol: A symbol that represents a phrase or category of words in a language.
- Rewriting Rules: Rules that specify how non-terminal symbols can be replaced by other symbols.
- Noun Phrase: A phrase that functions as a noun.
- Verb Phrase: A phrase that functions as a verb.
- Natural Language Toolkit (NLTK): A Python library for NLP.
- Parsing: The process of analyzing a sentence according to the rules of a grammar.
- Syntax Tree: A hierarchical representation of the structure of a sentence.
- Statistical NLP: An approach to NLP that uses statistical models learned from data.
- n-gram: A contiguous sequence of n items from a sample of text.
- Markov Chain: A sequence of events where the probability of each event depends only on the previous event.
- Tokenization: The process of splitting a sequence of characters into pieces (tokens).
- Text Classification: The task of assigning a category label to a text.
- Sentiment Analysis: Determining the emotional tone or attitude expressed in a piece of text.
- Bag-of-Words Model: A text representation that represents a document as the counts of its words, disregarding grammar and word order.
- Term Frequency (TF): The number of times a term appears in a document.
- Inverse Document Frequency (IDF): A measure of how rare a term is across a collection of documents.
- TF-IDF: A weight used in information retrieval and text mining that reflects how important a word is to a document in a corpus.
- Stop Words: Common words that are often removed from text before processing.
- Word Embeddings: Vector representations of words that capture semantic relationships.
- One-Hot Representation: A vector representation where each word is represented by a vector with a 1 in the corresponding index and 0s elsewhere.
- Distributed Representation: A vector representation where the meaning of a word is distributed across multiple values.
- Word2Vec: A model for learning word embeddings.
II. Short Answer Quiz
1. Explain the difference between soundness and completeness in the context of inference algorithms. Soundness means that any conclusion drawn by the algorithm is actually entailed by the knowledge base. Completeness means that the algorithm is capable of deriving every conclusion that is entailed by the knowledge base.
2. Describe the process of converting a logical sentence into Conjunctive Normal Form (CNF). The process involves eliminating bi-conditionals and implications, moving negations inward using De Morgan’s laws, and using the distributive law to get a conjunction of clauses where each clause is a disjunction of literals.
3. What is the purpose of using the resolution inference rule in propositional logic? The resolution rule is used to derive new clauses from existing ones, aiming to ultimately derive the empty clause, which indicates a contradiction and proves entailment.
4. Explain the marginalization rule and provide a simple example. Marginalization calculates the probability of a variable by summing over all possible values of other variables. For example, if you want to know the probability that someone likes ice cream, you would take the probability of them liking ice cream and liking chocolate times the probability that they like chocolate.
5. What is the key idea behind local search algorithms? Local search algorithms start with an initial state and iteratively improve it by moving to neighboring states, based on some evaluation function, without necessarily keeping track of the path taken to reach the solution.
6. Describe how simulated annealing helps to avoid local optima. Simulated annealing accepts worse neighbors with a probability that decreases over time, allowing the algorithm to escape local optima early in the search and converge towards a global optimum later.
7. In linear programming, what are the roles of the objective function and constraints? The objective function is what we want to minimize or maximize, while constraints are limitations on the values of variables that must be satisfied.
8. What is the purpose of enforcing arc consistency in a constraint satisfaction problem (CSP)? Enforcing arc consistency reduces the domains of variables by removing values that cannot be part of any solution due to binary constraints, making the search for a solution more efficient.
9. Explain the difference between a one-hot representation and a distributed representation in NLP. A one-hot representation represents a word as a vector with a 1 in the corresponding index and 0s elsewhere, while a distributed representation distributes the meaning of a word across multiple values in a vector.
10. How do word embedding models like Word2Vec capture semantic relationships between words? Word2Vec captures semantic relationships by training a model to predict the context words surrounding a given word in a large corpus, resulting in vector representations where similar words are located close to each other in vector space.
III. Essay Questions
1. Compare and contrast model checking and inference by resolution as methods for determining entailment in propositional logic. Discuss the advantages and disadvantages of each approach.
2. Explain how local search algorithms can be applied to solve optimization problems. Discuss the challenges of local optima and describe techniques, such as simulated annealing, for overcoming these challenges.
3. Describe the general framework of a constraint satisfaction problem (CSP). Discuss the role of variable and value selection heuristics in improving the efficiency of backtracking search for solving CSPs.
4. Explain the process of training a machine learning model for sentiment analysis. Discuss the different text representation techniques, such as bag-of-words and TF-IDF, and the role of word embeddings.
5. Describe the key concepts in Natural Language Processing (NLP), including syntax and semantics. Discuss how NLP techniques are used to understand and generate natural language.
IV. Glossary of Key Terms
- Activation Function: A function applied to the output of a neuron in a neural network to introduce non-linearity, enabling the network to learn complex patterns.
- Arc Consistency: A constraint satisfaction technique ensuring that for every value in a variable’s domain, there exists a consistent value in the domain of each of its neighboring variables based on the problem constraints.
- Backtracking Search: A recursive algorithm that explores possible solutions by trying different values for variables and backtracking when a constraint is violated, allowing the algorithm to systematically search the solution space.
- Bag-of-Words Model: A text representation in NLP that represents a document as the counts of its words, disregarding grammar and word order, which helps quantify the content of texts for analysis.
- Clause: In logic, it is the statement that combines different literals with “or” relationship.
- Complete: An inference algorithm that can derive all conclusions entailed by the KB.
- Conditioning: A probability rule that expresses the probability of one event in terms of its conditional probability, and this rule is used to find the probabilities that are unknown with the information given.
- Conjunctive Normal Form (CNF): A standardized logical sentence expressed as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals, simplifying logical deductions.
- Constraints: Limitation to the conditions of the variables in linear programing or constraint satisfaction problems.
- Context-Free Grammar: A type of formal grammar where rules consist of a single non-terminal symbol on the left-hand side, used to define the syntax of programming languages.
- Delta E (ΔE): The difference in value between the current state and its neighboring states.
- Distributed Representation: It describes the meaning of the representation of a word distributing over multiple values in vector which is the idea behind the word embedding technique.
- Domain: The set of possible values that can be assigned to a variable.
- Entailment (KB ⊨ α): KB logically implies that α; whenever KB is true, so does α, which is the relationship that is important when the machine needs to find if the conclusion is correct or not.
- Formal Grammar: A set of rules for generating sentences in a language, and those rules are applied in order to find what it is that is trying to be said in language analysis.
- Heuristic Function: It estimates the ‘goodness’ of a state (e.g., the distance to the goal), which will let machine learning models take efficient and near perfect results.
- Hill Climbing: This iterative optimization algorithm is characterized by continuously searching to find better solution while moving to a better neighbor and also have the highest value.
- Hypothesis Function (h): This function maps inputs to outputs and can be used to learn and predict.
- Inclusion-Exclusion Formula: Used to find the P(A or B), in which it finds the P(A), P(B), P(A and B), and finds P(A)+P(B)-P(A and B) in result.
- Inference Algorithm: A procedure to derive new sentences from existing ones in the KB.
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Knowledge Base (KB): A set of sentences representing facts known about the world.
- Layers (in Neural Networks): A level of nodes that receive input from other nodes and pass their output to additional nodes.
- Learning Rate (α): It controls the step size during the machine learning algorithm.
- Linear Programming: A mathematical technique for optimizing a linear objective function subject to linear equality and inequality constraints.
- Literal: A propositional symbol or its negation (e.g., P, not Q) that describes the condition of a statement.
- Local Maximum/Minimum: A state that is better than all its neighbors but not the best state overall.
- Local Search: A class of optimization algorithms that start with an initial state and iteratively improve it by moving to neighboring states.
- Logistic Regression: A statistical method for binary classification using a logistic function to model the probability of a certain class or event.
- Marginalization: Calculating the probability of a variable by summing over all possible values of other variables: P(A) = Σ P(A and B).
- Markov Chain: A sequence of events where the probability of each event depends only on the previous event, allowing modeling of sequences over time.
- Model: An assignment of truth values (true or false) to all propositional symbols in the language that represents the state.
- Model Checking: An algorithm for determining entailment by enumerating all possible models and checking if, in every model where the KB is true, the query is also true.
- n-gram: A contiguous sequence of n items from a sample of text that helps in analyzing languages and predicting text.
- Natural Language Processing (NLP): The field of AI that is related to the understanding of human language.
- Noun Phrase: A phrase that functions as a noun to use for language parsing.
- NP-Complete Problems: A class of problems for which no known polynomial-time algorithm exists.
- Objective Function: An mathematical function to be minimized or maximized in linear programming.
- One-Hot Representation: A vector representation where each word is represented by a vector with a 1 in the corresponding index and 0s elsewhere.
- Parsing: This process of taking a sentence and analyzing it according to grammar rules in NLP.
- Propositional Logic: A system for representing logical statements and reasoning about their truth values.
- Query (α): The question that we want to answer using the KB.
- Random Variable: A variable whose value is a numerical outcome of a random phenomenon.
- Rewriting Rules: Rules that specify how non-terminal symbols can be replaced by other symbols.
- Semantics: the meaning of words, phrases, and sentences in a language, which helps with extracting the insights and understanding of language.
- Simulated Annealing: A local search algorithm that sometimes accepts worse neighbors with a probability that decreases over time (temperature).
- Soft Threshold: A function that smoothly transitions between two values, allowing for outputs between 0 and 1.
- Soundness: An inference algorithm is sound if it only derives conclusions that are entailed by the KB.
- Statistical NLP: An approach to NLP that uses statistical models learned from data.
- Steepest Ascent Hill Climbing: Chooses the best neighbor among all neighbors in each iteration.
- Stop Words: Common words that are often removed from text before processing.
- Syntax: The set of rules that govern the structure of sentences in a language.
- Syntax Tree: A hierarchical representation of the structure of a sentence, used to know how a structure looks with a graphical approach.
- Temperature (in Simulated Annealing): A parameter that controls the probability of accepting worse neighbors; high temperature means higher probability, and low temperature means lower probability.
- Tokenization: The process of splitting a sequence of characters into pieces (tokens), which allows for language parsing and to read for machines.
- Traveling Salesman Problem (TSP): Finding the shortest possible route that visits every city and returns to the origin city.
- Unary Constraint: A constraint involving only one variable.
- Verb Phrase: A phrase that functions as a verb to be analyzed in parsing.
- Weights (w): Parameters in a machine learning model that determine the importance of each input feature, letting it know the emphasis on each feature.
- Word Embeddings: Vector representations of words that capture semantic relationships.
- Word2Vec: A model for learning word embeddings by knowing what words mean, learning and classifying similar words.
AI: Reasoning, Search, NLP, and Learning Techniques

Here’s a briefing document summarizing the main themes and ideas from the provided sources.

Briefing Document: Artificial Intelligence – Reasoning, Search, and Natural Language Processing

Overview:

The sources cover several fundamental concepts in Artificial Intelligence (AI), including logical reasoning, search algorithms, probabilistic reasoning, and natural language processing (NLP). They explore techniques for representing knowledge, drawing inferences, solving problems through search, handling uncertainty, and enabling computers to understand and generate human language.

I. Logical Reasoning and Inference:
- Entailment and Inference Algorithms: The core idea is that AI systems should be able to determine if a knowledge base (KB) entails a query (alpha). This means: “Given some query about the world…the question we want to ask…is does KB, our knowledge base, entail alpha? In other words, using only the information we know inside of our knowledge base…can we conclude that this sentence alpha is true?”
- Model Checking: This is a basic inference algorithm. It involves enumerating all possible models (assignments of truth values to variables) and checking if, in every model where the knowledge base is true, the query (alpha) is also true. “If we wanted to determine if our knowledge base entails some query alpha, then we are going to enumerate all possible models…And if in every model where our knowledge base is true, alpha is also true, then we know that the knowledge base entails alpha.”
- Inference Rules: These are logical transformations used to derive new knowledge from existing knowledge. Examples include:
- Implication Elimination: alpha implies beta can be transformed into not alpha or beta. “This is a way to translate if-then statements into or statements… if I have the implication, alpha implies beta, that I can draw the conclusion that either not alpha or beta”
- Biconditional Elimination: a if and only if b becomes a implies b and b implies a.
- De Morgan’s Laws: These laws relate ANDs and ORs through negation. not (alpha and beta) is equivalent to not alpha or not beta. And not (alpha or beta) is equivalent to not alpha and not beta. “If it is not true that alpha and beta, well, then either not alpha or not beta… if you have a negation in front of an and expression, you move the negation inwards, so to speak…and then flip the and into an or.”
- Distributive Law: alpha and (beta or gamma) is equivalent to (alpha and beta) or (alpha and gamma).
- Conjunctive Normal Form (CNF): A standard form for logical sentences where it is represented as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals (propositional symbols or their negations). “A conjunctive normal form sentence is a logical sentence that is a conjunction of clauses…a conjunction of clauses means it is an and of individual clauses, each of which has ors in it.”
- Resolution: An inference rule that applies to clauses in CNF. If you have P or Q and not P or R, you can resolve them to get Q or R. This involves dealing with factoring (removing duplicate literals) and the empty clause (representing a contradiction). “…if I have two clauses where there’s something that conflicts or something complementary between those two clauses, I can resolve them to get a new clause, to draw a new conclusion.”
- Inference by Resolution: To prove that a knowledge base entails a query (alpha), we assume not alpha and try to derive a contradiction (the empty clause) using resolution. “We want to prove that our knowledge base entails some query alpha…we’re going to try to prove that if we know the knowledge and not alpha, that that would be a contradiction…To determine if our knowledge base entails some query alpha, we’re going to convert knowledge base and not alpha to conjunctive normal form”
II. Search Algorithms:
- Search Problems: Defined by an initial state, actions, a transition model, a goal test, and a path cost function.
- Local Search: Algorithms that operate on a single current state and move to neighbors. They don’t care about the path to the solution.
- Hill Climbing: A simple local search algorithm that repeatedly moves to the neighbor with the highest value (or lowest cost). It suffers from problems with local maxima/minima. “Generally, what hill climbing is going to do is it’s going to consider the neighbors of that state…and pick the highest one I can…continually looking at all of my neighbors and picking the highest neighbor…until I get to a point…where I consider both of my neighbors and both of my neighbors have a lower value than I do.”
- Variations: Steepest ascent, stochastic, first choice, random restart, local beam search.
- Simulated Annealing: A local search algorithm that sometimes accepts worse moves to escape local optima. The probability of accepting a worse move depends on the “temperature” and the difference in cost (delta E). “whereas before, we never, ever wanted to take a move that made our situation worse, now we sometimes want to make a move that is actually going to make our situation worse…And so how do we do that? How do we decide to sometimes accept some state that might actually be worse? Well, we’re going to accept a worse state with some probability.”
- Linear Programming: A family of problems where the goal is to minimize a cost function subject to linear constraints. “the goal of linear programming is to minimize a cost function…subject to particular constraints, subjects to equations that are of the form like this of some sequence of variables is less than a bound or is equal to some particular value”
III. Constraint Satisfaction Problems (CSPs):
- Definition: Problems defined by variables, domains (possible values for each variable), and constraints.
- Node Consistency: Ensuring that all values in a variable’s domain satisfy the unary constraints (constraints involving only that variable). “…we can pick any of these values in the domain. And there won’t be a unary constraint that is violated as a result of it.”
- Arc Consistency: Ensuring that all values in a variable’s domain satisfy the binary constraints (constraints involving two variables). “In order to make some variable x arc consistent with respect to some other variable y, we need to remove any element from x’s domain to make sure that every choice for x, every choice in x’s domain, has a possible choice for y.”
- AC3: An algorithm for enforcing arc consistency across an entire CSP. It maintains a queue of arcs and revises domains to ensure consistency. “AC3 takes a constraint satisfaction problem. And it enforces our consistency across the entire problem…It’s going to basically maintain a queue or basically just a line of all of the arcs that it needs to make consistent.”
- Backtracking Search: A depth-first search algorithm for solving CSPs. It assigns values to variables one at a time, backtracking when a constraint is violated.
- Minimum Remaining Values (MRV): A heuristic for variable selection that chooses the variable with the fewest remaining legal values in its domain. “Select the variable that has the fewest legal values remaining in its domain…In the example of the classes and the exam slots, you would prefer to choose the class that can only meet on one possible day.”
- Degree Heuristic: A heuristic used to select what the best variable will be. “The general approach is that in cases of ties, where two or more of the classes each can only have one possible day of the exam left, we want to choose the one that is involved in the most constraints, the one that we expect to potentially have the bigger impact on the overall problem”
- Least Constraining Value: A heuristic for value selection that chooses the value that rules out the fewest choices for neighboring variables. “Loop over the values in the domain that we haven’t yet tried and pick the value that rules out the fewest values from the neighboring variables.”
IV. Probabilistic Reasoning:
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Inclusion-Exclusion Principle: Used to calculate the probability of A or B: P(A or B) = P(A) + P(B) – P(A and B). Deals with the problem of overcounting when calculating probabilities.
- Marginalization: A rule used to calculate the probability of a variable by summing over all possible values of other variables. “I need to sum up not just over B and not B, but for all of the possible values that the other random variable could take on…I’m going to sum up over j, where j is going to range over all of the possible values that y can take on. Well, let’s look at the probability that x equals xi and y equals yj.”
- Conditioning: Similar to marginalization, but uses conditional probabilities instead of joint probabilities.
V. Supervised Learning:
- Hypothesis Function: A function that maps inputs to outputs. In supervised learning the input consists of a set of labeled data points, each with multiple features and one associated value, or ‘label’. The job of supervised learning is to ‘learn’ a model that correctly maps an input consisting of a data point with multiple features to a corresponding output.
- Weights: Parameters of the hypothesis function that determine the importance of different input features. “We’ll generally call that number a weight for how important should these variables be in trying to determine the answer.”
- Threshold Function: A function that outputs one category if the weighted sum of inputs is above a threshold and another category otherwise. “If we do all this math, is it greater than or equal to 0? If so, we might categorize that data point as a rainy day. And otherwise, we might say, no rain.”
- Logistic Regression: Uses a logistic function (sigmoid) instead of a hard threshold, allowing for probabilistic outputs between 0 and 1. “Instead of using this hard threshold type of function, we can use instead a logistic function…And as a result, the possible output values are no longer just 0 and 1…But you can actually get any real numbered value between 0 and 1.”
- Gradient Descent: An iterative optimization algorithm used to find the optimal weights for a model by repeatedly updating the weights in the direction of the negative gradient of the cost function. “And we can use gradient descent to train a neural network, that gradient descent is going to tell us how to adjust the weights to try and lower that overall cost on all the data points.”
- Stochastic Gradient Descent: Updates the weights based on a single randomly chosen data point at each iteration.
- Mini-Batch Gradient Descent: Updates the weights based on a small batch of data points at each iteration.
- Neural Networks: A network of interconnected nodes (neurons) organized in layers. Each connection has a weight. Neural networks take an input and ‘learn’ to modify the weight of each connection to accurately map an input to an output. A simple neural network consists of an input layer and an output layer, while more complex neural networks consist of several hidden layers between input and output. “we create a network of nodes…and if we want, we can connect all of these nodes together such that every node in the first layer is connected to every node in the second layer…And each of these edges has a weight associated with it.”
- Activation Function: A function applied to the output of each node in a neural network to introduce non-linearity. “You take the inputs, you multiply them by the weights, and then you typically are going to transform that value a little bit using what’s called an activation function.”
- Multi-Class Classification: A classification problem with more than two categories. Can be handled using neural networks with multiple output nodes, each representing the probability of belonging to a particular class.
VI. Natural Language Processing (NLP):
- Syntax: The structure of language.
- Semantics: The meaning of language. “While syntax is all about the structure of language, semantics is about the meaning of language. It’s not enough for a computer just to know that a sentence is well-structured if it doesn’t know what that sentence means.”
- Formal Grammar: A system of rules for generating sentences in a language.
- Context-Free Grammar (CFG): A type of formal grammar that defines rules for rewriting non-terminal symbols into terminal symbols (words) or other non-terminal symbols. “a context-free grammar is some system of rules for generating sentences in a language…We’re going to give the computer some rules that we know about language and have the computer use those rules to make sense of the structure of language.”
- NLTK (Natural Language Toolkit): A Python library for NLP tasks.
- N-grams: Contiguous sequences of n items (characters or words) from a sample of text.
- Tokenization: The process of splitting a sequence of characters into pieces, such as words.
- Markov Chain: A sequence of values where one value can be predicted based on the preceding values. Can be used for language generation. “Recall that a Markov chain is some sequence of values where we can predict one value based on the values that came before it…we can use that to predict what word might come next in a sequence of words.”
- Text Classification: The problem of assigning a category or label to a piece of text.
- Sentiment Analysis: A specific text classification task that involves determining the sentiment (positive, negative, neutral) of a piece of text.
- Bag of Words: A representation of text as a collection of words, disregarding grammar and word order, but keeping track of word frequencies. “With the bag of words representation, I’m just going to keep track of the count of every single word, which I’m going to call features.”
- TF-IDF (Term Frequency-Inverse Document Frequency): A weighting scheme that assigns higher weights to words that are frequent in a document but rare in the overall corpus.
- One-Hot Representation: A vector representation of a word where one element is 1 and all other elements are 0. “Each of these words now has a distinct vector representation. And this is what we often call a one-hot representation, a representation of the meaning of a word as a vector with a single 1 and all of the rest of the values are 0.”
- Distributed Representation: A vector representation of a word where the meaning is distributed across multiple values, ideally in such a way that similar words have similar vector representations.
- Word Embeddings: Distributed representations of words that capture semantic relationships.
- Word2Vec: A model for generating word embeddings based on the context in which words appear. “we’re going to define the meaning of a word based on the words that appear around it, the context words around it…we’re going to say is because the words breakfast and lunch and dinner appear in a similar context, that they must have a similar meaning.”
This briefing document provides a high-level overview of the concepts covered in the sources. It highlights key definitions, algorithms, and techniques used in AI.

NLP, ML, and Problem Solving: FAQ

Natural Language Processing, Machine Learning and Problem Solving: FAQ

1. What is the core concept of “entailment” in the context of knowledge bases and inference algorithms, and how does model checking help determine entailment?

Entailment refers to whether a knowledge base (KB) logically implies a query (alpha). In other words, can you conclude that alpha is true solely based on the information within the KB? Model checking is an algorithm that answers this by enumerating all possible models (assignments of true/false to propositional symbols). If, in every model where the KB is true, alpha is also true, then the KB entails alpha. Essentially, it exhaustively checks if alpha must be true whenever the KB is true.

2. Explain the model checking algorithm, including how it enumerates models and determines if a knowledge base entails a query.

The model checking algorithm involves the following steps:
1. Enumerate all possible models: List every possible combination of truth values (true or false) for all propositional symbols in the knowledge base and query.
2. Evaluate the knowledge base in each model: Determine if the knowledge base (KB) is true or false in each of the enumerated models.
3. Check the query in models where the KB is true: For every model where the KB is true, check if the query (alpha) is also true.
- Determine entailment:If alpha is true in every model where the KB is true, then the KB entails alpha.
- If there exists at least one model where the KB is true but alpha is false, then the KB does not entail alpha.
3. What are inference rules in propositional logic, and give examples of implication elimination, biconditional elimination, and De Morgan’s laws?

Inference rules are logical equivalences that allow you to transform logical sentences into different, but logically equivalent, forms. This is useful for drawing new conclusions from existing knowledge. Here are some examples:
- Implication Elimination: alpha implies beta is equivalent to not alpha or beta. This replaces an implication with an OR statement.
- Biconditional Elimination: alpha if and only if beta is equivalent to (alpha implies beta) and (beta implies alpha). This breaks down a biconditional into two implications.
- De Morgan’s Laws:not (alpha and beta) is equivalent to not alpha or not beta. The negation of a conjunction is the disjunction of the negations.
- not (alpha or beta) is equivalent to not alpha and not beta. The negation of a disjunction is the conjunction of the negations.
4. Describe the conjunctive normal form (CNF) and explain the steps to convert a logical formula into CNF.

Conjunctive Normal Form (CNF) is a standard logical format where a sentence is represented as a conjunction (AND) of clauses, and each clause is a disjunction (OR) of literals. A literal is either a propositional symbol or its negation. The steps to convert a formula to CNF are:
1. Eliminate Biconditionals: Replace all alpha <-> beta with (alpha -> beta) ^ (beta -> alpha).
2. Eliminate Implications: Replace all alpha -> beta with ~alpha v beta.
3. Move Negations Inwards: Use De Morgan’s laws to move negations inward, so they apply only to literals (e.g., ~ (alpha ^ beta) becomes ~alpha v ~beta).
4. Distribute ORs over ANDs: Use the distributive law to transform the expression into a conjunction of clauses (e.g., alpha v (beta ^ gamma) becomes (alpha v beta) ^ (alpha v gamma)).
5. Explain the resolution inference rule and the resolution algorithm for proving entailment. What is “inference by resolution,” and how does the empty clause relate to contradiction?

The resolution inference rule states that if you have two clauses, alpha OR beta and ~alpha OR gamma, you can infer beta OR gamma. It essentially eliminates a complementary pair of literals (alpha and ~alpha) and combines the remaining literals into a new clause. “Inference by resolution” uses this rule repeatedly to derive new clauses.

The resolution algorithm for proving entailment involves:
1. Negate the query: To prove KB entails alpha, assume ~alpha.
2. Convert to CNF: Convert KB AND ~alpha into CNF.
3. Resolution Loop: Repeatedly apply the resolution rule to pairs of clauses in the CNF formula. Add any new clauses generated back into the set of clauses. If factoring is needed, remove any duplicate literals in resulting clause.
4. Check for Empty Clause: If, at any point, you derive the “empty clause” (a clause with no literals, representing “false”), this means you’ve found a contradiction.
5. Determine Entailment: If you derive the empty clause, then KB entails alpha (because KB AND ~alpha leads to a contradiction, so it must be the case that if KB is true, then alpha must be true). If you can no longer derive new clauses and haven’t found the empty clause, then KB does not entail alpha.
The empty clause signifies a contradiction because it represents a situation where both P and NOT P are true, which is impossible. Finding the empty clause through resolution proves that the initial assumption (the negated query) was inconsistent with the knowledge base.

6. Explain the inclusion-exclusion principle and the marginalization rule in probability theory, providing examples of their application.
- Inclusion-Exclusion Principle: This principle calculates the probability of A OR B. The formula is: P(A or B) = P(A) + P(B) – P(A and B). It is used to correct for over counting when calculating P(A or B).
- Example: The probability of rolling a 6 on a red die (A) OR a 6 on a blue die (B). If you just add P(A) + P(B), you’re double-counting the case where both dice show 6. Subtracting P(A and B) (the probability of both dice showing 6) corrects for this.
- Marginalization Rule: This rule allows you to calculate the probability of one variable (A) by summing over all possible values of another variable (B). The formula is: P(A) = Σ P(A and B).
- Example: Probability of it being cloudy (A), given the joint distribution of cloudiness and raininess (B). We calculate P(cloudy) by summing P(cloudy and rainy) + P(cloudy and not rainy). We consider all possible cases that take place, and then look at the probability that the probability of A happens in each of the cases. This is useful for finding an individual (unconditional) probability from a joint probability distribution.
7. Describe the hill climbing algorithm, including its pseudocode, and discuss its limitations (local optima). Also explain variations like stochastic hill climbing and random restart hill climbing.

The hill climbing algorithm is a local search technique used to find a maximum (or minimum) of a function. Its pseudocode is as follows:
1. Start with a current state (often random).
2. Loop: a. Find the neighbor of the current state with the highest (or lowest) value. b. If the neighbor is better than the current state, move to the neighbor ( current = neighbor). c. If the neighbor is not better, terminate and return the current state.
A major limitation of hill climbing is that it can get stuck in local optima: points that are better than their immediate neighbors but not the best overall solution.

Variations:
- Stochastic Hill Climbing: Randomly choose a neighbor with a better value, rather than always picking the best neighbor. This can help escape plateaus (areas of the search space with relatively equal value), but not always a local optimum.
- Random Restart Hill Climbing: Run the hill climbing algorithm multiple times from different random starting states. Keep track of the best solution found across all runs. This increases the chance of finding the global optimum by exploring different regions of the search space.
8. Explain the simulated annealing algorithm and how it can potentially escape local optima compared to simple hill climbing.

Simulated Annealing is a metaheuristic optimization algorithm that can be used for finding the global minimum of a function that may possess several local minima. Simulated Annealing works by first randomly picking a state. Then the algorithm calculates the cost of the state and then makes a neighbor of the state to calculate that cost as well. If the neighbor cost is better, than the new current state becomes the new neighbor. However, simulated annealing adds a twist. Even if the neighbor cost is not better than the current state, you still have a probability of setting the current state to the new worse neighbor to try and dislodge yourself.

This probability is based on a temperature. At the beginning, the temperature is high so there is a better probability to dislodge yourself and explore the search space even if it may lead to worse results at first. As the algorithm iterates, the temperature starts to go down, so it slowly starts to look for better neighbors instead of just exploring and dislodging.

Simulated Annealing is thus better than simple hill climbing because simple hill climbing never goes to a state that may lead to worse results, so as a result gets stuck in local optima as described in the hill climbing algorithm, which SA doesn’t suffer from.

Supervised Learning: Classification, Regression, and Evaluation

Supervised learning is a type of machine learning where a computer is given access to a dataset of input-output pairs and learns a function that maps inputs to outputs. The computer uses the data to train its model and understand the relationships between inputs and outputs. The goal is for the AI to learn to predict outputs based on new input data.

Key aspects of supervised learning:
- Input-output pairs: The computer is provided with a dataset where each data point consists of an input and a corresponding desired output.
- Function mapping: The goal is to find a function that accurately maps inputs to outputs, allowing the computer to make predictions on new, unseen data.
- Training: The computer uses the provided data to train its model, adjusting its internal parameters to minimize the difference between its predictions and the actual outputs.
Classification and regression are two common tasks within supervised learning.
- Classification: Aims to map inputs into discrete categories. An example would be classifying a banknote as authentic or counterfeit based on its features.
- Regression: Aims to predict continuous output values. For example, predicting sales based on advertising spending.
Implementation and evaluation
- Libraries such as Scikit-learn in Python provide tools to implement supervised learning algorithms.
- The data is typically split into training and testing sets. The model is trained on the training set and evaluated on the testing set to assess its ability to generalize to new data.
- Holdout cross-validation splits the data into training and testing sets. The training set trains the machine learning model. The testing set tests how well the machine learning model performs.
- K-fold cross-validation divides data into k different sets and runs k different experiments.
Machine Learning: Algorithms, Techniques, and Applications

Machine learning involves enabling computers to learn from data and experiences without explicit instructions. Instead of programming a computer with explicit rules, machine learning allows the computer to learn patterns from data and improve its performance on a specific task.

Key aspects of machine learning:
- Learning from Data: Machine learning algorithms use data to identify patterns, make predictions, and improve decision-making.
- Algorithms and Techniques: Machine learning encompasses a wide range of algorithms and techniques that enable computers to learn from data.
- Pattern Recognition: Machine learning algorithms identify underlying patterns and relationships within data.
Machine learning comes in different forms, including supervised learning, reinforcement learning and unsupervised learning.
- Supervised learning involves training a model on a labeled dataset consisting of input-output pairs, enabling the model to learn a function that maps inputs to outputs.
- Reinforcement learning involves training an agent to make decisions in an environment to maximize a reward signal.
- Unsupervised learning involves discovering patterns and relationships in unlabeled data without explicit guidance. Clustering is a task preformed in unsupervised learning that involves organizing a set of objects into distinct clusters or groups of similar objects.
Neural networks are a popular tool in machine learning inspired by the structure of the human brain and can be very effective at certain tasks. A neural network is a mathematical model for learning inspired by biological neural networks. Artificial neural networks can model mathematical functions and learn network parameters.

TensorFlow is a library that can be used for creating neural networks, modeling them, and running them on sample data.

Machine learning has a wide variety of applications including: recognizing faces in photos, playing games, understanding human language, spam detection, search and optimization problems, and more.

Neural Networks: Models, Training, and Applications

Neural networks are a popular tool in modern machine learning that draw inspiration from the way human brains learn and reason. They are a type of model that is effective at learning from some set of input data to figure out how to calculate some function from inputs to outputs.

Key aspects of neural networks:
- Mathematical Model: A neural network is a mathematical model for learning inspired by biological neural networks.
- Units: Instead of biological neurons, neural networks use units inside of the network. The units can be represented like nodes in a graph.
- Layers: Neural networks are composed of multiple layers of interconnected nodes or units, including an input layer, one or more hidden layers, and an output layer.
- Weights: Connections between units are defined by weights. The weights determine how signals are passed between connected nodes.
- Activation Functions: Activation functions introduce non-linearity into the network, allowing it to learn complex patterns and relationships in the data.
- Backpropagation: Backpropagation is a key algorithm that makes training multi-layered neural networks possible. The backpropagation algorithm is used to adjust the weights in the network during training to minimize the difference between predicted and actual outputs.
- Versatility: Neural networks are versatile tools applicable to a number of domains.
There are different types of neural networks, each designed for specific tasks:
- Feed-forward neural networks have connections that only move in one direction. The inputs pass through hidden layers and ultimately produce an output.
- Convolutional neural networks (CNNs) are designed for processing grid-like data, such as images. CNNs apply convolutional layers and pooling layers to extract features from images.
- Recurrent neural networks (RNNs) are designed for processing sequential data, such as text or time series. RNNs have connections that loop back into themselves, allowing them to maintain a hidden state that captures information about the sequence. Long short-term memory (LSTM) neural network is a popular type of RNN.
Training Neural Networks:
- Gradient descent is a technique used to train neural networks by minimizing a loss function. Gradient descent involves iteratively adjusting the weights of the network based on the gradient of the loss function with respect to the weights.
- Stochastic gradient descent randomly chooses one data point at a time to calculate the gradient based on, instead of calculating it based on all of the data points.
- Mini-batch gradient descent divides the data set up into small batches, groups of data points, to calculate the gradient based on.
- Overfitting occurs when a neural network is too complex and fits the training data too closely, resulting in poor generalization to new data.
- Dropout is a technique used to combat overfitting by randomly removing units from the neural network during training.
TensorFlow is a library that can be used for creating neural networks, modeling them, and running them on sample data.

Understanding Gradient Descent in Neural Networks

Gradient descent is an algorithm inspired by calculus for minimizing loss when training a neural network. In the context of neural networks, “loss” refers to how poorly a hypothesis function models data.

Key aspects of gradient descent:
- Loss Function: Gradient descent aims to minimize a loss function, which quantifies how poorly the neural network performs.
- Gradient Calculation: The algorithm calculates the gradient of the loss function with respect to the network’s weights. The gradient indicates the direction in which the weights should be adjusted to reduce the loss.
- Weight Update: The weights are updated by taking a small step in the direction opposite to the gradient. The size of this step can vary and is chosen when training the neural network.
- Iterative Process: This process is repeated iteratively, adjusting the weights little by little based on the data points, with the aim of converging towards a good solution.
There are variations to the standard gradient descent algorithm:
- Stochastic Gradient Descent: Instead of looking at all data points at once, stochastic gradient descent randomly chooses one data point at a time to calculate the gradient. This provides a less accurate gradient estimate but is faster to compute.
- Mini-Batch Gradient Descent: This approach is a middle ground between standard and stochastic gradient descent, where the data set is divided into small batches and the gradient is calculated based on these batches.
Understanding Neural Network Hidden Layers

Hidden layers are intermediate layers of artificial neurons or units within a neural network between the input layer and the output layer.

Here’s more about hidden layers and how they contribute to neural network functionality:
- Structure and Function In a neural network, the input layer receives the initial data, and the output layer produces the final result. The hidden layers lie in between, performing complex transformations on the input data to help the network learn non-linear relationships.
- Nodes and Connections Each hidden layer contains a certain number of nodes or units, each connected to the nodes in the preceding and following layers. The connections between nodes have weights, which are adjusted during training to optimize the network’s performance.
- Activation Each unit calculates its output value based on a linear combination of all the inputs. The advantage of layering like this gives an ability to model more complex functions.
Backpropagation: One of the challenges of neural networks is training neural networks that have hidden layers inside of them. The input data provides values for all of the inputs, and what the value of the output should be. However, the input data does not provide what the values for all of the nodes in the hidden layer should be. The key algorithm that makes training the hidden layers of neural networks possible is called backpropagation.

Deep Neural Networks: Neural networks that contain multiple hidden layers are called deep neural networks. The presence of multiple hidden layers allows the network to model more complex functions. Each layer can learn different features of the input, and these features can be combined to produce the desired output. However, complex networks are at greater risk of overfitting.

Dropout: Dropout is a technique that can combat overfitting in neural networks. It involves temporarily removing units from the network during training to prevent over-reliance on any single node.

Harvard CS50’s Artificial Intelligence with Python – Full University Course

The Original Text

This course from Harvard University explores the concepts and algorithms at the foundation of modern artificial intelligence, diving into the ideas that give rise to technologies like game-playing engines, handwriting recognition, and machine translation. You’ll gain exposure to the theory behind graph search algorithms, classification, optimization, reinforcement learning, and other topics in artificial intelligence and machine learning. Brian Yu teaches this course. Hello, world. This is CS50, and this is an introduction to artificial intelligence with Python with CS50’s own Brian Yu. This course picks up where CS50 itself leaves off and explores the concepts and algorithms at the foundation of modern AI. We’ll start with a look at how AI can search for solutions to problems, whether those problems are learning how to play a game or trying to find driving directions to a destination. We’ll then look at how AI can represent information, both knowledge that our AI is certain about, but also information and events about which our AI might be uncertain, learning how to represent that information, but more importantly, how to use that information to draw inferences and new conclusions as well. We’ll explore how AI can solve various types of optimization problems, trying to maximize profits or minimize costs or satisfy some other constraints before turning our attention to the fast-growing field of machine learning, where we won’t tell our AI exactly how to solve a problem, but instead, give our AI access to data and experiences so that our AI can learn on its own how to perform these tasks. In particular, we’ll look at neural networks, one of the most popular tools in modern machine learning, inspired by the way that human brains learn and reason as well before finally taking a look at the world of natural language processing so that it’s not just us humans learning to learn how artificial intelligence is able to speak, but also AI learning how to understand and interpret human language as well. We’ll explore these ideas and algorithms, and along the way, give you the opportunity to build your own AI programs to implement all of this and more. This is CS50. All right. Welcome, everyone, to an introduction to artificial intelligence with Python. My name is Brian Yu, and in this class, we’ll explore some of the ideas and techniques and algorithms that are at the foundation of artificial intelligence. Now, artificial intelligence covers a wide variety of types of techniques. Anytime you see a computer do something that appears to be intelligent or rational in some way, like recognizing someone’s face in a photo, or being able to play a game better than people can, or being able to understand human language when we talk to our phones and they understand what we mean and are able to respond back to us, these are all examples of AI, or artificial intelligence. And in this class, we’ll explore some of the ideas that make that AI possible. So we’ll begin our conversations with search, the problem of we have an AI, and we would like the AI to be able to search for solutions to some kind of problem, no matter what that problem might be. Whether it’s trying to get driving directions from point A to point B, or trying to figure out how to play a game, given a tic-tac-toe game, for example, figuring out what move it ought to make. After that, we’ll take a look at knowledge. Ideally, we want our AI to be able to know information, to be able to represent that information, and more importantly, to be able to draw inferences from that information, to be able to use the information it knows and draw additional conclusions. So we’ll talk about how AI can be programmed in order to do just that. Then we’ll explore the topic of uncertainty, talking about ideas of what happens if a computer isn’t sure about a fact, but maybe is only sure with a certain probability. So we’ll talk about some of the ideas behind probability, and how computers can begin to deal with uncertain events in order to be a little bit more intelligent in that sense as well. After that, we’ll turn our attention to optimization, problems of when the computer is trying to optimize for some sort of goal, especially in a situation where there might be multiple ways that a computer might solve a problem, but we’re looking for a better way, or potentially the best way, if that’s at all possible. Then we’ll take a look at machine learning, or learning more generally, and looking at how, when we have access to data, our computers can be programmed to be quite intelligent by learning from data and learning from experience, being able to perform a task better and better based on greater access to data. So your email, for example, where your email inbox somehow knows which of your emails are good emails and which of your emails are spam. These are all examples of computers being able to learn from past experiences and past data. We’ll take a look, too, at how computers are able to draw inspiration from human intelligence, looking at the structure of the human brain, and how neural networks can be a computer analog to that sort of idea, and how, by taking advantage of a certain type of structure of a computer program, we can write neural networks that are able to perform tasks very, very effectively. And then finally, we’ll turn our attention to language, not programming languages, but human languages that we speak every day. And taking a look at the challenges that come about as a computer tries to understand natural language, and how it is some of the natural language processing that occurs in modern artificial intelligence can actually work. But today, we’ll begin our conversation with search, this problem of trying to figure out what to do when we have some sort of situation that the computer is in, some sort of environment that an agent is in, so to speak, and we would like for that agent to be able to somehow look for a solution to that problem. Now, these problems can come in any number of different types of formats. One example, for instance, might be something like this classic 15 puzzle with the sliding tiles that you might have seen. Where you’re trying to slide the tiles in order to make sure that all the numbers line up in order. This is an example of what you might call a search problem. The 15 puzzle begins in an initially mixed up state, and we need some way of finding moves to make in order to return the puzzle to its solved state. But there are similar problems that you can frame in other ways. Trying to find your way through a maze, for example, is another example of a search problem. You begin in one place, you have some goal of where you’re trying to get to, and you need to figure out the correct sequence of actions that will take you from that initial state to the goal. And while this is a little bit abstract, any time we talk about maze solving in this class, you can translate it to something a little more real world. Something like driving directions. If you ever wonder how Google Maps is able to figure out what is the best way for you to get from point A to point B, and what turns to make at what time, depending on traffic, for example, it’s often some sort of search algorithm. You have an AI that is trying to get from an initial position to some sort of goal by taking some sequence of actions. So we’ll start our conversations today by thinking about these types of search problems and what goes in to solving a search problem like this in order for an AI to be able to find a good solution. In order to do so, though, we’re going to need to introduce a little bit of terminology, some of which I’ve already used. But the first term we’ll need to think about is an agent. An agent is just some entity that perceives its environment. It somehow is able to perceive the things around it and act on that environment in some way. So in the case of the driving directions, your agent might be some representation of a car that is trying to figure out what actions to take in order to arrive at a destination. In the case of the 15 puzzle with the sliding tiles, the agent might be the AI or the person that is trying to solve that puzzle to try and figure out what tiles to move in order to get to that solution. Next, we introduce the idea of a state. A state is just some configuration of the agent in its environment. So in the 15 puzzle, for example, any state might be any one of these three, for example. A state is just some configuration of the tiles. And each of these states is different and is going to require a slightly different solution. A different sequence of actions will be needed in each one of these in order to get from this initial state to the goal, which is where we’re trying to get. So the initial state, then, what is that? The initial state is just the state where the agent begins. It is one such state where we’re going to start from. And this is going to be the starting point for our search algorithm, so to speak. We’re going to begin with this initial state and then start to reason about it, to think about what actions might we apply to that initial state in order to figure out how to get from the beginning to the end, from the initial position to whatever our goal happens to be. And how do we make our way from that initial position to the goal? Well, ultimately, it’s via taking actions. Actions are just choices that we can make in any given state. And in AI, we’re always going to try to formalize these ideas a little bit more precisely, such that we could program them a little bit more mathematically, so to speak. So this will be a recurring theme. And we can more precisely define actions as a function. We’re going to effectively define a function called actions that takes an input, s, where s is going to be some state that exists inside of our environment. And actions of s is going to take the state as input and return as output the set of all actions that can be executed in that state. And so it’s possible that some actions are only valid in certain states and not in other states. And we’ll see examples of that soon, too. So in the case of the 15 puzzle, for example, there are generally going to be four possible actions that we can do most of the time. We can slide a tile to the right, slide a tile to the left, slide a tile up, or slide a tile down, for example. And those are going to be the actions that are available to us. So somehow our AI, our program, needs some encoding of the state, which is often going to be in some numerical format, and some encoding of these actions. But it also needs some encoding of the relationship between these things. How do the states and actions relate to one another? And in order to do that, we’ll introduce to our AI a transition model, which will be a description of what state we get after we perform some available action in some other state. And again, we can be a little bit more precise about this, define this transition model a little bit more formally, again, as a function. The function is going to be a function called result that this time takes two inputs. Input number one is s, some state. And input number two is a, some action. And the output of this function result is it is going to give us the state that we get after we perform action a in state s. So let’s take a look at an example to see more precisely what this actually means. Here is an example of a state, of the 15 puzzle, for example. And here is an example of an action, sliding a tile to the right. What happens if we pass these as inputs to the result function? Again, the result function takes this board, this state, as its first input. And it takes an action as a second input. And of course, here, I’m describing things visually so that you can see visually what the state is and what the action is. In a computer, you might represent one of these actions as just some number that represents the action. Or if you’re familiar with enums that allow you to enumerate multiple possibilities, it might be something like that. And this state might just be represented as an array or two-dimensional array of all of these numbers that exist. But here, we’re going to show it visually just so you can see it. But when we take this state and this action, pass it into the result function, the output is a new state. The state we get after we take a tile and slide it to the right, and this is the state we get as a result. If we had a different action and a different state, for example, and pass that into the result function, we’d get a different answer altogether. So the result function needs to take care of figuring out how to take a state and take an action and get what results. And this is going to be our transition model that describes how it is that states and actions are related to each other. If we take this transition model and think about it more generally and across the entire problem, we can form what we might call a state space. The set of all of the states we can get from the initial state via any sequence of actions, by taking 0 or 1 or 2 or more actions in addition to that, so we could draw a diagram that looks something like this, where every state is represented here by a game board, and there are arrows that connect every state to every other state we can get to from that state. And the state space is much larger than what you see just here. This is just a sample of what the state space might actually look like. And in general, across many search problems, whether they’re this particular 15 puzzle or driving directions or something else, the state space is going to look something like this. We have individual states and arrows that are connecting them. And oftentimes, just for simplicity, we’ll simplify our representation of this entire thing as a graph, some sequence of nodes and edges that connect nodes. But you can think of this more abstract representation as the exact same idea. Each of these little circles or nodes is going to represent one of the states inside of our problem. And the arrows here represent the actions that we can take in any particular state, taking us from one particular state to another state, for example. All right. So now we have this idea of nodes that are representing these states, actions that can take us from one state to another, and a transition model that defines what happens after we take a particular action. So the next step we need to figure out is how we know when the AI is done solving the problem. The AI needs some way to know when it gets to the goal that it’s found the goal. So the next thing we’ll need to encode into our artificial intelligence is a goal test, some way to determine whether a given state is a goal state. In the case of something like driving directions, it might be pretty easy. If you’re in a state that corresponds to whatever the user typed in as their intended destination, well, then you know you’re in a goal state. In the 15 puzzle, it might be checking the numbers to make sure they’re all in ascending order. But the AI needs some way to encode whether or not any state they happen to be in is a goal. And some problems might have one goal, like a maze where you have one initial position and one ending position, and that’s the goal. In other more complex problems, you might imagine that there are multiple possible goals. That there are multiple ways to solve a problem, and we might not care which one the computer finds, as long as it does find a particular goal. However, sometimes the computer doesn’t just care about finding a goal, but finding a goal well, or one with a low cost. And it’s for that reason that the last piece of terminology that we’ll use to define these search problems is something called a path cost. You might imagine that in the case of driving directions, it would be pretty annoying if I said I wanted directions from point A to point B, and the route that Google Maps gave me was a long route with lots of detours that were unnecessary that took longer than it should have for me to get to that destination. And it’s for that reason that when we’re formulating search problems, we’ll often give every path some sort of numerical cost, some number telling us how expensive it is to take this particular option, and then tell our AI that instead of just finding a solution, some way of getting from the initial state to the goal, we’d really like to find one that minimizes this path cost. That is, less expensive, or takes less time, or minimizes some other numerical value. We can represent this graphically if we take a look at this graph again, and imagine that each of these arrows, each of these actions that we can take from one state to another state, has some sort of number associated with it. That number being the path cost of this particular action, where some of the costs for any particular action might be more expensive than the cost for some other action, for example. Although this will only happen in some sorts of problems. In other problems, we can simplify the diagram and just assume that the cost of any particular action is the same. And this is probably the case in something like the 15 puzzle, for example, where it doesn’t really make a difference whether I’m moving right or moving left. The only thing that matters is the total number of steps that I have to take to get from point A to point B. And each of those steps is of equal cost. We can just assume it’s of some constant cost like one. And so this now forms the basis for what we might consider to be a search problem. A search problem has some sort of initial state, some place where we begin, some sort of action that we can take or multiple actions that we can take in any given state. And it has a transition model. Some way of defining what happens when we go from one state and take one action, what state do we end up with as a result. In addition to that, we need some goal test to know whether or not we’ve reached a goal. And then we need a path cost function that tells us for any particular path, by following some sequence of actions, how expensive is that path. What does its cost in terms of money or time or some other resource that we are trying to minimize our usage of. And the goal ultimately is to find a solution. Where a solution in this case is just some sequence of actions that will take us from the initial state to the goal state. And ideally, we’d like to find not just any solution but the optimal solution, which is a solution that has the lowest path cost among all of the possible solutions. And in some cases, there might be multiple optimal solutions. But an optimal solution just means that there is no way that we could have done better in terms of finding that solution. So now we’ve defined the problem. And now we need to begin to figure out how it is that we’re going to solve this kind of search problem. And in order to do so, you’ll probably imagine that our computer is going to need to represent a whole bunch of data about this particular problem. We need to represent data about where we are in the problem. And we might need to be considering multiple different options at once. And oftentimes, when we’re trying to package a whole bunch of data related to a state together, we’ll do so using a data structure that we’re going to call a node. A node is a data structure that is just going to keep track of a variety of different values. And specifically, in the case of a search problem, it’s going to keep track of these four values in particular. Every node is going to keep track of a state, the state we’re currently on. And every node is also going to keep track of a parent. A parent being the state before us or the node that we used in order to get to this current state. And this is going to be relevant because eventually, once we reach the goal node, once we get to the end, we want to know what sequence of actions we use in order to get to that goal. And the way we’ll know that is by looking at these parents to keep track of what led us to the goal and what led us to that state and what led us to the state before that, so on and so forth, backtracking our way to the beginning so that we know the entire sequence of actions we needed in order to get from the beginning to the end. The node is also going to keep track of what action we took in order to get from the parent to the current state. And the node is also going to keep track of a path cost. In other words, it’s going to keep track of the number that represents how long it took to get from the initial state to the state that we currently happen to be at. And we’ll see why this is relevant as we start to talk about some of the optimizations that we can make in terms of these search problems more generally. So this is the data structure that we’re going to use in order to solve the problem. And now let’s talk about the approach. How might we actually begin to solve the problem? Well, as you might imagine, what we’re going to do is we’re going to start at one particular state, and we’re just going to explore from there. The intuition is that from a given state, we have multiple options that we could take, and we’re going to explore those options. And once we explore those options, we’ll find that more options than that are going to make themselves available. And we’re going to consider all of the available options to be stored inside of a single data structure that we’ll call the frontier. The frontier is going to represent all of the things that we could explore next that we haven’t yet explored or visited. So in our approach, we’re going to begin the search algorithm by starting with a frontier that just contains one state. The frontier is going to contain the initial state, because at the beginning, that’s the only state we know about. That is the only state that exists. And then our search algorithm is effectively going to follow a loop. We’re going to repeat some process again and again and again. The first thing we’re going to do is if the frontier is empty, then there’s no solution. And we can report that there is no way to get to the goal. And that’s certainly possible. There are certain types of problems that an AI might try to explore and realize that there is no way to solve that problem. And that’s useful information for humans to know as well. So if ever the frontier is empty, that means there’s nothing left to explore. And we haven’t yet found a solution, so there is no solution. There’s nothing left to explore. Otherwise, what we’ll do is we’ll remove a node from the frontier. So right now at the beginning, the frontier just contains one node representing the initial state. But over time, the frontier might grow. It might contain multiple states. And so here, we’re just going to remove a single node from that frontier. If that node happens to be a goal, then we found a solution. So we remove a node from the frontier and ask ourselves, is this the goal? And we do that by applying the goal test that we talked about earlier, asking if we’re at the destination. Or asking if all the numbers of the 15 puzzle happen to be in order. So if the node contains the goal, we found a solution. Great. We’re done. And otherwise, what we’ll need to do is we’ll need to expand the node. And this is a term of art in artificial intelligence. To expand the node just means to look at all of the neighbors of that node. In other words, consider all of the possible actions that I could take from the state that this node is representing and what nodes could I get to from there. We’re going to take all of those nodes, the next nodes that I can get to from this current one I’m looking at, and add those to the frontier. And then we’ll repeat this process. So at a very high level, the idea is we start with a frontier that contains the initial state. And we’re constantly removing a node from the frontier, looking at where we can get to next and adding those nodes to the frontier, repeating this process over and over until either we remove a node from the frontier and it contains a goal, meaning we’ve solved the problem, or we run into a situation where the frontier is empty, at which point we’re left with no solution. So let’s actually try and take the pseudocode, put it into practice by taking a look at an example of a sample search problem. So right here, I have a sample graph. A is connected to B via this action. B is connected to nodes C and D. C is connected to E. D is connected to F. And what I’d like to do is have my AI find a path from A to E. We want to get from this initial state to this goal state. So how are we going to do that? Well, we’re going to start with a frontier that contains the initial state. This is going to represent our frontier. So our frontier initially will just contain A, that initial state where we’re going to begin. And now we’ll repeat this process. If the frontier is empty, no solution. That’s not a problem, because the frontier is not empty. So we’ll remove a node from the frontier as the one to consider next. There’s only one node in the frontier. So we’ll go ahead and remove it from the frontier. But now A, this initial node, this is the node we’re currently considering. We follow the next step. We ask ourselves, is this node the goal? No, it’s not. A is not the goal. E is the goal. So we don’t return the solution. So instead, we go to this last step, expand the node, and add the resulting nodes to the frontier. What does that mean? Well, it means take this state A and consider where we could get to next. And after A, what we could get to next is only B. So that’s what we get when we expand A. We find B. And we add B to the frontier. And now B is in the frontier. And we repeat the process again. We say, all right, the frontier is not empty. So let’s remove B from the frontier. B is now the node that we’re considering. We ask ourselves, is B the goal? No, it’s not. So we go ahead and expand B and add its resulting nodes to the frontier. What happens when we expand B? In other words, what nodes can we get to from B? Well, we can get to C and D. So we’ll go ahead and add C and D from the frontier. And now we have two nodes in the frontier, C and D. And we repeat the process again. We remove a node from the frontier. For now, I’ll do so arbitrarily just by picking C. We’ll see why later, how choosing which node you remove from the frontier is actually quite an important part of the algorithm. But for now, I’ll arbitrarily remove C, say it’s not the goal. So we’ll add E, the next one, to the frontier. Then let’s say I remove E from the frontier. And now I check I’m currently looking at state E. Is it a goal state? It is, because I’m trying to find a path from A to E. So I would return the goal. And that now would be the solution, that I’m now able to return the solution. And I have found a path from A to E. So this is the general idea, the general approach of this search algorithm, to follow these steps, constantly removing nodes from the frontier, until we’re able to find a solution. So the next question you might reasonably ask is, what could go wrong here? What are the potential problems with an approach like this? And here’s one example of a problem that could arise from this sort of approach. Imagine this same graph, same as before, with one change. The change being now, instead of just an arrow from A to B, we also have an arrow from B to A, meaning we can go in both directions. And this is true in something like the 15 puzzle, where when I slide a tile to the right, I could then slide a tile to the left to get back to the original position. I could go back and forth between A and B. And that’s what these double arrows symbolize, the idea that from one state, I can get to another, and then I can get back. And that’s true in many search problems. What’s going to happen if I try to apply the same approach now? Well, I’ll begin with A, same as before. And I’ll remove A from the frontier. And then I’ll consider where I can get to from A. And after A, the only place I can get to is B. So B goes into the frontier. Then I’ll say, all right, let’s take a look at B. That’s the only thing left in the frontier. Where can I get to from B? Before, it was just C and D. But now, because of that reverse arrow, I can get to A or C or D. So all three, A, C, and D, all of those now go into the frontier. They are places I can get to from B. And now I remove one from the frontier. And maybe I’m unlucky, and maybe I pick A. And now I’m looking at A again. And I consider, where can I get to from A? And from A, well, I can get to B. And now we start to see the problem. But if I’m not careful, I go from A to B, and then back to A, and then to B again. And I could be going in this infinite loop, where I never make any progress, because I’m constantly just going back and forth between two states that I’ve already seen. So what is the solution to this? We need some way to deal with this problem. And the way that we can deal with this problem is by somehow keeping track of what we’ve already explored. And the logic is going to be, well, if we’ve already explored the state, there’s no reason to go back to it. Once we’ve explored a state, don’t go back to it. Don’t bother adding it to the frontier. There’s no need to. So here’s going to be our revised approach, a better way to approach this sort of search problem. And it’s going to look very similar, just with a couple of modifications. We’ll start with a frontier that contains the initial state, same as before. But now we’ll start with another data structure, which will just be a set of nodes that we’ve already explored. So what are the states we’ve explored? Initially, it’s empty. We have an empty explored set. And now we repeat. If the frontier is empty, no solution, same as before. We remove a node from the frontier. We check to see if it’s a goal state, return the solution. None of this is any different so far. But now what we’re going to do is we’re going to add the node to the explored state. So if it happens to be the case that we remove a node from the frontier and it’s not the goal, we’ll add it to the explored set so that we know we’ve already explored it. We don’t need to go back to it again if it happens to come up later. And then the final step, we expand the node and we add the resulting nodes to the frontier. But before, we just always added the resulting nodes to the frontier. We’re going to be a little clever about it this time. We’re only going to add the nodes to the frontier if they aren’t already in the frontier and if they aren’t already in the explored set. So we’ll check both the frontier and the explored set, make sure that the node isn’t already in one of those two. And so long as it isn’t, then we’ll go ahead and add it to the frontier, but not otherwise. And so that revised approach is ultimately what’s going to help make sure that we don’t go back and forth between two nodes. Now, the one point that I’ve kind of glossed over here so far is this step here, removing a node from the frontier. Before, I just chose arbitrarily. Like, let’s just remove a node and that’s it. But it turns out it’s actually quite important how we decide to structure our frontier, how we add and how we remove our nodes. The frontier is a data structure and we need to make a choice about in what order are we going to be removing elements. And one of the simplest data structures for adding and removing elements is something called a stack. And a stack is a data structure that is a last in, first out data type, which means the last thing that I add to the frontier is going to be the first thing that I remove from the frontier. So the most recent thing to go into the stack or the frontier in this case is going to be the node that I explore. So let’s see what happens if I apply this stack-based approach to something like this problem, finding a path from A to E. What’s going to happen? Well, again, we’ll start with A and we’ll say, all right, let’s go ahead and look at A first. And then notice this time, we’ve added A to the explored set. A is something we’ve now explored. We have this data structure that’s keeping track. We then say from A, we can get to B. And all right, from B, what can we do? Well, from B, we can explore B and get to both C and D. So we added C and then D. So now, when we explore a node, we’re going to treat the frontier as a stack, last in, first out. D was the last one to come in. So we’ll go ahead and explore that next and say, all right, where can we get to from D? Well, we can get to F. And so all right, we’ll put F into the frontier. And now, because the frontier is a stack, F is the most recent thing that’s gone in the stack. So F is what we’ll explore next. We’ll explore F and say, all right, where can we get to from F? Well, we can’t get anywhere, so nothing gets added to the frontier. So now, what was the new most recent thing added to the frontier? Well, it’s now C, the only thing left in the frontier. We’ll explore that from which we can see, all right, from C, we can get to E. So E goes into the frontier. And then we say, all right, let’s look at E. And E is now the solution. And now, we’ve solved the problem. So when we treat the frontier like a stack, a last in, first out data structure, that’s the result we get. We go from A to B to D to F. And then we sort of backed up and went down to C and then E. And it’s important to get a visual sense for how this algorithm is working. We went very deep in this search tree, so to speak, all the way until the bottom where we hit a dead end. And then we effectively backed up and explored this other route that we didn’t try before. And it’s this going very deep in the search tree idea, this way the algorithm ends up working when we use a stack that we call this version of the algorithm depth first search. Depth first search is the search algorithm where we always explore the deepest node in the frontier. We keep going deeper and deeper through our search tree. And then if we hit a dead end, we back up and we try something else instead. But depth first search is just one of the possible search options that we could use. It turns out that there’s another algorithm called breadth first search, which behaves very similarly to depth first search with one difference. Instead of always exploring the deepest node in the search tree, the way the depth first search does, breadth first search is always going to explore the shallowest node in the frontier. So what does that mean? Well, it means that instead of using a stack which depth first search or DFS used, where the most recent item added to the frontier is the one we’ll explore next, in breadth first search or BFS, we’ll instead use a queue, where a queue is a first in first out data type, where the very first thing we add to the frontier is the first one we’ll explore and they effectively form a line or a queue, where the earlier you arrive in the frontier, the earlier you get explored. So what would that mean for the same exact problem, finding a path from A to E? Well, we start with A, same as before, then we’ll go ahead and have explored A and say, where can we get to from A? Well, from A, we can get to B, same as before. From B, same as before, we can get to C and D. So C and D get added to the frontier. This time, though, we added C to the frontier before D. So we’ll explore C first. So C gets explored. And from C, where can we get to? Well, we can get to E. So E gets added to the frontier. But because D was explored before E, we’ll look at D next. So we’ll explore D and say, where can we get to from D? We can get to F. And only then will we say, all right, now we can get to E. And so what breadth first search or BFS did is we started here, we looked at both C and D, and then we looked at E. Effectively, we’re looking at things one away from the initial state, then two away from the initial state, and only then, things that are three away from the initial state, unlike depth first search, which just went as deep as possible into the search tree until it hit a dead end and then ultimately had to back up. So these now are two different search algorithms that we could apply in order to try and solve a problem. And let’s take a look at how these would actually work in practice with something like maze solving, for example. So here’s an example of a maze. These empty cells represent places where our agent can move. These darkened gray cells represent walls that the agent can’t pass through. And ultimately, our agent, our AI, is going to try to find a way to get from position A to position B via some sequence of actions, where those actions are left, right, up, and down. What will depth first search do in this case? Well, depth first search will just follow one path. If it reaches a fork in the road where it has multiple different options, depth first search is just, in this case, going to choose one. That doesn’t a real preference. But it’s going to keep following one until it hits a dead end. And when it hits a dead end, depth first search effectively goes back to the last decision point and tries the other path, fully exhausting this entire path. And when it realizes that, OK, the goal is not here, then it turns its attention to this path. It goes as deep as possible. When it hits a dead end, it backs up and then tries this other path, keeps going as deep as possible down one particular path. And when it realizes that that’s a dead end, then it’ll back up, and then ultimately find its way to the goal. And maybe you got lucky, and maybe you made a different choice earlier on. But ultimately, this is how depth first search is going to work. It’s going to keep following until it hits a dead end. And when it hits a dead end, it backs up and looks for a different solution. And so one thing you might reasonably ask is, is this algorithm always going to work? Will it always actually find a way to get from the initial state? To the goal. And it turns out that as long as our maze is finite, as long as there are only finitely many spaces where we can travel, then, yes, depth first search is going to find a solution. Because eventually, it’ll just explore everything. If the maze happens to be infinite and there’s an infinite state space, which does exist in certain types of problems, then it’s a slightly different story. But as long as our maze has finitely many squares, we’re going to find a solution. The next question, though, that we want to ask is, is it going to be a good solution? Is it the optimal solution that we can find? And the answer there is not necessarily. And let’s take a look at an example of that. In this maze, for example, we’re again trying to find our way from A to B. And you notice here there are multiple possible solutions. We could go this way or we could go up in order to make our way from A to B. Now, if we’re lucky, depth first search will choose this way and get to B. But there’s no reason necessarily why depth first search would choose between going up or going to the right. It’s sort of an arbitrary decision point because both are going to be added to the frontier. And ultimately, if we get unlucky, depth first search might choose to explore this path first because it’s just a random choice at this point. It’ll explore, explore, explore. And it’ll eventually find the goal, this particular path, when in actuality there was a better path. There was a more optimal solution that used fewer steps, assuming we’re measuring the cost of a solution based on the number of steps that we need to take. So depth first search, if we’re unlucky, might end up not finding the best solution when a better solution is available. So that’s DFS, depth first search. How does BFS, or breadth first search, compare? How would it work in this particular situation? Well, the algorithm is going to look very different visually in terms of how BFS explores. Because BFS looks at shallower nodes first, the idea is going to be, BFS will first look at all of the nodes that are one away from the initial state. Look here and look here, for example, just at the two nodes that are immediately next to this initial state. Then it’ll explore nodes that are two away, looking at this state and that state, for example. Then it’ll explore nodes that are three away, this state and that state. Whereas depth first search just picked one path and kept following it, breadth first search, on the other hand, is taking the option of exploring all of the possible paths as kind of at the same time bouncing back between them, looking deeper and deeper at each one, but making sure to explore the shallower ones or the ones that are closer to the initial state earlier. So we’ll keep following this pattern, looking at things that are four away, looking at things that are five away, looking at things that are six away, until eventually we make our way to the goal. And in this case, it’s true we had to explore some states that ultimately didn’t lead us anywhere, but the path that we found to the goal was the optimal path. This is the shortest way that we could get to the goal. And so what might happen then in a larger maze? Well, let’s take a look at something like this and how breadth first search is going to behave. Well, breadth first search, again, we’ll just keep following the states until it receives a decision point. It could go either left or right. And while DFS just picked one and kept following that until it hit a dead end, BFS, on the other hand, will explore both. It’ll say look at this node, then this node, and it’ll look at this node, then that node. So on and so forth. And when it hits a decision point here, rather than pick one left or two right and explore that path, it’ll again explore both, alternating between them, going deeper and deeper. We’ll explore here, and then maybe here and here, and then keep going. Explore here and slowly make our way, you can visually see, further and further out. Once we get to this decision point, we’ll explore both up and down until ultimately we make our way to the goal. And what you’ll notice is, yes, breadth first search did find our way from A to B by following this particular path, but it needed to explore a lot of states in order to do so. And so we see some trade offs here between DFS and BFS, that in DFS, there may be some cases where there is some memory savings as compared to a breadth first approach, where breadth first search in this case had to explore a lot of states. But maybe that won’t always be the case. So now let’s actually turn our attention to some code and look at the code that we could actually write in order to implement something like depth first search or breadth first search in the context of solving a maze, for example. So I’ll go ahead and go into my terminal. And what I have here inside of maze.py is an implementation of this same idea of maze solving. I’ve defined a class called node that in this case is keeping track of the state, the parent, in other words, the state before the state, and the action. In this case, we’re not keeping track of the path cost because we can calculate the cost of the path at the end after we found our way from the initial state to the goal. In addition to this, I’ve defined a class called a stack frontier. And if unfamiliar with a class, a class is a way for me to define a way to generate objects in Python. It refers to an idea of object oriented programming, where the idea here is that I would like to create an object that is able to store all of my frontier data. And I would like to have functions, otherwise known as methods, on that object that I can use to manipulate the object. And so what’s going on here, if unfamiliar with the syntax, is I have a function that initially creates a frontier that I’m going to represent using a list. And initially, my frontier is represented by the empty list. There’s nothing in my frontier to begin with. I have an add function that adds something to the frontier as by appending it to the end of the list. I have a function that checks if the frontier contains a particular state. I have an empty function that checks if the frontier is empty. If the frontier is empty, that just means the length of the frontier is 0. And then I have a function for removing something from the frontier. I can’t remove something from the frontier if the frontier is empty, so I check for that first. But otherwise, if the frontier isn’t empty, recall that I’m implementing this frontier as a stack, a last in first out data structure, which means the last thing I add to the frontier, in other words, the last thing in the list, is the item that I should remove from this frontier. So what you’ll see here is I have removed the last item of a list. And if you index into a Python list with negative 1, that gets you the last item in the list. Since 0 is the first item, negative 1 kind of wraps around and gets you to the last item in the list. So we give that the node. We call that node. We update the frontier here on line 28 to say, go ahead and remove that node that you just removed from the frontier. And then we return the node as a result. So this class here effectively implements the idea of a frontier. It gives me a way to add something to a frontier and a way to remove something from the frontier as a stack. I’ve also, just for good measure, implemented an alternative version of the same thing called a queue frontier, which in parentheses you’ll see here, it inherits from a stack frontier, meaning it’s going to do all the same things that the stack frontier did, except the way we remove a node from the frontier is going to be slightly different. Instead of removing from the end of the list the way we would in a stack, we’re instead going to remove from the beginning of the list. Self.frontier 0 will get me the first node in the frontier, the first one that was added, and that is going to be the one that we return in the case of a queue. Then under here, I have a definition of a class called maze. This is going to handle the process of taking a sequence, a maze-like text file, and figuring out how to solve it. So it will take as input a text file that looks something like this, for example, where we see hash marks that are here representing walls, and I have the character A representing the starting position and the character B representing the ending position. And you can take a look at the code for parsing this text file right now. That’s the less interesting part. The more interesting part is this solve function here, the solve function is going to figure out how to actually get from point A to point B. And here we see an implementation of the exact same idea we saw from a moment ago. We’re going to keep track of how many states we’ve explored, just so we can report that data later. But I start with a node that represents just the start state. And I start with a frontier that, in this case, is a stack frontier. And given that I’m treating my frontier as a stack, you might imagine that the algorithm I’m using here is now depth-first search, because depth-first search, or DFS, uses a stack as its data structure. And initially, this frontier is just going to contain the start state. We initialize an explored set that initially is empty. There’s nothing we’ve explored so far. And now here’s our loop, that notion of repeating something again and again. First, we check if the frontier is empty by calling that empty function that we saw the implementation of a moment ago. And if the frontier is indeed empty, we’ll go ahead and raise an exception, or a Python error, to say, sorry, there is no solution to this problem. Otherwise, we’ll go ahead and remove a node from the frontier as by calling frontier.remove and update the number of states we’ve explored, because now we’ve explored one additional state. So we say self.numexplored plus equals 1, adding 1 to the number of states we’ve explored. Once we remove a node from the frontier, recall that the next step is to see whether or not it’s the goal, the goal test. And in the case of the maze, the goal is pretty easy. I check to see whether the state of the node is equal to the goal. Initially, when I set up the maze, I set up this value called goal, which is a property of the maze, so I can just check to see if the node is actually the goal. And if it is the goal, then what I want to do is backtrack my way towards figuring out what actions I took in order to get to this goal. And how do I do that? We’ll recall that every node stores its parent, the node that came before it that we used to get to this node, and also the action used in order to get there. So I can create this loop where I’m constantly just looking at the parent of every node and keeping track for all of the parents what action I took to get from the parent to this current node. So this loop is going to keep repeating this process of looking through all of the parent nodes until we get back to the initial state, which has no parent, where node.parent is going to be equal to none. As I do so, I’m going to be building up the list of all of the actions that I’m following and the list of all the cells that are part of the solution. But I’ll reverse them because when I build it up, going from the goal back to the initial state and building the sequence of actions from the goal to the initial state, but I want to reverse them in order to get the sequence of actions from the initial state to the goal. And that is ultimately going to be the solution. So all of that happens if the current state is equal to the goal. And otherwise, if it’s not the goal, well, then I’ll go ahead and add this state to the explored set to say, I’ve explored this state now. No need to go back to it if I come across it in the future. And then this logic here implements the idea of adding neighbors to the frontier. I’m saying, look at all of my neighbors, and I implemented a function called neighbors that you can take a look at. And for each of those neighbors, I’m going to check, is the state already in the frontier? Is the state already in the explored set? And if it’s not in either of those, then I’ll go ahead and add this new child node, this new node, to the frontier. So there’s a fair amount of syntax here, but the key here is not to understand all the nuances of the syntax. So feel free to take a closer look at this file on your own to get a sense for how it is working. But the key is to see how this is an implementation of the same pseudocode, the same idea that we were describing a moment ago on the screen when we were looking at the steps that we might follow in order to solve this kind of search problem. So now let’s actually see this in action. I’ll go ahead and run maze.py on maze1.txt, for example. And what we’ll see is here, we have a printout of what the maze initially looked like. And then here down below is after we’ve solved it. We had to explore 11 states in order to do it, and we found a path from A to B. And in this program, I just happened to generate a graphical representation of this as well. So I can open up maze.png, which is generated by this program, that shows you where in the darker color here are the walls, red is the initial state, green is the goal, and yellow is the path that was followed. We found a path from the initial state to the goal. But now let’s take a look at a more sophisticated maze to see what might happen instead. Let’s look now at maze2.txt. We’re now here. We have a much larger maze. Again, we’re trying to find our way from point A to point B. But now you’ll imagine that depth-first search might not be so lucky. It might not get the goal on the first try. It might have to follow one path, then backtrack and explore something else a little bit later. So let’s try this. We’ll run python maze.py of maze2.txt, this time trying on this other maze. And now, depth-first search is able to find a solution. Here, as indicated by the stars, is a way to get from A to B. And we can represent this visually by opening up this maze. Here’s what that maze looks like, and highlighted in yellow is the path that was found from the initial state to the goal. But how many states did we have to explore before we found that path? Well, recall that in my program, I was keeping track of the number of states that we’ve explored so far. And so I can go back to the terminal and see that, all right, in order to solve this problem, we had to explore 399 different states. And in fact, if I make one small modification of the program and tell the program at the end when we output this image, I added an argument called show explored. And if I set show explored equal to true and rerun this program, python maze.py, running it on maze2, and then I open the maze, what you’ll see here is highlighted in red are all of the states that had to be explored to get from the initial state to the goal. Depth-first search, or DFS, didn’t find its way to the goal right away. It made a choice to first explore this direction. And when it explored this direction, it had to follow every conceivable path all the way to the very end, even this long and winding one, in order to realize that, you know what? That’s a dead end. And instead, the program needed to backtrack. After going this direction, it must have gone this direction. It got lucky here by just not choosing this path, but it got unlucky here, exploring this direction, exploring a bunch of states it didn’t need to, and then likewise exploring all of this top part of the graph when it probably didn’t need to do that either. So all in all, depth-first search here really not performing optimally, or probably exploring more states than it needs to. It finds an optimal solution, the best path to the goal, but the number of states needed to explore in order to do so, the number of steps I had to take, that was much higher. So let’s compare. How would breadth-first search, or BFS, do on this exact same maze instead? And in order to do so, it’s a very easy change. The algorithm for DFS and BFS is identical with the exception of what data structure we use to represent the frontier, that in DFS, I used a stack frontier, last in, first out, whereas in BFS, I’m going to use a queue frontier, first in, first out, where the first thing I add to the frontier is the first thing that I remove. So I’ll go back to the terminal, rerun this program on the same maze, and now you’ll see that the number of states we had to explore was only 77 as compared to almost 400 when we used depth-first search. And we can see exactly why. We can see what happened if we open up maze.png now and take a look. Again, yellow highlight is the solution that breadth-first search found, which incidentally is the same solution that depth-first search found. They’re both finding the best solution. But notice all the white unexplored cells. There was much fewer states that needed to be explored in order to make our way to the goal because breadth-first search operates a little more shallowly. It’s exploring things that are close to the initial state without exploring things that are further away. So if the goal is not too far away, then breadth-first search can actually behave quite effectively on a maze that looks a little something like this. Now, in this case, both BFS and DFS ended up finding the same solution, but that won’t always be the case. And in fact, let’s take a look at one more example. For instance, maze3.txt. In maze3.txt, notice that here there are multiple ways that you could get from A to B. It’s a relatively small maze, but let’s look at what happens. If I use, and I’ll go ahead and turn off show explored so we just see the solution. If I use BFS, breadth-first search, to solve maze3.txt, well, then we find a solution, and if I open up the maze, here is the solution that we found. It is the optimal one. With just four steps, we can get from the initial state to what the goal happens to be. But what happens if we tried to use depth-first search or DFS instead? Well, again, I’ll go back up to my Q frontier, where Q frontier means that we’re using breadth-first search, and I’ll change it to a stack frontier, which means that now we’ll be using depth-first search. I’ll rerun pythonmaze.py, and now you’ll see that we find the solution, but it is not the optimal solution. This instead is what our algorithm finds, and maybe depth-first search would have found the solution. It’s possible, but it’s not guaranteed that if we just happen to be unlucky, if we choose this state instead of that state, then depth-first search might find a longer route to get from the initial state to the goal. So we do see some trade-offs here, where depth-first search might not find the optimal solution. So at that point, it seems like breadth-first search is pretty good. Is that the best we can do, where it’s going to find us the optimal solution, and we don’t have to worry about situations where we might end up finding a longer path to the solution than what actually exists? Where the goal is far away from the initial state, and we might have to take lots of steps in order to get from the initial state to the goal, what ended up happening is that this algorithm, BFS, ended up exploring basically the entire graph, having to go through the entire maze in order to find its way from the initial state to the goal state. What we’d ultimately like is for our algorithm to be a little bit more intelligent. And now what would it mean for our algorithm to be a little bit more intelligent in this case? Well, let’s look back to where breadth-first search might have been able to make a different decision and consider human intuition in this process as well. What might a human do when solving this maze that is different than what BFS ultimately chose to do? Well, the very first decision point that BFS made was right here, when it made five steps and ended up in a position where it had a fork in the row. It could either go left or it could go right. In these initial couple steps, there was no choice. There was only one action that could be taken from each of those states. And so the search algorithm did the only thing that any search algorithm could do, which is keep following that state after the next state. But this decision point is where things get a little bit interesting. Depth-first search, that very first search algorithm we looked at, chose to say, let’s pick one path and exhaust that path. See if anything that way has the goal. And if not, then let’s try the other way. Depth-first search took the alternative approach of saying, you know what, let’s explore things that are shallow, close to us first. Look left and right, then back left and back right, so on and so forth, alternating between our options in the hopes of finding something nearby. But ultimately, what might a human do if confronted with a situation like this of go left or go right? Well, a human might visually see that, all right, I’m trying to get to state b, which is way up there, and going right just feels like it’s closer to the goal. It feels like going right should be better than going left because I’m making progress towards getting to that goal. Now, of course, there are a couple of assumptions that I’m making here. I’m making the assumption that we can represent this grid as like a two-dimensional grid where I know the coordinates of everything. I know that a is in coordinate 0, 0, and b is in some other coordinate pair, and I know what coordinate I’m at now. So I can calculate that, yeah, going this way, that is closer to the goal. And that might be a reasonable assumption for some types of search problems, but maybe not in others. But for now, we’ll go ahead and assume that, that I know what my current coordinate pair is, and I know the coordinate, x, y, of the goal that I’m trying to get to. And in this situation, I’d like an algorithm that is a little bit more intelligent, that somehow knows that I should be making progress towards the goal, and this is probably the way to do that because in a maze, moving in the coordinate direction of the goal is usually, though not always, a good thing. And so here we draw a distinction between two different types of search algorithms, uninformed search and informed search. Uninformed search algorithms are algorithms like DFS and BFS, the two algorithms that we just looked at, which are search strategies that don’t use any problem-specific knowledge to be able to solve the problem. DFS and BFS didn’t really care about the structure of the maze or anything about the way that a maze is in order to solve the problem. They just look at the actions available and choose from those actions, and it doesn’t matter whether it’s a maze or some other problem, the solution or the way that it tries to solve the problem is really fundamentally going to be the same. What we’re going to take a look at now is an improvement upon uninformed search. We’re going to take a look at informed search. Informed search are going to be search strategies that use knowledge specific to the problem to be able to better find a solution. And in the case of a maze, this problem-specific knowledge is something like if I’m in a square that is geographically closer to the goal, that is better than being in a square that is geographically further away. And this is something we can only know by thinking about this problem and reasoning about what knowledge might be helpful for our AI agent to know a little something about. There are a number of different types of informed search. Specifically, first, we’re going to look at a particular type of search algorithm called greedy best-first search. Greedy best-first search, often abbreviated G-BFS, is a search algorithm that instead of expanding the deepest node like DFS or the shallowest node like BFS, this algorithm is always going to expand the node that it thinks is closest to the goal. Now, the search algorithm isn’t going to know for sure whether it is the closest thing to the goal. Because if we knew what was closest to the goal all the time, then we would already have a solution. The knowledge of what is close to the goal, we could just follow those steps in order to get from the initial position to the solution. But if we don’t know the solution, meaning we don’t know exactly what’s closest to the goal, instead we can use an estimate of what’s closest to the goal, otherwise known as a heuristic, just some way of estimating whether or not we’re close to the goal. And we’ll do so using a heuristic function conventionally called h of n that takes a status input and returns our estimate of how close we are to the goal. So what might this heuristic function actually look like in the case of a maze solving algorithm? Where we’re trying to solve a maze, what does the heuristic look like? Well, the heuristic needs to answer a question between these two cells, C and D, which one is better? Which one would I rather be in if I’m trying to find my way to the goal? Well, any human could probably look at this and tell you, you know what, D looks like it’s better. Even if the maze is convoluted and you haven’t thought about all the walls, D is probably better. And why is D better? Well, because if you ignore the wall, so let’s just pretend the walls don’t exist for a moment and relax the problem, so to speak, D, just in terms of coordinate pairs, is closer to this goal. It’s fewer steps that I wouldn’t take to get to the goal as compared to C, even if you ignore the walls. If you just know the xy-coordinate of C and the xy-coordinate of the goal, and likewise you know the xy-coordinate of D, you can calculate the D just geographically. Ignoring the walls looks like it’s better. And so this is the heuristic function that we’re going to use. And it’s something called the Manhattan distance, one specific type of heuristic, where the heuristic is how many squares vertically and horizontally and then left to right, so not allowing myself to go diagonally, just either up or right or left or down. How many steps do I need to take to get from each of these cells to the goal? Well, as it turns out, D is much closer. There are fewer steps. It only needs to take six steps in order to get to that goal. Again, here, ignoring the walls. We’ve relaxed the problem a little bit. We’re just concerned with if you do the math to subtract the x values from each other and the y values from each other, what is our estimate of how far we are away? We can estimate the D is closer to the goal than C is. And so now we have an approach. We have a way of picking which node to remove from the frontier. And at each stage in our algorithm, we’re going to remove a node from the frontier. We’re going to explore the node if it has the smallest value for this heuristic function, if it has the smallest Manhattan distance to the goal. And so what would this actually look like? Well, let me first label this graph, label this maze, with a number representing the value of this heuristic function, the value of the Manhattan distance from any of these cells. So from this cell, for example, we’re one away from the goal. From this cell, we’re two away from the goal, three away, four away. Here, we’re five away because we have to go one to the right and then four up. From somewhere like here, the Manhattan distance is two. We’re only two squares away from the goal geographically, even though in practice, we’re going to have to take a longer path. But we don’t know that yet. The heuristic is just some easy way to estimate how far we are away from the goal. And maybe our heuristic is overly optimistic. It thinks that, yeah, we’re only two steps away. When in practice, when you consider the walls, it might be more steps. So the important thing here is that the heuristic isn’t a guarantee of how many steps it’s going to take. It is estimating. It’s an attempt at trying to approximate. And it does seem generally the case that the squares that look closer to the goal have smaller values for the heuristic function than squares that are further away. So now, using greedy best-first search, what might this algorithm actually do? Well, again, for these first five steps, there’s not much of a choice. We start at this initial state a, and we say, all right, we have to explore these five states. But now we have a decision point. Now we have a choice between going left and going right. And before, when DFS and BFS would just pick arbitrarily, because it just depends on the order you throw these two nodes into the frontier, and we didn’t specify what order you put them into the frontier, only the order you take them out, here we can look at 13 and 11 and say that, all right, this square is a distance of 11 away from the goal according to our heuristic, according to our estimate. And this one, we estimate to be 13 away from the goal. So between those two options, between these two choices, I’d rather have the 11. I’d rather be 11 steps away from the goal, so I’ll go to the right. We’re able to make an informed decision, because we know a little something more about this problem. So then we keep following, 10, 9, 8. Between the two 7s, we don’t really have much of a way to know between those. So then we do just have to make an arbitrary choice. And you know what, maybe we choose wrong. But that’s OK, because now we can still say, all right, let’s try this 7. We say 7, 6, we have to make this choice, even though it increases the value of the heuristic function. But now we have another decision point, between 6 and 8, and between those two. And really, we’re also considering this 13, but that’s much higher. Between 6, 8, and 13, well, the 6 is the smallest value, so we’d rather take the 6. We’re able to make an informed decision that going this way to the right is probably better than going down. So we turn this way, we go to 5. And now we find a decision point where we’ll actually make a decision that we might not want to make, but there’s unfortunately not too much of a way around this. We see 4 and 6. 4 looks closer to the goal, right? It’s going up, and the goal is further up. So we end up taking that route, which ultimately leads us to a dead end. But that’s OK, because we can still say, all right, now let’s try the 6. And now follow this route that will ultimately lead us to the goal. And so this now is how greedy best-for-search might try to approach this problem by saying, whenever we have a decision between multiple nodes that we could explore, let’s explore the node that has the smallest value of h of n, this heuristic function that is estimating how far I have to go. And it just so happens that in this case, we end up doing better in terms of the number of states we needed to explore than BFS needed to. BFS explored all of this section and all of that section, but we were able to eliminate that by taking advantage of this heuristic, this knowledge about how close we are to the goal or some estimate of that idea. So this seems much better. So wouldn’t we always prefer an algorithm like this over an algorithm like breadth-first search? Well, maybe one thing to take into consideration is that we need to come up with a good heuristic, how good the heuristic is, is going to affect how good this algorithm is. And coming up with a good heuristic can oftentimes be challenging. But the other thing to consider is to ask the question, just as we did with the prior two algorithms, is this algorithm optimal? Will it always find the shortest path from the initial state to the goal? And to answer that question, let’s take a look at this example for a moment. Take a look at this example. Again, we’re trying to get from A to B. And again, I’ve labeled each of the cells with their Manhattan distance from the goal. The number of squares up and to the right, you would need to travel in order to get from that square to the goal. And let’s think about, would greedy best-first search that always picks the smallest number end up finding the optimal solution? What is the shortest solution? And would this algorithm find it? And the important thing to realize is that right here is the decision point. We’re estimated to be 12 away from the goal. And we have two choices. We can go to the left, which we estimate to be 13 away from the goal. Or we can go up, where we estimate it to be 11 away from the goal. And between those two, greedy best-first search is going to say the 11 looks better than the 13. And in doing so, greedy best-first search will end up finding this path to the goal. But it turns out this path is not optimal. There is a way to get to the goal using fewer steps. And it’s actually this way, this way that ultimately involved fewer steps, even though it meant at this moment choosing the worst option between the two or what we estimated to be the worst option based on the heuristics. And so this is what we mean by this is a greedy algorithm. It’s making the best decision locally. At this decision point, it looks like it’s better to go here than it is to go to the 13. But in the big picture, it’s not necessarily optimal. That it might find a solution when in actuality, there was a better solution available. So we would like some way to solve this problem. We like the idea of this heuristic, of being able to estimate the path, the distance between us and the goal. And that helps us to be able to make better decisions and to eliminate having to search through entire parts of this state space. But we would like to modify the algorithm so that we can achieve optimality, so that it can be optimal. And what is the way to do this? What is the intuition here? Well, let’s take a look at this problem. In this initial problem, greedy best research found us this solution here, this long path. And the reason why it wasn’t great is because, yes, the heuristic numbers went down pretty low. But later on, they started to build back up. They built back 8, 9, 10, 11, all the way up to 12 in this case. And so how might we go about trying to improve this algorithm? Well, one thing that we might realize is that if we go all the way through this algorithm, through this path, and we end up going to the 12, and we’ve had to take this many steps, who knows how many steps that is, just to get to this 12, we could have also, as an alternative, taken much fewer steps, just six steps, and ended up at this 13 here. And yes, 13 is more than 12, so it looks like it’s not as good. But it required far fewer steps. It only took six steps to get to this 13 versus many more steps to get to this 12. And while greedy best research says, oh, well, 12 is better than 13, so pick the 12, we might more intelligently say, I’d rather be somewhere that heuristically looks like it takes slightly longer if I can get there much more quickly. And we’re going to encode that idea, this general idea, into a more formal algorithm known as A star search. A star search is going to solve this problem by instead of just considering the heuristic, also considering how long it took us to get to any particular state. So the distinction is greedy best for search. If I am in a state right now, the only thing I care about is, what is the estimated distance, the heuristic value, between me and the goal? Whereas A star search will take into consideration two pieces of information. It’ll take into consideration, how far do I estimate I am from the goal? But also, how far did I have to travel in order to get here? Because that is relevant, too. So we’ll search algorithms by expanding the node with the lowest value of g of n plus h of n. h of n is that same heuristic that we were talking about a moment ago that’s going to vary based on the problem. But g of n is going to be the cost to reach the node, how many steps I had to take, in this case, to get to my current position. So what does that search algorithm look like in practice? Well, let’s take a look. Again, we’ve got the same maze. And again, I’ve labeled them with their Manhattan distance. This value is the h of n value, the heuristic estimate of how far each of these squares is away from the goal. But now, as we begin to explore states, we care not just about this heuristic value, but also about g of n, the number of steps I had to take in order to get there. And I care about summing those two numbers together. So what does that look like? On this very first step, I have taken one step. And now I am estimated to be 16 steps away from the goal. So the total value here is 17. Then I take one more step. I’ve now taken two steps. And I estimate myself to be 15 away from the goal, again, a total value of 17. Now I’ve taken three steps. And I’m estimated to be 14 away from the goal, so on and so forth. Four steps, an estimate of 13. Five steps, estimate of 12. And now here’s a decision point. I could either be six steps away from the goal with a heuristic of 13 for a total of 19, or I could be six steps away from the goal with a heuristic of 11 with an estimate of 17 for the total. So between 19 and 17, I’d rather take the 17, the 6 plus 11. So so far, no different than what we saw before. We’re still taking this option because it appears to be better. And I keep taking this option because it appears to be better. But it’s right about here that things get a little bit different. Now I could be 15 steps away from the goal with an estimated distance of 6. So 15 plus 6, total value of 21. Alternatively, I could be six steps away from the goal, because this is five steps away, so this is six steps away, with a total value of 13 as my estimate. So 6 plus 13, that’s 19. So here, we would evaluate g of n plus h of n to be 19, 6 plus 13. Whereas here, we would be 15 plus 6, or 21. And so the intuition is 19 less than 21, pick here. But the idea is ultimately I’d rather be having taken fewer steps, get to a 13, than having taken 15 steps and be at a 6, because it means I’ve had to take more steps in order to get there. Maybe there’s a better path this way. So instead, we’ll explore this route. Now if we go one more, this is seven steps plus 14 is 21. So between those two, it’s sort of a toss-up. We might end up exploring that one anyways. But after that, as these numbers start to get bigger in the heuristic values, and these heuristic values start to get smaller, you’ll find that we’ll actually keep exploring down this path. And you can do the math to see that at every decision point, A star search is going to make a choice based on the sum of how many steps it took me to get to my current position, and then how far I estimate I am from the goal. So while we did have to explore some of these states, the ultimate solution we found was, in fact, an optimal solution. It did find us the quickest possible way to get from the initial state to the goal. And it turns out that A star is an optimal search algorithm under certain conditions. So the conditions are H of n, my heuristic, needs to be admissible. What does it mean for a heuristic to be admissible? Well, a heuristic is admissible if it never overestimates the true cost. H of n always needs to either get it exactly right in terms of how far away I am, or it needs to underestimate. So we saw an example from before where the heuristic value was much smaller than the actual cost it would take. That’s totally fine, but the heuristic value should never overestimate. It should never think that I’m further away from the goal than I actually am. And meanwhile, to make a stronger statement, H of n also needs to be consistent. And what does it mean for it to be consistent? Mathematically, it means that for every node, which we’ll call n, and successor, the node after me, that I’ll call n prime, where it takes a cost of C to make that step, the heuristic value of n needs to be less than or equal to the heuristic value of n prime plus the cost. So it’s a lot of math, but in words what that ultimately means is that if I am here at this state right now, the heuristic value from me to the goal shouldn’t be more than the heuristic value of my successor, the next place I could go to, plus however much it would cost me to just make that step from one step to the next step. And so this is just making sure that my heuristic is consistent between all of these steps that I might take. So as long as this is true, then A star search is going to find me an optimal solution. And this is where much of the challenge of solving these search problems can sometimes come in, that A star search is an algorithm that is known and you could write the code fairly easily, but it’s choosing the heuristic. It can be the interesting challenge. The better the heuristic is, the better I’ll be able to solve the problem in the fewer states that I’ll have to explore. And I need to make sure that the heuristic satisfies these particular constraints. So all in all, these are some of the examples of search algorithms that might work, and certainly there are many more than just this. A star, for example, does have a tendency to use quite a bit of memory. So there are alternative approaches to A star that ultimately use less memory than this version of A star happens to use, and there are other search algorithms that are optimized for other cases as well. But now so far, we’ve only been looking at search algorithms where there is one agent. I am trying to find a solution to a problem. I am trying to navigate my way through a maze. I am trying to solve a 15 puzzle. I am trying to find driving directions from point A to point B. Sometimes in search situations, though, we’ll enter an adversarial situation, where I am an agent trying to make intelligent decisions. And there’s someone else who is fighting against me, so to speak, that has opposite objectives, someone where I am trying to succeed, someone else that wants me to fail. And this is most popular in something like a game, a game like Tic Tac Toe, where we’ve got this 3 by 3 grid, and x and o take turns, either writing an x or an o in any one of these squares. And the goal is to get three x’s in a row if you’re the x player, or three o’s in a row if you’re the o player. And computers have gotten quite good at playing games, Tic Tac Toe very easily, but even more complex games. And so you might imagine, what does an intelligent decision in a game look like? So maybe x makes an initial move in the middle, and o plays up here. What does an intelligent move for x now become? Where should you move if you were x? And it turns out there are a couple of possibilities. But if an AI is playing this game optimally, then the AI might play somewhere like the upper right, where in this situation, o has the opposite objective of x. x is trying to win the game to get three in a row diagonally here. And o is trying to stop that objective, opposite of the objective. And so o is going to place here to try to block. But now, x has a pretty clever move. x can make a move like this, where now x has two possible ways that x can win the game. x could win the game by getting three in a row across here. Or x could win the game by getting three in a row vertically this way. So it doesn’t matter where o makes their next move. o could play here, for example, blocking the three in a row horizontally. But then x is going to win the game by getting a three in a row vertically. And so there’s a fair amount of reasoning that’s going on here in order for the computer to be able to solve a problem. And it’s similar in spirit to the problems we’ve looked at so far. There are actions. There’s some sort of state of the board and some transition from one action to the next. But it’s different in the sense that this is now not just a classical search problem, but an adversarial search problem. That I am at the x player trying to find the best moves to make, but I know that there is some adversary that is trying to stop me. So we need some sort of algorithm to deal with these adversarial type of search situations. And the algorithm we’re going to take a look at is an algorithm called Minimax, which works very well for these deterministic games where there are two players. It can work for other types of games as well. But we’ll look right now at games where I make a move, then my opponent makes a move. And I am trying to win, and my opponent is trying to win also. Or in other words, my opponent is trying to get me to lose. And so what do we need in order to make this algorithm work? Well, any time we try and translate this human concept of playing a game, winning and losing to a computer, we want to translate it in terms that the computer can understand. And ultimately, the computer really just understands the numbers. And so we want some way of translating a game of x’s and o’s on a grid to something numerical, something the computer can understand. The computer doesn’t normally understand notions of win or lose. But it does understand the concept of bigger and smaller. And so what we might do is we might take each of the possible ways that a tic-tac-toe game can unfold and assign a value or a utility to each one of those possible ways. And in a tic-tac-toe game, and in many types of games, there are three possible outcomes. The outcomes are o wins, x wins, or nobody wins. So player one wins, player two wins, or nobody wins. And for now, let’s go ahead and assign each of these possible outcomes a different value. We’ll say o winning, that’ll have a value of negative 1. Nobody winning, that’ll have a value of 0. And x winning, that will have a value of 1. So we’ve just assigned numbers to each of these three possible outcomes. And now we have two players, we have the x player and the o player. And we’re going to go ahead and call the x player the max player. And we’ll call the o player the min player. And the reason why is because in the min and max algorithm, the max player, which in this case is x, is aiming to maximize the score. These are the possible options for the score, negative 1, 0, and 1. x wants to maximize the score, meaning if at all possible, x would like this situation, where x wins the game, and we give it a score of 1. But if this isn’t possible, if x needs to choose between these two options, negative 1, meaning o winning, or 0, meaning nobody winning, x would rather that nobody wins, score of 0, than a score of negative 1, o winning. So this notion of winning and losing and tying has been reduced mathematically to just this idea of try and maximize the score. The x player always wants the score to be bigger. And on the flip side, the min player, in this case o, is aiming to minimize the score. The o player wants the score to be as small as possible. So now we’ve taken this game of x’s and o’s and winning and losing and turned it into something mathematical, something where x is trying to maximize the score, o is trying to minimize the score. Let’s now look at all of the parts of the game that we need in order to encode it in an AI so that an AI can play a game like tic-tac-toe. So the game is going to need a couple of things. We’ll need some sort of initial state that will, in this case, call s0, which is how the game begins, like an empty tic-tac-toe board, for example. We’ll also need a function called player, where the player function is going to take as input a state here represented by s. And the output of the player function is going to be which player’s turn is it. We need to be able to give a tic-tac-toe board to the computer, run it through a function, and that function tells us whose turn it is. We’ll need some notion of actions that we can take. We’ll see examples of that in just a moment. We need some notion of a transition model, same as before. If I have a state and I take an action, I need to know what results as a consequence of it. I need some way of knowing when the game is over. So this is equivalent to kind of like a goal test, but I need some terminal test, some way to check to see if a state is a terminal state, where a terminal state means the game is over. In a classic game of tic-tac-toe, a terminal state means either someone has gotten three in a row or all of the squares of the tic-tac-toe board are filled. Either of those conditions make it a terminal state. In a game of chess, it might be something like when there is checkmate or if checkmate is no longer possible, that that becomes a terminal state. And then finally, we’ll need a utility function, a function that takes a state and gives us a numerical value for that terminal state, some way of saying if x wins the game, that has a value of 1. If o is won the game, that has a value of negative 1. If nobody has won the game, that has a value of 0. So let’s take a look at each of these in turn. The initial state, we can just represent in tic-tac-toe as the empty game board. This is where we begin. It’s the place from which we begin this search. And again, I’ll be representing these things visually, but you can imagine this really just being like an array or a two-dimensional array of all of these possible squares. Then we need the player function that, again, takes a state and tells us whose turn it is. Assuming x makes the first move, if I have an empty game board, then my player function is going to return x. And if I have a game board where x has made a move, then my player function is going to return o. The player function takes a tic-tac-toe game board and tells us whose turn it is. Next up, we’ll consider the actions function. The actions function, much like it did in classical search, takes a state and gives us the set of all of the possible actions we can take in that state. So let’s imagine it’s o is turned to move in a game board that looks like this. What happens when we pass it into the actions function? So the actions function takes this state of the game as input, and the output is a set of possible actions. It’s a set of I could move in the upper left or I could move in the bottom middle. So those are the two possible action choices that I have when I begin in this particular state. Now, just as before, when we had states and actions, we need some sort of transition model to tell us when we take this action in the state, what is the new state that we get. And here, we define that using the result function that takes a state as input as well as an action. And when we apply the result function to this state, saying that let’s let o move in this upper left corner, the new state we get is this resulting state where o is in the upper left corner. And now, this seems obvious to someone who knows how to play tic-tac-toe. Of course, you play in the upper left corner. That’s the board you get. But all of this information needs to be encoded into the AI. The AI doesn’t know how to play tic-tac-toe until you tell the AI how the rules of tic-tac-toe work. And this function, defining this function here, allows us to tell the AI how this game actually works and how actions actually affect the outcome of the game. So the AI needs to know how the game works. The AI also needs to know when the game is over, as by defining a function called terminal that takes as input a state s, such that if we take a game that is not yet over, pass it into the terminal function, the output is false. The game is not over. But if we take a game that is over because x has gotten three in a row along that diagonal, pass that into the terminal function, then the output is going to be true because the game now is, in fact, over. And finally, we’ve told the AI how the game works in terms of what moves can be made and what happens when you make those moves. We’ve told the AI when the game is over. Now we need to tell the AI what the value of each of those states is. And we do that by defining this utility function that takes a state s and tells us the score or the utility of that state. So again, we said that if x wins the game, that utility is a value of 1, whereas if o wins the game, then the utility of that is negative 1. And the AI needs to know, for each of these terminal states where the game is over, what is the utility of that state? So if I give you a game board like this where the game is, in fact, over, and I ask the AI to tell me what the value of that state is, it could do so. The value of the state is 1. Where things get interesting, though, is if the game is not yet over. Let’s imagine a game board like this, where in the middle of the game, it’s o’s turn to make a move. So how do we know it’s o’s turn to make a move? We can calculate that using the player function. We can say player of s, pass in the state, o is the answer. So we know it’s o’s turn to move. And now, what is the value of this board and what action should o take? Well, that’s going to depend. We have to do some calculation here. And this is where the minimax algorithm really comes in. Recall that x is trying to maximize the score, which means that o is trying to minimize the score. So o would like to minimize the total value that we get at the end of the game. And because this game isn’t over yet, we don’t really know just yet what the value of this game board is. We have to do some calculation in order to figure that out. And so how do we do that kind of calculation? Well, in order to do so, we’re going to consider, just as we might in a classical search situation, what actions could happen next and what states will that take us to. And it turns out that in this position, there are only two open squares, which means there are only two open places where o can make a move. o could either make a move in the upper left or o can make a move in the bottom middle. And minimax doesn’t know right out of the box which of those moves is going to be better. So it’s going to consider both. But now, we sort of run into the same situation. Now, I have two more game boards, neither of which is over. What happens next? And now, it’s in this sense that minimax is what we’ll call a recursive algorithm. It’s going to now repeat the exact same process, although now considering it from the opposite perspective. It’s as if I am now going to put myself, if I am the o player, I’m going to put myself in my opponent’s shoes, my opponent as the x player, and consider what would my opponent do if they were in this position? What would my opponent do, the x player, if they were in that position? And what would then happen? Well, the other player, my opponent, the x player, is trying to maximize the score, whereas I am trying to minimize the score as the o player. So x is trying to find the maximum possible value that they can get. And so what’s going to happen? Well, from this board position, x only has one choice. x is going to play here, and they’re going to get three in a row. And we know that that board, x winning, that has a value of 1. If x wins the game, the value of that game board is 1. And so from this position, if this state can only ever lead to this state, it’s the only possible option, and this state has a value of 1, then the maximum possible value that the x player can get from this game board is also 1. From here, the only place we can get is to a game with a value of 1, so this game board also has a value of 1. Now we consider this one over here. What’s going to happen now? Well, x needs to make a move. The only move x can make is in the upper left, so x will go there. And in this game, no one wins the game. Nobody has three in a row. And so the value of that game board is 0. Nobody is 1. And so again, by the same logic, if from this board position the only place we can get to is a board where the value is 0, then this state must also have a value of 0. And now here comes the choice part, the idea of trying to minimize. I, as the o player, now know that if I make this choice moving in the upper left, that is going to result in a game with a value of 1, assuming everyone plays optimally. And if I instead play in the lower middle, choose this fork in the road, that is going to result in a game board with a value of 0. I have two options. I have a 1 and a 0 to choose from, and I need to pick. And as the min player, I would rather choose the option with the minimum value. So whenever a player has multiple choices, the min player will choose the option with the smallest value. The max player will choose the option with the largest value. Between the 1 and the 0, the 0 is smaller, meaning I’d rather tie the game than lose the game. And so this game board will say also has a value of 0, because if I am playing optimally, I will pick this fork in the road. I’ll place my o here to block x’s 3 in a row, x will move in the upper left, and the game will be over, and no one will have won the game. So this is now the logic of minimax, to consider all of the possible options that I can take, all of the actions that I can take, and then to put myself in my opponent’s shoes. I decide what move I’m going to make now by considering what move my opponent will make on the next turn. And to do that, I consider what move I would make on the turn after that, so on and so forth, until I get all the way down to the end of the game, to one of these so-called terminal states. In fact, this very decision point, where I am trying to decide as the o player what to make a decision about, might have just been a part of the logic that the x player, my opponent, was using, the move before me. This might be part of some larger tree, where x is trying to make a move in this situation, and needs to pick between three different options in order to make a decision about what to happen. And the further and further away we are from the end of the game, the deeper this tree has to go. Because every level in this tree is going to correspond to one move, one move or action that I take, one move or action that my opponent takes, in order to decide what happens. And in fact, it turns out that if I am the x player in this position, and I recursively do the logic, and see I have a choice, three choices, in fact, one of which leads to a value of 0. If I play here, and if everyone plays optimally, the game will be a tie. If I play here, then o is going to win, and I’ll lose playing optimally. Or here, where I, the x player, can win, well between a score of 0, and negative 1, and 1, I’d rather pick the board with a value of 1, because that’s the maximum value I can get. And so this board would also have a maximum value of 1. And so this tree can get very, very deep, especially as the game starts to have more and more moves. And this logic works not just for tic-tac-toe, but any of these sorts of games, where I make a move, my opponent makes a move, and ultimately, we have these adversarial objectives. And we can simplify the diagram into a diagram that looks like this. This is a more abstract version of the minimax tree, where these are each states, but I’m no longer representing them as exactly like tic-tac-toe boards. This is just representing some generic game that might be tic-tac-toe, might be some other game altogether. Any of these green arrows that are pointing up, that represents a maximizing state. I would like the score to be as big as possible. And any of these red arrows pointing down, those are minimizing states, where the player is the min player, and they are trying to make the score as small as possible. So if you imagine in this situation, I am the maximizing player, this player here, and I have three choices. One choice gives me a score of 5, one choice gives me a score of 3, and one choice gives me a score of 9. Well, then between those three choices, my best option is to choose this 9 over here, the score that maximizes my options out of all the three options. And so I can give this state a value of 9, because among my three options, that is the best choice that I have available to me. So that’s my decision now. You imagine it’s like one move away from the end of the game. But then you could also ask a reasonable question, what might my opponent do two moves away from the end of the game? My opponent is the minimizing player. They are trying to make the score as small as possible. Imagine what would have happened if they had to pick which choice to make. One choice leads us to this state, where I, the maximizing player, am going to opt for 9, the biggest score that I can get. And 1 leads to this state, where I, the maximizing player, would choose 8, which is then the largest score that I can get. Now the minimizing player, forced to choose between a 9 or an 8, is going to choose the smallest possible score, which in this case is an 8. And that is then how this process would unfold, that the minimizing player in this case considers both of their options, and then all of the options that would happen as a result of that. So this now is a general picture of what the minimax algorithm looks like. Let’s now try to formalize it using a little bit of pseudocode. So what exactly is happening in the minimax algorithm? Well, given a state s, we need to decide what to happen. The max player, if it’s max’s player’s turn, then max is going to pick an action a in actions of s. Recall that actions is a function that takes a state and gives me back all of the possible actions that I can take. It tells me all of the moves that are possible. The max player is going to specifically pick an action a in this set of actions that gives me the highest value of min value of result of s and a. So what does that mean? Well, it means that I want to make the option that gives me the highest score of all of the actions a. But what score is that going to have? To calculate that, I need to know what my opponent, the min player, is going to do if they try to minimize the value of the state that results. So we say, what state results after I take this action? And what happens when the min player tries to minimize the value of that state? I consider that for all of my possible options. And after I’ve considered that for all of my possible options, I pick the action a that has the highest value. Likewise, the min player is going to do the same thing but backwards. They’re also going to consider what are all of the possible actions they can take if it’s their turn. And they’re going to pick the action a that has the smallest possible value of all the options. And the way they know what the smallest possible value of all the options is is by considering what the max player is going to do by saying, what’s the result of applying this action to the current state? And then what would the max player try to do? What value would the max player calculate for that particular state? So everyone makes their decision based on trying to estimate what the other person would do. And now we need to turn our attention to these two functions, max value and min value. How do you actually calculate the value of a state if you’re trying to maximize its value? And how do you calculate the value of a state if you’re trying to minimize the value? If you can do that, then we have an entire implementation of this min and max algorithm. So let’s try it. Let’s try and implement this max value function that takes a state and returns as output the value of that state if I’m trying to maximize the value of the state. Well, the first thing I can check for is to see if the game is over. Because if the game is over, in other words, if the state is a terminal state, then this is easy. I already have this utility function that tells me what the value of the board is. If the game is over, I just check, did x win, did o win, is it a tie? And this utility function just knows what the value of the state is. What’s trickier is if the game isn’t over. Because then I need to do this recursive reasoning about thinking, what is my opponent going to do on the next move? And I want to calculate the value of this state. And I want the value of the state to be as high as possible. And I’ll keep track of that value in a variable called v. And if I want the value to be as high as possible, I need to give v an initial value. And initially, I’ll just go ahead and set it to be as low as possible. Because I don’t know what options are available to me yet. So initially, I’ll set v equal to negative infinity, which seems a little bit strange. But the idea here is I want the value initially to be as low as possible. Because as I consider my actions, I’m always going to try and do better than v. And if I set v to negative infinity, I know I can always do better than that. So now I consider my actions. And this is going to be some kind of loop where for every action in actions of state, recall actions as a function that takes my state and gives me all the possible actions that I can use in that state. So for each one of those actions, I want to compare it to v and say, all right, v is going to be equal to the maximum of v and this expression. So what is this expression? Well, first it is get the result of taking the action in the state and then get the min value of that. In other words, let’s say I want to find out from that state what is the best that the min player can do because they’re going to try and minimize the score. So whatever the resulting score is of the min value of that state, compare it to my current best value and just pick the maximum of those two because I am trying to maximize the value. In short, what these three lines of code are doing are going through all of my possible actions and asking the question, how do I maximize the score given what my opponent is going to try to do? After this entire loop, I can just return v and that is now the value of that particular state. And for the min player, it’s the exact opposite of this, the same logic just backwards. To calculate the minimum value of a state, first we check if it’s a terminal state. If it is, we return its utility. Otherwise, we’re going to now try to minimize the value of the state given all of my possible actions. So I need an initial value for v, the value of the state. And initially, I’ll set it to infinity because I know I can always get something less than infinity. So by starting with v equals infinity, I make sure that the very first action I find, that will be less than this value of v. And then I do the same thing, loop over all of my possible actions. And for each of the results that we could get when the max player makes their decision, let’s take the minimum of that and the current value of v. So after all is said and done, I get the smallest possible value of v that I then return back to the user. So that, in effect, is the pseudocode for Minimax. That is how we take a gain and figure out what the best move to make is by recursively using these max value and min value functions, where max value calls min value, min value calls max value back and forth, all the way until we reach a terminal state, at which point our algorithm can simply return the utility of that particular state. So what you might imagine is that this is going to start to be a long process, especially as games start to get more complex, as we start to add more moves and more possible options and games that might last quite a bit longer. So the next question to ask is, what sort of optimizations can we make here? How can we do better in order to use less space or take less time to be able to solve this kind of problem? And we’ll take a look at a couple of possible optimizations. But for one, we’ll take a look at this example. Again, returning to these up arrows and down arrows, let’s imagine that I now am the max player, this green arrow. I am trying to make this score as high as possible. And this is an easy game where there are just two moves. I make a move, one of these three options. And then my opponent makes a move, one of these three options, based on what move I make. And as a result, we get some value. Let’s look at the order in which I do these calculations and figure out if there are any optimizations I might be able to make to this calculation process. I’m going to have to look at these states one at a time. So let’s say I start here on the left and say, all right, now I’m going to consider, what will the min player, my opponent, try to do here? Well, the min player is going to look at all three of their possible actions and look at their value, because these are terminal states. They’re the end of the game. And so they’ll see, all right, this node is a value of four, value of eight, value of five. And the min player is going to say, well, all right, between these three options, four, eight, and five, I’ll take the smallest one. I’ll take the four. So this state now has a value of four. Then I, as the max player, say, all right, if I take this action, it will have a value of four. That’s the best that I can do, because min player is going to try and minimize my score. So now what if I take this option? We’ll explore this next. And now explore what the min player would do if I choose this action. And the min player is going to say, all right, what are the three options? The min player has options between nine, three, and seven. And so three is the smallest among nine, three, and seven. So we’ll go ahead and say this state has a value of three. So now I, as the max player, I have now explored two of my three options. I know that one of my options will guarantee me a score of four, at least. And one of my options will guarantee me a score of three. And now I consider my third option and say, all right, what happens here? Same exact logic. The min player is going to look at these three states, two, four, and six. I’ll say the minimum possible option is two. So the min player wants the two. Now I, as the max player, have calculated all of the information by looking two layers deep, by looking at all of these nodes. And I can now say, between the four, the three, and the two, you know what? I’d rather take the four. Because if I choose this option, if my opponent plays optimally, they will try and get me to the four. But that’s the best I can do. I can’t guarantee a higher score. Because if I pick either of these two options, I might get a three or I might get a two. And it’s true that down here is a nine. And that’s the highest score out of any of the scores. So I might be tempted to say, you know what? Maybe I should take this option because I might get the nine. But if the min player is playing intelligently, if they’re making the best moves at each possible option they have when they get to make a choice, I’ll be left with a three. Whereas I could better, playing optimally, have guaranteed that I would get the four. So that is, in effect, the logic that I would use as a min and max player trying to maximize my score from that node there. But it turns out they took quite a bit of computation for me to figure that out. I had to reason through all of these nodes in order to draw this conclusion. And this is for a pretty simple game where I have three choices, my opponent has three choices, and then the game’s over. So what I’d like to do is come up with some way to optimize this. Maybe I don’t need to do all of this calculation to still reach the conclusion that, you know what, this action to the left, that’s the best that I could do. Let’s go ahead and try again and try to be a little more intelligent about how I go about doing this. So first, I start the exact same way. I don’t know what to do initially, so I just have to consider one of the options and consider what the min player might do. Min has three options, four, eight, and five. And between those three options, min says four is the best they can do because they want to try to minimize the score. Now I, the max player, will consider my second option, making this move here, and considering what my opponent would do in response. What will the min player do? Well, the min player is going to, from that state, look at their options. And I would say, all right, nine is an option, three is an option. And if I am doing the math from this initial state, doing all this calculation, when I see a three, that should immediately be a red flag for me. Because when I see a three down here at this state, I know that the value of this state is going to be at most three. It’s going to be three or something less than three, even though I haven’t yet looked at this last action or even further actions if there were more actions that could be taken here. How do I know that? Well, I know that the min player is going to try to minimize my score. And if they see a three, the only way this could be something other than a three is if this remaining thing that I haven’t yet looked at is less than three, which means there is no way for this value to be anything more than three because the min player can already guarantee a three and they are trying to minimize my score. So what does that tell me? Well, it tells me that if I choose this action, my score is going to be three or maybe even less than three if I’m unlucky. But I already know that this action will guarantee me a four. And so given that I know that this action guarantees me a score of four and this action means I can’t do better than three, if I’m trying to maximize my options, there is no need for me to consider this triangle here. There is no value, no number that could go here that would change my mind between these two options. I’m always going to opt for this path that gets me a four as opposed to this path where the best I can do is a three if my opponent plays optimally. And this is going to be true for all the future states that I look at too. That if I look over here at what min player might do over here, if I see that this state is a two, I know that this state is at most a two because the only way this value could be something other than two is if one of these remaining states is less than a two and so the min player would opt for that instead. So even without looking at these remaining states, I as the maximizing player can know that choosing this path to the left is going to be better than choosing either of those two paths to the right because this one can’t be better than three. This one can’t be better than two. And so four in this case is the best that I can do. So in order to do this cut, and I can say now that this state has a value of four. So in order to do this type of calculation, I was doing a little bit more bookkeeping, keeping track of things, keeping track all the time of what is the best that I can do, what is the worst that I can do, and for each of these states saying, all right, well, if I already know that I can get a four, then if the best I can do at this state is a three, no reason for me to consider it, I can effectively prune this leaf and anything below it from the tree. And it’s for that reason this approach, this optimization to minimax, is called alpha, beta pruning. Alpha and beta stand for these two values that you’ll have to keep track of of the best you can do so far and the worst you can do so far. And pruning is the idea of if I have a big, long, deep search tree, I might be able to search it more efficiently if I don’t need to search through everything, if I can remove some of the nodes to try and optimize the way that I look through this entire search space. So alpha, beta pruning can definitely save us a lot of time as we go about the search process by making our searches more efficient. But even then, it’s still not great as games get more complex. Tic-tac-toe, fortunately, is a relatively simple game. And we might reasonably ask a question like, how many total possible tic-tac-toe games are there? You can think about it. You can try and estimate how many moves are there at any given point, how many moves long can the game last. It turns out there are about 255,000 possible tic-tac-toe games that can be played. But compare that to a more complex game, something like a game of chess, for example. Far more pieces, far more moves, games that last much longer. How many total possible chess games could there be? It turns out that after just four moves each, four moves by the white player, four moves by the black player, that there are 288 billion possible chess games that can result from that situation, after just four moves each. And going even further, if you look at entire chess games and how many possible chess games there could be as a result there, there are more than 10 to the 29,000 possible chess games, far more chess games than could ever be considered. And this is a pretty big problem for the Minimax algorithm, because the Minimax algorithm starts with an initial state, considers all the possible actions, and all the possible actions after that, all the way until we get to the end of the game. And that’s going to be a problem if the computer is going to need to look through this many states, which is far more than any computer could ever do in any reasonable amount of time. So what do we do in order to solve this problem? Instead of looking through all these states which is totally intractable for a computer, we need some better approach. And it turns out that better approach generally takes the form of something called depth-limited Minimax, where normally Minimax is depth-unlimited. We just keep going layer after layer, move after move, until we get to the end of the game. Depth-limited Minimax is instead going to say, you know what, after a certain number of moves, maybe I’ll look 10 moves ahead, maybe I’ll look 12 moves ahead, but after that point, I’m going to stop and not consider additional moves that might come after that, just because it would be computationally intractable to consider all of those possible options. But what do we do after we get 10 or 12 moves deep when we arrive at a situation where the game’s not over? Minimax still needs a way to assign a score to that game board or game state to figure out what its current value is, which is easy to do if the game is over, but not so easy to do if the game is not yet over. So in order to do that, we need to add one additional feature to depth-limited Minimax called an evaluation function, which is just some function that is going to estimate the expected utility of a game from a given state. So in a game like chess, if you imagine that a game value of 1 means white wins, negative 1 means black wins, 0 means it’s a draw, then you might imagine that a score of 0.8 means white is very likely to win, though certainly not guaranteed. And you would have an evaluation function that estimates how good the game state happens to be. And depending on how good that evaluation function is, that is ultimately what’s going to constrain how good the AI is. The better the AI is at estimating how good or how bad any particular game state is, the better the AI is going to be able to play that game. If the evaluation function is worse and not as good as it estimating what the expected utility is, then it’s going to be a whole lot harder. And you can imagine trying to come up with these evaluation functions. In chess, for example, you might write an evaluation function based on how many pieces you have as compared to how many pieces your opponent has, because each one has a value. And your evaluation function probably needs to be a little bit more complicated than that to consider other possible situations that might arise as well. And there are many other variants on Minimax that add additional features in order to help it perform better under these larger, more computationally untractable situations where we couldn’t possibly explore all of the possible moves. So we need to figure out how to use evaluation functions and other techniques to be able to play these games ultimately better. But this now was a look at this kind of adversarial search, these search problems where we have situations where I am trying to play against some sort of opponent. And these search problems show up all over the place throughout artificial intelligence. We’ve been talking a lot today about more classical search problems, like trying to find directions from one location to another. But any time an AI is faced with trying to make a decision, like what do I do now in order to do something that is rational, or do something that is intelligent, or trying to play a game, like figuring out what move to make, these sort of algorithms can really come in handy. It turns out that for tic-tac-toe, the solution is pretty simple because it’s a small game. XKCD has famously put together a web comic where he will tell you exactly what move to make as the optimal move to make no matter what your opponent happens to do. This type of thing is not quite as possible for a much larger game like Checkers or Chess, for example, where chess is totally computationally untractable for most computers to be able to explore all the possible states. So we really need our AI to be far more intelligent about how they go about trying to deal with these problems and how they go about taking this environment that they find themselves in and ultimately searching for one of these solutions. So this, then, was a look at search in artificial intelligence. Next time, we’ll take a look at knowledge, thinking about how it is that our AIs are able to know information, reason about that information, and draw conclusions, all in our look at AI and the principles behind it. We’ll see you next time. [“AIMS INTRO MUSIC”] All right, welcome back, everyone, to an introduction to artificial intelligence with Python. Last time, we took a look at search problems, in particular, where we have AI agents that are trying to solve some sort of problem by taking actions in some sort of environment, whether that environment is trying to take actions by playing moves in a game or whether those actions are something like trying to figure out where to make turns in order to get driving directions from point A to point B. This time, we’re going to turn our attention more generally to just this idea of knowledge, the idea that a lot of intelligence is based on knowledge, especially if we think about human intelligence. People know information. We know facts about the world. And using that information that we know, we’re able to draw conclusions, reason about the information that we know in order to figure out how to do something or figure out some other piece of information that we conclude based on the information we already have available to us. What we’d like to focus on now is the ability to take this idea of knowledge and being able to reason based on knowledge and apply those ideas to artificial intelligence. In particular, we’re going to be building what are known as knowledge-based agents, agents that are able to reason and act by representing knowledge internally. Somehow inside of our AI, they have some understanding of what it means to know something. And ideally, they have some algorithms or some techniques they can use based on that knowledge that they know in order to figure out the solution to a problem or figure out some additional piece of information that can be helpful in some sense. So what do we mean by reasoning based on knowledge to be able to draw conclusions? Well, let’s look at a simple example drawn from the world of Harry Potter. We take one sentence that we know to be true. Imagine if it didn’t rain, then Harry visited Hagrid today. So one fact that we might know about the world. And then we take another fact. Harry visited Hagrid or Dumbledore today, but not both. So it tells us something about the world, that Harry either visited Hagrid but not Dumbledore, or Harry visited Dumbledore but not Hagrid. And now we have a third piece of information about the world that Harry visited Dumbledore today. So we now have three pieces of information now, three facts. Inside of a knowledge base, so to speak, information that we know. And now we, as humans, can try and reason about this and figure out, based on this information, what additional information can we begin to conclude? And well, looking at these last two statements, Harry either visited Hagrid or Dumbledore but not both, and we know that Harry visited Dumbledore today, well, then it’s pretty reasonable that we could draw the conclusion that, you know what, Harry must not have visited Hagrid today. Because based on a combination of these two statements, we can draw this inference, so to speak, a conclusion that Harry did not visit Hagrid today. But it turns out we can even do a little bit better than that, get some more information by taking a look at this first statement and reasoning about that. This first statement says, if it didn’t rain, then Harry visited Hagrid today. So what does that mean? In all cases where it didn’t rain, then we know that Harry visited Hagrid. But if we also know now that Harry did not visit Hagrid, then that tells us something about our initial premise that we were thinking about. In particular, it tells us that it did rain today, because we can reason, if it didn’t rain, that Harry would have visited Hagrid. But we know for a fact that Harry did not visit Hagrid today. So it’s this kind of reason, this sort of logical reasoning, where we use logic based on the information that we know in order to take information and reach conclusions that is going to be the focus of what we’re going to be talking about today. How can we make our artificial intelligence logical so that they can perform the same kinds of deduction, the same kinds of reasoning that we’ve been doing so far? Of course, humans reason about logic generally in terms of human language. That I just now was speaking in English, talking in English about these sentences and trying to reason through how it is that they relate to one another. We’re going to need to be a little bit more formal when we turn our attention to computers and being able to encode this notion of logic and truthhood and falsehood inside of a machine. So we’re going to need to introduce a few more terms and a few symbols that will help us reason through this idea of logic inside of an artificial intelligence. And we’ll begin with the idea of a sentence. Now, a sentence in a natural language like English is just something that I’m saying, like what I’m saying right now. In the context of AI, though, a sentence is just an assertion about the world in what we’re going to call a knowledge representation language, some way of representing knowledge inside of our computers. And the way that we’re going to spend most of today reasoning about knowledge is through a type of logic known as propositional logic. There are a number of different types of logic, some of which we’ll touch on. But propositional logic is based on a logic of propositions, or just statements about the world. And so we begin in propositional logic with a notion of propositional symbols. We will have certain symbols that are oftentimes just letters, something like P or Q or R, where each of those symbols is going to represent some fact or sentence about the world. So P, for example, might represent the fact that it is raining. And so P is going to be a symbol that represents that idea. And Q, for example, might represent Harry visited Hagrid today. Each of these propositional symbols represents some sentence or some fact about the world. But in addition to just having individual facts about the world, we want some way to connect these propositional symbols together in order to reason more complexly about other facts that might exist inside of the world in which we’re reasoning. So in order to do that, we’ll need to introduce some additional symbols that are known as logical connectives. Now, there are a number of these logical connectives. But five of the most important, and the ones we’re going to focus on today, are these five up here, each represented by a logical symbol. Not is represented by this symbol here, and is represented as sort of an upside down V, or is represented by a V shape. Implication, and we’ll talk about what that means in just a moment, is represented by an arrow. And biconditional, again, we’ll talk about what that means in a moment, is represented by these double arrows. But these five logical connectives are the main ones we’re going to be focusing on in terms of thinking about how it is that a computer can reason about facts and draw conclusions based on the facts that it knows. But in order to get there, we need to take a look at each of these logical connectives and build up an understanding for what it is that they actually mean. So let’s go ahead and begin with the not symbol, so this not symbol here. And what we’re going to show for each of these logical connectives is what we’re going to call a truth table, a table that demonstrates what this word not means when we attach it to a propositional symbol or any sentence inside of our logical language. And so the truth table for not is shown right here. If P, some propositional symbol, or some other sentence even, is false, then not P is true. And if P is true, then not P is false. So you can imagine that placing this not symbol in front of some sentence of propositional logic just says the opposite of that. So if, for example, P represented it is raining, then not P would represent the idea that it is not raining. And as you might expect, if P is false, meaning if the sentence, it is raining, is false, well then the sentence not P must be true. The sentence that it is not raining is therefore true. So not, you can imagine, just takes whatever is in P and it inverts it. It turns false into true and true into false, much analogously to what the English word not means, just taking whatever comes after it and inverting it to mean the opposite. Next up, and also very English-like, is this idea of and represented by this upside-down V shape or this point shape. And as opposed to just taking a single argument the way not does, we have P and we have not P. And is going to combine two different sentences in propositional logic together. So I might have one sentence P and another sentence Q, and I want to combine them together to say P and Q. And the general logic for what P and Q means is it means that both of its operands are true. P is true and also Q is true. And so here’s what that truth table looks like. This time we have two variables, P and Q. And when we have two variables, each of which can be in two possible states, true or false, that leads to two squared or four possible combinations of truth and falsehood. So we have P is false and Q is false. We have P is false and Q is true. P is true and Q is false. And then P and Q both are true. And those are the only four possibilities for what P and Q could mean. And in each of those situations, this third column here, P and Q, is telling us a little bit about what it actually means for P and Q to be true. And we see that the only case where P and Q is true is in this fourth row here, where P happens to be true, Q also happens to be true. And in all other situations, P and Q is going to evaluate to false. So this, again, is much in line with what our intuition of and might mean. If I say P and Q, I probably mean that I expect both P and Q to be true. Next up, also potentially consistent with what we mean, is this word or, represented by this V shape, sort of an upside down and symbol. And or, as the name might suggest, is true if either of its arguments are true, as long as P is true or Q is true, then P or Q is going to be true. Which means the only time that P or Q is false is if both of its operands are false. If P is false and Q is false, then P or Q is going to be false. But in all other cases, at least one of the operands is true. Maybe they’re both true, in which case P or Q is going to evaluate to true. Now, this is mostly consistent with the way that most people might use the word or, in the sense of speaking the word or in normal English, though there is sometimes when we might say or, where we mean P or Q, but not both, where we mean, sort of, it can only be one or the other. It’s important to note that this symbol here, this or, means P or Q or both, that those are totally OK. As long as either or both of them are true, then the or is going to evaluate to be true, as well. It’s only in the case where all of the operands are false that P or Q ultimately evaluates to false, as well. In logic, there’s another symbol known as the exclusive or, which encodes this idea of exclusivity of one or the other, but not both. But we’re not going to be focusing on that today. Whenever we talk about or, we’re always talking about either or both, in this case, as represented by this truth table here. So that now is not an and an or. And next up is what we might call implication, as denoted by this arrow symbol. So we have P and Q. And this sentence here will generally read as P implies Q. And what P implies Q means is that if P is true, then Q is also true. So I might say something like, if it is raining, then I will be indoors. Meaning, it is raining implies I will be indoors, as the logical sentence that I’m saying there. And the truth table for this can sometimes be a little bit tricky. So obviously, if P is true and Q is true, then P implies Q. That’s true. That definitely makes sense. And it should also stand to reason that when P is true and Q is false, then P implies Q is false. Because if I said to you, if it is raining, then I will be out indoors. And it is raining, but I’m not indoors? Well, then it would seem to be that my original statement was not true. P implies Q means that if P is true, then Q also needs to be true. And if it’s not, well, then the statement is false. What’s also worth noting, though, is what happens when P is false. When P is false, the implication makes no claim at all. If I say something like, if it is raining, then I will be indoors. And it turns out it’s not raining. Then in that case, I am not making any statement as to whether or not I will be indoors or not. P implies Q just means that if P is true, Q must be true. But if P is not true, then we make no claim about whether or not Q is true at all. So in either case, if P is false, it doesn’t matter what Q is. Whether it’s false or true, we’re not making any claim about Q whatsoever. We can still evaluate the implication to true. The only way that the implication is ever false is if our premise, P, is true, but the conclusion that we’re drawing Q happens to be false. So in that case, we would say P does not imply Q in that case. Finally, the last connective that we’ll discuss is this bi-conditional. You can think of a bi-conditional as a condition that goes in both directions. So originally, when I said something like, if it is raining, then I will be indoors. I didn’t say what would happen if it wasn’t raining. Maybe I’ll be indoors, maybe I’ll be outdoors. This bi-conditional, you can read as an if and only if. So I can say, I will be indoors if and only if it is raining, meaning if it is raining, then I will be indoors. And if I am indoors, it’s reasonable to conclude that it is also raining. So this bi-conditional is only true when P and Q are the same. So if P is true and Q is true, then this bi-conditional is also true. P implies Q, but also the reverse is true. Q also implies P. So if P and Q both happen to be false, we would still say it’s true. But in any of these other two situations, this P if and only if Q is going to ultimately evaluate to false. So a lot of trues and falses going on there, but these five basic logical connectives are going to form the core of the language of propositional logic, the language that we’re going to use in order to describe ideas, and the language that we’re going to use in order to reason about those ideas in order to draw conclusions. So let’s now take a look at some of the additional terms that we’ll need to know about in order to go about trying to form this language of propositional logic and writing AI that’s actually able to understand this sort of logic. The next thing we’re going to need is the notion of what is actually true about the world. We have a whole bunch of propositional symbols, P and Q and R and maybe others, but we need some way of knowing what actually is true in the world. Is P true or false? Is Q true or false? So on and so forth. And to do that, we’ll introduce the notion of a model. A model just assigns a truth value, where a truth value is either true or false, to every propositional symbol. In other words, it’s creating what we might call a possible world. So let me give an example. If, for example, I have two propositional symbols, P is it is raining and Q is it is a Tuesday, a model just takes each of these two symbols and assigns a truth value to them, either true or false. So here’s a sample model. In this model, in other words, in this possible world, it is possible that P is true, meaning it is raining, and Q is false, meaning it is not a Tuesday. But there are other possible worlds or other models as well. There is some model where both of these variables are true, some model where both of these variables are false. In fact, if there are n variables that are propositional symbols like this that are either true or false, then the number of possible models is 2 to the n, because each of these possible models, possible variables within my model, could be set to either true or false if I don’t know any information about it. So now that I have the symbols and the connectives that I’m going to need in order to construct these parts of knowledge, we need some way to represent that knowledge. And to do so, we’re going to allow our AI access to what we’ll call a knowledge base. And a knowledge base is really just a set of sentences that our AI knows to be true. Some set of sentences in propositional logic that are things that our AI knows about the world. And so we might tell our AI some information, information about a situation that it finds itself in, or a situation about a problem that it happens to be trying to solve. And we would give that information to the AI that the AI would store inside of its knowledge base. And what happens next is the AI would like to use that information in the knowledge base to be able to draw conclusions about the rest of the world. And what do those conclusions look like? Well, to understand those conclusions, we’ll need to introduce one more idea, one more symbol. And that is the notion of entailment. So this sentence here, with this double turnstile in these Greek letters, this is the Greek letter alpha and the Greek letter beta. And we read this as alpha entails beta. And alpha and beta here are just sentences in propositional logic. And what this means is that alpha entails beta means that in every model, in other words, in every possible world in which sentence alpha is true, then sentence beta is also true. So if something entails something else, if alpha entails beta, it means that if I know alpha to be true, then beta must therefore also be true. So if my alpha is something like I know that it is a Tuesday in January, then a reasonable beta might be something like I know that it is January. Because in all worlds where it is a Tuesday in January, I know for sure that it must be January, just by definition. This first statement or sentence about the world entails the second statement. And we can reasonably use deduction based on that first sentence to figure out that the second sentence is, in fact, true as well. And ultimately, it’s this idea of entailment that we’re going to try and encode into our computer. We want our AI agent to be able to figure out what the possible entailments are. We want our AI to be able to take these three sentences, sentences like, if it didn’t rain, Harry visited Hagrid. That Harry visited Hagrid or Dumbledore, but not both. And that Harry visited Dumbledore. And just using that information, we’d like our AI to be able to infer or figure out that using these three sentences inside of a knowledge base, we can draw some conclusions. In particular, we can draw the conclusions here that, one, Harry did not visit Hagrid today. And we can draw the entailment, too, that it did, in fact, rain today. And this process is known as inference. And that’s what we’re going to be focusing on today, this process of deriving new sentences from old ones, that I give you these three sentences, you put them in the knowledge base in, say, the AI. And the AI is able to use some sort of inference algorithm to figure out that these two sentences must also be true. And that is how we define inference. So let’s take a look at an inference example to see how we might actually go about inferring things in a human sense before we take a more algorithmic approach to see how we could encode this idea of inference in AI. And we’ll see there are a number of ways that we can actually achieve this. So again, we’ll deal with a couple of propositional symbols. We’ll deal with P, Q, and R. P is it is a Tuesday. Q is it is raining. And R is Harry will go for a run, three propositional symbols that we are just defining to mean this. We’re not saying anything yet about whether they’re true or false. We’re just defining what they are. Now, we’ll give ourselves or an AI access to a knowledge base, abbreviated to KB, the knowledge that we know about the world. We know this statement. All right. So let’s try to parse it. The parentheses here are just used for precedent, so we can see what associates with what. But you would read this as P and not Q implies R. All right. So what does that mean? Let’s put it piece by piece. P is it is a Tuesday. Q is it is raining, so not Q is it is not raining, and implies R is Harry will go for a run. So the way to read this entire sentence in human natural language at least is if it is a Tuesday and it is not raining, then Harry will go for a run. So if it is a Tuesday and it is not raining, then Harry will go for a run. And that is now inside of our knowledge base. And let’s now imagine that our knowledge base has two other pieces of information as well. It has information that P is true, that it is a Tuesday. And we also have the information not Q, that it is not raining, that this sentence Q, it is raining, happens to be false. And those are the three sentences that we have access to. P and not Q implies R, P and not Q. Using that information, we should be able to draw some inferences. P and not Q is only true if both P and not Q are true. All right, we know that P is true and we know that not Q is true. So we know that this whole expression is true. And the definition of implication is if this whole thing on the left is true, then this thing on the right must also be true. So if we know that P and not Q is true, then R must be true as well. So the inference we should be able to draw from all of this is that R is true and we know that Harry will go for a run by taking this knowledge inside of our knowledge base and being able to reason based on that idea. And so this ultimately is the beginning of what we might consider to be some sort of inference algorithm, some process that we can use to try and figure out whether or not we can draw some conclusion. And ultimately, what these inference algorithms are going to answer is the central question about entailment. Given some query about the world, something we’re wondering about the world, and we’ll call that query alpha, the question we want to ask using these inference algorithms is does KB, our knowledge base, entail alpha? In other words, using only the information we know inside of our knowledge base, the knowledge that we have access to, can we conclude that this sentence alpha is true? And that’s ultimately what we would like to do. So how can we do that? How can we go about writing an algorithm that can look at this knowledge base and figure out whether or not this query alpha is actually true? Well, it turns out there are a couple of different algorithms for doing so. And one of the simplest, perhaps, is known as model checking. Now, remember that a model is just some assignment of all of the propositional symbols inside of our language to a truth value, true or false. And you can think of a model as a possible world, that there are many possible worlds where different things might be true or false, and we can enumerate all of them. And the model checking algorithm does exactly that. So what does our model checking algorithm do? Well, if we wanted to determine if our knowledge base entails some query alpha, then we are going to enumerate all possible models. In other words, consider all possible values of true and false for our variables, all possible states in which our world can be in. And if in every model where our knowledge base is true, alpha is also true, then we know that the knowledge base entails alpha. So let’s take a closer look at that sentence and try and figure out what it actually means. If we know that in every model, in other words, in every possible world, no matter what assignment of true and false to variables you give, if we know that whenever our knowledge is true, what we know to be true is true, that this query alpha is also true, well, then it stands to reason that as long as our knowledge base is true, then alpha must also be true. And so this is going to form the foundation of our model checking algorithm. We’re going to enumerate all of the possible worlds and ask ourselves whenever the knowledge base is true, is alpha true? And if that’s the case, then we know alpha to be true. And otherwise, there is no entailment. Our knowledge base does not entail alpha. All right. So this is a little bit abstract, but let’s take a look at an example to try and put real propositional symbols to this idea. So again, we’ll work with the same example. P is it is a Tuesday, Q is it is raining, R as Harry will go for a run. Our knowledge base contains these pieces of information. P and not Q implies R. We also know P. It is a Tuesday and not Q. It is not raining. And our query, our alpha in this case, the thing we want to ask is R. We want to know, is it guaranteed? Is it entailed that Harry will go for a run? So the first step is to enumerate all of the possible models. We have three propositional symbols here, P, Q, and R, which means we have 2 to the third power, or eight possible models. All false, false, false true, false true, false, false true, true, et cetera. Eight possible ways you could assign true and false to all of these models. And we might ask in each one of them, is the knowledge base true? Here are the set of things that we know. In which of these worlds could this knowledge base possibly apply to? In which world is this knowledge base true? Well, in the knowledge base, for example, we know P. We know it is a Tuesday, which means we know that these four first four rows where P is false, none of those are going to be true or are going to work for this particular knowledge base. Our knowledge base is not true in those worlds. Likewise, we also know not Q. We know that it is not raining. So any of these models where Q is true, like these two and these two here, those aren’t going to work either because we know that Q is not true. And finally, we also know that P and not Q implies R, which means that when P is true or P is true here and Q is false, Q is false in these two, then R must be true. And if ever P is true, Q is false, but R is also false, well, that doesn’t satisfy this implication here. That implication does not hold true under those situations. So we could say that for our knowledge base, we can conclude under which of these possible worlds is our knowledge base true and under which of the possible worlds is our knowledge base false. And it turns out there is only one possible world where our knowledge base is actually true. In some cases, there might be multiple possible worlds where the knowledge base is true. But in this case, it just so happens that there’s only one, one possible world where we can definitively say something about our knowledge base. And in this case, we would look at the query. The query of R is R true, R is true, and so as a result, we can draw that conclusion. And so this is this idea of model check-in. Enumerate all the possible models and look in those possible models to see whether or not, if our knowledge base is true, is the query in question true as well. So let’s now take a look at how we might actually go about writing this in a programming language like Python. Take a look at some actual code that would encode this notion of propositional symbols and logic and these connectives like and and or and not and implication and so forth and see what that code might actually look like. So I’ve written in advance a logic library that’s more detailed than we need to worry about entirely today. But the important thing is that we have one class for every type of logical symbol or connective that we might have. So we just have one class for logical symbols, for example, where every symbol is going to represent and store some name for that particular symbol. And we also have a class for not that takes an operand. So we might say not one symbol to say something is not true or some other sentence is not true. We have one for and, one for or, so on and so forth. And I’ll just demonstrate how this works. And you can take a look at the actual logic.py later on. But I’ll go ahead and call this file harry.py. We’re going to store information about this world of Harry Potter, for example. So I’ll go ahead and import from my logic module. I’ll import everything. And in this library, in order to create a symbol, you use capital S symbol. And I’ll create a symbol for rain, to mean it is raining, for example. And I’ll create a symbol for Hagrid, to mean Harry visited Hagrid, is what this symbol is going to mean. So this symbol means it is raining. This symbol means Harry visited Hagrid. And I’ll add another symbol called Dumbledore for Harry visited Dumbledore. Now, I’d like to save these symbols so that I can use them later as I do some logical analysis. So I’ll go ahead and save each one of them inside of a variable. So like rain, Hagrid, and Dumbledore, so you could call the variables anything. And now that I have these logical symbols, I can use logical connectives to combine them together. So for example, if I have a sentence like and rain and Hagrid, for example, which is not necessarily true, but just for demonstration, I can now try and print out sentence.formula, which is a function I wrote that takes a sentence in propositional logic and just prints it out so that we, the programmers, can now see this in order to get an understanding for how it actually works. So if I run python harry.py, what we’ll see is this sentence in propositional logic, rain and Hagrid. This is the logical representation of what we have here in our Python program of saying and whose arguments are rain and Hagrid. So we’re saying rain and Hagrid by encoding that idea. And this is quite common in Python object-oriented programming, where you have a number of different classes, and you pass arguments into them in order to create a new and object, for example, in order to represent this idea. But now what I’d like to do is somehow encode the knowledge that I have about the world in order to solve that problem from the beginning of class, where we talked about trying to figure out who Harry visited and trying to figure out if it’s raining or if it’s not raining. And so what knowledge do I have? I’ll go ahead and create a new variable called knowledge. And what do I know? Well, I know the very first sentence that we talked about was the idea that if it is not raining, then Harry will visit Hagrid. So all right, how do I encode the idea that it is not raining? Well, I can use not and then the rain symbol. So here’s me saying that it is not raining. And now the implication is that if it is not raining, then Harry visited Hagrid. So I’ll wrap this inside of an implication to say, if it is not raining, this first argument to the implication will then Harry visited Hagrid. So I’m saying implication, the premise is that it’s not raining. And if it is not raining, then Harry visited Hagrid. And I can print out knowledge.formula to see the logical formula equivalent of that same idea. So I run Python of harry.py. And this is the logical formula that we see as a result, which is a text-based version of what we were looking at before, that if it is not raining, then that implies that Harry visited Hagrid. But there was additional information that we had access to as well. In this case, we had access to the fact that Harry visited either Hagrid or Dumbledore. So how do I encode that? Well, this means that in my knowledge, I’ve really got multiple pieces of knowledge going on. I know one thing and another thing and another thing. So I’ll go ahead and wrap all of my knowledge inside of an and. And I’ll move things on to new lines just for good measure. But I know multiple things. So I’m saying knowledge is an and of multiple different sentences. I know multiple different sentences to be true. One such sentence that I know to be true is this implication, that if it is not raining, then Harry visited Hagrid. Another such sentence that I know to be true is or Hagrid Dumbledore. In other words, Hagrid or Dumbledore is true, because I know that Harry visited Hagrid or Dumbledore. But I know more than that, actually. That initial sentence from before said that Harry visited Hagrid or Dumbledore, but not both. So now I want a sentence that will encode the idea that Harry didn’t visit both Hagrid and Dumbledore. Well, the notion of Harry visiting Hagrid and Dumbledore would be represented like this, and of Hagrid and Dumbledore. And if that is not true, if I want to say not that, then I’ll just wrap this whole thing inside of a not. So now these three lines, line 8 says that if it is not raining, then Harry visited Hagrid. Line 9 says Harry visited Hagrid or Dumbledore. And line 10 says Harry didn’t visit both Hagrid and Dumbledore, that it is not true that both the Hagrid symbol and the Dumbledore symbol are true. Only one of them can be true. And finally, the last piece of information that I knew was the fact that Harry visited Dumbledore. So these now are the pieces of knowledge that I know, one sentence and another sentence and another and another. And I can print out what I know just to see it a little bit more visually. And here now is a logical representation of the information that my computer is now internally representing using these various different Python objects. And again, take a look at logic.py if you want to take a look at how exactly it’s implementing this, but no need to worry too much about all of the details there. We’re here saying that if it is not raining, then Harry visited Hagrid. We’re saying that Hagrid or Dumbledore is true. And we’re saying it is not the case that Hagrid and Dumbledore is true, that they’re not both true. And we also know that Dumbledore is true. So this long logical sentence represents our knowledge base. It is the thing that we know. And now what we’d like to do is we’d like to use model checking to ask a query, to ask a question like, based on this information, do I know whether or not it’s raining? And we as humans were able to logic our way through it and figure out that, all right, based on these sentences, we can conclude this and that to figure out that, yes, it must have been raining. But now we’d like for the computer to do that as well. So let’s take a look at the model checking algorithm that is going to follow that same pattern that we drew out in pseudocode a moment ago. So I’ve defined a function here in logic.py that you can take a look at called model check. Model check takes two arguments, the knowledge that I already know, and the query. And the idea is, in order to do model checking, I need to enumerate all of the possible models. And for each of the possible models, I need to ask myself, is the knowledge base true? And is the query true? So the first thing I need to do is somehow enumerate all of the possible models, meaning for all possible symbols that exist, I need to assign true and false to each one of them and see whether or not it’s still true. And so here is the way we’re going to do that. We’re going to start. So I’ve defined another helper function internally that we’ll get to in just a moment. But this function starts by getting all of the symbols in both the knowledge and the query, by figuring out what symbols am I dealing with. In this case, the symbols I’m dealing with are rain and Hagrid and Dumbledore, but there might be other symbols depending on the problem. And we’ll take a look soon at some examples of situations where ultimately we’re going to need some additional symbols in order to represent the problem. And then we’re going to run this check all function, which is a helper function that’s basically going to recursively call itself checking every possible configuration of propositional symbols. So we start out by looking at this check all function. And what do we do? So if not symbols means if we finish assigning all of the symbols. We’ve assigned every symbol a value. So far we haven’t done that, but if we ever do, then we check. In this model, is the knowledge true? That’s what this line is saying. If we evaluate the knowledge propositional logic formula using the model’s assignment of truth values, is the knowledge true? If the knowledge is true, then we should return true only if the query is true. Because if the knowledge is true, we want the query to be true as well in order for there to be entailment. Otherwise, we don’t know that there otherwise there won’t be an entailment if there’s ever a situation where what we know in our knowledge is true, but the query, the thing we’re asking, happens to be false. So this line here is checking that same idea that in all worlds where the knowledge is true, the query must also be true. Otherwise, we can just return true because if the knowledge isn’t true, then we don’t care. This is equivalent to when we were enumerating this table from a moment ago. In all situations where the knowledge base wasn’t true, all of these seven rows here, we didn’t care whether or not our query was true or not. We only care to check whether the query is true when the knowledge base is actually true, which was just this green highlighted row right there. So that logic is encoded using that statement there. And otherwise, if we haven’t assigned symbols yet, which we haven’t seen anything yet, then the first thing we do is pop one of the symbols. I make a copy of the symbols first just to save an existing copy. But I pop one symbol off of the remaining symbols so that I just pick one symbol at random. And I create one copy of the model where that symbol is true. And I create a second copy of the model where that symbol is false. So I now have two copies of the model, one where the symbol is true and one where the symbol is false. And I need to make sure that this entailment holds in both of those models. So I recursively check all on the model where the statement is true and check all on the model where the statement is false. So again, you can take a look at that function to try to get a sense for how exactly this logic is working. But in effect, what it’s doing is recursively calling this check all function again and again and again. And on every level of the recursion, we’re saying let’s pick a new symbol that we haven’t yet assigned, assign it to true and assign it to false, and then check to make sure that the entailment holds in both cases. Because ultimately, I need to check every possible world. I need to take every combination of symbols and try every combination of true and false in order to figure out whether the entailment relation actually holds. So that function we’ve written for you. But in order to use that function inside of harry.py, what I’ll write is something like this. I would like to model check based on the knowledge. And then I provide as a second argument what the query is, what the thing I want to ask is. And what I want to ask in this case is, is it raining? So model check again takes two arguments. The first argument is the information that I know, this knowledge, which in this case is this information that was given to me at the beginning. And the second argument, rain, is encoding the idea of the query. What am I asking? I would like to ask, based on this knowledge, do I know for sure that it is raining? And I can try and print out the result of that. And when I run this program, I see that the answer is true. That based on this information, I can conclusively say that it is raining, because using this model checking algorithm, we were able to check that in every world where this knowledge is true, it is raining. In other words, there is no world where this knowledge is true, and it is not raining. So you can conclude that it is, in fact, raining. And this sort of logic can be applied to a number of different types of problems, that if confronted with a problem where some sort of logical deduction can be used in order to try to solve it, you might try thinking about what propositional symbols you might need in order to represent that information, and what statements and propositional logic you might use in order to encode that information which you know. And this process of trying to take a problem and figure out what propositional symbols to use in order to encode that idea, or how to represent it logically, is known as knowledge engineering. That software engineers and AI engineers will take a problem and try and figure out how to distill it down into knowledge that is representable by a computer. And if we can take any general purpose problem, some problem that we find in the human world, and turn it into a problem that computers know how to solve as by using any number of different variables, well, then we can take a computer that is able to do something like model checking or some other inference algorithm and actually figure out how to solve that problem. So now we’ll take a look at two or three examples of knowledge engineering and practice, of taking some problem and figuring out how we can apply logical symbols and use logical formulas to be able to encode that idea. And we’ll start with a very popular board game in the US and the UK known as Clue. Now, in the game of Clue, there’s a number of different factors that are going on. But the basic premise of the game, if you’ve never played it before, is that there are a number of different people. For now, we’ll just use three, Colonel Mustard, Professor Plumb, and Miss Scarlet. There are a number of different rooms, like a ballroom, a kitchen, and a library. And there are a number of different weapons, a knife, a revolver, and a wrench. And three of these, one person, one room, and one weapon, is the solution to the mystery, the murderer and what room they were in and what weapon they happened to use. And what happens at the beginning of the game is that all these cards are randomly shuffled together. And three of them, one person, one room, and one weapon, are placed into a sealed envelope that we don’t know. And we would like to figure out, using some sort of logical process, what’s inside the envelope, which person, which room, and which weapon. And we do so by looking at some, but not all, of these cards here, by looking at these cards to try and figure out what might be going on. And so this is a very popular game. But let’s now try and formalize it and see if we could train a computer to be able to play this game by reasoning through it logically. So in order to do this, we’ll begin by thinking about what propositional symbols we’re ultimately going to need. Remember, again, that propositional symbols are just some symbol, some variable, that can be either true or false in the world. And so in this case, the propositional symbols are really just going to correspond to each of the possible things that could be inside the envelope. Mustard is a propositional symbol that, in this case, will just be true if Colonel Mustard is inside the envelope, if he is the murderer, and false otherwise. And likewise for Plum, for Professor Plum, and Scarlet, for Miss Scarlet. And likewise for each of the rooms and for each of the weapons. We have one propositional symbol for each of these ideas. Then using those propositional symbols, we can begin to create logical sentences, create knowledge that we know about the world. So for example, we know that someone is the murderer, that one of the three people is, in fact, the murderer. And how would we encode that? Well, we don’t know for sure who the murderer is. But we know it is one person or the second person or the third person. So I could say something like this. Mustard or Plum or Scarlet. And this piece of knowledge encodes that one of these three people is the murderer. We don’t know which, but one of these three things must be true. What other information do we know? Well, we know that, for example, one of the rooms must have been the room in the envelope. The crime was committed either in the ballroom or the kitchen or the library. Again, right now, we don’t know which. But this is knowledge we know at the outset, knowledge that one of these three must be inside the envelope. And likewise, we can say the same thing about the weapon, that it was either the knife or the revolver or the wrench, that one of those weapons must have been the weapon of choice and therefore the weapon in the envelope. And then as the game progresses, the gameplay works by people get various different cards. And using those cards, you can deduce information. That if someone gives you a card, for example, I have the Professor Plum card in my hand, then I know the Professor Plum card can’t be inside the envelope. I know that Professor Plum is not the criminal, so I know a piece of information like not Plum, for example. I know that Professor Plum has to be false. This propositional symbol is not true. And sometimes I might not know for sure that a particular card is not in the middle, but sometimes someone will make a guess and I’ll know that one of three possibilities is not true. Someone will guess Colonel Mustard in the library with the revolver or something to that effect. And in that case, a card might be revealed that I don’t see. But if it is a card and it is either Colonel Mustard or the revolver or the library, then I know that at least one of them can’t be in the middle. So I know something like it is either not Mustard or it is not the library or it is not the revolver. Now maybe multiple of these are not true, but I know that at least one of Mustard, Library, and Revolver must, in fact, be false. And so this now is a propositional logic representation of this game of Clue, a way of encoding the knowledge that we know inside this game using propositional logic that a computer algorithm, something like model checking that we saw a moment ago, can actually look at and understand. So let’s now take a look at some code to see how this algorithm might actually work in practice. All right, so I’m now going to open up a file called Clue.py, which I’ve started already. And what we’ll see here is I’ve defined a couple of things. To find some symbols initially, notice I have a symbol for Colonel Mustard, a symbol for Professor Plum, a symbol for Miss Scarlett, all of which I’ve put inside of this list of characters. I have a symbol for Ballroom and Kitchen and Library inside of a list of rooms. And then I have symbols for Knife and Revolver and Wrench. These are my weapons. And so all of these characters and rooms and weapons altogether, those are my symbols. And now I also have this check knowledge function. And what the check knowledge function does is it takes my knowledge and it’s going to try and draw conclusions about what I know. So for example, we’ll loop over all of the possible symbols and we’ll check, do I know that that symbol is true? And a symbol is going to be something like Professor Plum or the Knife or the Library. And if I know that it is true, in other words, I know that it must be the card in the envelope, then I’m going to print out using a function called cprint, which prints things in color. I’m going to print out the word yes, and I’m going to print that in green, just to make it very clear to us. If we’re not sure that the symbol is true, maybe I can check to see if I’m sure that the symbol is not true. Like if I know for sure that it is not Professor Plum, for example. And I do that by running model check again, this time checking if my knowledge is not the symbol, if I know for sure that the symbol is not true. And if I don’t know for sure that the symbol is not true, because I say if not model check, meaning I’m not sure that the symbol is false, well, then I’ll go ahead and print out maybe next to the symbol. Because maybe the symbol is true, maybe it’s not, I don’t actually know. So what knowledge do I actually have? Well, let’s try and represent my knowledge now. So my knowledge is, I know a couple of things, so I’ll put them in an and. And I know that one of the three people must be the criminal. So I know or mustard, plum, scarlet. This is my way of encoding that it is either Colonel Mustard or Professor Plum or Miss Scarlet. I know that it must have happened in one of the rooms. So I know or ballroom, kitchen, library, for example. And I know that one of the weapons must have been used as well. So I know or knife, revolver, wrench. So that might be my initial knowledge, that I know that it must have been one of the people, I know it must have been in one of the rooms, and I know that it must have been one of the weapons. And I can see what that knowledge looks like as a formula by printing out knowledge.formula. So I’ll run python clue.py. And here now is the information that I know in logical format. I know that it is Colonel Mustard or Professor Plum or Miss Scarlet. And I know that it is the ballroom, the kitchen, or the library. And I know that it is the knife, the revolver, or the wrench. But I don’t know much more than that. I can’t really draw any firm conclusions. And in fact, we can see that if I try and do, let me go ahead and run my knowledge check function on my knowledge. Knowledge check is this function that I, or check knowledge rather, is this function that I just wrote that looks over all of the symbols and tries to see what conclusions I can actually draw about any of the symbols. So I’ll go ahead and run clue.py and see what it is that I know. And it seems that I don’t really know anything for sure. I have all three people are maybes, all three of the rooms are maybes, all three of the weapons are maybes. I don’t really know anything for certain just yet. But now let me try and add some additional information and see if additional information, additional knowledge, can help us to logically reason our way through this process. And we are just going to provide the information. Our AI is going to take care of doing the inference and figuring out what conclusions it’s able to draw. So I start with some cards. And those cards tell me something. So if I have the kernel mustard card, for example, I know that the mustard symbol must be false. In other words, mustard is not the one in the envelope, is not the criminal. So I can say, knowledge supports something called, every and in this library supports dot add, which is a way of adding knowledge or adding an additional logical sentence to an and clause. So I can say, knowledge dot add, not mustard. I happen to know, because I have the mustard card, that kernel mustard is not the suspect. And maybe I have a couple of other cards too. Maybe I also have a card for the kitchen. So I know it’s not the kitchen. And maybe I have another card that says that it is not the revolver. So I have three cards, kernel mustard, the kitchen, and the revolver. And I encode that into my AI this way by saying, it’s not kernel mustard, it’s not the kitchen, and it’s not the revolver. And I know those to be true. So now, when I rerun clue.py, we’ll see that I’ve been able to eliminate some possibilities. Before, I wasn’t sure if it was the knife or the revolver or the wrench. If a knife was maybe, a revolver was maybe, wrench is maybe. Now I’m down to just the knife and the wrench. Between those two, I don’t know which one it is. They’re both maybes. But I’ve been able to eliminate the revolver, which is one that I know to be false, because I have the revolver card. And so additional information might be acquired over the course of this game. And we would represent that just by adding knowledge to our knowledge set or knowledge base that we’ve been building here. So if, for example, we additionally got the information that someone made a guess, someone guessed like Miss Scarlet in the library with the wrench. And we know that a card was revealed, which means that one of those three cards, either Miss Scarlet or the library or the wrench, one of those at minimum must not be inside of the envelope. So I could add some knowledge, say knowledge.add. And I’m going to add an or clause, because I don’t know for sure which one it’s not, but I know one of them is not in the envelope. So it’s either not Scarlet, or it’s not the library, and or supports multiple arguments. I can say it’s also or not the wrench. So at least one of those needs a Scarlet library and wrench. At least one of those needs to be false. I don’t know which, though. Maybe it’s multiple. Maybe it’s just one, but at least one I know needs to hold. And so now if I rerun clue.py, I don’t actually have any additional information just yet. Nothing I can say conclusively. I still know that maybe it’s Professor Plum, maybe it’s Miss Scarlet. I haven’t eliminated any options. But let’s imagine that I get some more information, that someone shows me the Professor Plum card, for example. So I say, all right, let’s go back here, knowledge.add, not Plum. So I have the Professor Plum card. I know the Professor Plum is not in the middle. I rerun clue.py. And right now, I’m able to draw some conclusions. Now I’ve been able to eliminate Professor Plum, and the only person it could left remaining be is Miss Scarlet. So I know, yes, Miss Scarlet, this variable must be true. And I’ve been able to infer that based on the information I already had. Now between the ballroom and the library and the knife and the wrench, for those two, I’m still not sure. So let’s add one more piece of information. Let’s say that I know that it’s not the ballroom. Someone has shown me the ballroom card, so I know it’s not the ballroom. Which means at this point, I should be able to conclude that it’s the library. Let’s see. I’ll say knowledge.add, not the ballroom. And we’ll go ahead and run that. And it turns out that after all of this, not only can I conclude that I know that it’s the library, but I also know that the weapon was the knife. And that might have been an inference that was a little bit trickier, something I wouldn’t have realized immediately, but the AI, via this model checking algorithm, is able to draw that conclusion, that we know for sure that it must be Miss Scarlet in the library with the knife. And how did we know that? Well, we know it from this or clause up here, that we know that it’s either not Scarlet, or it’s not the library, or it’s not the wrench. And given that we know that it is Miss Scarlet, and we know that it is the library, then the only remaining option for the weapon is that it is not the wrench, which means that it must be the knife. So we as humans now can go back and reason through that, even though it might not have been immediately clear. And that’s one of the advantages of using an AI or some sort of algorithm in order to do this, is that the computer can exhaust all of these possibilities and try and figure out what the solution actually should be. And so for that reason, it’s often helpful to be able to represent knowledge in this way. Knowledge engineering, some situation where we can use a computer to be able to represent knowledge and draw conclusions based on that knowledge. And any time we can translate something into propositional logic symbols like this, this type of approach can be useful. So you might be familiar with logic puzzles, where you have to puzzle your way through trying to figure something out. This is what a classic logic puzzle might look like. Something like Gilderoy, Minerva, Pomona, and Horace each belong to a different one of the four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. And then we have some information. The Gilderoy belongs to Gryffindor or Ravenclaw, Pomona does not belong in Slytherin, and Minerva does belong to Gryffindor. So we have a couple pieces of information. And using that information, we need to be able to draw some conclusions about which person should be assigned to which house. And again, we can use the exact same idea to try and implement this notion. So we need some propositional symbols. And in this case, the propositional symbols are going to get a little more complex, although we’ll see ways to make this a little bit cleaner later on. But we’ll need 16 propositional symbols, one for each person and house. So we need to say, remember, every propositional symbol is either true or false. So Gilderoy Gryffindor is either true or false. Either he’s in Gryffindor or he is not. Likewise, Gilderoy Hufflepuff also true or false. Either it is true or it’s false. And that’s true for every combination of person and house that we could come up with. We have some sort of propositional symbol for each one of those. Using this type of knowledge, we can then begin to think about what types of logical sentences we can say about the puzzle. That if we know what will before even think about the information we were given, we can think about the premise of the problem, that every person is assigned to a different house. So what does that tell us? Well, it tells us sentences like this. It tells us like Pomona Slytherin implies not Pomona Hufflepuff. Something like if Pomona is in Slytherin, then we know that Pomona is not in Hufflepuff. And we know this for all four people and for all combinations of houses, that no matter what person you pick, if they’re in one house, then they’re not in some other house. So I’ll probably have a whole bunch of knowledge statements that are of this form, that if we know Pomona is in Slytherin, then we know Pomona is not in Hufflepuff. We were also given the information that each person is in a different house. So I also have pieces of knowledge that look something like this. Minerva Ravenclaw implies not Gilderoy Ravenclaw. If they’re all in different houses, then if Minerva is in Ravenclaw, then we know the Gilderoy is not in Ravenclaw as well. And I have a whole bunch of similar sentences like this that are expressing that idea for other people and other houses as well. And so in addition to sentences of these form, I also have the knowledge that was given to me. Information like Gilderoy was in Gryffindor or in Ravenclaw that would be represented like this, Gilderoy Gryffindor or Gilderoy Ravenclaw. And then using these sorts of sentences, I can begin to draw some conclusions about the world. So let’s see an example of this. We’ll go ahead and actually try and implement this logic puzzle to see if we can figure out what the answer is. I’ll go ahead and open up puzzle.py, where I’ve already started to implement this sort of idea. I’ve defined a list of people and a list of houses. And I’ve so far created one symbol for every person and for every house. That’s what this double four loop is doing, looping over all people, looping over all houses, creating a new symbol for each of them. And then I’ve added some information. I know that every person belongs to a house, so I’ve added the information for every person that person Gryffindor or person Hufflepuff or person Ravenclaw or person Slytherin, that one of those four things must be true. Every person belongs to a house. What other information do I know? I also know that only one house per person, so no person belongs to multiple houses. So how does this work? Well, this is going to be true for all people. So I’ll loop over every person. And then I need to loop over all different pairs of houses. The idea is I want to encode the idea that if Minerva is in Gryffindor, then Minerva can’t be in Ravenclaw. So I’ll loop over all houses, each one. And I’ll loop over all houses again, h2. And as long as they’re different, h1 not equal to h2, then I’ll add to my knowledge base this piece of information. That implication, in other words, an if then, if the person is in h1, then I know that they are not in house h2. So these lines here are encoding the notion that for every person, if they belong to house one, then they are not in house two. And the other piece of logic we need to encode is the idea that every house can only have one person. In other words, if Pomona is in Hufflepuff, then nobody else is allowed to be in Hufflepuff either. And that’s the same logic, but sort of backwards. I loop over all of the houses and loop over all different pairs of people. So I loop over people once, loop over people again, and only do this when the people are different, p1 not equal to p2. And I add the knowledge that if, as given by the implication, if person one belongs to the house, then it is not the case that person two belongs to the same house. So here I’m just encoding the knowledge that represents the problem’s constraints. I know that everyone’s in a different house. I know that any person can only belong to one house. And I can now take my knowledge and try and print out the information that I happen to know. So I’ll go ahead and print out knowledge.formula, just to see this in action, and I’ll go ahead and skip this for now. But we’ll come back to this in a second. Let’s print out the knowledge that I know by running Python puzzle.py. It’s a lot of information, a lot that I have to scroll through, because there are 16 different variables all going on. But the basic idea, if we scroll up to the very top, is I see my initial information. Gilderoy is either in Gryffindor, or Gilderoy is in Hufflepuff, or Gilderoy is in Ravenclaw, or Gilderoy is in Slytherin, and then way more information as well. So this is quite messy, more than we really want to be looking at. And soon, too, we’ll see ways of representing this a little bit more nicely using logic. But for now, we can just say these are the variables that we’re dealing with. And now we’d like to add some information. So the information we’re going to add is Gilderoy is in Gryffindor, or he is in Ravenclaw. So that knowledge was given to us. So I’ll go ahead and say knowledge.add. And I know that either or Gilderoy Gryffindor or Gilderoy Ravenclaw. One of those two things must be true. I also know that Pomona was not in Slytherin, so I can say knowledge.add not this symbol, not the Pomona-Slytherin symbol. And then I can add the knowledge that Minerva is in Gryffindor by adding the symbol Minerva Gryffindor. So those are the pieces of knowledge that I know. And this loop here at the bottom just loops over all of my symbols, checks to see if the knowledge entails that symbol by calling this model check function again. And if it does, if we know the symbol is true, we print out the symbol. So now I can run Python, puzzle.py, and Python is going to solve this puzzle for me. We’re able to conclude that Gilderoy belongs to Ravenclaw, Pomona belongs to Hufflepuff, Minerva to Gryffindor, and Horace to Slytherin just by encoding this knowledge inside the computer, although it was quite tedious to do in this case. And as a result, we were able to get the conclusion from that as well. And you can imagine this being applied to many sorts of different deductive situations. So not only these situations where we’re trying to deal with Harry Potter characters in this puzzle, but if you’ve ever played games like Mastermind, where you’re trying to figure out which order different colors go in and trying to make predictions about it, I could tell you, for example, let’s play a simplified version of Mastermind where there are four colors, red, blue, green, and yellow, and they’re in some order, but I’m not telling you what order. You just have to make a guess, and I’ll tell you of red, blue, green, and yellow how many of the four you got in the right position. So a simplified version of this game, you might make a guess like red, blue, green, yellow, and I would tell you something like two of those four are in the correct position, but the other two are not. And then you could reasonably make a guess and say, all right, look at this, blue, red, green, yellow. Try switching two of them around, and this time maybe I tell you, you know what, none of those are in the correct position. And the question then is, all right, what is the correct order of these four colors? And we as humans could begin to reason this through. All right, well, if none of these were correct, but two of these were correct, well, it must have been because I switched the red and the blue, which means red and blue here must be correct, which means green and yellow are probably not correct. You can begin to do this sort of deductive reasoning. And we can also equivalently try and take this and encode it inside of our computer as well. And it’s going to be very similar to the logic puzzle that we just did a moment ago. So I won’t spend too much time on this code because it is fairly similar. But again, we have a whole bunch of colors and four different positions in which those colors can be. And then we have some additional knowledge. And I encode all of that knowledge. And you can take a look at this code on your own time. But I just want to demonstrate that when we run this code, run python mastermind.py and run and see what we get, we ultimately are able to compute red 0 in the 0 position, blue in the 1 position, yellow in the 2 position, and green in the 3 position as the ordering of those symbols. Now, ultimately, what you might have noticed is this process was taking quite a long time. And in fact, model checking is not a particularly efficient algorithm, right? What I need to do in order to model check is take all of my possible different variables and enumerate all of the possibilities that they could be in. If I have n variables, I have 2 to the n possible worlds that I need to be looking through in order to perform this model checking algorithm. And this is probably not tractable, especially as we start to get to much larger and larger sets of data where you have many, many more variables that are at play. Right here, we only have a relatively small number of variables. So this sort of approach can actually work. But as the number of variables increases, model checking becomes less and less good of a way of trying to solve these sorts of problems. So while it might have been OK for something like Mastermind to conclude that this is indeed the correct sequence where all four are in the correct position, what we’d like to do is come up with some better ways to be able to make inferences rather than just enumerate all of the possibilities. And to do so, what we’ll transition to next is the idea of inference rules, some sort of rules that we can apply to take knowledge that already exists and translate it into new forms of knowledge. And the general way we’ll structure an inference rule is by having a horizontal line here. Anything above the line is going to represent a premise, something that we know to be true. And then anything below the line will be the conclusion that we can arrive at after we apply the logic from the inference rule that we’re going to demonstrate. So we’ll do some of these inference rules by demonstrating them in English first, but then translating them into the world of propositional logic so you can see what those inference rules actually look like. So for example, let’s imagine that I have access to two pieces of information. I know, for example, that if it is raining, then Harry is inside, for example. And let’s say I also know it is raining. Then most of us could reasonably then look at this information and conclude that, all right, Harry must be inside. This inference rule is known as modus ponens, and it’s phrased more formally in logic as this. If we know that alpha implies beta, in other words, if alpha, then beta, and we also know that alpha is true, then we should be able to conclude that beta is also true. We can apply this inference rule to take these two pieces of information and generate this new piece of information. Notice that this is a totally different approach from the model checking approach, where the approach was look at all of the possible worlds and see what’s true in each of these worlds. Here, we’re not dealing with any specific world. We’re just dealing with the knowledge that we know and what conclusions we can arrive at based on that knowledge. That I know that A implies B, and I know A, and the conclusion is B. And this should seem like a relatively obvious rule. But of course, if alpha, then beta, and we know alpha, then we should be able to conclude that beta is also true. And that’s going to be true for many, but maybe even all of the inference rules that we’ll take a look at. You should be able to look at them and say, yeah, of course that’s going to be true. But it’s putting these all together, figuring out the right combination of inference rules that can be applied that ultimately is going to allow us to generate interesting knowledge inside of our AI. So that’s modus ponensis application of implication, that if we know alpha and we know that alpha implies beta, then we can conclude beta. Let’s take a look at another example. Fairly straightforward, something like Harry is friends with Ron and Hermione. Based on that information, we can reasonably conclude Harry is friends with Hermione. That must also be true. And this inference rule is known as and elimination. And what and elimination says is that if we have a situation where alpha and beta are both true, I have information alpha and beta, well then, just alpha is true. Or likewise, just beta is true. That if I know that both parts are true, then one of those parts must also be true. Again, something obvious from the point of view of human intuition, but a computer needs to be told this kind of information. To be able to apply the inference rule, we need to tell the computer that this is an inference rule that you can apply, so the computer has access to it and is able to use it in order to translate information from one form to another. In addition to that, let’s take a look at another example of an inference rule, something like it is not true that Harry did not pass the test. Bit of a tricky sentence to parse. I’ll read it again. It is not true, or it is false, that Harry did not pass the test. Well, if it is false that Harry did not pass the test, then the only reasonable conclusion is that Harry did pass the test. And so this, instead of being and elimination, is what we call double negation elimination. That if we have two negatives inside of our premise, then we can just remove them altogether. They cancel each other out. One turns true to false, and the other one turns false back into true. Phrased a little bit more formally, we say that if the premise is not alpha, then the conclusion we can draw is just alpha. We can say that alpha is true. We’ll take a look at a couple more of these. If I have it is raining, then Harry is inside. How do I reframe this? Well, this one is a little bit trickier. But if I know if it is raining, then Harry is inside, then I conclude one of two things must be true. Either it is not raining, or Harry is inside. Now, this one’s trickier. So let’s think about it a little bit. This first premise here, if it is raining, then Harry is inside, is saying that if I know that it is raining, then Harry must be inside. So what is the other possible case? Well, if Harry is not inside, then I know that it must not be raining. So one of those two situations must be true. Either it’s not raining, or it is raining, in which case Harry is inside. So the conclusion I can draw is either it is not raining, or it is raining, so therefore, Harry is inside. And so this is a way to translate if-then statements into or statements. And this is known as implication elimination. And this is similar to what we actually did in the beginning when we were first looking at those very first sentences about Harry and Hagrid and Dumbledore. And phrased a little bit more formally, this says that if I have the implication, alpha implies beta, that I can draw the conclusion that either not alpha or beta, because there are only two possibilities. Either alpha is true or alpha is not true. So one of those possibilities is alpha is not true. But if alpha is true, well, then we can draw the conclusion that beta must be true. So either alpha is not true or alpha is true, in which case beta is also true. So this is one way to turn an implication into just a statement about or. In addition to eliminating implications, we can also eliminate biconditionals as well. So let’s take an English example, something like, it is raining if and only if Harry is inside. And this if and only if really sounds like that biconditional, that double arrow sign that we saw in propositional logic not too long ago. And what does this actually mean if we were to translate this? Well, this means that if it is raining, then Harry is inside. And if Harry is inside, then it is raining, that this implication goes both ways. And this is what we would call biconditional elimination, that I can take a biconditional, a if and only if b, and translate that into something like this, a implies b, and b implies a. So many of these inference rules are taking logic that uses certain symbols and turning them into different symbols, taking an implication and turning it into an or, or taking a biconditional and turning it into implication. And another example of it would be something like this. It is not true that both Harry and Ron passed the test. Well, all right, how do we translate that? What does that mean? Well, if it is not true that both of them passed the test, well, then the reasonable conclusion we might draw is that at least one of them didn’t pass the test. So the conclusion is either Harry did not pass the test or Ron did not pass the test, or both. This is not an exclusive or. But if it is true that it is not true that both Harry and Ron passed the test, well, then either Harry didn’t pass the test or Ron didn’t pass the test. And this type of law is one of De Morgan’s laws. Quite famous in logic where the idea is that we can turn an and into an or. We can say we can take this and that both Harry and Ron passed the test and turn it into an or by moving the nots around. So if it is not true that Harry and Ron passed the test, well, then either Harry did not pass the test or Ron did not pass the test either. And the way we frame that more formally using logic is to say this. If it is not true that alpha and beta, well, then either not alpha or not beta. The way I like to think about this is that if you have a negation in front of an and expression, you move the negation inwards, so to speak, moving the negation into each of these individual sentences and then flip the and into an or. So the negation moves inwards and the and flips into an or. So I go from not a and b to not a or not b. And there’s actually a reverse of De Morgan’s law that goes in the other direction for something like this. If I say it is not true that Harry or Ron passed the test, meaning neither of them passed the test, well, then the conclusion I can draw is that Harry did not pass the test and Ron did not pass the test. So in this case, instead of turning an and into an or, we’re turning an or into an and. But the idea is the same. And this, again, is another example of De Morgan’s laws. And the way that works is that if I have not a or b this time, the same logic is going to apply. I’m going to move the negation inwards. And I’m going to flip this time, flip the or into an and. So if not a or b, meaning it is not true that a or b or alpha or beta, then I can say not alpha and not beta, moving the negation inwards in order to make that conclusion. So those are De Morgan’s laws and a couple other inference rules that are worth just taking a look at. One is the distributive law that works this way. So if I have alpha and beta or gamma, well, then much in the same way that you can use in math, use distributive laws to distribute operands like addition and multiplication, I can do a similar thing here, where I can say if alpha and beta or gamma, then I can say something like alpha and beta or alpha and gamma, that I’ve been able to distribute this and sign throughout this expression. So this is an example of the distributive property or the distributive law as applied to logic in much the same way that you would distribute a multiplication over the addition of something, for example. This works the other way too. So if, for example, I have alpha or beta and gamma, I can distribute the or throughout the expression. I can say alpha or beta and alpha or gamma. So the distributive law works in that way too. And it’s helpful if I want to take an or and move it into the expression. And we’ll see an example soon of why it is that we might actually care to do something like that. All right, so now we’ve seen a lot of different inference rules. And the question now is, how can we use those inference rules to actually try and draw some conclusions, to actually try and prove something about entailment, proving that given some initial knowledge base, we would like to find some way to prove that a query is true? Well, one way to think about it is actually to think back to what we talked about last time when we talked about search problems. Recall again that search problems have some sort of initial state. They have actions that you can take from one state to another as defined by a transition model that tells you how to get from one state to another. We talked about testing to see if you were at a goal. And then some path cost function to see how many steps did you have to take or how costly was the solution that you found. Now that we have these inference rules that take some set of sentences in propositional logic and get us some new set of sentences in propositional logic, we can actually treat those sentences or those sets of sentences as states inside of a search problem. So if we want to prove that some query is true, prove that some logical theorem is true, we can treat theorem proving as a form of a search problem. I can say that we begin in some initial state, where that initial state is the knowledge base that I begin with, the set of all of the sentences that I know to be true. What actions are available to me? Well, the actions are any of the inference rules that I can apply at any given time. The transition model just tells me after I apply the inference rule, here is the new set of all of the knowledge that I have, which will be the old set of knowledge, plus some additional inference that I’ve been able to draw, much as in the same way we saw what we got when we applied those inference rules and got some sort of conclusion. That conclusion gets added to our knowledge base, and our transition model will encode that. What is the goal test? Well, our goal test is checking to see if we have proved the statement we’re trying to prove, if the thing we’re trying to prove is inside of our knowledge base. And the path cost function, the thing we’re trying to minimize, is maybe the number of inference rules that we needed to use, the number of steps, so to speak, inside of our proof. And so here we’ve been able to apply the same types of ideas that we saw last time with search problems to something like trying to prove something about knowledge by taking our knowledge and framing it in terms that we can understand as a search problem with an initial state, with actions, with a transition model. So this shows a couple of things, one being how versatile search problems are, that they can be the same types of algorithms that we use to solve a maze or figure out how to get from point A to point B inside of driving directions, for example, can also be used as a theorem proving method of taking some sort of starting knowledge base and trying to prove something about that knowledge. So this, yet again, is a second way, in addition to model checking, to try and prove that certain statements are true. But it turns out there’s yet another way that we can try and apply inference. And we’ll talk about this now, which is not the only way, but certainly one of the most common, which is known as resolution. And resolution is based on another inference rule that we’ll take a look at now, quite a powerful inference rule that will let us prove anything that can be proven about a knowledge base. And it’s based on this basic idea. Let’s say I know that either Ron is in the Great Hall or Hermione is in the library. And let’s say I also know that Ron is not in the Great Hall. Based on those two pieces of information, what can I conclude? Well, I could pretty reasonably conclude that Hermione must be in the library. How do I know that? Well, it’s because these two statements, these two what we’ll call complementary literals, literals that complement each other, they’re opposites of each other, seem to conflict with each other. This sentence tells us that either Ron is in the Great Hall or Hermione is in the library. So if we know that Ron is not in the Great Hall, that conflicts with this one, which means Hermione must be in the library. And this we can frame as a more general rule known as the unit resolution rule, a rule that says that if we have p or q and we also know not p, well then from that we can reasonably conclude q. That if p or q are true and we know that p is not true, the only possibility is for q to then be true. And this, it turns out, is quite a powerful inference rule in terms of what it can do, in part because we can quickly start to generalize this rule. This q right here doesn’t need to just be a single propositional symbol. It could be multiple, all chained together in a single clause, as we’ll call it. So if I had something like p or q1 or q2 or q3, so on and so forth, up until qn, so I had n different other variables, and I have not p, well then what happens when these two complement each other is that these two clauses resolve, so to speak, to produce a new clause that is just q1 or q2 all the way up to qn. And in an or, the order of the arguments in the or doesn’t actually matter. The p doesn’t need to be the first thing. It could have been in the middle. But the idea here is that if I have p in one clause and not p in the other clause, well then I know that one of these remaining things must be true. I’ve resolved them in order to produce a new clause. But it turns out we can generalize this idea even further, in fact, and display even more power that we can have with this resolution rule. So let’s take another example. Let’s say, for instance, that I know the same piece of information that either Ron is in the Great Hall or Hermione is in the library. And the second piece of information I know is that Ron is not in the Great Hall or Harry is sleeping. So it’s not just a single piece of information. I have two different clauses. And we’ll define clauses more precisely in just a moment. What do I know here? Well again, for any propositional symbol like Ron is in the Great Hall, there are only two possibilities. Either Ron is in the Great Hall, in which case, based on resolution, we know that Harry must be sleeping, or Ron is not in the Great Hall, in which case we know based on the same rule that Hermione must be in the library. Based on those two things in combination, I can say based on these two premises that I can conclude that either Hermione is in the library or Harry is sleeping. So again, because these two conflict with each other, I know that one of these two must be true. And you can take a closer look and try and reason through that logic. Make sure you convince yourself that you believe this conclusion. Stated more generally, we can name this resolution rule by saying that if we know p or q is true, and we also know that not p or r is true, we resolve these two clauses together to get a new clause, q or r, that either q or r must be true. And again, much as in the last case, q and r don’t need to just be single propositional symbols. It could be multiple symbols. So if I had a rule that had p or q1 or q2 or q3, so on and so forth, up until qn, where n is just some number. And likewise, I had not p or r1 or r2, so on and so forth, up until rm, where m, again, is just some other number. I can resolve these two clauses together to get one of these must be true, q1 or q2 up until qn or r1 or r2 up until rm. And this is just a generalization of that same rule we saw before. Each of these things here are what we’re going to call a clause, where a clause is formally defined as a disjunction of literals, where a disjunction means it’s a bunch of things that are connected with or. Disjunction means things connected with or. Conjunction, meanwhile, is things connected with and. And a literal is either a propositional symbol or the opposite of a propositional symbol. So it’s something like p or q or not p or not q. Those are all propositional symbols or not of the propositional symbols. And we call those literals. And so a clause is just something like this, p or q or r, for example. Meanwhile, what this gives us an ability to do is it gives us an ability to turn logic, any logical sentence, into something called conjunctive normal form. A conjunctive normal form sentence is a logical sentence that is a conjunction of clauses. Recall, again, conjunction means things are connected to one another using and. And so a conjunction of clauses means it is an and of individual clauses, each of which has ors in it. So something like this, a or b or c, and d or not e, and f or g. Everything in parentheses is one clause. All of the clauses are connected to each other using an and. And everything in the clause is separated using an or. And this is just a standard form that we can translate a logical sentence into that just makes it easy to work with and easy to manipulate. And it turns out that we can take any sentence in logic and turn it into conjunctive normal form just by applying some inference rules and transformations to it. So we’ll take a look at how we can actually do that. So what is the process for taking a logical formula and converting it into conjunctive normal form, otherwise known as c and f? Well, the process looks a little something like this. We need to take all of the symbols that are not part of conjunctive normal form. The bi-conditionals and the implications and so forth, and turn them into something that is more closely like conjunctive normal form. So the first step will be to eliminate bi-conditionals, those if and only if double arrows. And we know how to eliminate bi-conditionals because we saw there was an inference rule to do just that. Any time I have an expression like alpha if and only if beta, I can turn that into alpha implies beta and beta implies alpha based on that inference rule we saw before. Likewise, in addition to eliminating bi-conditionals, I can eliminate implications as well, the if then arrows. And I can do that using the same inference rule we saw before too, taking alpha implies beta and turning that into not alpha or beta because that is logically equivalent to this first thing here. Then we can move knots inwards because we don’t want knots on the outsides of our expressions. Conjunctive normal form requires that it’s just claws and claws and claws and claws. Any knots need to be immediately next to propositional symbols. But we can move those knots around using De Morgan’s laws by taking something like not A and B and turn it into not A or not B, for example, using De Morgan’s laws to manipulate that. And after that, all we’ll be left with are ands and ors. And those are easy to deal with. We can use the distributive law to distribute the ors so that the ors end up on the inside of the expression, so to speak, and the ands end up on the outside. So this is the general pattern for how we’ll take a formula and convert it into conjunctive normal form. And let’s now take a look at an example of how we would do this and explore then why it is that we would want to do something like this. Here’s how we can do it. Let’s take this formula, for example. P or Q implies R. And I’d like to convert this into conjunctive normal form, where it’s all ands of clauses, and every clause is a disjunctive clause. It’s ors together. So what’s the first thing I need to do? Well, this is an implication. So let me go ahead and remove that implication. Using the implication inference rule, I can turn P or Q into P or Q implies R into not P or Q or R. So that’s the first step. I’ve gotten rid of the implication. And next, I can get rid of the not on the outside of this expression, too. I can move the nots inwards so they’re closer to the literals themselves by using De Morgan’s laws. And De Morgan’s law says that not P or Q is equivalent to not P and not Q. Again, here, just applying the inference rules that we’ve already seen in order to translate these statements. And now, I have two things that are separated by an or, where this thing on the inside is an and. What I’d really like to move the ors so the ors are on the inside, because conjunctive normal form means I need clause and clause and clause and clause. And so to do that, I can use the distributive law. If I have not P and not Q or R, I can distribute the or R to both of these to get not P or R and not Q or R using the distributive law. And this now here at the bottom is in conjunctive normal form. It is a conjunction and and of disjunctions of clauses that just are separated by ors. So this process can be used by any formula to take a logical sentence and turn it into this conjunctive normal form, where I have clause and clause and clause and clause and clause and so on. So why is this helpful? Why do we even care about taking all these sentences and converting them into this form? It’s because once they’re in this form where we have these clauses, these clauses are the inputs to the resolution inference rule that we saw a moment ago, that if I have two clauses where there’s something that conflicts or something complementary between those two clauses, I can resolve them to get a new clause, to draw a new conclusion. And we call this process inference by resolution, using the resolution rule to draw some sort of inference. And it’s based on the same idea, that if I have P or Q, this clause, and I have not P or R, that I can resolve these two clauses together to get Q or R as the resulting clause, a new piece of information that I didn’t have before. Now, a couple of key points that are worth noting about this before we talk about the actual algorithm. One thing is that, let’s imagine we have P or Q or S, and I also have not P or R or S. The resolution rule says that because this P conflicts with this not P, we would resolve to put everything else together to get Q or S or R or S. But it turns out that this double S is redundant, or S here and or S there. It doesn’t change the meaning of the sentence. So in resolution, when we do this resolution process, we’ll usually also do a process known as factoring, where we take any duplicate variables that show up and just eliminate them. So Q or S or R or S just becomes Q or R or S. The S only needs to appear once, no need to include it multiple times. Now, one final question worth considering is what happens if I try to resolve P and not P together? If I know that P is true and I know that not P is true, well, resolution says I can merge these clauses together and look at everything else. Well, in this case, there is nothing else, so I’m left with what we might call the empty clause. I’m left with nothing. And the empty clause is always false. The empty clause is equivalent to just being false. And that’s pretty reasonable because it’s impossible for both P and not P to both hold at the same time. P is either true or it’s not true, which means that if P is true, then this must be false. And if this is true, then this must be false. There is no way for both of these to hold at the same time. So if ever I try and resolve these two, it’s a contradiction, and I’ll end up getting this empty clause where the empty clause I can call equivalent to false. And this idea that if I resolve these two contradictory terms, I get the empty clause, this is the basis for our inference by resolution algorithm. Here’s how we’re going to perform inference by resolution at a very high level. We want to prove that our knowledge base entails some query alpha, that based on the knowledge we have, we can prove conclusively that alpha is going to be true. How are we going to do that? Well, in order to do that, we’re going to try to prove that if we know the knowledge and not alpha, that that would be a contradiction. And this is a common technique in computer science more generally, this idea of proving something by contradiction. If I want to prove that something is true, I can do so by first assuming that it is false and showing that it would be contradictory, showing that it leads to some contradiction. And if the thing I’m trying to prove, if when I assume it’s false, leads to a contradiction, then it must be true. And that’s the logical approach or the idea behind a proof by contradiction. And that’s what we’re going to do here. We want to prove that this query alpha is true. So we’re going to assume that it’s not true. We’re going to assume not alpha. And we’re going to try and prove that it’s a contradiction. If we do get a contradiction, well, then we know that our knowledge entails the query alpha. If we don’t get a contradiction, there is no entailment. This is this idea of a proof by contradiction of assuming the opposite of what you’re trying to prove. And if you can demonstrate that that’s a contradiction, then what you’re proving must be true. But more formally, how do we actually do this? How do we check that knowledge base and not alpha is going to lead to a contradiction? Well, here is where resolution comes into play. To determine if our knowledge base entails some query alpha, we’re going to convert knowledge base and not alpha to conjunctive normal form, that form where we have a whole bunch of clauses that are all anded together. And when we have these individual clauses, now we can keep checking to see if we can use resolution to produce a new clause. We can take any pair of clauses and check, is there some literal that is the opposite of each other or complementary to each other in both of them? For example, I have a p in one clause and a not p in another clause. Or an r in one clause and a not r in another clause. If ever I have that situation where once I convert to conjunctive normal form and I have a whole bunch of clauses, I see two clauses that I can resolve to produce a new clause, then I’ll do so. This process occurs in a loop. I’m going to keep checking to see if I can use resolution to produce a new clause and keep using those new clauses to try to generate more new clauses after that. Now, it just so may happen that eventually we may produce the empty clause, the clause we were talking about before. If I resolve p and not p together, that produces the empty clause and the empty clause we know to be false. Because we know that there’s no way for both p and not p to both simultaneously be true. So if ever we produce the empty clause, then we have a contradiction. And if we have a contradiction, that’s exactly what we were trying to do in a fruit by contradiction. If we have a contradiction, then we know that our knowledge base must entail this query alpha. And we know that alpha must be true. And it turns out, and we won’t go into the proof here, but you can show that otherwise, if you don’t produce the empty clause, then there is no entailment. If we run into a situation where there are no more new clauses to add, we’ve done all the resolution that we can do, and yet we still haven’t produced the empty clause, then there is no entailment in this case. And this now is the resolution algorithm. And it’s very abstract looking, especially this idea of like, what does it even mean to have the empty clause? So let’s take a look at an example, actually try and prove some entailment by using this inference by resolution process. So here’s our question. We have this knowledge base. Here is the knowledge that we know, A or B, and not B or C, and not C. And we want to know if all of this entails A. So this is our knowledge base here, this whole log thing. And our query alpha is just this propositional symbol, A. So what do we do? Well, first, we want to prove by contradiction. So we want to first assume that A is false, and see if that leads to some sort of contradiction. So here is what we’re going to start with, A or B, and not B or C, and not C. This is our knowledge base. And we’re going to assume not A. We’re going to assume that the thing we’re trying to prove is, in fact, false. And so this is now in conjunctive normal form, and I have four different clauses. I have A or B. I have not B or C. I have not C, and I have not A. And now, I can begin to just pick two clauses that I can resolve, and apply the resolution rule to them. And so looking at these four clauses, I see, all right, these two clauses are ones I can resolve. I can resolve them because there are complementary literals that show up in them. There’s a C here, and a not C here. So just looking at these two clauses, if I know that not B or C is true, and I know that C is not true, well, then I can resolve these two clauses to say, all right, not B, that must be true. I can generate this new clause as a new piece of information that I now know to be true. And all right, now I can repeat this process, do the process again. Can I use resolution again to get some new conclusion? Well, it turns out I can. I can use that new clause I just generated, along with this one here. There are complementary literals. This B is complementary to, or conflicts with, this not B over here. And so if I know that A or B is true, and I know that B is not true, well, then the only remaining possibility is that A must be true. So now we have A. That is a new clause that I’ve been able to generate. And now, I can do this one more time. I’m looking for two clauses that can be resolved, and you might programmatically do this by just looping over all possible pairs of clauses and checking for complementary literals in each. And here, I can say, all right, I found two clauses, not A and A, that conflict with each other. And when I resolve these two together, well, this is the same as when we were resolving P and not P from before. When I resolve these two clauses together, I get rid of the As, and I’m left with the empty clause. And the empty clause we know to be false, which means we have a contradiction, which means we can safely say that this whole knowledge base does entail A. That if this sentence is true, that we know that A for sure is also true. So this now, using inference by resolution, is an entirely different way to take some statement and try and prove that it is, in fact, true. Instead of enumerating all of the possible worlds that we might be in in order to try to figure out in which cases is the knowledge base true and in which cases are query true, instead we use this resolution algorithm to say, let’s keep trying to figure out what conclusions we can draw and see if we reach a contradiction. And if we reach a contradiction, then that tells us something about whether our knowledge actually entails the query or not. And it turns out there are many different algorithms that can be used for inference. What we’ve just looked at here are just a couple of them. And in fact, all of this is just based on one particular type of logic. It’s based on propositional logic, where we have these individual symbols and we connect them using and and or and not and implies and by conditionals. But propositional logic is not the only kind of logic that exists. And in fact, we see that there are limitations that exist in propositional logic, especially as we saw in examples like with the mastermind example or with the example with the logic puzzle where we had different Hogwarts house people that belong to different houses and we were trying to figure out who belonged to which houses. There were a lot of different propositional symbols that we needed in order to represent some fairly basic ideas. So now is the final topic that we’ll take a look at just before we end class today is one final type of logic different from propositional logic known as first order logic, which is a little bit more powerful than propositional logic and is going to make it easier for us to express certain types of ideas. In propositional logic, if we think back to that puzzle with the people in the Hogwarts houses, we had a whole bunch of symbols. And every symbol could only be true or false. We had a symbol for Minerva Gryffindor, which was either true of Minerva within Gryffindor and false otherwise, and likewise for Minerva Hufflepuff and Minerva Ravenclaw and Minerva Slytherin and so forth. But this was starting to get quite redundant. We wanted some way to be able to express that there is a relationship between these propositional symbols, that Minerva shows up in all of them. And also, I would have liked to have not have had so many different symbols to represent what really was a fairly straightforward problem. So first order logic will give us a different way of trying to deal with this idea by giving us two different types of symbols. We’re going to have constant symbols that are going to represent objects like people or houses. And then predicate symbols, which you can think of as relations or functions that take an input and evaluate them to true or false, for example, that tell us whether or not some property of some constant or some pair of constants or multiple constants actually holds. So we’ll see an example of that in just a moment. For now, in this same problem, our constant symbols might be objects, things like people or houses. So Minerva, Pomona, Horace, Gilderoy, those are all constant symbols, as are my four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. Predicates, meanwhile, these predicate symbols are going to be properties that might hold true or false of these individual constants. So person might hold true of Minerva, but it would be false for Gryffindor because Gryffindor is not a person. And house is going to hold true for Ravenclaw, but it’s not going to hold true for Horace, for example, because Horace is a person. And belongs to, meanwhile, is going to be some relation that is going to relate people to their houses. And it’s going to only tell me when someone belongs to a house or does not. So let’s take a look at some examples of what a sentence in first order logic might actually look like. A sentence might look like something like this. Person Minerva, with Minerva in parentheses, and person being a predicate symbol, Minerva being a constant symbol. This sentence in first order logic effectively means Minerva is a person, or the person property applies to the Minerva object. So if I want to say something like Minerva is a person, here is how I express that idea using first order logic. Meanwhile, I can say something like, house Gryffindor, to likewise express the idea that Gryffindor is a house. I can do that this way. And all of the same logical connectives that we saw in propositional logic, those are going to work here too. And or implication by conditional not. In fact, I can use not to say something like, not house Minerva. And this sentence in first order logic means something like, Minerva is not a house. It is not true that the house property applies to Minerva. Meanwhile, in addition to some of these predicate symbols that just take a single argument, some of our predicate symbols are going to express binary relations, relations between two of its arguments. So I could say something like, belongs to, and then two inputs, Minerva and Gryffindor, to express the idea that Minerva belongs to Gryffindor. And so now here’s the key difference, or one of the key differences, between this and propositional logic. In propositional logic, I needed one symbol for Minerva Gryffindor, and one symbol for Minerva Hufflepuff, and one symbol for all the other people’s Gryffindor and Hufflepuff variables. In this case, I just need one symbol for each of my people, and one symbol for each of my houses. And then I can express as a predicate something like, belongs to, and say, belongs to Minerva Gryffindor, to express the idea that Minerva belongs to Gryffindor House. So already we can see that first order logic is quite expressive in being able to express these sorts of sentences using the existing constant symbols and predicates that already exist, while minimizing the number of new symbols that I need to create. I can just use eight symbols for people for houses, instead of 16 symbols for every possible combination of each. But first order logic gives us a couple of additional features that we can use to express even more complex ideas. And these more additional features are generally known as quantifiers. And there are two main quantifiers in first order logic, the first of which is universal quantification. Universal quantification lets me express an idea like something is going to be true for all values of a variable. Like for all values of x, some statement is going to hold true. So what might a sentence in universal quantification look like? Well, we’re going to use this upside down a to mean for all. So upside down ax means for all values of x, where x is any object, this is going to hold true. Belongs to x Gryffindor implies not belongs to x Hufflepuff. So let’s try and parse this out. This means that for all values of x, if this holds true, if x belongs to Gryffindor, then this does not hold true. x does not belong to Hufflepuff. So translated into English, this sentence is saying something like for all objects x, if x belongs to Gryffindor, then x does not belong to Hufflepuff, for example. Or a phrase even more simply, anyone in Gryffindor is not in Hufflepuff, simplified way of saying the same thing. So this universal quantification lets us express an idea like something is going to hold true for all values of a particular variable. In addition to universal quantification though, we also have existential quantification. Whereas universal quantification said that something is going to be true for all values of a variable, existential quantification says that some expression is going to be true for some value of a variable, at least one value of the variable. So let’s take a look at a sample sentence using existential quantification. One such sentence looks like this. There exists an x. This backwards e stands for exists. And here we’re saying there exists an x such that house x and belongs to Minerva x. In other words, there exists some object x where x is a house and Minerva belongs to x. Or phrased a little more succinctly in English, I’m here just saying Minerva belongs to a house. There’s some object that is a house and Minerva belongs to a house. And combining this universal and existential quantification, we can create far more sophisticated logical statements than we were able to just using propositional logic. I could combine these to say something like this. For all x, person x implies there exists a y such that house y and belongs to xy. All right. So a lot of stuff going on there, a lot of symbols. Let’s try and parse it out and just understand what it’s saying. Here we’re saying that for all values of x, if x is a person, then this is true. So in other words, I’m saying for all people, and we call that person x, this statement is going to be true. What statement is true of all people? Well, there exists a y that is a house, so there exists some house, and x belongs to y. In other words, I’m saying that for all people out there, there exists some house such that x, the person, belongs to y, the house. This is phrased more succinctly. I’m saying that every person belongs to a house, that for all x, if x is a person, then there exists a house that x belongs to. And so we can now express a lot more powerful ideas using this idea now of first order logic. And it turns out there are many other kinds of logic out there. There’s second order logic and other higher order logic, each of which allows us to express more and more complex ideas. But all of it, in this case, is really in pursuit of the same goal, which is the representation of knowledge. We want our AI agents to be able to know information, to represent that information, whether that’s using propositional logic or first order logic or some other logic, and then be able to reason based on that, to be able to draw conclusions, make inferences, figure out whether there’s some sort of entailment relationship, as by using some sort of inference algorithm, something like inference by resolution or model checking or any number of these other algorithms that we can use in order to take information that we know and translate it to additional conclusions. So all of this has helped us to create AI that is able to represent information about what it knows and what it doesn’t know. Next time, though, we’ll take a look at how we can make our AI even more powerful by not just encoding information that we know for sure to be true and not to be true, but also to take a look at uncertainty, to look at what happens if AI thinks that something might be probable or maybe not very probable or somewhere in between those two extremes, all in the pursuit of trying to build our intelligent systems to be even more intelligent. We’ll see you next time. Thank you. All right, welcome back, everyone, to an introduction to artificial intelligence with Python. And last time, we took a look at how it is that AI inside of our computers can represent knowledge. We represented that knowledge in the form of logical sentences in a variety of different logical languages. And the idea was we wanted our AI to be able to represent knowledge or information and somehow use those pieces of information to be able to derive new pieces of information by inference, to be able to take some information and deduce some additional conclusions based on the information that it already knew for sure. But in reality, when we think about computers and we think about AI, very rarely are our machines going to be able to know things for sure. Oftentimes, there’s going to be some amount of uncertainty in the information that our AIs or our computers are dealing with, where it might believe something with some probability, as we’ll soon discuss what probability is all about and what it means, but not entirely for certain. And we want to use the information that it has some knowledge about, even if it doesn’t have perfect knowledge, to still be able to make inferences, still be able to draw conclusions. So you might imagine, for example, in the context of a robot that has some sensors and is exploring some environment, it might not know exactly where it is or exactly what’s around it, but it does have access to some data that can allow it to draw inferences with some probability. There’s some likelihood that one thing is true or another. Or you can imagine in context where there is a little bit more randomness and uncertainty, something like predicting the weather, where you might not be able to know for sure what tomorrow’s weather is with 100% certainty, but you can probably infer with some probability what tomorrow’s weather is going to be based on maybe today’s weather and yesterday’s weather and other data that you might have access to as well. And so oftentimes, we can distill this in terms of just possible events that might happen and what the likelihood of those events are. This comes a lot in games, for example, where there is an element of chance inside of those games. So you imagine rolling a dice. You’re not sure exactly what the die roll is going to be, but you know it’s going to be one of these possibilities from 1 to 6, for example. And so here now, we introduce the idea of probability theory. And what we’ll take a look at today is beginning by looking at the mathematical foundations of probability theory, getting an understanding for some of the key concepts within probability, and then diving into how we can use probability and the ideas that we look at mathematically to represent some ideas in terms of models that we can put into our computers in order to program an AI that is able to use information about probability to draw inferences, to make some judgments about the world with some probability or likelihood of being true. So probability ultimately boils down to this idea that there are possible worlds that we’re here representing using this little Greek letter omega. And the idea of a possible world is that when I roll a die, there are six possible worlds that could result from it. I could roll a 1, or a 2, or a 3, or a 4, or a 5, or a 6. And each of those are a possible world. And each of those possible worlds has some probability of being true, the probability that I do roll a 1, or a 2, or a 3, or something else. And we represent that probability like this, using the capital letter P. And then in parentheses, what it is that we want the probability of. So this right here would be the probability of some possible world as represented by the little letter omega. Now, there are a couple of basic axioms of probability that become relevant as we consider how we deal with probability and how we think about it. First and foremost, every probability value must range between 0 and 1 inclusive. So the smallest value any probability can have is the number 0, which is an impossible event. Something like I roll a die, and the die is a 7 is the roll that I get. If the die only has numbers 1 through 6, the event that I roll a 7 is impossible, so it would have probability 0. And on the other end of the spectrum, probability can range all the way up to the positive number 1, meaning an event is certain to happen, that I roll a die and the number is less than 10, for example. That is an event that is guaranteed to happen if the only sides on my die are 1 through 6, for instance. And then they can range through any real number in between these two values. Where, generally speaking, a higher value for the probability means an event is more likely to take place, and a lower value for the probability means the event is less likely to take place. And the other key rule for probability looks a little bit like this. This sigma notation, if you haven’t seen it before, refers to summation, the idea that we’re going to be adding up a whole sequence of values. And this sigma notation is going to come up a couple of times today, because as we deal with probability, oftentimes we’re adding up a whole bunch of individual values or individual probabilities to get some other value. So we’ll see this come up a couple of times. But what this notation means is that if I sum up all of the possible worlds omega that are in big omega, which represents the set of all the possible worlds, meaning I take for all of the worlds in the set of possible worlds and add up all of their probabilities, what I ultimately get is the number 1. So if I take all the possible worlds, add up what each of their probabilities is, I should get the number 1 at the end, meaning all probabilities just need to sum to 1. So for example, if I take dice, for example, and if you imagine I have a fair die with numbers 1 through 6 and I roll the die, each one of these rolls has an equal probability of taking place. And the probability is 1 over 6, for example. So each of these probabilities is between 0 and 1, 0 meaning impossible and 1 meaning for certain. And if you add up all of these probabilities for all of the possible worlds, you get the number 1. And we can represent any one of those probabilities like this. The probability that we roll the number 2, for example, is just 1 over 6. Every six times we roll the die, we’d expect that one time, for instance, the die might come up as a 2. Its probability is not certain, but it’s a little more than nothing, for instance. And so this is all fairly straightforward for just a single die. But things get more interesting as our models of the world get a little bit more complex. Let’s imagine now that we’re not just dealing with a single die, but we have two dice, for example. I have a red die here and a blue die there, and I care not just about what the individual roll is, but I care about the sum of the two rolls. In this case, the sum of the two rolls is the number 3. How do I begin to now reason about what does the probability look like if instead of having one die, I now have two dice? Well, what we might imagine is that we could first consider what are all of the possible worlds. And in this case, all of the possible worlds are just every combination of the red and blue die that I could come up with. For the red die, it could be a 1 or a 2 or a 3 or a 4 or a 5 or a 6. And for each of those possibilities, the blue die, likewise, could also be either 1 or 2 or 3 or 4 or 5 or 6. And it just so happens that in this particular case, each of these possible combinations is equally likely. Equally likely are all of these various different possible worlds. That’s not always going to be the case. If you imagine more complex models that we could try to build and things that we could try to represent in the real world, it’s probably not going to be the case that every single possible world is always equally likely. But in the case of fair dice, where in any given die roll, any one number has just as good a chance of coming up as any other number, we can consider all of these possible worlds to be equally likely. But even though all of the possible worlds are equally likely, that doesn’t necessarily mean that their sums are equally likely. So if we consider what the sum is of all of these two, so 1 plus 1, that’s a 2. 2 plus 1 is a 3. And consider for each of these possible pairs of numbers what their sum ultimately is, we can notice that there are some patterns here, where it’s not entirely the case that every number comes up equally likely. If you consider 7, for example, what’s the probability that when I roll two dice, their sum is 7? There are several ways this can happen. There are six possible worlds where the sum is 7. It could be a 1 and a 6, or a 2 and a 5, or a 3 and a 4, a 4 and a 3, and so forth. But if you instead consider what’s the probability that I roll two dice, and the sum of those two die rolls is 12, for example, we’re looking at this diagram, there’s only one possible world in which that can happen. And that’s the possible world where both the red die and the blue die both come up as sixes to give us a sum total of 12. So based on just taking a look at this diagram, we see that some of these probabilities are likely different. The probability that the sum is a 7 must be greater than the probability that the sum is a 12. And we can represent that even more formally by saying, OK, the probability that we sum to 12 is 1 out of 36. Out of the 36 equally likely possible worlds, 6 squared because we have six options for the red die and six options for the blue die, out of those 36 options, only one of them sums to 12. Whereas on the other hand, the probability that if we take two dice rolls and they sum up to the number 7, well, out of those 36 possible worlds, there were six worlds where the sum was 7. And so we get 6 over 36, which we can simplify as a fraction to just 1 over 6. So here now, we’re able to represent these different ideas of probability, representing some events that might be more likely and then other events that are less likely as well. And these sorts of judgments, where we’re figuring out just in the abstract what is the probability that this thing takes place, are generally known as unconditional probabilities. Some degree of belief we have in some proposition, some fact about the world, in the absence of any other evidence. Without knowing any additional information, if I roll a die, what’s the chance it comes up as a 2? Or if I roll two dice, what’s the chance that the sum of those two die rolls is a 7? But usually when we’re thinking about probability, especially when we’re thinking about training in AI to intelligently be able to know something about the world and make predictions based on that information, it’s not unconditional probability that our AI is dealing with, but rather conditional probability, probability where rather than having no original knowledge, we have some initial knowledge about the world and how the world actually works. So conditional probability is the degree of belief in a proposition given some evidence that has already been revealed to us. So what does this look like? Well, it looks like this in terms of notation. We’re going to represent conditional probability as probability of A and then this vertical bar and then B. And the way to read this is the thing on the left-hand side of the vertical bar is what we want the probability of. Here now, I want the probability that A is true, that it is the real world, that it is the event that actually does take place. And then on the right side of the vertical bar is our evidence, the information that we already know for certain about the world. For example, that B is true. So the way to read this entire expression is what is the probability of A given B, the probability that A is true, given that we already know that B is true. And this type of judgment, conditional probability, the probability of one thing given some other fact, comes up quite a lot when we think about the types of calculations we might want our AI to be able to do. For example, we might care about the probability of rain today given that we know that it rained yesterday. We could think about the probability of rain today just in the abstract. What is the chance that today it rains? But usually, we have some additional evidence. I know for certain that it rained yesterday. And so I would like to calculate the probability that it rains today given that I know that it rained yesterday. Or you might imagine that I want to know the probability that my optimal route to my destination changes given the current traffic condition. So whether or not traffic conditions change, that might change the probability that this route is actually the optimal route. Or you might imagine in a medical context, I want to know the probability that a patient has a particular disease given some results of some tests that have been performed on that patient. And I have some evidence, the results of that test, and I would like to know the probability that a patient has a particular disease. So this notion of conditional probability comes up everywhere. So we begin to think about what we would like to reason about, but being able to reason a little more intelligently by taking into account evidence that we already have. We’re more able to get an accurate result for what is the likelihood that someone has this disease if we know this evidence, the results of the test, as opposed to if we were just calculating the unconditional probability of saying, what is the probability they have the disease without any evidence to try and back up our result one way or the other. So now that we’ve got this idea of what conditional probability is, the next question we have to ask is, all right, how do we calculate conditional probability? How do we figure out mathematically, if I have an expression like this, how do I get a number from that? What does conditional probability actually mean? Well, the formula for conditional probability looks a little something like this. The probability of a given b, the probability that a is true, given that we know that b is true, is equal to this fraction, the probability that a and b are true, divided by just the probability that b is true. And the way to intuitively try to think about this is that if I want to know the probability that a is true, given that b is true, well, I want to consider all the ways they could both be true out of the only worlds that I care about are the worlds where b is already true. I can sort of ignore all the cases where b isn’t true, because those aren’t relevant to my ultimate computation. They’re not relevant to what it is that I want to get information about. So let’s take a look at an example. Let’s go back to that example of rolling two dice and the idea that those two dice might sum up to the number 12. We discussed earlier that the unconditional probability that if I roll two dice and they sum to 12 is 1 out of 36, because out of the 36 possible worlds that I might care about, in only one of them is the sum of those two dice 12. It’s only when red is 6 and blue is also 6. But let’s say now that I have some additional information. I now want to know what is the probability that the two dice sum to 12, given that I know that the red die was a 6. So I already have some evidence. I already know the red die is a 6. I don’t know what the blue die is. That information isn’t given to me in this expression. But given the fact that I know that the red die rolled a 6, what is the probability that we sum to 12? And so we can begin to do the math using that expression from before. Here, again, are all of the possibilities, all of the possible combinations of red die being 1 through 6 and blue die being 1 through 6. And I might consider first, all right, what is the probability of my evidence, my B variable, where I want to know, what is the probability that the red die is a 6? Well, the probability that the red die is a 6 is just 1 out of 6. So these 1 out of 6 options are really the only worlds that I care about here now. All the rest of them are irrelevant to my calculation, because I already have this evidence that the red die was a 6, so I don’t need to care about all of the other possibilities that could result. So now, in addition to the fact that the red die rolled as a 6 and the probability of that, the other piece of information I need to know in order to calculate this conditional probability is the probability that both of my variables, A and B, are true. The probability that both the red die is a 6, and they all sum to 12. So what is the probability that both of these things happen? Well, it only happens in one possible case in 1 out of these 36 cases, and it’s the case where both the red and the blue die are equal to 6. This is a piece of information that we already knew. And so this probability is equal to 1 over 36. And so to get the conditional probability that the sum is 12, given that I know that the red dice is equal to 6, well, I just divide these two values together, and 1 over 36 divided by 1 over 6 gives us this probability of 1 over 6. Given that I know that the red die rolled a value of 6, the probability that the sum of the two dice is 12 is also 1 over 6. And that probably makes intuitive sense to you, too, because if the red die is a 6, the only way for me to get to a 12 is if the blue die also rolls a 6, and we know that the probability of the blue die rolling a 6 is 1 over 6. So in this case, the conditional probability seems fairly straightforward. But this idea of calculating a conditional probability by looking at the probability that both of these events take place is an idea that’s going to come up again and again. This is the definition now of conditional probability. And we’re going to use that definition as we think about probability more generally to be able to draw conclusions about the world. This, again, is that formula. The probability of A given B is equal to the probability that A and B take place divided by the probability of B. And you’ll see this formula sometimes written in a couple of different ways. You could imagine algebraically multiplying both sides of this equation by probability of B to get rid of the fraction, and you’ll get an expression like this. The probability of A and B, which is this expression over here, is just the probability of B times the probability of A given B. Or you could represent this equivalently since A and B in this expression are interchangeable. A and B is the same thing as B and A. You could imagine also representing the probability of A and B as the probability of A times the probability of B given A, just switching all of the A’s and B’s. These three are all equivalent ways of trying to represent what joint probability means. And so you’ll sometimes see all of these equations, and they might be useful to you as you begin to reason about probability and to think about what values might be taking place in the real world. Now, sometimes when we deal with probability, we don’t just care about a Boolean event like did this happen or did this not happen. Sometimes we might want the ability to represent variable values in a probability space where some variable might take on multiple different possible values. And in probability, we call a variable in probability theory a random variable. A random variable in probability is just some variable in probability theory that has some domain of values that it can take on. So what do I mean by this? Well, what I mean is I might have a random variable that is just called roll, for example, that has six possible values. Roll is my variable, and the possible values, the domain of values that it can take on are 1, 2, 3, 4, 5, and 6. And I might like to know the probability of each. In this case, they happen to all be the same. But in other random variables, that might not be the case. For example, I might have a random variable to represent the weather, for example, where the domain of values it could take on are things like sun or cloudy or rainy or windy or snowy. And each of those might have a different probability. And I care about knowing what is the probability that the weather equals sun or that the weather equals clouds, for instance. And I might like to do some mathematical calculations based on that information. Other random variables might be something like traffic. What are the odds that there is no traffic or light traffic or heavy traffic? Traffic, in this case, is my random variable. And the values that that random variable can take on are here. It’s either none or light or heavy. And I, the person doing these calculations, I, the person encoding these random variables into my computer, need to make the decision as to what these possible values actually are. You might imagine, for example, for a flight. If I care about whether or not I make it or do a flight on time, my flight has a couple of possible values that it could take on. My flight could be on time. My flight could be delayed. My flight could be canceled. So flight, in this case, is my random variable. And these are the values that it can take on. And often, I want to know something about the probability that my random variable takes on each of those possible values. And this is what we then call a probability distribution. A probability distribution takes a random variable and gives me the probability for each of the possible values in its domain. So in the case of this flight, for example, my probability distribution might look something like this. My probability distribution says the probability that the random variable flight is equal to the value on time is 0.6. Or otherwise, put into more English human-friendly terms, the likelihood that my flight is on time is 60%, for example. And in this case, the probability that my flight is delayed is 30%. The probability that my flight is canceled is 10% or 0.1. And if you sum up all of these possible values, the sum is going to be 1, right? If you take all of the possible worlds, here are my three possible worlds for the value of the random variable flight, add them all up together, the result needs to be the number 1 per that axiom of probability theory that we’ve discussed before. So this now is one way of representing this probability distribution for the random variable flight. Sometimes you’ll see it represented a little bit more concisely that this is pretty verbose for really just trying to express three possible values. And so often, you’ll instead see the same notation representing using a vector. And all a vector is is a sequence of values. As opposed to just a single value, I might have multiple values. And so I could extend instead, represent this idea this way. Bold p, so a larger p, generally meaning the probability distribution of this variable flight is equal to this vector represented in angle brackets. The probability distribution is 0.6, 0.3, and 0.1. And I would just have to know that this probability distribution is in order of on time or delayed and canceled to know how to interpret this vector. To mean the first value in the vector is the probability that my flight is on time. The second value in the vector is the probability that my flight is delayed. And the third value in the vector is the probability that my flight is canceled. And so this is just an alternate way of representing this idea, a little more verbosely. But oftentimes, you’ll see us just talk about a probability distribution over a random variable. And whenever we talk about that, what we’re really doing is trying to figure out the probabilities of each of the possible values that that random variable can take on. But this notation is just a little bit more succinct, even though it can sometimes be a little confusing, depending on the context in which you see it. So we’ll start to look at examples where we use this sort of notation to describe probability and to describe events that might take place. A couple of other important ideas to know with regards to probability theory. One is this idea of independence. And independence refers to the idea that the knowledge of one event doesn’t influence the probability of another event. So for example, in the context of my two dice rolls, where I had the red die and the blue die, the probability that I roll the red die and the blue die, those two events, red die and blue die, are independent. Knowing the result of the red die doesn’t change the probabilities for the blue die. It doesn’t give me any additional information about what the value of the blue die is ultimately going to be. But that’s not always going to be the case. You might imagine that in the case of weather, something like clouds and rain, those are probably not independent. But if it is cloudy, that might increase the probability that later in the day it’s going to rain. So some information informs some other event or some other random variable. So independence refers to the idea that one event doesn’t influence the other. And if they’re not independent, then there might be some relationship. So mathematically, formally, what does independence actually mean? Well, recall this formula from before, that the probability of A and B is the probability of A times the probability of B given A. And the more intuitive way to think about this is that to know how likely it is that A and B happen, well, let’s first figure out the likelihood that A happens. And then given that we know that A happens, let’s figure out the likelihood that B happens and multiply those two things together. But if A and B were independent, meaning knowing A doesn’t change anything about the likelihood that B is true, well, then the probability of B given A, meaning the probability that B is true, given that I know A is true, well, that I know A is true shouldn’t really make a difference if these two things are independent, that A shouldn’t influence B at all. So the probability of B given A is really just the probability of B. If it is true that A and B are independent. And so this right here is one example of a definition for what it means for A and B to be independent. The probability of A and B is just the probability of A times the probability of B. Anytime you find two events A and B where this relationship holds, then you can say that A and B are independent. So an example of that might be the dice that we were taking a look at before. Here, if I wanted the probability of red being a 6 and blue being a 6, well, that’s just the probability that red is a 6 multiplied by the probability that blue is a 6. It’s both equal to 1 over 36. So I can say that these two events are independent. What wouldn’t be independent, for example, would be an example. So this, for example, has a probability of 1 over 36, as we talked about before. But what wouldn’t be independent would be a case like this, the probability that the red die rolls a 6 and the red die rolls a 4. If you just naively took, OK, red die 6, red die 4, well, if I’m only rolling the die once, you might imagine the naive approach is to say, well, each of these has a probability of 1 over 6. So multiply them together, and the probability is 1 over 36. But of course, if you’re only rolling the red die once, there’s no way you could get two different values for the red die. It couldn’t both be a 6 and a 4. So the probability should be 0. But if you were to multiply probability of red 6 times probability of red 4, well, that would equal 1 over 36. But of course, that’s not true. Because we know that there is no way, probability 0, that when we roll the red die once, we get both a 6 and a 4, because only one of those possibilities can actually be the result. And so we can say that the event that red roll is 6 and the event that red roll is 4, those two events are not independent. If I know that the red roll is a 6, I know that the red roll cannot possibly be a 4, so these things are not independent. And instead, if I wanted to calculate the probability, I would need to use this conditional probability as the regular definition of the probability of two events taking place. And the probability of this now, well, the probability of the red roll being a 6, that’s 1 over 6. But what’s the probability that the roll is a 4 given that the roll is a 6? Well, this is just 0, because there’s no way for the red roll to be a 4, given that we already know the red roll is a 6. And so the value, if we do add all that multiplication, is we get the number 0. So this idea of conditional probability is going to come up again and again, especially as we begin to reason about multiple different random variables that might be interacting with each other in some way. And this gets us to one of the most important rules in probability theory, which is known as Bayes rule. And it turns out that just using the information we’ve already learned about probability and just applying a little bit of algebra, we can actually derive Bayes rule for ourselves. But it’s a very important rule when it comes to inference and thinking about probability in the context of what it is that a computer can do or what a mathematician could do by having access to information about probability. So let’s go back to these equations to be able to derive Bayes rule ourselves. We know the probability of A and B, the likelihood that A and B take place, is the likelihood of B, and then the likelihood of A, given that we know that B is already true. And likewise, the probability of A given A and B is the probability of A times the probability of B, given that we know that A is already true. This is sort of a symmetric relationship where it doesn’t matter the order of A and B and B and A mean the same thing. And so in these equations, we can just swap out A and B to be able to represent the exact same idea. So we know that these two equations are already true. We’ve seen that already. And now let’s just do a little bit of algebraic manipulation of this stuff. Both of these expressions on the right-hand side are equal to the probability of A and B. So what I can do is take these two expressions on the right-hand side and just set them equal to each other. If they’re both equal to the probability of A and B, then they both must be equal to each other. So probability of A times probability of B given A is equal to the probability of B times the probability of A given B. And now all we’re going to do is do a little bit of division. I’m going to divide both sides by P of A. And now I get what is Bayes’ rule. The probability of B given A is equal to the probability of B times the probability of A given B divided by the probability of A. And sometimes in Bayes’ rule, you’ll see the order of these two arguments switched. So instead of B times A given B, it’ll be A given B times B. That ultimately doesn’t matter because in multiplication, you can switch the order of the two things you’re multiplying, and it doesn’t change the result. But this here right now is the most common formulation of Bayes’ rule. The probability of B given A is equal to the probability of A given B times the probability of B divided by the probability of A. And this rule, it turns out, is really important when it comes to trying to infer things about the world, because it means you can express one conditional probability, the conditional probability of B given A, using knowledge about the probability of A given B, using the reverse of that conditional probability. So let’s first do a little bit of an example with this, just to see how we might use it, and then explore what this means a little bit more generally. So we’re going to construct a situation where I have some information. There are two events that I care about, the idea that it’s cloudy in the morning and the idea that it is rainy in the afternoon. Those are two different possible events that could take place, cloudy in the morning, or the AM, rainy in the PM. And what I care about is, given clouds in the morning, what is the probability of rain in the afternoon? A reasonable question I might ask, in the morning, I look outside, or an AI’s camera looks outside and sees that there are clouds in the morning. And we want to conclude, we want to figure out what is the probability that in the afternoon, there is going to be rain. Of course, in the abstract, we don’t have access to this kind of information, but we can use data to begin to try and figure this out. So let’s imagine now that I have access to some pieces of information. I have access to the idea that 80% of rainy afternoons start out with a cloudy morning. And you might imagine that I could have gathered this data just by looking at data over a sequence of time, that I know that 80% of the time when it’s raining in the afternoon, it was cloudy that morning. I also know that 40% of days have cloudy mornings. And I also know that 10% of days have rainy afternoons. And now using this information, I would like to figure out, given clouds in the morning, what is the probability that it rains in the afternoon? I want to know the probability of afternoon rain given morning clouds. And I can do that, in particular, using this fact, the probability of, so if I know that 80% of rainy afternoons start with cloudy mornings, then I know the probability of cloudy mornings given rainy afternoons. So using sort of the reverse conditional probability, I can figure that out. Expressed in terms of Bayes rule, this is what that would look like. Probability of rain given clouds is the probability of clouds given rain times the probability of rain divided by the probability of clouds. Here I’m just substituting in for the values of a and b from that equation of Bayes rule from before. And then I can just do the math. I have this information. I know that 80% of the time, if it was raining, then there were clouds in the morning. So 0.8 here. Probability of rain is 0.1, because 10% of days were rainy, and 40% of days were cloudy. I do the math, and I can figure out the answer is 0.2. So the probability that it rains in the afternoon, given that it was cloudy in the morning, is 0.2 in this case. And this now is an application of Bayes rule, the idea that using one conditional probability, we can get the reverse conditional probability. And this is often useful when one of the conditional probabilities might be easier for us to know about or easier for us to have data about. And using that information, we can calculate the other conditional probability. So what does this look like? Well, it means that knowing the probability of cloudy mornings given rainy afternoons, we can calculate the probability of rainy afternoons given cloudy mornings. Or, for example, more generally, if we know the probability of some visible effect, some effect that we can see and observe, given some unknown cause that we’re not sure about, well, then we can calculate the probability of that unknown cause given the visible effect. So what might that look like? Well, in the context of medicine, for example, I might know the probability of some medical test result given a disease. Like, I know that if someone has a disease, then x% of the time the medical test result will show up as this, for instance. And using that information, then I can calculate, all right, what is the probability that given I know the medical test result, what is the likelihood that someone has the disease? This is the piece of information that is usually easier to know, easier to immediately have access to data for. And this is the information that I actually want to calculate. Or I might want to know, for example, if I know that some probability of counterfeit bills have blurry text around the edges, because counterfeit printers aren’t nearly as good at printing text precisely. So I have some information about, given that something is a counterfeit bill, like x% of counterfeit bills have blurry text, for example. And using that information, then I can calculate some piece of information that I might want to know, like, given that I know there’s blurry text on a bill, what is the probability that that bill is counterfeit? So given one conditional probability, I can calculate the other conditional probability as well. And so now we’ve taken a look at a couple of different types of probability. And we’ve looked at unconditional probability, where I just look at what is the probability of this event occurring, given no additional evidence that I might have access to. And we’ve also looked at conditional probability, where I have some sort of evidence, and I would like to, using that evidence, be able to calculate some other probability as well. And the other kind of probability that will be important for us to think about is joint probability. And this is when we’re considering the likelihood of multiple different events simultaneously. And so what do we mean by this? For example, I might have probability distributions that look a little something like this. Like, oh, I want to know the probability distribution of clouds in the morning. And that distribution looks like this. 40% of the time, C, which is my random variable here, is equal to it’s cloudy. And 60% of the time, it’s not cloudy. So here is just a simple probability distribution that is effectively telling me that 40% of the time, it’s cloudy. I might also have a probability distribution for rain in the afternoon, where 10% of the time, or with probability 0.1, it is raining in the afternoon. And with probability 0.9, it is not raining in the afternoon. And using just these two pieces of information, I don’t actually have a whole lot of information about how these two variables relate to each other. But I could if I had access to their joint probability, meaning for every combination of these two things, meaning morning cloudy and afternoon rain, morning cloudy and afternoon not rain, morning not cloudy and afternoon rain, and morning not cloudy and afternoon not raining, if I had access to values for each of those four, I’d have more information. So information that’d be organized in a table like this, and this, rather than just a probability distribution, is a joint probability distribution. It tells me the probability distribution of each of the possible combinations of values that these random variables can take on. So if I want to know what is the probability that on any given day it is both cloudy and rainy, well, I would say, all right, we’re looking at cases where it is cloudy and cases where it is raining. And the intersection of those two, that row in that column, is 0.08. So that is the probability that it is both cloudy and rainy using that information. And using this conditional probability table, using this joint probability table, I can begin to draw other pieces of information about things like conditional probability. So I might ask a question like, what is the probability distribution of clouds given that I know that it is raining? Meaning I know for sure that it’s raining. Tell me the probability distribution over whether it’s cloudy or not, given that I know already that it is, in fact, raining. And here I’m using C to stand for that random variable. I’m looking for a distribution, meaning the answer to this is not going to be a single value. It’s going to be two values, a vector of two values, where the first value is probability of clouds, the second value is probability that it is not cloudy, but the sum of those two values is going to be 1. Because when you add up the probabilities of all of the possible worlds, the result that you get must be the number 1. And well, what do we know about how to calculate a conditional probability? Well, we know that the probability of A given B is the probability of A and B divided by the probability of B. So what does this mean? Well, it means that I can calculate the probability of clouds given that it’s raining as the probability of clouds and raining divided by the probability of rain. And this comma here for the probability distribution of clouds and rain, this comma sort of stands in for the word and. You’ll sort of see in the logical operator and and the comma used interchangeably. This means the probability distribution over the clouds and knowing the fact that it is raining divided by the probability of rain. And the interesting thing to note here and what we’ll often do in order to simplify our mathematics is that dividing by the probability of rain, the probability of rain here is just some numerical constant. It is some number. Dividing by probability of rain is just dividing by some constant, or in other words, multiplying by the inverse of that constant. And it turns out that oftentimes we can just not worry about what the exact value of this is and just know that it is, in fact, a constant value. And we’ll see why in a moment. So instead of expressing this as this joint probability divided by the probability of rain, sometimes we’ll just represent it as alpha times the numerator here, the probability distribution of C, this variable, and that we know that it is raining, for instance. So all we’ve done here is said this value of 1 over the probability of rain, that’s really just a constant we’re going to divide by or equivalently multiply by the inverse of at the end. We’ll just call it alpha for now and deal with it a little bit later. But the key idea here now, and this is an idea that’s going to come up again, is that the conditional distribution of C given rain is proportional to, meaning just some factor multiplied by the joint probability of C and rain being true. And so how do we figure this out? Well, this is going to be the probability that it is cloudy given that it’s raining, which is 0.08, and the probability that it’s not cloudy given that it’s raining, which is 0.02. And so we get alpha times here now is that probability distribution. 0.08 is clouds and rain. 0.02 is not cloudy and rain. But of course, 0.08 and 0.02 don’t sum up to the number 1. And we know that in a probability distribution, if you consider all of the possible values, they must sum up to a probability of 1. And so we know that we just need to figure out some constant to normalize, so to speak, these values, something we can multiply or divide by to get it so that all these probabilities sum up to 1, and it turns out that if we multiply both numbers by 10, then we can get that result of 0.8 and 0.2. The proportions are still equivalent, but now 0.8 plus 0.2, those sum up to the number 1. So take a look at this and see if you can understand step by step how it is we’re getting from one point to another. The key idea here is that by using the joint probabilities, these probabilities that it is both cloudy and rainy and that it is not cloudy and rainy, I can take that information and figure out the conditional probability given that it’s raining. What is the chance that it’s cloudy versus not cloudy? Just by multiplying by some normalization constant, so to speak. And this is what a computer can begin to use to be able to interact with these various different types of probabilities. And it turns out there are a number of other probability rules that are going to be useful to us as we begin to explore how we can actually use this information to encode into our computers some more complex analysis that we might want to do about probability and distributions and random variables that we might be interacting with. So here are a couple of those important probability rules. One of the simplest rules is just this negation rule. What is the probability of not event A? So A is an event that has some probability, and I would like to know what is the probability that A does not occur. And it turns out it’s just 1 minus P of A, which makes sense. Because if those are the two possible cases, either A happens or A doesn’t happen, then when you add up those two cases, you must get 1, which means that P of not A must just be 1 minus P of A. Because P of A and P of not A must sum up to the number 1. They must include all of the possible cases. We’ve seen an expression for calculating the probability of A and B. We might also reasonably want to calculate the probability of A or B. What is the probability that one thing happens or another thing happens? So for example, I might want to calculate what is the probability that if I roll two dice, a red die and a blue die, what is the likelihood that A is a 6 or B is a 6, like one or the other? And what you might imagine you could do, and the wrong way to approach it, would be just to say, all right, well, A comes up as a 6 with the red die comes up as a 6 with probability 1 over 6. The same for the blue die, it’s also 1 over 6. Add them together, and you get 2 over 6, otherwise known as 1 third. But this suffers from a problem of over counting, that we’ve double counted the case, where both A and B, both the red die and the blue die, both come up as a 6-roll. And I’ve counted that instance twice. So to resolve this, the actual expression for calculating the probability of A or B uses what we call the inclusion-exclusion formula. So I take the probability of A, add it to the probability of B. That’s all same as before. But then I need to exclude the cases that I’ve double counted. So I subtract from that the probability of A and B. And that gets me the result for A or B. I consider all the cases where A is true and all the cases where B is true. And if you imagine this is like a Venn diagram of cases where A is true, cases where B is true, I just need to subtract out the middle to get rid of the cases that I have overcounted by double counting them inside of both of these individual expressions. One other rule that’s going to be quite helpful is a rule called marginalization. So marginalization is answering the question of how do I figure out the probability of A using some other variable that I might have access to, like B? Even if I don’t know additional information about it, I know that B, some event, can have two possible states, either B happens or B doesn’t happen, assuming it’s a Boolean, true or false. And well, what that means is that for me to be able to calculate the probability of A, there are only two cases. Either A happens and B happens, or A happens and B doesn’t happen. And those are two disjoint, meaning they can’t both happen together. Either B happens or B doesn’t happen. They’re disjoint or separate cases. And so I can figure out the probability of A just by adding up those two cases. The probability that A is true is the probability that A and B is true, plus the probability that A is true and B isn’t true. So by marginalizing, I’ve looked at the two possible cases that might take place, either B happens or B doesn’t happen. And in either of those cases, I look at what’s the probability that A happens. And if I add those together, well, then I get the probability that A happens as a whole. So take a look at that rule. It doesn’t matter what B is or how it’s related to A. So long as I know these joint distributions, I can figure out the overall probability of A. And this can be a useful way if I have a joint distribution, like the joint distribution of A and B, to just figure out some unconditional probability, like the probability of A. And we’ll see examples of this soon as well. Now, sometimes these might not just be random, might not just be variables that are events that are like they happened or they didn’t happen, like B is here. They might be some broader probability distribution where there are multiple possible values. And so here, in order to use this marginalization rule, I need to sum up not just over B and not B, but for all of the possible values that the other random variable could take on. And so here, we’ll see a version of this rule for random variables. And it’s going to include that summation notation to indicate that I’m summing up, adding up a whole bunch of individual values. So here’s the rule. Looks a lot more complicated, but it’s actually the equivalent exactly the same rule. What I’m saying here is that if I have two random variables, one called x and one called y, well, the probability that x is equal to some value x sub i, this is just some value that this variable takes on. How do I figure it out? Well, I’m going to sum up over j, where j is going to range over all of the possible values that y can take on. Well, let’s look at the probability that x equals xi and y equals yj. So the exact same rule, the only difference here is now I’m summing up over all of the possible values that y can take on, saying let’s add up all of those possible cases and look at this joint distribution, this joint probability, that x takes on the value I care about, given all of the possible values for y. And if I add all those up, then I can get this unconditional probability of what x is equal to, whether or not x is equal to some value x sub i. So let’s take a look at this rule, because it does look a little bit complicated. Let’s try and put a concrete example to it. Here again is that same joint distribution from before. I have cloud, not cloudy, rainy, not rainy. And maybe I want to access some variable. I want to know what is the probability that it is cloudy. Well, marginalization says that if I have this joint distribution and I want to know what is the probability that it is cloudy, well, I need to consider the other variable, the variable that’s not here, the idea that it’s rainy. And I consider the two cases, either it’s raining or it’s not raining. And I just sum up the values for each of those possibilities. In other words, the probability that it is cloudy is equal to the sum of the probability that it’s cloudy and it’s rainy and the probability that it’s cloudy and it is not raining. And so these now are values that I have access to. These are values that are just inside of this joint probability table. What is the probability that it is both cloudy and rainy? Well, it’s just the intersection of these two here, which is 0.08. And the probability that it’s cloudy and not raining is, all right, here’s cloudy, here’s not raining. It’s 0.32. So it’s 0.08 plus 0.32, which just gives us equal to 0.4. That is the unconditional probability that it is, in fact, cloudy. And so marginalization gives us a way to go from these joint distributions to just some individual probability that I might care about. And you’ll see a little bit later why it is that we care about that and why that’s actually useful to us as we begin doing some of these calculations. Last rule we’ll take a look at before transitioning to something a little bit different is this rule of conditioning, very similar to the marginalization rule. But it says that, again, if I have two events, a and b, but instead of having access to their joint probabilities, I have access to their conditional probabilities, how they relate to each other. Well, again, if I want to know the probability that a happens, and I know that there’s some other variable b, either b happens or b doesn’t happen, and so I can say that the probability of a is the probability of a given b times the probability of b, meaning b happened. And given that I know b happened, what’s the likelihood that a happened? And then I consider the other case, that b didn’t happen. So here’s the probability that b didn’t happen. And here’s the probability that a happens, given that I know that b didn’t happen. And this is really the equivalent rule just using conditional probability instead of joint probability, where I’m saying let’s look at both of these two cases and condition on b. Look at the case where b happens, and look at the case where b doesn’t happen, and look at what probabilities I get as a result. And just as in the case of marginalization, where there was an equivalent rule for random variables that could take on multiple possible values in a domain of possible values, here, too, conditioning has the same equivalent rule. Again, there’s a summation to mean I’m summing over all of the possible values that some random variable y could take on. But if I want to know what is the probability that x takes on this value, then I’m going to sum up over all the values j that y could take on, and say, all right, what’s the chance that y takes on that value yj? And multiply it by the conditional probability that x takes on this value, given that y took on that value yj. So equivalent rule just using conditional probabilities instead of joint probabilities. And using the equation we know about joint probabilities, we can translate between these two. So all right, we’ve seen a whole lot of mathematics, and we’ve just laid the foundation for mathematics. And no need to worry if you haven’t seen probability in too much detail up until this point. These are the foundations of the ideas that are going to come up as we begin to explore how we can now take these ideas from probability and begin to apply them to represent something inside of our computer, something inside of the AI agent we’re trying to design that is able to represent information and probabilities and the likelihoods between various different events. So there are a number of different probabilistic models that we can generate, but the first of the models we’re going to talk about are what are known as Bayesian networks. And a Bayesian network is just going to be some network of random variables, connected random variables that are going to represent the dependence between these random variables. The odds are most random variables in this world are not independent from each other, but there’s some relationship between things that are happening that we care about. If it is rainy today, that might increase the likelihood that my flight or my train gets delayed, for example. There are some dependence between these random variables, and a Bayesian network is going to be able to capture those dependencies. So what is a Bayesian network? What is its actual structure, and how does it work? Well, a Bayesian network is going to be a directed graph. And again, we’ve seen directed graphs before. They are individual nodes with arrows or edges that connect one node to another node pointing in a particular direction. And so this directed graph is going to have nodes as well, where each node in this directed graph is going to represent a random variable, something like the weather, or something like whether my train was on time or delayed. And we’re going to have an arrow from a node x to a node y to mean that x is a parent of y. So that’ll be our notation. If there’s an arrow from x to y, x is going to be considered a parent of y. And the reason that’s important is because each of these nodes is going to have a probability distribution that we’re going to store along with it, which is the distribution of x given some evidence, given the parents of x. So the way to more intuitively think about this is the parents seem to be thought of as sort of causes for some effect that we’re going to observe. And so let’s take a look at an actual example of a Bayesian network and think about the types of logic that might be involved in reasoning about that network. Let’s imagine for a moment that I have an appointment out of town, and I need to take a train in order to get to that appointment. So what are the things I might care about? Well, I care about getting to my appointment on time. Whether I make it to my appointment and I’m able to attend it or I miss the appointment. And you might imagine that that’s influenced by the train, that the train is either on time or it’s delayed, for example. But that train itself is also influenced. Whether the train is on time or not depends maybe on the rain. Is there no rain? Is it light rain? Is there heavy rain? And it might also be influenced by other variables too. It might be influenced as well by whether or not there’s maintenance on the train track, for example. If there is maintenance on the train track, that probably increases the likelihood that my train is delayed. And so we can represent all of these ideas using a Bayesian network that looks a little something like this. Here I have four nodes representing four random variables that I would like to keep track of. I have one random variable called rain that can take on three possible values in its domain, either none or light or heavy, for no rain, light rain, or heavy rain. I have a variable called maintenance for whether or not there is maintenance on the train track, which it has two possible values, just either yes or no. Either there is maintenance or there’s no maintenance happening on the track. Then I have a random variable for the train indicating whether or not the train was on time or not. That random variable has two possible values in its domain. The train is either on time or the train is delayed. And then finally, I have a random variable for whether I make it to my appointment. For my appointment down here, I have a random variable called appointment that itself has two possible values, attend and miss. And so here are the possible values. Here are my four nodes, each of which represents a random variable, each of which has a domain of possible values that it can take on. And the arrows, the edges pointing from one node to another, encode some notion of dependence inside of this graph, that whether I make it to my appointment or not is dependent upon whether the train is on time or delayed. And whether the train is on time or delayed is dependent on two things given by the two arrows pointing at this node. It is dependent on whether or not there was maintenance on the train track. And it is also dependent upon whether or not it was raining or whether it is raining. And just to make things a little complicated, let’s say as well that whether or not there is maintenance on the track, this too might be influenced by the rain. That if there’s heavier rain, well, maybe it’s less likely that it’s going to be maintenance on the train track that day because they’re more likely to want to do maintenance on the track on days when it’s not raining, for example. And so these nodes might have different relationships between them. But the idea is that we can come up with a probability distribution for any of these nodes based only upon its parents. And so let’s look node by node at what this probability distribution might actually look like. And we’ll go ahead and begin with this root node, this rain node here, which is at the top, and has no arrows pointing into it, which means its probability distribution is not going to be a conditional distribution. It’s not based on anything. I just have some probability distribution over the possible values for the rain random variable. And that distribution might look a little something like this. None, light and heavy, each have a possible value. Here I’m saying the likelihood of no rain is 0.7, of light rain is 0.2, of heavy rain is 0.1, for example. So here is a probability distribution for this root node in this Bayesian network. And let’s now consider the next node in the network, maintenance. Track maintenance is yes or no. And the general idea of what this distribution is going to encode, at least in this story, is the idea that the heavier the rain is, the less likely it is that there’s going to be maintenance on the track. Because the people that are doing maintenance on the track probably want to wait until a day when it’s not as rainy in order to do the track maintenance, for example. And so what might that probability distribution look like? Well, this now is going to be a conditional probability distribution, that here are the three possible values for the rain random variable, which I’m here just going to abbreviate to R, either no rain, light rain, or heavy rain. And for each of those possible values, either there is yes track maintenance or no track maintenance. And those have probabilities associated with them. That I see here that if it is not raining, then there is a probability of 0.4 that there’s track maintenance and a probability of 0.6 that there isn’t. But if there’s heavy rain, then here the chance that there is track maintenance is 0.1 and the chance that there is not track maintenance is 0.9. Each of these rows is going to sum up to 1. Because each of these represent different values of whether or not it’s raining, the three possible values that that random variable can take on. And each is associated with its own probability distribution that is ultimately all going to add up to the number 1. So that there is our distribution for this random variable called maintenance, about whether or not there is maintenance on the train track. And now let’s consider the next variable. Here we have a node inside of our Bayesian network called train that has two possible values, on time and delayed. And this node is going to be dependent upon the two nodes that are pointing towards it, that whether or not the train is on time or delayed depends on whether or not there is track maintenance. And it depends on whether or not there is rain, that heavier rain probably means more likely that my train is delayed. And if there is track maintenance, that also probably means it’s more likely that my train is delayed as well. And so you could construct a larger probability distribution, a conditional probability distribution, that instead of conditioning on just one variable, as was the case here, is now conditioning on two variables, conditioning both on rain represented by r and on maintenance represented by yes. Again, each of these rows has two values that sum up to the number 1, one for whether the train is on time, one for whether the train is delayed. And here I can say something like, all right, if I know there was light rain and track maintenance, well, OK, that would be r is light and m is yes. Well, then there is a probability of 0.6 that my train is on time, and a probability of 0.4 the train is delayed. And you can imagine gathering this data just by looking at real world data, looking at data about, all right, if I knew that it was light rain and there was track maintenance, how often was a train delayed or not delayed? And you could begin to construct this thing. The interesting thing is intelligently, being able to try to figure out how might you go about ordering these things, what things might influence other nodes inside of this Bayesian network. And the last thing I care about is whether or not I make it to my appointment. So did I attend or miss the appointment? And ultimately, whether I attend or miss the appointment, it is influenced by track maintenance, because it’s indirectly this idea that, all right, if there is track maintenance, well, then my train might more likely be delayed. And if my train is more likely to be delayed, then I’m more likely to miss my appointment. But what we encode in this Bayesian network are just what we might consider to be more direct relationships. So the train has a direct influence on the appointment. And given that I know whether the train is on time or delayed, knowing whether there’s track maintenance isn’t going to give me any additional information that I didn’t already have. That if I know train, these other nodes that are up above isn’t really going to influence the result. And so here we might represent it using another conditional probability distribution that looks a little something like this. The train can take on two possible values. Either my train is on time or my train is delayed. And for each of those two possible values, I have a distribution for what are the odds that I’m able to attend the meeting and what are the odds that I missed the meeting. And obviously, if my train is on time, I’m much more likely to be able to attend the meeting than if my train is delayed, in which case I’m more likely to miss that meeting. So all of these nodes put all together here represent this Bayesian network, this network of random variables whose values I ultimately care about, and that have some sort of relationship between them, some sort of dependence where these arrows from one node to another indicate some dependence, that I can calculate the probability of some node given the parents that happen to exist there. So now that we’ve been able to describe the structure of this Bayesian network and the relationships between each of these nodes by associating each of the nodes in the network with a probability distribution, whether that’s an unconditional probability distribution in the case of this root node here, like rain, and a conditional probability distribution in the case of all of the other nodes whose probabilities are dependent upon the values of their parents, we can begin to do some computation and calculation using the information inside of that table. So let’s imagine, for example, that I just wanted to compute something simple like the probability of light rain. How would I get the probability of light rain? Well, light rain, rain here is a root node. And so if I wanted to calculate that probability, I could just look at the probability distribution for rain and extract from it the probability of light rains, just a single value that I already have access to. But we could also imagine wanting to compute more complex joint probabilities, like the probability that there is light rain and also no track maintenance. This is a joint probability of two values, light rain and no track maintenance. And the way I might do that is first by starting by saying, all right, well, let me get the probability of light rain. But now I also want the probability of no track maintenance. But of course, this node is dependent upon the value of rain. So what I really want is the probability of no track maintenance, given that I know that there was light rain. And so the expression for calculating this idea that the probability of light rain and no track maintenance is really just the probability of light rain and the probability that there is no track maintenance, given that I know that there already is light rain. So I take the unconditional probability of light rain, multiply it by the conditional probability of no track maintenance, given that I know there is light rain. And you can continue to do this again and again for every variable that you want to add into this joint probability that I might want to calculate. If I wanted to know the probability of light rain and no track maintenance and a delayed train, well, that’s going to be the probability of light rain, multiplied by the probability of no track maintenance, given light rain, multiplied by the probability of a delayed train, given light rain and no track maintenance. Because whether the train is on time or delayed is dependent upon both of these other two variables. And so I have two pieces of evidence that go into the calculation of that conditional probability. And each of these three values is just a value that I can look up by looking at one of these individual probability distributions that is encoded into my Bayesian network. And if I wanted a joint probability over all four of the variables, something like the probability of light rain and no track maintenance and a delayed train and I miss my appointment, well, that’s going to be multiplying four different values, one from each of these individual nodes. It’s going to be the probability of light rain, then of no track maintenance given light rain, then of a delayed train, given light rain and no track maintenance. And then finally, for this node here, for whether I make it to my appointment or not, it’s not dependent upon these two variables, given that I know whether or not the train is on time. I only need to care about the conditional probability that I miss my train, or that I miss my appointment, given that the train happens to be delayed. And so that’s represented here by four probabilities, each of which is located inside of one of these probability distributions for each of the nodes, all multiplied together. And so I can take a variable like that and figure out what the joint probability is by multiplying a whole bunch of these individual probabilities from the Bayesian network. But of course, just as with last time, where what I really wanted to do was to be able to get new pieces of information, here, too, this is what we’re going to want to do with our Bayesian network. In the context of knowledge, we talked about the problem of inference. Given things that I know to be true, can I draw conclusions, make deductions about other facts about the world that I also know to be true? And what we’re going to do now is apply the same sort of idea to probability. Using information about which I have some knowledge, whether some evidence or some probabilities, can I figure out not other variables for certain, but can I figure out the probabilities of other variables taking on particular values? And so here, we introduce the problem of inference in a probabilistic setting, in a case where variables might not necessarily be true for sure, but they might be random variables that take on different values with some probability. So how do we formally define what exactly this inference problem actually is? Well, the inference problem has a couple of parts to it. We have some query, some variable x that we want to compute the distribution for. Maybe I want the probability that I miss my train, or I want the probability that there is track maintenance, something that I want information about. And then I have some evidence variables. Maybe it’s just one piece of evidence. Maybe it’s multiple pieces of evidence. But I’ve observed certain variables for some sort of event. So for example, I might have observed that it is raining. This is evidence that I have. I know that there is light rain, or I know that there is heavy rain. And that is evidence I have. And using that evidence, I want to know what is the probability that my train is delayed, for example. And that is a query that I might want to ask based on this evidence. So I have a query, some variable. Evidence, which are some other variables that I have observed inside of my Bayesian network. And of course, that does leave some hidden variables. Why? These are variables that are not evidence variables and not query variables. So you might imagine in the case where I know whether or not it’s raining, and I want to know whether my train is going to be delayed or not, the hidden variable, the thing I don’t have access to, is something like, is there maintenance on the track? Or am I going to make or not make my appointment, for example? These are variables that I don’t have access to. They’re hidden because they’re not things I observed, and they’re also not the query, the thing that I’m asking. And so ultimately, what we want to calculate is I want to know the probability distribution of x given e, the event that I observed. So given that I observed some event, I observed that it is raining, I would like to know what is the distribution over the possible values of the train random variable. Is it on time? Is it delayed? What’s the likelihood it’s going to be there? And it turns out we can do this calculation just using a lot of the probability rules that we’ve already seen in action. And ultimately, we’re going to take a look at the math at a little bit of a high level, at an abstract level. But ultimately, we can allow computers and programming libraries that already exist to begin to do some of this math for us. But it’s good to get a general sense for what’s actually happening when this inference process takes place. Let’s imagine, for example, that I want to compute the probability distribution of the appointment random variable given some evidence, given that I know that there was light rain and no track maintenance. So there’s my evidence, these two variables that I observe the values of. I observe the value of rain. I know there’s light rain. And I know that there is no track maintenance going on today. And what I care about knowing, my query, is this random variable appointment. I want to know the distribution of this random variable appointment, like what is the chance that I’m able to attend my appointment? What is the chance that I miss my appointment given this evidence? And the hidden variable, the information that I don’t have access to, is this variable train. This is information that is not part of the evidence that I see, not something that I observe. But it is also not the query that I’m asking for. And so what might this inference procedure look like? Well, if you recall back from when we were defining conditional probability and doing math with conditional probabilities, we know that a conditional probability is proportional to the joint probability. And we remembered this by recalling that the probability of A given B is just some constant factor alpha multiplied by the probability of A and B. That constant factor alpha turns out to be like dividing over the probability of B. But the important thing is that it’s just some constant multiplied by the joint distribution, the probability that all of these individual things happen. So in this case, I can take the probability of the appointment random variable given light rain and no track maintenance and say that is just going to be proportional, some constant alpha, multiplied by the joint probability, the probability of a particular value for the appointment random variable and light rain and no track maintenance. Well, all right, how do I calculate this, probability of appointment and light rain and no track maintenance, when what I really care about is knowing I need all four of these values to be able to calculate a joint distribution across everything because in a particular appointment depends upon the value of train? Well, in order to do that, here I can begin to use that marginalization trick, that there are only two ways I can get any configuration of an appointment, light rain, and no track maintenance. Either this particular setting of variables happens and the train is on time, or this particular setting of variables happens and the train is delayed. Those are two possible cases that I would want to consider. And if I add those two cases up, well, then I get the result just by adding up all of the possibilities for the hidden variable or variables that there are multiple. But since there’s only one hidden variable here, train, all I need to do is iterate over all the possible values for that hidden variable train and add up their probabilities. So this probability expression here becomes probability distribution over appointment, light, no rain, and train is on time, and the probability distribution over the appointment, light rain, no track maintenance, and that the train is delayed, for example. So I take both of the possible values for train, go ahead and add them up. These are just joint probabilities that we saw earlier, how to calculate just by going parent, parent, parent, parent, and calculating those probabilities and multiplying them together. And then you’ll need to normalize them at the end, speaking at a high level, to make sure that everything adds up to the number 1. So the formula for how you do this in a process known as inference by enumeration looks a little bit complicated, but ultimately it looks like this. And let’s now try to distill what it is that all of these symbols actually mean. Let’s start here. What I care about knowing is the probability of x, my query variable, given some sort of evidence. What do I know about conditional probabilities? Well, a conditional probability is proportional to the joint probability. So it is some alpha, some normalizing constant, multiplied by this joint probability of x and evidence. And how do I calculate that? Well, to do that, I’m going to marginalize over all of the hidden variables, all the variables that I don’t directly observe the values for. I’m basically going to iterate over all of the possibilities that it could happen and just sum them all up. And so I can translate this into a sum over all y, which ranges over all the possible hidden variables and the values that they could take on, and adds up all of those possible individual probabilities. And that is going to allow me to do this process of inference by enumeration. Now, ultimately, it’s pretty annoying if we as humans have to do all this math for ourselves. But turns out this is where computers and AI can be particularly helpful, that we can program a computer to understand a Bayesian network, to be able to understand these inference procedures, and to be able to do these calculations. And using the information you’ve seen here, you could implement a Bayesian network from scratch yourself. But turns out there are a lot of libraries, especially written in Python, that allow us to make it easier to do this sort of probabilistic inference, to be able to take a Bayesian network and do these sorts of calculations, so that you don’t need to know and understand all of the underlying math, though it’s helpful to have a general sense for how it works. But you just need to be able to describe the structure of the network and make queries in order to be able to produce the result. And so let’s take a look at an example of that right now. It turns out that there are a lot of possible libraries that exist in Python for doing this sort of inference. It doesn’t matter too much which specific library you use. They all behave in fairly similar ways. But the library I’m going to use here is one known as pomegranate. And here inside of model.py, I have defined a Bayesian network, just using the structure and the syntax that the pomegranate library expects. And what I’m effectively doing is just, in Python, creating nodes to represent each of the nodes of the Bayesian network that you saw me describe a moment ago. So here on line four, after I’ve imported pomegranate, I’m defining a variable called rain that is going to represent a node inside of my Bayesian network. It’s going to be a node that follows this distribution, where there are three possible values, none for no rain, light for light rain, heavy for heavy rain. And these are the probabilities of each of those taking place. 0.7 is the likelihood of no rain, 0.2 for light rain, 0.1 for heavy rain. Then after that, we go to the next variable, the variable for track maintenance, for example, which is dependent upon that rain variable. And this, instead of being an unconditional distribution, is a conditional distribution, as indicated by a conditional probability table here. And the idea is that I’m following this is conditional on the distribution of rain. So if there is no rain, then the chance that there is, yes, track maintenance is 0.4. If there’s no rain, the chance that there is no track maintenance is 0.6. Likewise, for light rain, I have a distribution. For heavy rain, I have a distribution as well. But I’m effectively encoding the same information you saw represented graphically a moment ago. But I’m telling this Python program that the maintenance node obeys this particular conditional probability distribution. And we do the same thing for the other random variables as well. Train was a node inside my distribution that was a conditional probability table with two parents. It was dependent not only on rain, but also on track maintenance. And so here I’m saying something like, given that there is no rain and, yes, track maintenance, the probability that my train is on time is 0.8. And the probability that it’s delayed is 0.2. And likewise, I can do the same thing for all of the other possible values of the parents of the train node inside of my Bayesian network by saying, for all of those possible values, here is the distribution that the train node should follow. Then I do the same thing for an appointment based on the distribution of the variable train. Then at the end, what I do is actually construct this network by describing what the states of the network are and by adding edges between the dependent nodes. So I create a new Bayesian network, add states to it, one for rain, one for maintenance, one for the train, one for the appointment. And then I add edges connecting the related pieces. Rain has an arrow to maintenance because rain influences track maintenance. Rain also influences the train. Maintenance also influences the train. And train influences whether I make it to my appointment and bake just finalizes the model and does some additional computation. So the specific syntax of this is not really the important part. Pomegranate just happens to be one of several different libraries that can all be used for similar purposes. And you could describe and define a library for yourself that implemented similar things. But the key idea here is that someone can design a library for a general Bayesian network that has nodes that are based upon its parents. And then all a programmer needs to do using one of those libraries is to define what those nodes and what those probability distributions are. And we can begin to do some interesting logic based on it. So let’s try doing that conditional or joint probability calculation that we saw us do by hand before by going into likelihood.py, where here I’m importing the model that I just defined a moment ago. And here I’d just like to calculate model.probability, which calculates the probability for a given observation. And I’d like to calculate the probability of no rain, no track maintenance, my train is on time, and I’m able to attend the meeting. So sort of the optimal scenario that there is no rain and no maintenance on the track, my train is on time, and I’m able to attend the meeting. What is the probability that all of that actually happens? And I can calculate that using the library and just print out its probability. And so I’ll go ahead and run python of likelihood.py. And I see that, OK, the probability is about 0.34. So about a third of the time, everything goes right for me in this case. No rain, no track maintenance, train is on time, and I’m able to attend the meeting. But I could experiment with this, try and calculate other probabilities as well. What’s the probability that everything goes right up until the train, but I still miss my meeting? So no rain, no track maintenance, train is on time, but I miss the appointment. Let’s calculate that probability. And all right, that has a probability of about 0.04. So about 4% of the time, the train will be on time, there won’t be any rain, no track maintenance, and yet I’ll still miss the meeting. And so this is really just an implementation of the calculation of the joint probabilities that we did before. What this library is likely doing is first figuring out the probability of no rain, then figuring out the probability of no track maintenance given no rain, then the probability that my train is on time given both of these values, and then the probability that I miss my appointment given that I know that the train was on time. So this, again, is the calculation of that joint probability. And turns out we can also begin to have our computer solve inference problems as well, to begin to infer, based on information, evidence that we see, what is the likelihood of other variables also being true. So let’s go into inference.py, for example. We’re here, I’m again importing that exact same model from before, importing all the nodes and all the edges and the probability distribution that is encoded there as well. And now there’s a function for doing some sort of prediction. And here, into this model, I pass in the evidence that I observe. So here, I’ve encoded into this Python program the evidence that I have observed. I have observed the fact that the train is delayed. And that is the value for one of the four random variables inside of this Bayesian network. And using that information, I would like to be able to draw inspiration and figure out inferences about the values of the other random variables that are inside of my Bayesian network. I would like to make predictions about everything else. So all of the actual computational logic is happening in just these three lines, where I’m making this call to this prediction. Down below, I’m just iterating over all of the states and all the predictions and just printing them out so that we can visually see what the results are. But let’s find out, given the train is delayed, what can I predict about the values of the other random variables? Let’s go ahead and run python inference.py. I run that, and all right, here is the result that I get. Given the fact that I know that the train is delayed, this is evidence that I have observed. Well, given that there is a 45% chance or a 46% chance that there was no rain, a 31% chance there was light rain, a 23% chance there was heavy rain, I can see a probability distribution of a track maintenance and a probability distribution over whether I’m able to attend or miss my appointment. Now, we know that whether I attend or miss the appointment, that is only dependent upon the train being delayed or not delayed. It shouldn’t depend on anything else. So let’s imagine, for example, that I knew that there was heavy rain. That shouldn’t affect the distribution for making the appointment. And indeed, if I go up here and add some evidence, say that I know that the value of rain is heavy. That is evidence that I now have access to. I now have two pieces of evidence. I know that the rain is heavy, and I know that my train is delayed. I can calculate the probability by running this inference procedure again and seeing the result. I know that the rain is heavy. I know my train is delayed. The probability distribution for track maintenance changed. Given that I know that there’s heavy rain, now it’s more likely that there is no track maintenance, 88%, as opposed to 64% from here before. And now, what is the probability that I make the appointment? Well, that’s the same as before. It’s still going to be attend the appointment with probability 0.6, missed the appointment with probability 0.4, because it was only dependent upon whether or not my train was on time or delayed. And so this here is implementing that idea of that inference algorithm to be able to figure out, based on the evidence that I have, what can we infer about the values of the other variables that exist as well. So inference by enumeration is one way of doing this inference procedure, just looping over all of the values the hidden variables could take on and figuring out what the probability is. Now, it turns out this is not particularly efficient. And there are definitely optimizations you can make by avoiding repeated work. If you’re calculating the same sort of probability multiple times, there are ways of optimizing the program to avoid having to recalculate the same probabilities again and again. But even then, as the number of variables get large, as the number of possible values of variables could take on, get large, we’re going to start to have to do a lot of computation, a lot of calculation, to be able to do this inference. And at that point, it might start to get unreasonable, in terms of the amount of time that it would take to be able to do this sort of exact inference. And it’s for that reason that oftentimes, when it comes towards probability and things we’re not entirely sure about, we don’t always care about doing exact inference and knowing exactly what the probability is. But if we can approximate the inference procedure, do some sort of approximate inference, that that can be pretty good as well. That if I don’t know the exact probability, but I have a general sense for the probability that I can get increasingly accurate with more time, that that’s probably pretty good, especially if I can get that to happen even faster. So how could I do approximate inference inside of a Bayesian network? Well, one method is through a procedure known as sampling. In the process of sampling, I’m going to take a sample of all of the variables inside of this Bayesian network here. And how am I going to sample? Well, I’m going to sample one of the values from each of these nodes according to their probability distribution. So how might I take a sample of all these nodes? Well, I’ll start at the root. I’ll start with rain. Here’s the distribution for rain. And I’ll go ahead and, using a random number generator or something like it, randomly pick one of these three values. I’ll pick none with probability 0.7, light with probability 0.2, and heavy with probability 0.1. So I’ll randomly just pick one of them according to that distribution. And maybe in this case, I pick none, for example. Then I do the same thing for the other variable. Maintenance also has a probability distribution. And I’m going to sample. Now, there are three probability distributions here. But I’m only going to sample from this first row here, because I’ve observed already in my sample that the value of rain is none. So given that rain is none, I’m going to sample from this distribution to say, all right, what should the value of maintenance be? And in this case, maintenance is going to be, let’s just say yes, which happens 40% of the time in the event that there is no rain, for example. And we’ll sample all of the rest of the nodes in this way as well, that I want to sample from the train distribution. And I’ll sample from this first row here, where there is no rain, but there is track maintenance. And I’ll sample 80% of the time. I’ll say the train is on time. 20% of the time, I’ll say the train is delayed. And finally, we’ll do the same thing for whether I make it to my appointment or not. Did I attend or miss the appointment? We’ll sample based on this distribution and maybe say that in this case, I attend the appointment, which happens 90% of the time when the train is actually on time. So by going through these nodes, I can very quickly just do some sampling and get a sample of the possible values that could come up from going through this entire Bayesian network according to those probability distributions. And where this becomes powerful is if I do this not once, but I do this thousands or tens of thousands of times and generate a whole bunch of samples all using this distribution. I get different samples. Maybe some of them are the same. But I get a value for each of the possible variables that could come up. And so then if I’m ever faced with a question, a question like, what is the probability that the train is on time, you could do an exact inference procedure. This is no different than the inference problem we had before where I could just marginalize, look at all the possible other values of the variables, and do the computation of inference by enumeration to find out this probability exactly. But I could also, if I don’t care about the exact probability, just sample it, approximate it to get close. And this is a powerful tool in AI where we don’t need to be right 100% of the time or we don’t need to be exactly right. If we just need to be right with some probability, we can often do so more effectively, more efficiently. And so if here now are all of those possible samples, I’ll highlight the ones where the train is on time. I’m ignoring the ones where the train is delayed. And in this case, there’s like six out of eight of the samples have the train is arriving on time. And so maybe in this case, I can say that in six out of eight cases, that’s the likelihood that the train is on time. And with eight samples, that might not be a great prediction. But if I had thousands upon thousands of samples, then this could be a much better inference procedure to be able to do these sorts of calculations. So this is a direct sampling method to just do a bunch of samples and then figure out what the probability of some event is. Now, this from before was an unconditional probability. What is the probability that the train is on time? And I did that by looking at all the samples and figuring out, right, here are the ones where the train is on time. But sometimes what I want to calculate is not an unconditional probability, but rather a conditional probability, something like what is the probability that there is light rain, given that the train is on time, something to that effect. And to do that kind of calculation, well, what I might do is here are all the samples that I have. And I want to calculate a probability distribution, given that I know that the train is on time. So to be able to do that, I can kind of look at the two cases where the train was delayed and ignore or reject them, sort of exclude them from the possible samples that I’m considering. And now I want to look at these remaining cases where the train is on time. Here are the cases where there is light rain. And I say, OK, these are two out of the six possible cases. That can give me an approximation for the probability of light rain, given the fact that I know the train was on time. And I did that in almost exactly the same way, just by adding an additional step, by saying that, all right, when I take each sample, let me reject all of the samples that don’t match my evidence and only consider the samples that do match what it is that I have in my evidence that I want to make some sort of calculation about. And it turns out, using the libraries that we’ve had for Bayesian networks, we can begin to implement this same sort of idea, like implement rejection sampling, which is what this method is called, to be able to figure out some probability, not via direct inference, but instead by sampling. So what I have here is a program called sample.py. Imports the exact same model. And what I define first is a program to generate a sample. And the way I generate a sample is just by looping over all of the states. The states need to be in some sort of order to make sure I’m looping in the correct order. But effectively, if it is a conditional distribution, I’m going to sample based on the parents. And otherwise, I’m just going to directly sample the variable, like rain, which has no parents. It’s just an unconditional distribution and keep track of all those parent samples and return the final sample. The exact syntax of this, again, not particularly important. It just happens to be part of the implementation details of this particular library. The interesting logic is down below. Now that I have the ability to generate a sample, if I want to know the distribution of the appointment random variable, given that the train is delayed, well, then I can begin to do calculations like this. Let me take 10,000 samples and assemble all my results in this list called data. I’ll go ahead and loop n times, in this case, 10,000 times. I’ll generate a sample. And I want to know the distribution of appointment, given that the train is delayed. So according to rejection sampling, I’m only going to consider samples where the train is delayed. If the train is not delayed, I’m not going to consider those values at all. So I’m going to say, all right, if I take the sample, look at the value of the train random variable, if the train is delayed, well, let me go ahead and add to my data that I’m collecting the value of the appointment random variable that it took on in this particular sample. So I’m only considering the samples where the train is delayed. And for each of those samples, considering what the value of appointment is, and then at the end, I’m using a Python class called counter, which quickly counts up all the values inside of a data set. So I can take this list of data and figure out how many times was my appointment made and how many times was my appointment missed. And so this here, with just a couple lines of code, is an implementation of rejection sampling. And I can run it by going ahead and running Python sample.py. And when I do that, here is the result I get. This is the result of the counter. 1,251 times, I was able to attend the meeting. And 856 times, I was able to miss the meeting. And you can imagine, by doing more and more samples, I’ll be able to get a better and better, more accurate result. And this is a randomized process. It’s going to be an approximation of the probability. If I run it a different time, you’ll notice the numbers are similar, 12, 72, and 905. But they’re not identical because there’s some randomization, some likelihood that things might be higher or lower. And so this is why we generally want to try and use more samples so that we can have a greater amount of confidence in our result, be more sure about the result that we’re getting of whether or not it accurately reflects or represents the actual underlying probabilities that are inherent inside of this distribution. And so this, then, was an instance of rejection sampling. And it turns out there are a number of other sampling methods that you could use to begin to try to sample. One problem that rejection sampling has is that if the evidence you’re looking for is a fairly unlikely event, well, you’re going to be rejecting a lot of samples. Like if I’m looking for the probability of x given some evidence e, if e is very unlikely to occur, like occurs maybe one every 1,000 times, then I’m only going to be considering 1 out of every 1,000 samples that I do, which is a pretty inefficient method for trying to do this sort of calculation. I’m throwing away a lot of samples. And it takes computational effort to be able to generate those samples. So I’d like to not have to do something like that. So there are other sampling methods that can try and address this. One such sampling method is called likelihood weighting. In likelihood weighting, we follow a slightly different procedure. And the goal is to avoid needing to throw out samples that didn’t match the evidence. And so what we’ll do is we’ll start by fixing the values for the evidence variables. Rather than sample everything, we’re going to fix the values of the evidence variables and not sample those. Then we’re going to sample all the other non-evidence variables in the same way, just using the Bayesian network looking at the probability distributions, sampling all the non-evidence variables. But then what we need to do is weight each sample by its likelihood. If our evidence is really unlikely, we want to make sure that we’ve taken into account how likely was the evidence to actually show up in the sample. If I have a sample where the evidence was much more likely to show up than another sample, then I want to weight the more likely one higher. So we’re going to weight each sample by its likelihood, where likelihood is just defined as the probability of all the evidence. Given all the evidence we have, what is the probability that it would happen in that particular sample? So before, all of our samples were weighted equally. They all had a weight of 1 when we were calculating the overall average. In this case, we’re going to weight each sample, multiply each sample by its likelihood in order to get the more accurate distribution. So what would this look like? Well, if I ask the same question, what is the probability of light rain, given that the train is on time, when I do the sampling procedure and start by trying to sample, I’m going to start by fixing the evidence variable. I’m already going to have in my sample the train is on time. That way, I don’t have to throw out anything. I’m only sampling things where I know the value of the variables that are my evidence are what I expect them to be. So I’ll go ahead and sample from rain. And maybe this time, I sample light rain instead of no rain. Then I’ll sample from track maintenance and say, maybe, yes, there’s track maintenance. Then for train, well, I’ve already fixed it in place. Train was an evidence variable. So I’m not going to bother sampling again. I’ll just go ahead and move on. I’ll move on to appointment and go ahead and sample from appointment as well. So now I’ve generated a sample. I’ve generated a sample by fixing this evidence variable and sampling the other three. And the last step is now weighting the sample. How much weight should it have? And the weight is based on how probable is it that the train was actually on time, this evidence actually happened, given the values of these other variables, light rain and the fact that, yes, there was track maintenance. Well, to do that, I can just go back to the train variable and say, all right, if there was light rain and track maintenance, the likelihood of my evidence, the likelihood that my train was on time, is 0.6. And so this particular sample would have a weight of 0.6. And I could repeat the sampling procedure again and again. Each time every sample would be given a weight according to the probability of the evidence that I see associated with it. And there are other sampling methods that exist as well, but all of them are designed to try and get it the same idea, to approximate the inference procedure of figuring out the value of a variable. So we’ve now dealt with probability as it pertains to particular variables that have these discrete values. But what we haven’t really considered is how values might change over time. That we’ve considered something like a variable for rain, where rain can take on values of none or light rain or heavy rain. But in practice, usually when we consider values for variables like rain, we like to consider it for over time, how do the values of these variables change? What do we do with when we’re dealing with uncertainty over a period of time, which can come up in the context of weather, for example, if I have sunny days and I have rainy days. And I’d like to know not just what is the probability that it’s raining now, but what is the probability that it rains tomorrow, or the day after that, or the day after that. And so to do this, we’re going to introduce a slightly different kind of model. But here, we’re going to have a random variable, not just one for the weather, but for every possible time step. And you can define time step however you like. A simple way is just to use days as your time step. And so we can define a variable called x sub t, which is going to be the weather at time t. So x sub 0 might be the weather on day 0. x sub 1 might be the weather on day 1, so on and so forth. x sub 2 is the weather on day 2. But as you can imagine, if we start to do this over longer and longer periods of time, there’s an incredible amount of data that might go into this. If you’re keeping track of data about the weather for a year, now suddenly you might be trying to predict the weather tomorrow, given 365 days of previous pieces of evidence. And that’s a lot of evidence to have to deal with and manipulate and calculate. Probably nobody knows what the exact conditional probability distribution is for all of those combinations of variables. And so when we’re trying to do this inference inside of a computer, when we’re trying to reasonably do this sort of analysis, it’s helpful to make some simplifying assumptions, some assumptions about the problem that we can just assume are true, to make our lives a little bit easier. Even if they’re not totally accurate assumptions, if they’re close to accurate or approximate, they’re usually pretty good. And the assumption we’re going to make is called the Markov assumption, which is the assumption that the current state depends only on a finite fixed number of previous states. So the current day’s weather depends not on all the previous day’s weather for the rest of all of history, but the current day’s weather I can predict just based on yesterday’s weather, or just based on the last two days weather, or the last three days weather. But oftentimes, we’re going to deal with just the one previous state that helps to predict this current state. And by putting a whole bunch of these random variables together, using this Markov assumption, we can create what’s called a Markov chain, where a Markov chain is just some sequence of random variables where each of the variables distribution follows that Markov assumption. And so we’ll do an example of this where the Markov assumption is, I can predict the weather. Is it sunny or rainy? And we’ll just consider those two possibilities for now, even though there are other types of weather. But I can predict each day’s weather just on the prior day’s weather, using today’s weather, I can come up with a probability distribution for tomorrow’s weather. And here’s what this weather might look like. It’s formatted in terms of a matrix, as you might describe it, as rows and columns of values, where on the left-hand side, I have today’s weather, represented by the variable x sub t. And over here in the columns, I have tomorrow’s weather, represented by the variable x sub t plus 1, t plus 1 day’s weather instead. And what this matrix is saying is, if today is sunny, well, then it’s more likely than not that tomorrow is also sunny. Oftentimes, the weather stays consistent for multiple days in a row. And for example, let’s say that if today is sunny, our model says that tomorrow, with probability 0.8, it will also be sunny. And with probability 0.2, it will be raining. And likewise, if today is raining, then it’s more likely than not that tomorrow is also raining. With probability 0.7, it’ll be raining. With probability 0.3, it will be sunny. So this matrix, this description of how it is we transition from one state to the next state is what we’re going to call the transition model. And using the transition model, you can begin to construct this Markov chain by just predicting, given today’s weather, what’s the likelihood of tomorrow’s weather happening. And you can imagine doing a similar sampling procedure, where you take this information, you sample what tomorrow’s weather is going to be. Using that, you sample the next day’s weather. And the result of that is you can form this Markov chain of like x0, time and time, day zero is sunny, the next day is sunny, maybe the next day it changes to raining, then raining, then raining. And the pattern that this Markov chain follows, given the distribution that we had access to, this transition model here, is that when it’s sunny, it tends to stay sunny for a little while. The next couple of days tend to be sunny too. And when it’s raining, it tends to be raining as well. And so you get a Markov chain that looks like this, and you can do analysis on this. You can say, given that today is raining, what is the probability that tomorrow is raining? Or you can begin to ask probability questions like, what is the probability of this sequence of five values, sun, sun, rain, rain, rain, and answer those sorts of questions too. And it turns out there are, again, many Python libraries for interacting with models like this of probabilities that have distributions and random variables that are based on previous variables according to this Markov assumption. And pomegranate2 has ways of dealing with these sorts of variables. So I’ll go ahead and go into the chain directory, where I have some information about Markov chains. And here, I’ve defined a file called model.py, where I’ve defined in a very similar syntax. And again, the exact syntax doesn’t matter so much as the idea that I’m encoding this information into a Python program so that the program has access to these distributions. I’ve here defined some starting distribution. So every Markov model begins at some point in time, and I need to give it some starting distribution. And so we’ll just say, you know at the start, you can pick 50-50 between sunny and rainy. We’ll say it’s sunny 50% of the time, rainy 50% of the time. And then down below, I’ve here defined the transition model, how it is that I transition from one day to the next. And here, I’ve encoded that exact same matrix from before, that if it was sunny today, then with probability 0.8, it will be sunny tomorrow. And it’ll be rainy tomorrow with probability 0.2. And I likewise have another distribution for if it was raining today instead. And so that alone defines the Markov model. You can begin to answer questions using that model. But one thing I’ll just do is sample from the Markov chain. It turns out there is a method built into this Markov chain library that allows me to sample 50 states from the chain, basically just simulating like 50 instances of weather. And so let me go ahead and run this. Python model.py. And when I run it, what I get is that it’s going to sample from this Markov chain 50 states, 50 days worth of weather that it’s just going to randomly sample. And you can imagine sampling many times to be able to get more data, to be able to do more analysis. But here, for example, it’s sunny two days in a row, rainy a whole bunch of days in a row before it changes back to sun. And so you get this model that follows the distribution that we originally described, that follows the distribution of sunny days tend to lead to more sunny days. Rainy days tend to lead to more rainy days. And that then is a Markov model. And Markov models rely on us knowing the values of these individual states. I know that today is sunny or that today is raining. And using that information, I can draw some sort of inference about what tomorrow is going to be like. But in practice, this often isn’t the case. It often isn’t the case that I know for certain what the exact state of the world is. Oftentimes, the state of the world is exactly unknown. But I’m able to somehow sense some information about that state, that a robot or an AI doesn’t have exact knowledge about the world around it. But it has some sort of sensor, whether that sensor is a camera or sensors that detect distance or just a microphone that is sensing audio, for example. It is sensing data. And using that data, that data is somehow related to the state of the world, even if it doesn’t actually know, our AI doesn’t know, what the underlying true state of the world actually is. And for that, we need to get into the world of sensor models, the way of describing how it is that we translate what the hidden state, the underlying true state of the world, is with what the observation, what it is that the AI knows or the AI has access to, actually is. And so for example, a hidden state might be a robot’s position. If a robot is exploring new uncharted territory, the robot likely doesn’t know exactly where it is. But it does have an observation. It has robot sensor data, where it can sense how far away are possible obstacles around it. And using that information, using the observed information that it has, it can infer something about the hidden state. Because what the true hidden state is influences those observations. Whatever the robot’s true position is affects or has some effect upon what the sensor data of the robot is able to collect is, even if the robot doesn’t actually know for certain what its true position is. Likewise, if you think about a voice recognition or a speech recognition program that listens to you and is able to respond to you, something like Alexa or what Apple and Google are doing with their voice recognition as well, that you might imagine that the hidden state, the underlying state, is what words are actually spoken. The true nature of the world contains you saying a particular sequence of words, but your phone or your smart home device doesn’t know for sure exactly what words you said. The only observation that the AI has access to is some audio waveforms. And those audio waveforms are, of course, dependent upon this hidden state. And you can infer, based on those audio waveforms, what the words spoken likely were. But you might not know with 100% certainty what that hidden state actually is. And it might be a task to try and predict, given this observation, given these audio waveforms, can you figure out what the actual words spoken are. And likewise, you might imagine on a website, true user engagement. Might be information you don’t directly have access to. But you can observe data, like website or app analytics, about how often was this button clicked or how often are people interacting with a page in a particular way. And you can use that to infer things about your users as well. So this type of problem comes up all the time when we’re dealing with AI and trying to infer things about the world. That often AI doesn’t really know the hidden true state of the world. All the AI has access to is some observation that is related to the hidden true state. But it’s not direct. There might be some noise there. The audio waveform might have some additional noise that might be difficult to parse. The sensor data might not be exactly correct. There’s some noise that might not allow you to conclude with certainty what the hidden state is, but can allow you to infer what it might be. And so the simple example we’ll take a look at here is imagining the hidden state as the weather, whether it’s sunny or rainy or not. And imagine you are programming an AI inside of a building that maybe has access to just a camera to inside the building. And all you have access to is an observation as to whether or not employees are bringing an umbrella into the building or not. You can detect whether it’s an umbrella or not. And so you might have an observation as to whether or not an umbrella is brought into the building or not. And using that information, you want to predict whether it’s sunny or rainy, even if you don’t know what the underlying weather is. So the underlying weather might be sunny or rainy. And if it’s raining, obviously people are more likely to bring an umbrella. And so whether or not people bring an umbrella, your observation, tells you something about the hidden state. And of course, this is a bit of a contrived example, but the idea here is to think about this more broadly in terms of more generally, any time you observe something, it having to do with some underlying hidden state. And so to try and model this type of idea where we have these hidden states and observations, rather than just use a Markov model, which has state, state, state, state, each of which is connected by that transition matrix that we described before, we’re going to use what we call a hidden Markov model. Very similar to a Markov model, but this is going to allow us to model a system that has hidden states that we don’t directly observe, along with some observed event that we do actually see. And so in addition to that transition model that we still need of saying, given the underlying state of the world, if it’s sunny or rainy, what’s the probability of tomorrow’s weather? We also need another model that, given some state, is going to give us an observation of green, yes, someone brings an umbrella into the office, or red, no, nobody brings umbrellas into the office. And so the observation might be that if it’s sunny, then odds are nobody is going to bring an umbrella to the office. But maybe some people are just being cautious, and they do bring an umbrella to the office anyways. And if it’s raining, then with much higher probability, then people are going to bring umbrellas into the office. But maybe if the rain was unexpected, people didn’t bring an umbrella. And so it might have some other probability as well. And so using the observations, you can begin to predict with reasonable likelihood what the underlying state is, even if you don’t actually get to observe the underlying state, if you don’t get to see what the hidden state is actually equal to. This here we’ll often call the sensor model. It’s also often called the emission probabilities, because the state, the underlying state, emits some sort of emission that you then observe. And so that can be another way of describing that same idea. And the sensor Markov assumption that we’re going to use is this assumption that the evidence variable, the thing we observe, the emission that gets produced, depends only on the corresponding state, meaning it can predict whether or not people will bring umbrellas or not entirely dependent just on whether it is sunny or rainy today. Of course, again, this assumption might not hold in practice, that in practice, it might depend whether or not people bring umbrellas, might depend not just on today’s weather, but also on yesterday’s weather and the day before. But for simplification purposes, it can be helpful to apply this sort of assumption just to allow us to be able to reason about these probabilities a little more easily. And if we’re able to approximate it, we can still often get a very good answer. And so what these hidden Markov models end up looking like is a little something like this, where now, rather than just have one chain of states, like sun, sun, rain, rain, rain, we instead have this upper level, which is the underlying state of the world. Is it sunny or is it rainy? And those are connected by that transition matrix we described before. But each of these states produces an emission, produces an observation that I see, that on this day, it was sunny and people didn’t bring umbrellas. And on this day, it was sunny, but people did bring umbrellas. And on this day, it was raining and people did bring umbrellas, and so on and so forth. And so each of these underlying states represented by x sub t for x sub 1, 0, 1, 2, so on and so forth, produces some sort of observation or emission, which is what the e stands for, e sub 0, e sub 1, e sub 2, so on and so forth. And so this, too, is a way of trying to represent this idea. And what you want to think about is that these underlying states are the true nature of the world, the robot’s position as it moves over time, and that produces some sort of sensor data that might be observed, or what people are actually saying and using the emission data of what audio waveforms do you detect in order to process that data and try and figure it out. And there are a number of possible tasks that you might want to do given this kind of information. And one of the simplest is trying to infer something about the future or the past or about these sort of hidden states that might exist. And so the tasks that you’ll often see, and we’re not going to go into the mathematics of these tasks, but they’re all based on the same idea of conditional probabilities and using the probability distributions we have to draw these sorts of conclusions. One task is called filtering, which is given observations from the start until now, calculate the distribution for the current state, meaning given information about from the beginning of time until now, on which days do people bring an umbrella or not bring an umbrella, can I calculate the probability of the current state that today, is it sunny or is it raining? Another task that might be possible is prediction, which is looking towards the future. Given observations about people bringing umbrellas from the beginning of when we started counting time until now, can I figure out the distribution that tomorrow is it sunny or is it raining? And you can also go backwards as well by a smoothing, where I can say given observations from start until now, calculate the distributions for some past state. Like I know that today people brought umbrellas and tomorrow people brought umbrellas. And so given two days worth of data of people bringing umbrellas, what’s the probability that yesterday it was raining? And that I know that people brought umbrellas today, that might inform that decision as well. It might influence those probabilities. And there’s also a most likely explanation task, in addition to other tasks that might exist as well, which is combining some of these given observations from the start up until now, figuring out the most likely sequence of states. And this is what we’re going to take a look at now, this idea that if I have all these observations, umbrella, no umbrella, umbrella, no umbrella, can I calculate the most likely states of sun, rain, sun, rain, and whatnot that actually represented the true weather that would produce these observations? And this is quite common when you’re trying to do something like voice recognition, for example, that you have these emissions of the audio waveforms, and you would like to calculate based on all of the observations that you have, what is the most likely sequence of actual words, or syllables, or sounds that the user actually made when they were speaking to this particular device, or other tasks that might come up in that context as well. And so we can try this out by going ahead and going into the HMM directory, HMM for Hidden Markov Model. And here, what I’ve done is I’ve defined a model where this model first defines my possible state, sun, and rain, along with their emission probabilities, the observation model, or the emission model, where here, given that I know that it’s sunny, the probability that I see people bring an umbrella is 0.2, the probability of no umbrella is 0.8. And likewise, if it’s raining, then people are more likely to bring an umbrella. Umbrella has probability 0.9, no umbrella has probability 0.1. So the actual underlying hidden states, those states are sun and rain, but the things that I observe, the observations that I can see, are either umbrella or no umbrella as the things that I observe as a result. So this then, I also need to add to it a transition matrix, same as before, saying that if today is sunny, then tomorrow is more likely to be sunny. And if today is rainy, then tomorrow is more likely to be raining. As of before, I give it some starting probabilities, saying at first, 50-50 chance for whether it’s sunny or rainy. And then I can create the model based on that information. Again, the exact syntax of this is not so important, so much as it is the data that I am now encoding into a program, such that now I can begin to do some inference. So I can give my program, for example, a list of observations, umbrella, umbrella, no umbrella, umbrella, umbrella, so on and so forth, no umbrella, no umbrella. And I would like to calculate, I would like to figure out the most likely explanation for these observations. What is likely is whether rain, rain, is this rain, or is it more likely that this was actually sunny, and then it switched back to it being rainy? And that’s an interesting question. We might not be sure, because it might just be that it just so happened on this rainy day, people decided not to bring an umbrella. Or it could be that it switched from rainy to sunny back to rainy, which doesn’t seem too likely, but it certainly could happen. And using the data we give to the hidden Markov model, our model can begin to predict these answers, can begin to figure it out. So we’re going to go ahead and just predict these observations. And then for each of those predictions, go ahead and print out what the prediction is. And this library just so happens to have a function called predict that does this prediction process for me. So I’ll run python sequence.py. And the result I get is this. This is the prediction based on the observations of what all of those states are likely to be. And it’s likely to be rain and rain. In this case, it thinks that what most likely happened is that it was sunny for a day and then went back to being rainy. But in different situations, if it was rainy for longer maybe, or if the probabilities were slightly different, you might imagine that it’s more likely that it was rainy all the way through. And it just so happened on one rainy day, people decided not to bring umbrellas. And so here, too, Python libraries can begin to allow for the sort of inference procedure. And by taking what we know and by putting it in terms of these tasks that already exist, these general tasks that work with hidden Markov models, then any time we can take an idea and formulate it as a hidden Markov model, formulate it as something that has hidden states and observed emissions that result from those states, then we can take advantage of these algorithms that are known to exist for trying to do this sort of inference. So now we’ve seen a couple of ways that AI can begin to deal with uncertainty. We’ve taken a look at probability and how we can use probability to describe numerically things that are likely or more likely or less likely to happen than other events or other variables. And using that information, we can begin to construct these standard types of models, things like Bayesian networks and Markov chains and hidden Markov models that all allow us to be able to describe how particular events relate to other events or how the values of particular variables relate to other variables, not for certain, but with some sort of probability distribution. And by formulating things in terms of these models that already exist, we can take advantage of Python libraries that implement these sort of models already and allow us just to be able to use them to produce some sort of resulting effect. So all of this then allows our AI to begin to deal with these sort of uncertain problems so that our AI doesn’t need to know things for certain but can infer based on information it doesn’t know. Next time, we’ll take a look at additional types of problems that we can solve by taking advantage of AI-related algorithms, even beyond the world of the types of problems we’ve already explored. We’ll see you next time. OK. Welcome back, everyone, to an introduction to artificial intelligence with Python. And now, so far, we’ve taken a look at a couple of different types of problems. We’ve seen classical search problems where we’re trying to get from an initial state to a goal by figuring out some optimal path. We’ve taken a look at adversarial search where we have a game-playing agent that is trying to make the best move. We’ve seen knowledge-based problems where we’re trying to use logic and inference to be able to figure out and draw some additional conclusions. And we’ve seen some probabilistic models as well where we might not have certain information about the world, but we want to use the knowledge about probabilities that we do have to be able to draw some conclusions. Today, we’re going to turn our attention to another category of problems generally known as optimization problems, where optimization is really all about choosing the best option from a set of possible options. And we’ve already seen optimization in some contexts, like game-playing, where we’re trying to create an AI that chooses the best move out of a set of possible moves. But what we’ll take a look at today is a category of types of problems and algorithms to solve them that can be used in order to deal with a broader range of potential optimization problems. And the first of the algorithms that we’ll take a look at is known as a local search. And local search differs from search algorithms we’ve seen before in the sense that the search algorithms we’ve looked at so far, which are things like breadth-first search or A-star search, for example, generally maintain a whole bunch of different paths that we’re simultaneously exploring, and we’re looking at a bunch of different paths at once trying to find our way to the solution. On the other hand, in local search, this is going to be a search algorithm that’s really just going to maintain a single node, looking at a single state. And we’ll generally run this algorithm by maintaining that single node and then moving ourselves to one of the neighboring nodes throughout this search process. And this is generally useful in context not like these problems, which we’ve seen before, like a maze-solving situation where we’re trying to find our way from the initial state to the goal by following some path. But local search is most applicable when we really don’t care about the path at all, and all we care about is what the solution is. And in the case of solving a maze, the solution was always obvious. You could point to the solution. You know exactly what the goal is, and the real question is, what is the path to get there? But local search is going to come up in cases where figuring out exactly what the solution is, exactly what the goal looks like, is actually the heart of the challenge. And to give an example of one of these kinds of problems, we’ll consider a scenario where we have two types of buildings, for example. We have houses and hospitals. And our goal might be in a world that’s formatted as this grid, where we have a whole bunch of houses, a house here, house here, two houses over there, maybe we want to try and find a way to place two hospitals on this map. So maybe a hospital here and a hospital there. And the problem now is we want to place two hospitals on the map, but we want to do so with some sort of objective. And our objective in this case is to try and minimize the distance of any of the houses from a hospital. So you might imagine, all right, what’s the distance from each of the houses to their nearest hospital? There are a number of ways we could calculate that distance. But one way is using a heuristic we’ve looked at before, which is the Manhattan distance, this idea of how many rows and columns would you have to move inside of this grid layout in order to get to a hospital, for example. And it turns out, if you take each of these four houses and figure out, all right, how close are they to their nearest hospital, you get something like this, where this house is three away from a hospital, this house is six away, and these two houses are each four away. And if you add all those numbers up together, you get a total cost of 17, for example. So for this particular configuration of hospitals, a hospital here and a hospital there, that state, we might say, has a cost of 17. And the goal of this problem now that we would like to apply a search algorithm to figure out is, can you solve this problem to find a way to minimize that cost? Minimize the total amount if you sum up all of the distances from all the houses to the nearest hospital. How can we minimize that final value? And if we think about this problem a little bit more abstractly, abstracting away from this specific problem and thinking more generally about problems like it, you can often formulate these problems by thinking about them as a state-space landscape, as we’ll soon call it. Here in this diagram of a state-space landscape, each of these vertical bars represents a particular state that our world could be in. So for example, each of these vertical bars represents a particular configuration of two hospitals. And the height of this vertical bar is generally going to represent some function of that state, some value of that state. So maybe in this case, the height of the vertical bar represents what is the cost of this particular configuration of hospitals in terms of what is the sum total of all the distances from all of the houses to their nearest hospital. And generally speaking, when we have a state-space landscape, we want to do one of two things. We might be trying to maximize the value of this function, trying to find a global maximum, so to speak, of this state-space landscape, a single state whose value is higher than all of the other states that we could possibly choose from. And generally in this case, when we’re trying to find a global maximum, we’ll call the function that we’re trying to optimize some objective function, some function that measures for any given state how good is that state, such that we can take any state, pass it into the objective function, and get a value for how good that state is. And ultimately, what our goal is is to find one of these states that has the highest possible value for that objective function. An equivalent but reversed problem is the problem of finding a global minimum, some state that has a value after you pass it into this function that is lower than all of the other possible values that we might choose from. And generally speaking, when we’re trying to find a global minimum, we call the function that we’re calculating a cost function. Generally, each state has some sort of cost, whether that cost is a monetary cost, or a time cost, or in the case of the houses and hospitals, we’ve been looking at just now, a distance cost in terms of how far away each of the houses is from a hospital. And we’re trying to minimize the cost, find the state that has the lowest possible value of that cost. So these are the general types of ideas we might be trying to go for within a state space landscape, trying to find a global maximum, or trying to find a global minimum. And how exactly do we do that? We’ll recall that in local search, we generally operate this algorithm by maintaining just a single state, just some current state represented inside of some node, maybe inside of a data structure, where we’re keeping track of where we are currently. And then ultimately, what we’re going to do is from that state, move to one of its neighbor states. So in this case, represented in this one-dimensional space by just the state immediately to the left or to the right of it. But for any different problem, you might define what it means for there to be a neighbor of a particular state. In the case of a hospital, for example, that we were just looking at, a neighbor might be moving one hospital one space to the left or to the right or up or down. Some state that is close to our current state, but slightly different, and as a result, might have a slightly different value in terms of its objective function or in terms of its cost function. So this is going to be our general strategy in local search, to be able to take a state, maintaining some current node, and move where we’re looking at in the state space landscape in order to try to find a global maximum or a global minimum somehow. And perhaps the simplest of algorithms that we could use to implement this idea of local search is an algorithm known as hill climbing. And the basic idea of hill climbing is, let’s say I’m trying to maximize the value of my state. I’m trying to figure out where the global maximum is. I’m going to start at a state. And generally, what hill climbing is going to do is it’s going to consider the neighbors of that state, that from this state, all right, I could go left or I could go right, and this neighbor happens to be higher and this neighbor happens to be lower. And in hill climbing, if I’m trying to maximize the value, I’ll generally pick the highest one I can between the state to the left and right of me. This one is higher. So I’ll go ahead and move myself to consider that state instead. And then I’ll repeat this process, continually looking at all of my neighbors and picking the highest neighbor, doing the same thing, looking at my neighbors, picking the highest of my neighbors, until I get to a point like right here, where I consider both of my neighbors and both of my neighbors have a lower value than I do. This current state has a value that is higher than any of its neighbors. And at that point, the algorithm terminates. And I can say, all right, here I have now found the solution. And the same thing works in exactly the opposite way for trying to find a global minimum. But the algorithm is fundamentally the same. If I’m trying to find a global minimum and say my current state starts here, I’ll continually look at my neighbors, pick the lowest value that I possibly can, until I eventually, hopefully, find that global minimum, a point at which when I look at both of my neighbors, they each have a higher value. And I’m trying to minimize the total score or cost or value that I get as a result of calculating some sort of cost function. So we can formulate this graphical idea in terms of pseudocode. And the pseudocode for hill climbing might look like this. We define some function called hill climb that takes as input the problem that we’re trying to solve. And generally, we’re going to start in some sort of initial state. So I’ll start with a variable called current that is keeping track of my initial state, like an initial configuration of hospitals. And maybe some problems lend themselves to an initial state, some place where you begin. In other cases, maybe not, in which case we might just randomly generate some initial state, just by choosing two locations for hospitals at random, for example, and figuring out from there how we might be able to improve. But that initial state, we’re going to store inside of current. And now, here comes our loop, some repetitive process we’re going to do again and again until the algorithm terminates. And what we’re going to do is first say, let’s figure out all of the neighbors of the current state. From my state, what are all of the neighboring states for some definition of what it means to be a neighbor? And I’ll go ahead and choose the highest value of all of those neighbors and save it inside of this variable called neighbor. So keep track of the highest-valued neighbor. This is in the case where I’m trying to maximize the value. In the case where I’m trying to minimize the value, you might imagine here, you’ll pick the neighbor with the lowest possible value. But these ideas are really fundamentally interchangeable. And it’s possible, in some cases, there might be multiple neighbors that each have an equally high value or an equally low value in the minimizing case. And in that case, we can just choose randomly from among them. Choose one of them and save it inside of this variable neighbor. And then the key question to ask is, is this neighbor better than my current state? And if the neighbor, the best neighbor that I was able to find, is not better than my current state, well, then the algorithm is over. And I’ll just go ahead and return the current state. If none of my neighbors are better, then I may as well stay where I am, is the general logic of the hill climbing algorithm. But otherwise, if the neighbor is better, then I may as well move to that neighbor. So you might imagine setting current equal to neighbor, where the general idea is if I’m at a current state and I see a neighbor that is better than me, then I’ll go ahead and move there. And then I’ll repeat the process, continually moving to a better neighbor until I reach a point at which none of my neighbors are better than I am. And at that point, we’d say the algorithm can just terminate there. So let’s take a look at a real example of this with these houses and hospitals. So we’ve seen now that if we put the hospitals in these two locations, that has a total cost of 17. And now we need to define, if we’re going to implement this hill climbing algorithm, what it means to take this particular configuration of hospitals, this particular state, and get a neighbor of that state. And a simple definition of neighbor might be just, let’s pick one of the hospitals and move it by one square, the left or right or up or down, for example. And that would mean we have six possible neighbors from this particular configuration. We could take this hospital and move it to any of these three possible squares, or we take this hospital and move it to any of those three possible squares. And each of those would generate a neighbor. And what I might do is say, all right, here’s the locations and the distances between each of the houses and their nearest hospital. Let me consider all of the neighbors and see if any of them can do better than a cost of 17. And it turns out there are a couple of ways that we could do that. And it doesn’t matter if we randomly choose among all the ways that are the best. But one such possible way is by taking a look at this hospital here and considering the directions in which it might move. If we hold this hospital constant, if we take this hospital and move it one square up, for example, that doesn’t really help us. It gets closer to the house up here, but it gets further away from the house down here. And it doesn’t really change anything for the two houses along the left-hand side. But if we take this hospital on the right and move it one square down, it’s the opposite problem. It gets further away from the house up above, and it gets closer to the house down below. The real idea, the goal should be to be able to take this hospital and move it one square to the left. By moving it one square to the left, we move it closer to both of these houses on the right without changing anything about the houses on the left. For them, this hospital is still the closer one, so they aren’t affected. So we’re able to improve the situation by picking a neighbor that results in a decrease in our total cost. And so we might do that. Move ourselves from this current state to a neighbor by just taking that hospital and moving it. And at this point, there’s not a whole lot that can be done with this hospital. But there’s still other optimizations we can make, other neighbors we can move to that are going to have a better value. If we consider this hospital, for example, we might imagine that right now it’s a bit far up, that both of these houses are a little bit lower. So we might be able to do better by taking this hospital and moving it one square down, moving it down so that now instead of a cost of 15, we’re down to a cost of 13 for this particular configuration. And we can do even better by taking the hospital and moving it one square to the left. Now instead of a cost of 13, we have a cost of 11, because this house is one away from the hospital. This one is four away. This one is three away. And this one is also three away. So we’ve been able to do much better than that initial cost that we had using the initial configuration. Just by taking every state and asking ourselves the question, can we do better by just making small incremental changes, moving to a neighbor, moving to a neighbor, and moving to a neighbor after that? And now at this point, we can potentially see that at this point, the algorithm is going to terminate. There’s actually no neighbor we can move to that is going to improve the situation, get us a cost that is less than 11. Because if we take this hospital and move it upper to the right, well, that’s going to make it further away. If we take it and move it down, that doesn’t really change the situation. It gets further away from this house but closer to that house. And likewise, the same story was true for this hospital. Any neighbor we move it to, up, left, down, or right, is either going to make it further away from the houses and increase the cost, or it’s going to have no effect on the cost whatsoever. And so the question we might now ask is, is this the best we could do? Is this the best placement of the hospitals we could possibly have? And it turns out the answer is no, because there’s a better way that we could place these hospitals. And in particular, there are a number of ways you could do this. But one of the ways is by taking this hospital here and moving it to this square, for example, moving it diagonally by one square, which was not part of our definition of neighbor. We could only move left, right, up, or down. But this is, in fact, better. It has a total cost of 9. It is now closer to both of these houses. And as a result, the total cost is less. But we weren’t able to find it, because in order to get there, we had to go through a state that actually wasn’t any better than the current state that we had been on previously. And so this appears to be a limitation, or a concern you might have as you go about trying to implement a hill climbing algorithm, is that it might not always give you the optimal solution. If we’re trying to maximize the value of any particular state, we’re trying to find the global maximum, a concern might be that we could get stuck at one of the local maxima, highlighted here in blue, where a local maxima is any state whose value is higher than any of its neighbors. If we ever find ourselves at one of these two states when we’re trying to maximize the value of the state, we’re not going to make any changes. We’re not going to move left or right. We’re not going to move left here, because those states are worse. But yet, we haven’t found the global optimum. We haven’t done as best as we could do. And likewise, in the case of the hospitals, what we’re ultimately trying to do is find a global minimum, find a value that is lower than all of the others. But we have the potential to get stuck at one of the local minima, any of these states whose value is lower than all of its neighbors, but still not as low as the local minima. And so the takeaway here is that it’s not always going to be the case that when we run this naive hill climbing algorithm, that we’re always going to find the optimal solution. There are things that could go wrong. If we started here, for example, and tried to maximize our value as much as possible, we might move to the highest possible neighbor, move to the highest possible neighbor, move to the highest possible neighbor, and stop, and never realize that there’s actually a better state way over there that we could have gone to instead. And other problems you might imagine just by taking a look at this state space landscape are these various different types of plateaus, something like this flat local maximum here, where all six of these states each have the exact same value. And so in the case of the algorithm we showed before, none of the neighbors are better, so we might just get stuck at this flat local maximum. And even if you allowed yourself to move to one of the neighbors, it wouldn’t be clear which neighbor you would ultimately move to, and you could get stuck here as well. And there’s another one over here. This one is called a shoulder. It’s not really a local maximum, because there’s still places where we can go higher, not a local minimum, because we can go lower. So we can still make progress, but it’s still this flat area, where if you have a local search algorithm, there’s potential to get lost here, unable to make some upward or downward progress, depending on whether we’re trying to maximize or minimize it, and therefore another potential for us to be able to find a solution that might not actually be the optimal solution. And so because of this potential, the potential that hill climbing has to not always find us the optimal result, it turns out there are a number of different varieties and variations on the hill climbing algorithm that help to solve the problem better depending on the context, and depending on the specific type of problem, some of these variants might be more applicable than others. What we’ve taken a look at so far is a version of hill climbing generally called steepest ascent hill climbing, where the idea of steepest ascent hill climbing is we are going to choose the highest valued neighbor, in the case where we’re trying to maximize or the lowest valued neighbor in cases where we’re trying to minimize. But generally speaking, if I have five neighbors and they’re all better than my current state, I will pick the best one of those five. Now, sometimes that might work pretty well. It’s sort of a greedy approach of trying to take the best operation at any particular time step, but it might not always work. There might be cases where actually I want to choose an option that is slightly better than me, but maybe not the best one because that later on might lead to a better outcome ultimately. So there are other variants that we might consider of this basic hill climbing algorithm. One is known as stochastic hill climbing. And in this case, we choose randomly from all of our higher value neighbors. So if I’m at my current state and there are five neighbors that are all better than I am, rather than choosing the best one, as steep as the set would do, stochastic will just choose randomly from one of them, thinking that if it’s better, then it’s better. And maybe there’s a potential to make forward progress, even if it is not locally the best option I could possibly choose. First choice hill climbing ends up just choosing the very first highest valued neighbor that it follows, behaving on a similar idea, rather than consider all of the neighbors. As soon as we find a neighbor that is better than our current state, we’ll go ahead and move there. There may be some efficiency improvements there and maybe has the potential to find a solution that the other strategies weren’t able to find. And with all of these variants, we still suffer from the same potential risk, this risk that we might end up at a local minimum or a local maximum. And we can reduce that risk by repeating the process multiple times. So one variant of hill climbing is random restart hill climbing, where the general idea is we’ll conduct hill climbing multiple times. If we apply steepest descent hill climbing, for example, we’ll start at some random state, try and figure out how to solve the problem and figure out what is the local maximum or local minimum we get to. And then we’ll just randomly restart and try again, choose a new starting configuration, try and figure out what the local maximum or minimum is, and do this some number of times. And then after we’ve done it some number of times, we can pick the best one out of all of the ones that we’ve taken a look at. So there’s another option we have access to as well. And then, although I said that generally local search will usually just keep track of a single node and then move to one of its neighbors, there are variants of hill climbing that are known as local beam searches, where rather than keep track of just one current best state, we’re keeping track of k highest valued neighbors, such that rather than starting at one random initial configuration, I might start with 3 or 4 or 5, randomly generate all the neighbors, and then pick the 3 or 4 or 5 best of all of the neighbors that I find, and continually repeat this process, with the idea being that now I have more options that I’m considering, more ways that I could potentially navigate myself to the optimal solution that might exist for a particular problem. So let’s now take a look at some actual code that can implement some of these kinds of ideas, something like steepest ascent hill climbing, for example, for trying to solve this hospital problem. So I’m going to go ahead and go into my hospitals directory, where I’ve actually set up the basic framework for solving this type of problem. I’ll go ahead and go into hospitals.py, and we’ll take a look at the code we’ve created here. I’ve defined a class that is going to represent the state space. So the space has a height, and a width, and also some number of hospitals. So you can configure how big is your map, how many hospitals should go here. We have a function for adding a new house to the state space, and then some functions that are going to get me all of the available spaces for if I want to randomly place hospitals in particular locations. And here now is the hill climbing algorithm. So what are we going to do in the hill climbing algorithm? Well, we’re going to start by randomly initializing where the hospitals are going to go. We don’t know where the hospitals should actually be, so let’s just randomly place them. So here I’m running a loop for each of the hospitals that I have. I’m going to go ahead and add a new hospital at some random location. So I basically get all of the available spaces, and I randomly choose one of them as where I would like to add this particular hospital. I have some logging output and generating some images, which we’ll take a look at a little bit later. But here is the key idea. So I’m going to just keep repeating this algorithm. I could specify a maximum of how many times I want it to run, or I could just run it up until it hits a local maximum or local minimum. And now we’ll basically consider all of the hospitals that could potentially move. So consider each of the two hospitals or more hospitals if they’re more than that. And consider all of the places where that hospital could move to, some neighbor of that hospital that we can move the neighbor to. And then see, is this going to be better than where we were currently? So if it is going to be better, then we’ll go ahead and update our best neighbor and keep track of this new best neighbor that we found. And then afterwards, we can ask ourselves the question, if best neighbor cost is greater than or equal to the cost of the current set of hospitals, meaning if the cost of our best neighbor is greater than the current cost, meaning our best neighbor is worse than our current state, well, then we shouldn’t make any changes at all. And we should just go ahead and return the current set of hospitals. But otherwise, we can update our hospitals in order to change them to one of the best neighbors. And if there are multiple that are all equivalent, I’m here using random.choice to say go ahead and choose one randomly. So this is really just a Python implementation of that same idea that we were just talking about, this idea of taking a current state, some current set of hospitals, generating all of the neighbors, looking at all of the ways we could take one hospital and move it one square to the left or right or up or down, and then figuring out, based on all of that information, which is the best neighbor or the set of all the best neighbors, and then choosing from one of those. And each time, we go ahead and generate an image in order to do that. And so now what we’re doing is if we look down at the bottom, I’m going to randomly generate a space with height 10 and width 20. And I’ll say go ahead and put three hospitals somewhere in the space. I’ll randomly generate 15 houses that I just go ahead and add in random locations. And now I’m going to run this hill climbing algorithm in order to try and figure out where we should place those hospitals. So we’ll go ahead and run this program by running Python hospitals. And we see that we started. Our initial state had a cost of 72, but we were able to continually find neighbors that were able to decrease that cost, decrease to 69, 66, 63, so on and so forth, all the way down to 53, as the best neighbor we were able to ultimately find. And we can take a look at what that looked like by just opening up these files. So here, for example, was the initial configuration. We randomly selected a location for each of these 15 different houses and then randomly selected locations for one, two, three hospitals that were just located somewhere inside of the state space. And if you add up all the distances from each of the houses to their nearest hospital, you get a total cost of about 72. And so now the question is, what neighbors can we move to that improve the situation? And it looks like the first one the algorithm found was by taking this house that was over there on the right and just moving it to the left. And that probably makes sense because if you look at the houses in that general area, really these five houses look like they’re probably the ones that are going to be closest to this hospital over here. Moving it to the left decreases the total distance, at least to most of these houses, though it does increase that distance for one of them. And so we’re able to make these improvements to the situation by continually finding ways that we can move these hospitals around until we eventually settle at this particular state that has a cost of 53, where we figured out a position for each of the hospitals. And now none of the neighbors that we could move to are actually going to improve the situation. We can take this hospital and this hospital and that hospital and look at each of the neighbors. And none of those are going to be better than this particular configuration. And again, that’s not to say that this is the best we could do. There might be some other configuration of hospitals that is a global minimum. And this might just be a local minimum that is the best of all of its neighbors, but maybe not the best in the entire possible state space. And you could search through the entire state space by considering all of the possible configurations for hospitals. But ultimately, that’s going to be very time intensive, especially as our state space gets bigger and there might be more and more possible states. It’s going to take quite a long time to look through all of them. And so being able to use these sort of local search algorithms can often be quite good for trying to find the best solution we can do. And especially if we don’t care about doing the best possible and we just care about doing pretty good and finding a pretty good placement of those hospitals, then these methods can be particularly powerful. But of course, we can try and mitigate some of this concern by instead of using hill climbing to use random restart, this idea of rather than just hill climb one time, we can hill climb multiple times and say, try hill climbing a whole bunch of times on the exact same map and figure out what is the best one that we’ve been able to find. And so I’ve here implemented a function for random restart that restarts some maximum number of times. And what we’re going to do is repeat that number of times this process of just go ahead and run the hill climbing algorithm, figure out what the cost is of getting from all the houses to the hospitals, and then figure out is this better than we’ve done so far. So I can try this exact same idea where instead of running hill climbing, I’ll go ahead and run random restart. And I’ll randomly restart maybe 20 times, for example. And we’ll go ahead and now I’ll remove all the images and then rerun the program. And now we started by finding a original state. When we initially ran hill climbing, the best cost we were able to find was 56. Each of these iterations is a different iteration of the hill climbing algorithm. We’re running hill climbing not one time, but 20 times here, each time going until we find a local minimum in this case. And we look and see each time did we do better than we did the best time we’ve done so far. So we went from 56 to 46. This one was greater, so we ignored it. This one was 41, which was less, so we went ahead and kept that one. And for all of the remaining 16 times that we tried to implement hill climbing and we tried to run the hill climbing algorithm, we couldn’t do any better than that 41. Again, maybe there is a way to do better that we just didn’t find, but it looks like that way ended up being a pretty good solution to the problem. That was attempt number three, starting from counting at zero. So we can take a look at that, open up number three. And this was the state that happened to have a cost of 41, that after running the hill climbing algorithm on some particular random initial configuration of hospitals, this is what we found was the local minimum in terms of trying to minimize the cost. And it looks like we did pretty well. This hospital is pretty close to this region. This one is pretty close to these houses here. This hospital looks about as good as we can do for trying to capture those houses over on that side. And so these sorts of algorithms can be quite useful for trying to solve these problems. But the real problem with many of these different types of hill climbing, steepest of sense, stochastic, first choice, and so forth, is that they never make a move that makes our situation worse. They’re always going to take ourselves in our current state, look at the neighbors, and consider can we do better than our current state and move to one of those neighbors. Which of those neighbors we choose might vary among these various different types of algorithms, but we never go from a current position to a position that is worse than our current position. And ultimately, that’s what we’re going to need to do if we want to be able to find a global maximum or a global minimum. Because sometimes if we get stuck, we want to find some way of dislodging ourselves from our local maximum or local minimum in order to find the global maximum or the global minimum or increase the probability that we do find it. And so the most popular technique for trying to approach the problem from that angle is a technique known as simulated annealing, simulated because it’s modeling after a real physical process of annealing, where you can think about this in terms of physics, a physical situation where you have some system of particles. And you might imagine that when you heat up a particular physical system, there’s a lot of energy there. Things are moving around quite randomly. But over time, as the system cools down, it eventually settles into some final position. And that’s going to be the general idea of simulated annealing. We’re going to simulate that process of some high temperature system where things are moving around randomly quite frequently, but over time decreasing that temperature until we eventually settle at our ultimate solution. And the idea is going to be if we have some state space landscape that looks like this and we begin at its initial state here, if we’re looking for a global maximum and we’re trying to maximize the value of the state, our traditional hill climbing algorithms would just take the state and look at the two neighbor ones and always pick the one that is going to increase the value of the state. But if we want some chance of being able to find the global maximum, we can’t always make good moves. We have to sometimes make bad moves and allow ourselves to make a move in a direction that actually seems for now to make our situation worse such that later we can find our way up to that global maximum in terms of trying to solve that problem. Of course, once we get up to this global maximum, once we’ve done a whole lot of the searching, then we probably don’t want to be moving to states that are worse than our current state. And so this is where this metaphor for annealing starts to come in, where we want to start making more random moves and over time start to make fewer of those random moves based on a particular temperature schedule. So the basic outline looks something like this. Early on in simulated annealing, we have a higher temperature state. And what we mean by a higher temperature state is that we are more likely to accept neighbors that are worse than our current state. We might look at our neighbors. And if one of our neighbors is worse than the current state, especially if it’s not all that much worse, if it’s pretty close but just slightly worse, then we might be more likely to accept that and go ahead and move to that neighbor anyways. But later on as we run simulated annealing, we’re going to decrease that temperature. And at a lower temperature, we’re going to be less likely to accept neighbors that are worse than our current state. Now to formalize this and put a little bit of pseudocode to it, here is what that algorithm might look like. We have a function called simulated annealing that takes as input the problem we’re trying to solve and also potentially some maximum number of times we might want to run the simulated annealing process, how many different neighbors we’re going to try and look for. And that value is going to vary based on the problem you’re trying to solve. We’ll, again, start with some current state that will be equal to the initial state of the problem. But now we need to repeat this process over and over for max number of times. Repeat some process some number of times where we’re first going to calculate a temperature. And this temperature function takes the current time t starting at 1 going all the way up to max and then gives us some temperature that we can use in our computation, where the idea is that this temperature is going to be higher early on and it’s going to be lower later on. So there are a number of ways this temperature function could often work. One of the simplest ways is just to say it is like the proportion of time that we still have remaining. Out of max units of time, how much time do we have remaining? You start off with a lot of that time remaining. And as time goes on, the temperature is going to decrease because you have less and less of that remaining time still available to you. So we calculate a temperature for the current time. And then we pick a random neighbor of the current state. No longer are we going to be picking the best neighbor that we possibly can or just one of the better neighbors that we can. We’re going to pick a random neighbor. It might be better. It might be worse. But we’re going to calculate that. We’re going to calculate delta E, E for energy in this case, which is just how much better is the neighbor than the current state. So if delta E is positive, that means the neighbor is better than our current state. If delta E is negative, that means the neighbor is worse than our current state. And so we can then have a condition that looks like this. If delta E is greater than 0, that means the neighbor state is better than our current state. And if ever that situation arises, we’ll just go ahead and update current to be that neighbor. Same as before, move where we are currently to be the neighbor because the neighbor is better than our current state. We’ll go ahead and accept that. But now the difference is that whereas before, we never, ever wanted to take a move that made our situation worse, now we sometimes want to make a move that is actually going to make our situation worse because sometimes we’re going to need to dislodge ourselves from a local minimum or local maximum to increase the probability that we’re able to find the global minimum or the global maximum a little bit later. And so how do we do that? How do we decide to sometimes accept some state that might actually be worse? Well, we’re going to accept a worse state with some probability. And that probability needs to be based on a couple of factors. It needs to be based in part on the temperature, where if the temperature is higher, we’re more likely to move to a worse neighbor. And if the temperature is lower, we’re less likely to move to a worse neighbor. But it also, to some degree, should be based on delta E. If the neighbor is much worse than the current state, we probably want to be less likely to choose that than if the neighbor is just a little bit worse than the current state. So again, there are a couple of ways you could calculate this. But it turns out one of the most popular is just to calculate E to the power of delta E over T, where E is just a constant. Delta E over T are based on delta E and T here. We calculate that value. And that’ll be some value between 0 and 1. And that is the probability with which we should just say, all right, let’s go ahead and move to that neighbor. And it turns out that if you do the math for this value, when delta E is such that the neighbor is not that much worse than the current state, that’s going to be more likely that we’re going to go ahead and move to that state. And likewise, when the temperature is lower, we’re going to be less likely to move to that neighboring state as well. So now this is the big picture for simulated annealing, this process of taking the problem and going ahead and generating random neighbors will always move to a neighbor if it’s better than our current state. But even if the neighbor is worse than our current state, we’ll sometimes move there depending on how much worse it is and also based on the temperature. And as a result, the hope, the goal of this whole process is that as we begin to try and find our way to the global maximum or the global minimum, we can dislodge ourselves if we ever get stuck at a local maximum or local minimum in order to eventually make our way to exploring the part of the state space that is going to be the best. And then as the temperature decreases, eventually we settle there without moving around too much from what we’ve found to be the globally best thing that we can do thus far. So at the very end, we just return whatever the current state happens to be. And that is the conclusion of this algorithm. We’ve been able to figure out what the solution is. And these types of algorithms have a lot of different applications. Any time you can take a problem and formulate it as something where you can explore a particular configuration and then ask, are any of the neighbors better than this current configuration and have some way of measuring that, then there is an applicable case for these hill climbing, simulated annealing types of algorithms. So sometimes it can be for facility location type problems, like for when you’re trying to plan a city and figure out where the hospitals should be. But there are definitely other applications as well. And one of the most famous problems in computer science is the traveling salesman problem. Traveling salesman problem generally is formulated like this. I have a whole bunch of cities here indicated by these dots. And what I’d like to do is find some route that takes me through all of the cities and ends up back where I started. So some route that starts here, goes through all these cities, and ends up back where I originally started. And what I might like to do is minimize the total distance that I have to travel or the total cost of taking this entire path. And you can imagine this is a problem that’s very applicable in situations like when delivery companies are trying to deliver things to a whole bunch of different houses, they want to figure out, how do I get from the warehouse to all these various different houses and get back again, all using as minimal time and distance and energy as possible. So you might want to try to solve these sorts of problems. But it turns out that solving this particular kind of problem is very computationally difficult. It is a very computationally expensive task to be able to figure it out. This falls under the category of what are known as NP-complete problems, problems that there is no known efficient way to try and solve these sorts of problems. And so what we ultimately have to do is come up with some approximation, some ways of trying to find a good solution, even if we’re not going to find the globally best solution that we possibly can, at least not in a feasible or tractable amount of time. And so what we could do is take the traveling salesman problem and try to formulate it using local search and ask a question like, all right, I can pick some state, some configuration, some route between all of these nodes. And I can measure the cost of that state, figure out what the distance is. And I might now want to try to minimize that cost as much as possible. And then the only question now is, what does it mean to have a neighbor of this state? What does it mean to take this particular route and have some neighboring route that is close to it but slightly different and such that it might have a different total distance? And there are a number of different definitions for what a neighbor of a traveling salesman configuration might look like. But one way is just to say, a neighbor is what happens if we pick two of these edges between nodes and switch them effectively. So for example, I might pick these two edges here, these two that just happened across this node goes here, this node goes there, and go ahead and switch them. And what that process will generally look like is removing both of these edges from the graph, taking this node, and connecting it to the node it wasn’t connected to. So connecting it up here instead. We’ll need to take these arrows that were originally going this way and reverse them, so move them going the other way, and then just fill in that last remaining blank, add an arrow that goes in that direction instead. So by taking two edges and just switching them, I have been able to consider one possible neighbor of this particular configuration. And it looks like this neighbor is actually better. It looks like this probably travels a shorter distance in order to get through all the cities through this route than the current state did. And so you could imagine implementing this idea inside of a hill climbing or simulated annealing algorithm, where we repeat this process to try and take a state of this traveling salesman problem, look at all the neighbors, and then move to the neighbors if they’re better, or maybe even move to the neighbors if they’re worse, until we eventually settle upon some best solution that we’ve been able to find. And it turns out that these types of approximation algorithms, even if they don’t always find the very best solution, can often do pretty well at trying to find solutions that are helpful too. So that then was a look at local search, a particular category of algorithms that can be used for solving a particular type of problem, where we don’t really care about the path to the solution. I didn’t care about the steps I took to decide where the hospitals should go. I just cared about the solution itself. I just care about where the hospitals should be, or what the route through the traveling salesman journey really ought to be. Another type of algorithm that might come up are known as these categories of linear programming types of problems. And linear programming often comes up in the context where we’re trying to optimize for some mathematical function. But oftentimes, linear programming will come up when we might have real numbered values. So it’s not just discrete fixed values that we might have, but any decimal values that we might want to be able to calculate. And so linear programming is a family of types of problems where we might have a situation that looks like this, where the goal of linear programming is to minimize a cost function. And you can invert the numbers and say try and maximize it, but often we’ll frame it as trying to minimize a cost function that has some number of variables, x1, x2, x3, all the way up to xn, just some number of variables that are involved, things that I want to know the values to. And this cost function might have coefficients in front of those variables. And this is what we would call a linear equation, where we just have all of these variables that might be multiplied by a coefficient and then add it together. We’re not going to square anything or cube anything, because that’ll give us different types of equations. With linear programming, we’re just dealing with linear equations in addition to linear constraints, where a constraint is going to look something like if we sum up this particular equation that is just some linear combination of all of these variables, it is less than or equal to some bound b. And we might have a whole number of these various different constraints that we might place onto our linear programming exercise. And likewise, just as we can have constraints that are saying this linear equation is less than or equal to some bound b, it might also be equal to something. That if you want some sum of some combination of variables to be equal to a value, you can specify that. And we can also maybe specify that each variable has lower and upper bounds, that it needs to be a positive number, for example, or it needs to be a number that is less than 50, for example. And there are a number of other choices that we can make there for defining what the bounds of a variable are. But it turns out that if you can take a problem and formulate it in these terms, formulate the problem as your goal is to minimize a cost function, and you’re minimizing that cost function subject to particular constraints, subjects to equations that are of the form like this of some sequence of variables is less than a bound or is equal to some particular value, then there are a number of algorithms that already exist for solving these sorts of problems. So let’s go ahead and take a look at an example. Here’s an example of a problem that might come up in the world of linear programming. Often, this is going to come up when we’re trying to optimize for something. And we want to be able to do some calculations, and we have constraints on what we’re trying to optimize. And so it might be something like this. In the context of a factory, we have two machines, x1 and x2. x1 costs $50 an hour to run. x2 costs $80 an hour to run. And our goal, what we’re trying to do, our objective, is to minimize the total cost. So that’s what we’d like to do. But we need to do so subject to certain constraints. So there might be a labor constraint that x1 requires five units of labor per hour, x2 requires two units of labor per hour, and we have a total of 20 units of labor that we have to spend. So this is a constraint. We have no more than 20 units of labor that we can spend, and we have to spend it across x1 and x2, each of which requires a different amount of labor. And we might also have a constraint like this that tells us x1 is going to produce 10 units of output per hour, x2 is going to produce 12 units of output per hour, and the company needs 90 units of output. So we have some goal, something we need to achieve. We need to achieve 90 units of output, but there are some constraints that x1 can only produce 10 units of output per hour, x2 produces 12 units of output per hour. These types of problems come up quite frequently, and you can start to notice patterns in these types of problems, problems where I am trying to optimize for some goal, minimizing cost, maximizing output, maximizing profits, or something like that. And there are constraints that are placed on that process. And so now we just need to formulate this problem in terms of linear equations. So let’s start with this first point. Two machines, x1 and x2, x costs $50 an hour, x2 costs $80 an hour. Here we can come up with an objective function that might look like this. This is our cost function, rather. 50 times x1 plus 80 times x2, where x1 is going to be a variable representing how many hours do we run machine x1 for, x2 is going to be a variable representing how many hours are we running machine x2 for. And what we’re trying to minimize is this cost function, which is just how much it costs to run each of these machines per hour summed up. This is an example of a linear equation, just some combination of these variables plus coefficients that are placed in front of them. And I would like to minimize that total value. But I need to do so subject to these constraints. x1 requires 50 units of labor per hour, x2 requires 2, and we have a total of 20 units of labor to spend. And so that gives us a constraint of this form. 5 times x1 plus 2 times x2 is less than or equal to 20. 20 is the total number of units of labor we have to spend. And that’s spent across x1 and x2, each of which requires a different number of units of labor per hour, for example. And finally, we have this constraint here. x1 produces 10 units of output per hour, x2 produces 12, and we need 90 units of output. And so this might look something like this. That 10×1 plus 12×2, this is amount of output per hour, it needs to be at least 90. We can do better or great, but it needs to be at least 90. And if you recall from my formulation before, I said that generally speaking in linear programming, we deal with equals constraints or less than or equal to constraints. So we have a greater than or equal to sign here. That’s not a problem. Whenever we have a greater than or equal to sign, we can just multiply the equation by negative 1, and that’ll flip it around to a less than or equals negative 90, for example, instead of a greater than or equal to 90. And that’s going to be an equivalent expression that we can use to represent this problem. So now that we have this cost function and these constraints that it’s subject to, it turns out there are a number of algorithms that can be used in order to solve these types of problems. And these problems go a little bit more into geometry and linear algebra than we’re really going to get into. But the most popular of these types of algorithms are simplex, which was one of the first algorithms discovered for trying to solve linear programs. And later on, a class of interior point algorithms can be used to solve this type of problem as well. The key is not to understand exactly how these algorithms work, but to realize that these algorithms exist for efficiently finding solutions any time we have a problem of this particular form. And so we can take a look, for example, at the production directory here, where here I have a file called production.py, where here I’m using scipy, which was the library for a lot of science-related functions within Python. And I can go ahead and just run this optimization function in order to run a linear program. .linprog here is going to try and solve this linear program for me, where I provide to this expression, to this function call, all of the data about my linear program. So it needs to be in a particular format, which might be a little confusing at first. But this first argument to scipy.optimize.linprogramming is the cost function, which is in this case just an array or a list that has 50 and 80, because my original cost function was 50 times x1 plus 80 times x2. So I just tell Python, 50 and 80, those are the coefficients that I am now trying to optimize for. And then I provide all of the constraints. So the constraints, and I wrote them up above in comments, is the constraint 1 is 5×1 plus 2×2 is less than or equal to 20. And constraint 2 is negative 10×1 plus negative 12×2 is less than or equal to negative 90. And so scipy expects these constraints to be in a particular format. It first expects me to provide all of the coefficients for the upper bound equations, ub just for upper bound, where the coefficients of the first equation are 5 and 2, because we have 5×1 and 2×2. And the coefficients for the second equation are negative 10 and negative 12, because I have negative 10×1 plus negative 12×2. And then here, we provide it as a separate argument, just to keep things separate, what the actual bound is. What is the upper bound for each of these constraints? Well, for the first constraint, the upper bound is 20. That was constraint number 1. And then for constraint number 2, the upper bound is 90. So a bit of a cryptic way of representing it. It’s not quite as simple as just writing the mathematical equations. What really is being expected here are all of the coefficients and all of the numbers that are in these equations by first providing the coefficients for the cost function, then providing all the coefficients for the inequality constraints, and then providing all of the upper bounds for those inequality constraints. And once all of that information is there, then we can run any of these interior point algorithms or the simplex algorithm. Even if you don’t understand how it works, you can just run the function and figure out what the result should be. And here, I said if the result is a success, we were able to solve this problem. Go ahead and print out what the value of x1 and x2 should be. Otherwise, go ahead and print out no solution. And so if I run this program by running python production.py, it takes a second to calculate. But then we see here is what the optimal solution should be. x1 should run for 1.5 hours. x2 should run for 6.25 hours. And we were able to do this by just formulating the problem as a linear equation that we were trying to optimize, some cost that we were trying to minimize, and then some constraints that were placed on that. And many, many problems fall into this category of problems that you can solve if you can just figure out how to use equations and use these constraints to represent that general idea. And that’s a theme that’s going to come up a couple of times today, where we want to be able to take some problem and reduce it down to some problem we know how to solve in order to begin to find a solution and to use existing methods that we can use in order to find a solution more effectively or more efficiently. And it turns out that these types of problems, where we have constraints, show up in other ways too. And there’s an entire class of problems that’s more generally just known as constraint satisfaction problems. And we’re going to now take a look at how you might formulate a constraint satisfaction problem and how you might go about solving a constraint satisfaction problem. But the basic idea of a constraint satisfaction problem is we have some number of variables that need to take on some values. And we need to figure out what values each of those variables should take on. But those variables are subject to particular constraints that are going to limit what values those variables can actually take on. So let’s take a look at a real world example, for example. Let’s look at exam scheduling, that I have four students here, students 1, 2, 3, and 4. Each of them is taking some number of different classes. Classes here are going to be represented by letters. So student 1 is enrolled in courses A, B, and C. Student 2 is enrolled in courses B, D, and E, so on and so forth. And now, say university, for example, is trying to schedule exams for all of these courses. But there are only three exam slots on Monday, Tuesday, and Wednesday. And we have to schedule an exam for each of these courses. But the constraint now, the constraint we have to deal with with the scheduling, is that we don’t want anyone to have to take two exams on the same day. We would like to try and minimize that or eliminate it if at all possible. So how do we begin to represent this idea? How do we structure this in a way that a computer with an AI algorithm can begin to try and solve the problem? Well, let’s in particular just look at these classes that we might take and represent each of the courses as some node inside of a graph. And what we’ll do is we’ll create an edge between two nodes in this graph if there is a constraint between those two nodes. So what does this mean? Well, we can start with student 1, who’s enrolled in courses A, B, and C. What that means is that A and B can’t have an exam at the same time. A and C can’t have an exam at the same time. And B and C also can’t have an exam at the same time. And I can represent that in this graph by just drawing edges. One edge between A and B, one between B and C, and then one between C and A. And that encodes now the idea that between those nodes, there is a constraint. And in particular, the constraint happens to be that these two can’t be equal to each other, though there are other types of constraints that are possible, depending on the type of problem that you’re trying to solve. And then we can do the same thing for each of the other students. So for student 2, who’s enrolled in courses B, D, and E, well, that means B, D, and E, those all need to have edges that connect each other as well. Student 3 is enrolled in courses C, E, and F. So we’ll go ahead and take C, E, and F and connect those by drawing edges between them too. And then finally, student 4 is enrolled in courses E, F, and G. And we can represent that by drawing edges between E, F, and G, although E and F already had an edge between them. We don’t need another one, because this constraint is just encoding the idea that course E and course F cannot have an exam on the same day. So this then is what we might call the constraint graph. There’s some graphical representation of all of my variables, so to speak, and the constraints between those possible variables. Where in this particular case, each of the constraints represents an inequality constraint, that an edge between B and D means whatever value the variable B takes on cannot be the value that the variable D takes on as well. So what then actually is a constraint satisfaction problem? Well, a constraint satisfaction problem is just some set of variables, x1 all the way through xn, some set of domains for each of those variables. So every variable needs to take on some values. Maybe every variable has the same domain, but maybe each variable has a slightly different domain. And then there’s a set of constraints, and we’ll just call a set C, that is some constraints that are placed upon these variables, like x1 is not equal to x2. But there could be other forms too, like maybe x1 equals x2 plus 1 if these variables are taking on numerical values in their domain, for example. The types of constraints are going to vary based on the types of problems. And constraint satisfaction shows up all over the place as well, in any situation where we have variables that are subject to particular constraints. So one popular game is Sudoku, for example, this 9 by 9 grid where you need to fill in numbers in each of these cells, but you want to make sure there’s never a duplicate number in any row, or in any column, or in any grid of 3 by 3 cells, for example. So what might this look like as a constraint satisfaction problem? Well, my variables are all of the empty squares in the puzzle. So represented here is just like an x comma y coordinate, for example, as all of the squares where I need to plug in a value, where I don’t know what value it should take on. The domain is just going to be all of the numbers from 1 through 9, any value that I could fill in to one of these cells. So that is going to be the domain for each of these variables. And then the constraints are going to be of the form, like this cell can’t be equal to this cell, can’t be equal to this cell, can’t be, and all of these need to be different, for example, and same for all of the rows, and the columns, and the 3 by 3 squares as well. So those constraints are going to enforce what values are actually allowed. And we can formulate the same idea in the case of this exam scheduling problem, where the variables we have are the different courses, a up through g. The domain for each of these variables is going to be Monday, Tuesday, and Wednesday. Those are the possible values each of the variables can take on, that in this case just represent when is the exam for that class. And then the constraints are of this form, a is not equal to b, a is not equal to c, meaning a and b can’t have an exam on the same day, a and c can’t have an exam on the same day. Or more formally, these two variables cannot take on the same value within their domain. So that then is this formulation of a constraint satisfaction problem that we can begin to use to try and solve this problem. And constraints can come in a number of different forms. There are hard constraints, which are constraints that must be satisfied for a correct solution. So something like in the Sudoku puzzle, you cannot have this cell and this cell that are in the same row take on the same value. That is a hard constraint. But problems can also have soft constraints, where these are constraints that express some notion of preference, that maybe a and b can’t have an exam on the same day, but maybe someone has a preference that a’s exam is earlier than b’s exam. It doesn’t need to be the case with some expression that some solution is better than another solution. And in that case, you might formulate the problem as trying to optimize for maximizing people’s preferences. You want people’s preferences to be satisfied as much as possible. In this case, though, we’ll mostly just deal with hard constraints, constraints that must be met in order to have a correct solution to the problem. So we want to figure out some assignment of these variables to their particular values that is ultimately going to give us a solution to the problem by allowing us to assign some day to each of the classes such that we don’t have any conflicts between classes. So it turns out that we can classify the constraints in a constraint satisfaction problem into a number of different categories. The first of those categories are perhaps the simplest of the types of constraints, which are known as unary constraints, where unary constraint is a constraint that just involves a single variable. For example, a unary constraint might be something like, a does not equal Monday, meaning Course A cannot have its exam on Monday. If for some reason the instructor for the course isn’t available on Monday, you might have a constraint in your problem that looks like this, something that just has a single variable a in it, and maybe says a is not equal to Monday, or a is equal to something, or in the case of numbers greater than or less than something, a constraint that just has one variable, we consider to be a unary constraint. And this is in contrast to something like a binary constraint, which is a constraint that involves two variables, for example. So this would be a constraint like the ones we were looking at before. Something like a does not equal b is an example of a binary constraint, because it is a constraint that has two variables involved in it, a and b. And we represented that using some arc or some edge that connects variable a to variable b. And using this knowledge of, OK, what is a unary constraint? What is a binary constraint? There are different types of things we can say about a particular constraint satisfaction problem. And one thing we can say is we can try and make the problem node consistent. So what does node consistency mean? Node consistency means that we have all of the values in a variable’s domain satisfying that variable’s unary constraints. So for each of the variables inside of our constraint satisfaction problem, if all of the values satisfy the unary constraints for that particular variable, we can say that the entire problem is node consistent, or we can even say that a particular variable is node consistent if we just want to make one node consistent within itself. So what does that actually look like? Let’s look at now a simplified example, where instead of having a whole bunch of different classes, we just have two classes, a and b, each of which has an exam on either Monday or Tuesday or Wednesday. So this is the domain for the variable a, and this is the domain for the variable b. And now let’s imagine we have these constraints, a not equal to Monday, b not equal to Tuesday, b not equal to Monday, a not equal to b. So those are the constraints that we have on this particular problem. And what we can now try to do is enforce node consistency. And node consistency just means we make sure that all of the values for any variable’s domain satisfy its unary constraints. And so we could start by trying to make node a node consistent. Is it consistent? Does every value inside of a’s domain satisfy its unary constraints? Well, initially, we’ll see that Monday does not satisfy a’s unary constraints, because we have a constraint, a unary constraint here, that a is not equal to Monday. But Monday is still in a’s domain. And so this is something that is not node consistent, because we have Monday in the domain. But this is not a valid value for this particular node. And so how do we make this node consistent? Well, to make the node consistent, what we’ll do is we’ll just go ahead and remove Monday from a’s domain. Now a can only be on Tuesday or Wednesday, because we had this constraint that said a is not equal to Monday. And at this point now, a is node consistent. For each of the values that a can take on, Tuesday and Wednesday, there is no constraint that is a unary constraint that conflicts with that idea. There is no constraint that says that a can’t be Tuesday. There is no unary constraint that says that a cannot be on Wednesday. And so now we can turn our attention to b. b also has a domain, Monday, Tuesday, and Wednesday. And we can begin to see whether those variables satisfy the unary constraints as well. Well, here is a unary constraint, b is not equal to Tuesday. And that does not appear to be satisfied by this domain of Monday, Tuesday, and Wednesday, because Tuesday, this possible value that the variable b could take on is not consistent with this unary constraint, that b is not equal to Tuesday. So to solve that problem, we’ll go ahead and remove Tuesday from b’s domain. Now b’s domain only contains Monday and Wednesday. But as it turns out, there’s yet another unary constraint that we placed on the variable b, which is here. b is not equal to Monday. And that means that this value, Monday, inside of b’s domain, is not consistent with b’s unary constraints, because we have a constraint that says the b cannot be Monday. And so we can remove Monday from b’s domain. And now we’ve made it through all of the unary constraints. We’ve not yet considered this constraint, which is a binary constraint. But we’ve considered all of the unary constraints, all of the constraints that involve just a single variable. And we’ve made sure that every node is consistent with those unary constraints. So we can say that now we have enforced node consistency, that for each of these possible nodes, we can pick any of these values in the domain. And there won’t be a unary constraint that is violated as a result of it. So node consistency is fairly easy to enforce. We just take each node, make sure the values in the domain satisfy the unary constraints. Where things get a little bit more interesting is when we consider different types of consistency, something like arc consistency, for example. And arc consistency refers to when all of the values in a variable’s domain satisfy the variable’s binary constraints. So when we’re looking at trying to make a arc consistent, we’re no longer just considering the unary constraints that involve a. We’re trying to consider all of the binary constraints that involve a as well. So any edge that connects a to another variable inside of that constraint graph that we were taking a look at before. Put a little bit more formally, arc consistency. And arc really is just another word for an edge that connects two of these nodes inside of our constraint graph. We can define arc consistency a little more precisely like this. In order to make some variable x arc consistent with respect to some other variable y, we need to remove any element from x’s domain to make sure that every choice for x, every choice in x’s domain, has a possible choice for y. So put another way, if I have a variable x and I want to make x an arc consistent, then I’m going to look at all of the possible values that x can take on and make sure that for all of those possible values, there is still some choice that I can make for y, if there’s some arc between x and y, to make sure that y has a possible option that I can choose as well. So let’s look at an example of that going back to this example from before. We enforced node consistency already by saying that a can only be on Tuesday or Wednesday because we knew that a could not be on Monday. And we also said that b’s only domain only consists of Wednesday because we know that b does not equal Tuesday and also b does not equal Monday. So now let’s begin to consider arc consistency. Let’s try and make a arc consistent with b. And what that means is to make a arc consistent with respect to b means that for any choice we make in a’s domain, there is some choice we can make in b’s domain that is going to be consistent. And we can try that. For a, we can choose Tuesday as a possible value for a. If I choose Tuesday for a, is there a value for b that satisfies the binary constraint? Well, yes, b Wednesday would satisfy this constraint that a does not equal b because Tuesday does not equal Wednesday. However, if we chose Wednesday for a, well, then there is no choice in b’s domain that satisfies this binary constraint. There is no way I can choose something for b that satisfies a does not equal b because I know b must be Wednesday. And so if ever I run into a situation like this where I see that here is a possible value for a such that there is no choice of value for b that satisfies the binary constraint, well, then this is not arc consistent. And to make it arc consistent, I would need to take Wednesday and remove it from a’s domain. Because Wednesday was not going to be a possible choice I can make for a because it wasn’t consistent with this binary constraint for b. There was no way I could choose Wednesday for a and still have an available solution by choosing something for b as well. So here now, I’ve been able to enforce arc consistency. And in doing so, I’ve actually solved this entire problem, that given these constraints where a and b can have exams on either Monday or Tuesday or Wednesday, the only solution, as it would appear, is that a’s exam must be on Tuesday and b’s exam must be on Wednesday. And that is the only option available to me. So if we want to apply our consistency to a larger graph, not just looking at one particular pair of our consistency, there are ways we can do that too. And we can begin to formalize what the pseudocode would look like for trying to write an algorithm that enforces arc consistency. And we’ll start by defining a function called revise. Revise is going to take as input a CSP, otherwise known as a constraint satisfaction problem, and also two variables, x and y. And what revise is going to do is it is going to make x arc consistent with respect to y, meaning remove anything from x’s domain that doesn’t allow for a possible option for y. How does this work? Well, we’ll go ahead and first keep track of whether or not we’ve made a revision. Revise is ultimately going to return true or false. It’ll return true in the event that we did make a revision to x’s domain. It’ll return false if we didn’t make any change to x’s domain. And we’ll see in a moment why that’s going to be helpful. But we start by saying revised equals false. We haven’t made any changes. Then we’ll say, all right, let’s go ahead and loop over all of the possible values in x’s domain. So loop over x’s domain for each little x in x’s domain. I want to make sure that for each of those choices, I have some available choice in y that satisfies the binary constraints that are defined inside of my CSP, inside of my constraint satisfaction problem. So if ever it’s the case that there is no value y in y’s domain that satisfies the constraint for x and y, well, if that’s the case, that means that this value x shouldn’t be in x’s domain. So we’ll go ahead and delete x from x’s domain. And I’ll set revised equal to true because I did change x’s domain. I changed x’s domain by removing little x. And I removed little x because it wasn’t art consistent. There was no way I could choose a value for y that would satisfy this xy constraint. So in this case, we’ll go ahead and set revised equal true. And we’ll do this again and again for every value in x’s domain. Sometimes it might be fine. In other cases, it might not allow for a possible choice for y, in which case we need to remove this value from x’s domain. And at the end, we just return revised to indicate whether or not we actually made a change. So this function, then, this revised function is effectively an implementation of what you saw me do graphically a moment ago. And it makes one variable, x, arc consistent with another variable, in this case, y. But generally speaking, when we want to enforce our consistency, we’ll often want to enforce our consistency not just for a single arc, but for the entire constraint satisfaction problem. And it turns out there’s an algorithm to do that as well. And that algorithm is known as AC3. AC3 takes a constraint satisfaction problem. And it enforces our consistency across the entire problem. How does it do that? Well, it’s going to basically maintain a queue or basically just a line of all of the arcs that it needs to make consistent. And over time, we might remove things from that queue as we begin dealing with our consistency. And we might need to add things to that queue as well if there are more things we need to make arc consistent. So we’ll go ahead and start with a queue that contains all of the arcs in the constraint satisfaction problem, all of the edges that connect two nodes that have some sort of binary constraint between them. And now, as long as the queue is non-empty, there is work to be done. The queue is all of the things that we need to make arc consistent. So as long as the queue is non-empty, there’s still things we have to do. What do we have to do? Well, we’ll start by de-queuing from the queue, remove something from the queue. And strictly speaking, it doesn’t need to be a queue, but a queue is a traditional way of doing this. We’ll de-queue from the queue, and that’ll give us an arc, x and y, these two variables where I would like to make x arc consistent with y. So how do we make x arc consistent with y? Well, we can go ahead and just use that revise function that we talked about a moment ago. We called the revise function, passing as input the constraint satisfaction problem, and also these variables x and y, because I want to make x arc consistent with y. In other words, remove any values from x’s domain that don’t leave an available option for y. And recall, what does revised return? Well, it returns true if we actually made a change, if we removed something from x’s domain, because there wasn’t an available option for y, for example. And it returns false if we didn’t make any change to x’s domain at all. And it turns out if revised returns false, if we didn’t make any changes, well, then there’s not a whole lot more work to be done here for this arc. We can just move ahead to the next arc that’s in the queue. But if we did make a change, if we did reduce x’s domain by removing values from x’s domain, well, then what we might realize is that this creates potential problems later on, that it might mean that some arc that was arc consistent with x, that node might no longer be arc consistent with x, because while there used to be an option that we could choose for x, now there might not be, because now we might have removed something from x that was necessary for some other arc to be arc consistent. And so if ever we did revise x’s domain, we’re going to need to add some things to the queue, some additional arcs that we might want to check. How do we do that? Well, first thing we want to check is to make sure that x’s domain is not 0. If x’s domain is 0, that means there are no available options for x at all. And that means that there’s no way you can solve the constraint satisfaction problem. If we’ve removed everything from x’s domain, we’ll go ahead and just return false here to indicate there’s no way to solve the problem, because there’s nothing left in x’s domain. But otherwise, if there are things left in x’s domain, but fewer things than before, well, then what we’ll do is we’ll loop over each variable z that is in all of x’s neighbors, except for y, y we already handled. But we’ll consider all of x’s other’s neighbors and ask ourselves, all right, will that arc from each of those z’s to x, that arc might no longer be arc consistent, because while for each z, there might have been a possible option we could choose for x to correspond with each of z’s possible values, now there might not be, because we removed some elements from x’s domain. And so what we’ll do here is we’ll go ahead and enqueue, adding something to the queue, this arc zx for all of those neighbors z. So we need to add back some arcs to the queue in order to continue to enforce arc consistency. At the very end, if we make it through all this process, then we can return true. But this now is AC3, this algorithm for enforcing arc consistency on a constraint satisfaction problem. And the big idea is really just keep track of all of the arcs that we might need to make arc consistent, make it arc consistent by calling the revise function. And if we did revise it, then there are some new arcs that might need to be added to the queue in order to make sure that everything is still arc consistent, even after we’ve removed some of the elements from a particular variable’s domain. So what then would happen if we tried to enforce arc consistency on a graph like this, on a graph where each of these variables has a domain of Monday, Tuesday, and Wednesday? Well, it turns out that by enforcing arc consistency on this graph, well, it can solve some types of problems. Nothing actually changes here. For any particular arc, just considering two variables, there’s always a way for me to just, for any of the choices I make for one of them, make a choice for the other one, because there are three options, and I just need the two to be different from each other. So this is actually quite easy to just take an arc and just declare that it is arc consistent, because if I pick Monday for D, then I just pick something that isn’t Monday for B. In arc consistency, we only consider consistency between a binary constraint between two nodes, and we’re not really considering all of the rest of the nodes yet. So just using AC3, the enforcement of arc consistency, that can sometimes have the effect of reducing domains to make it easier to find solutions, but it will not always actually solve the problem. We might still need to somehow search to try and find a solution. And we can use classical traditional search algorithms to try to do so. You’ll recall that a search problem generally consists of these parts. We have some initial state, some actions, a transition model that takes me from one state to another state, a goal test to tell me have I satisfied my objective correctly, and then some path cost function, because in the case of like maze solving, I was trying to get to my goal as quickly as possible. So you could formulate a CSP, or a constraint satisfaction problem, as one of these types of search problems. The initial state will just be an empty assignment, where an assignment is just a way for me to assign any particular variable to any particular value. So if an empty assignment is no variables that are assigned to any values yet, then the action I can take is adding some new variable equals value pair to that assignment, saying for this assignment, let me add a new value for this variable. And the transition model just defines what happens when you take that action. You get a new assignment that has that variable equal to that value inside of it. The goal test is just checking to make sure all the variables have been assigned and making sure all the constraints have been satisfied. And the path cost function is sort of irrelevant. I don’t really care about what the path really is. I just care about finding some assignment that actually satisfies all of the constraints. So really, all the paths have the same cost. I don’t really care about the path to the goal. I just care about the solution itself, much as we’ve talked about now before. The problem here, though, is that if we just implement this naive search algorithm just by implementing like breadth-first search or depth-first search, this is going to be very, very inefficient. And there are ways we can take advantage of efficiencies in the structure of a constraint satisfaction problem itself. And one of the key ideas is that we can really just order these variables. And it doesn’t matter what order we assign variables in. The assignment a equals 2 and then b equals 8 is identical to the assignment of b equals 8 and then a equals 2. Switching the order doesn’t really change anything about the fundamental nature of that assignment. And so there are some ways that we can try and revise this idea of a search algorithm to apply it specifically for a problem like a constraint satisfaction problem. And it turns out the search algorithm we’ll generally use when talking about constraint satisfaction problems is something known as backtracking search. And the big idea of backtracking search is we’ll go ahead and make assignments from variables to values. And if ever we get stuck, we arrive at a place where there is no way we can make any forward progress while still preserving the constraints that we need to enforce, we’ll go ahead and backtrack and try something else instead. So the very basic sketch of what backtracking search looks like is it looks like this. Function called backtrack that takes as input an assignment and a constraint satisfaction problem. So initially, we don’t have any assigned variables. So when we begin backtracking search, this assignment is just going to be the empty assignment with no variables inside of it. But we’ll see later this is going to be a recursive function. So backtrack takes as input the assignment and the problem. If the assignment is complete, meaning all of the variables have been assigned, we just return that assignment. That, of course, won’t be true initially, because we start with an empty assignment. But over time, we might add things to that assignment. So if ever the assignment actually is complete, then we’re done. Then just go ahead and return that assignment. But otherwise, there is some work to be done. So what we’ll need to do is select an unassigned variable for this particular problem. So we need to take the problem, look at the variables that have already been assigned, and pick a variable that has not yet been assigned. And I’ll go ahead and take that variable. And then I need to consider all of the values in that variable’s domain. So we’ll go ahead and call this domain values function. We’ll talk a little more about that later, that takes a variable and just gives me back an ordered list of all of the values in its domain. So I’ve taken a random unselected variable. I’m going to loop over all of the possible values. And the idea is, let me just try all of these values as possible values for the variable. So if the value is consistent with the assignment so far, it doesn’t violate any of the constraints, well then let’s go ahead and add variable equals value to the assignment because it’s so far consistent. And now let’s recursively call backtrack to try and make the rest of the assignments also consistent. So I’ll go ahead and call backtrack on this new assignment that I’ve added the variable equals value to. And now I recursively call backtrack and see what the result is. And if the result isn’t a failure, well then let me just return that result. And otherwise, what else could happen? Well, if it turns out the result was a failure, well then that means this value was probably a bad choice for this particular variable because when I assigned this variable equal to that value, eventually down the road I ran into a situation where I violated constraints. There was nothing more I could do. So now I’ll remove variable equals value from the assignment, effectively backtracking to say, all right, that value didn’t work. Let’s try another value instead. And then at the very end, if we were never able to return a complete assignment, we’ll just go ahead and return failure because that means that none of the values worked for this particular variable. This now is the idea for backtracking search, to take each of the variables, try values for them, and recursively try backtracking search, see if we can make progress. And if ever we run into a dead end, we run into a situation where there is no possible value we can choose that satisfies the constraints, we return failure. And that propagates up, and eventually we make a different choice by going back and trying something else instead. So let’s put this algorithm into practice. Let’s actually try and use backtracking search to solve this problem now, where I need to figure out how to assign each of these courses to an exam slot on Monday or Tuesday or Wednesday in such a way that it satisfies these constraints, that each of these edges mean those two classes cannot have an exam on the same day. So I can start by just starting at a node. It doesn’t really matter which I start with, but in this case, I’ll just start with A. And I’ll ask the question, all right, let me loop over the values in the domain. And maybe in this case, I’ll just start with Monday and say, all right, let’s go ahead and assign A to Monday. We’ll just go and order Monday, Tuesday, Wednesday. And now let’s consider node B. So I’ve made an assignment to A, so I recursively call backtrack with this new part of the assignment. And now I’m looking to pick another unassigned variable like B. And I’ll say, all right, maybe I’ll start with Monday, because that’s the very first value in B’s domain. And I ask, all right, does Monday violate any constraints? And it turns out, yes, it does. It violates this constraint here between A and B, because A and B are now both on Monday, and that doesn’t work, because B can’t be on the same day as A. So that doesn’t work. So we might instead try Tuesday, try the next value in B’s domain. And is that consistent with the assignment so far? Well, yeah, B, Tuesday, A, Monday, that is consistent so far, because they’re not on the same day. So that’s good. Now we can recursively call backtrack. Try again. Pick another unassigned variable, something like D, and say, all right, let’s go through its possible values. Is Monday consistent with this assignment? Well, yes, it is. B and D are on different days, Monday versus Tuesday. And A and B are also on different days, Monday versus Tuesday. So that’s fine so far, too. We’ll go ahead and try again. Maybe we’ll go to this variable here, E. Say, can we make that consistent? Let’s go through the possible values. We’ve recursively called backtrack. We might start with Monday and say, all right, that’s not consistent, because D and E now have exams on the same day. So we might try Tuesday instead, going to the next one. Ask, is that consistent? Well, no, it’s not, because B and E, those have exams on the same day. And so we try, all right, is Wednesday consistent? And in turn, it’s like, all right, yes, it is. Wednesday is consistent, because D and E now have exams on different days. B and E now have exams on different days. All seems to be well so far. I recursively call backtrack, select another unassigned variable, we’ll say maybe choose C this time, and say, all right, let’s try the values that C could take on. Let’s start with Monday. And it turns out that’s not consistent, because now A and C both have exams on the same day. So I try Tuesday and say, that’s not consistent either, because B and C now have exams on the same day. And then I say, all right, let’s go ahead and try Wednesday. But that’s not consistent either, because C and E each have exams on the same day too. So now we’ve gone through all the possible values for C, Monday, Tuesday, and Wednesday. And none of them are consistent. There is no way we can have a consistent assignment. Backtrack, in this case, will return a failure. And so then we’d say, all right, we have to backtrack back to here. Well, now for E, we’ve tried all of Monday, Tuesday, and Wednesday. And none of those work, because Wednesday, which seemed to work, turned out to be a failure. So that means there’s no possible way we can assign E. So that’s a failure too. We have to go back up to D, which means that Monday assignment to D, that must be wrong. We must try something else. So we can try, all right, what if instead of Monday, we try Tuesday? Tuesday, it turns out, is not consistent, because B and D now have an exam on the same day. But Wednesday, as it turns out, works. And now we can begin to mix and forward progress again. We go back to E and say, all right, which of these values works? Monday turns out to work by not violating any constraints. Then we go up to C now. Monday doesn’t work, because it violates a constraint. Violates two, actually. Tuesday doesn’t work, because it violates a constraint as well. But Wednesday does work. Then we can go to the next variable, F, and say, all right, does Monday work? We’ll know. It violates a constraint. But Tuesday does work. And then finally, we can look at the last variable, G, recursively calling backtrack one more time. Monday is inconsistent. That violates a constraint. Tuesday also violates a constraint. But Wednesday, that doesn’t violate a constraint. And so now at this point, we recursively call backtrack one last time. We now have a satisfactory assignment of all of the variables. And at this point, we can say that we are now done. We have now been able to successfully assign a variable or a value to each one of these variables in such a way that we’re not violating any constraints. We’re going to go ahead and have classes A and E have their exams on Monday. Classes B and F can have their exams on Tuesday. And classes C, D, and G can have their exams on Wednesday. And there’s no violated constraints that might come up there. So that then was a graphical look at how this might work. Let’s now take a look at some code we could use to actually try and solve this problem as well. So here I’ll go ahead and go into the scheduling directory. We’re here now. We’ll start by looking at schedule0.py. We’re here. I define a list of variables, A, B, C, D, E, F, G. Those are all different classes. Then underneath that, I define my list of constraints. So constraint A and B. That is a constraint because they can’t be on the same day. Likewise, A and C, B and C, so on and so forth, enforcing those exact same constraints. And here then is what the backtracking function might look like. First, if the assignment is complete, if I’ve made an assignment of every variable to a value, go ahead and just return that assignment. Then we’ll select an unassigned variable from that assignment. Then for each of the possible values in the domain, Monday, Tuesday, Wednesday, let’s go ahead and create a new assignment that assigns the variable to that value. I’ll call this consistent function, which I’ll show you in a moment, that just checks to make sure this new assignment is consistent. But if it is consistent, we’ll go ahead and call backtrack to go ahead and continue trying to run backtracking search. And as long as the result is not none, meaning it wasn’t a failure, we can go ahead and return that result. But if we make it through all the values and nothing works, then it is a failure. There’s no solution. We go ahead and return none here. What do these functions do? Select unassigned variable is just going to choose a variable not yet assigned. So it’s going to loop over all the variables. And if it’s not already assigned, we’ll go ahead and just return that variable. And what does the consistent function do? Well, the consistent function goes through all the constraints. And if we have a situation where we’ve assigned both of those values to variables, but they are the same, well, then that is a violation of the constraint, in which case we’ll return false. But if nothing is inconsistent, then the assignment is consistent and will return true. And then all the program does is it calls backtrack on an empty assignment, an empty dictionary that has no variable assigned and no values yet, save that inside a solution, and then print out that solution. So by running this now, I can run Python schedule0.py. And what I get as a result of that is an assignment of all these variables to values. And it turns out we assign a to Monday as we would expect, b to Tuesday, c to Wednesday, exactly the same type of thing we were talking about before, an assignment of each of these variables to values that doesn’t violate any constraints. And I had to do a fair amount of work in order to implement this idea myself. I had to write the backtrack function that went ahead and went through this process of recursively trying to do this backtracking search. But it turns out the constraint satisfaction problems are so popular that there exist many libraries that already implement this type of idea. Again, as with before, the specific library is not as important as the fact that libraries do exist. This is just one example of a Python constraint library, where now, rather than having to do all the work from scratch inside of schedule1.py, I’m just taking advantage of a library that implements a lot of these ideas already. So here, I create a new problem, add variables to it with particular domains. I add a whole bunch of these individual constraints, where I call addConstraint and pass in a function describing what the constraint is. And the constraint basically says the function that takes two variables, x and y, and makes sure that x is not equal to y, enforcing the idea that these two classes cannot have exams on the same day. And then, for any constraint satisfaction problem, I can call getSolutions to get all the solutions to that problem. And then, for each of those solutions, print out what that solution happens to be. And if I run python schedule1.py, and now see, there are actually a number of different solutions that can be used to solve the problem. There are, in fact, six different solutions, assignments of variables to values that will give me a satisfactory answer to this constraint satisfaction problem. So this then was an implementation of a very basic backtracking search method, where really we just went through each of the variables, picked one that wasn’t assigned, tried the possible values the variable could take on. And then, if it worked, if it didn’t violate any constraints, then we kept trying other variables. And if ever we hit a dead end, we had to backtrack. But ultimately, we might be able to be a little bit more intelligent about how we do this in order to improve the efficiency of how we solve these sorts of problems. And one thing we might imagine trying to do is going back to this idea of inference, using the knowledge we know to be able to draw conclusions in order to make the rest of the problem solving process a little bit easier. And let’s now go back to where we got stuck in this problem the first time. When we were solving this constraint satisfaction problem, we dealt with B. And then we went on to D. And we went ahead and just assigned D to Monday, because that seemed to work with the assignment so far. It didn’t violate any constraints. But it turned out that later on that choice turned out to be a bad one, that that choice wasn’t consistent with the rest of the values that we could take on here. And the question is, is there anything we could do to avoid getting into a situation like this, avoid trying to go down a path that’s ultimately not going to lead anywhere by taking advantage of knowledge that we have initially? And it turns out we do have that kind of knowledge. We can look at just the structure of this graph so far. And we can say that right now C’s domain, for example, contains values Monday, Tuesday, and Wednesday. And based on those values, we can say that this graph is not arc consistent. Recall that arc consistency is all about making sure that for every possible value for a particular node, that there is some other value that we are able to choose. And as we can see here, Monday and Tuesday are not going to be possible values that we can choose for C. They’re not going to be consistent with a node like B, for example, because B is equal to Tuesday, which means that C cannot be Tuesday. And because A is equal to Monday, C also cannot be Monday. So using that information, by making C arc consistent with A and B, we could remove Monday and Tuesday from C’s domain and just leave C with Wednesday, for example. And if we continued to try and enforce arc consistency, we’d see there are some other conclusions we can draw as well. We see that B’s only option is Tuesday and C’s only option is Wednesday. And so if we want to make E arc consistent, well, E can’t be Tuesday, because that wouldn’t be arc consistent with B. And E can’t be Wednesday, because that wouldn’t be arc consistent with C. So we can go ahead and say E and just set that equal to Monday, for example. And then we can begin to do this process again and again, that in order to make D arc consistent with B and E, then D would have to be Wednesday. That’s the only possible option. And likewise, we can make the same judgments for F and G as well. And it turns out that without having to do any additional search, just by enforcing arc consistency, we were able to actually figure out what the assignment of all the variables should be without needing to backtrack at all. And the way we did that is by interleaving this search process and the inference step, by this step of trying to enforce arc consistency. And the algorithm to do this is often called just the maintaining arc consistency algorithm, which just enforces arc consistency every time we make a new assignment of a value to an existing variable. So sometimes we can enforce our consistency using that AC3 algorithm at the very beginning of the problem before we even begin searching in order to limit the domain of the variables in order to make it easier to search. But we can also take advantage of the interleaving of enforcing our consistency with search such that every time in the search process we make a new assignment, we go ahead and enforce arc consistency as well to make sure that we’re just eliminating possible values from domains whenever possible. And how do we do this? Well, this is really equivalent to just every time we make a new assignment to a variable x. We’ll go ahead and call our AC3 algorithm, this algorithm that enforces arc consistency on a constraint satisfaction problem. And we go ahead and call that, starting it with a Q, not of all of the arcs, which we did originally, but just of all of the arcs that we want to make arc consistent with x, this thing that we have just made an assignment to. So all arcs yx, where y is a neighbor of x, something that shares a constraint with x, for example. And by maintaining arc consistency in the backtracking search process, we can ultimately make our search process a little bit more efficient. And so this is the revised version of this backtrack function. Same as before, the changes here are highlighted in yellow. Every time we add a new variable equals value to our assignment, we’ll go ahead and run this inference procedure, which might do a number of different things. But one thing it could do is call the maintaining arc consistency algorithm to make sure we’re able to enforce arc consistency on the problem. And we might be able to draw new inferences as a result of that process. Get new guarantees of this variable needs to be equal to that value, for example. That might happen one time. It might happen many times. And so long as those inferences are not a failure, as long as they don’t lead to a situation where there is no possible way to make forward progress, well, then we can go ahead and add those inferences, those new knowledge, that new pieces of knowledge I know about what variables should be assigned to what values, I can add those to the assignment in order to more quickly make forward progress by taking advantage of information that I can just deduce, information I know based on the rest of the structure of the constraint satisfaction problem. And the only other change I’ll need to make now is if it turns out this value doesn’t work, well, then down here, I’ll go ahead and need to remove not only variable equals value, but also any of those inferences that I made, remove that from the assignment as well. So here, then, we’re often able to solve the problem by backtracking less than we might originally have needed to, just by taking advantage of the fact that every time we make a new assignment of one variable to one value, that might reduce the domains of other variables as well. And we can use that information to begin to more quickly draw conclusions in order to try and solve the problem more efficiently as well. And it turns out there are other heuristics we can use to try and improve the efficiency of our search process as well. And it really boils down to a couple of these functions that I’ve talked about, but we haven’t really talked about how they’re working. And one of them is this function here, select unassigned variable, where we’re selecting some variable in the constraint satisfaction problem that has not yet been assigned. So far, I’ve sort of just been selecting variables randomly, just like picking one variable and one unassigned variable in order to decide, all right, this is the variable that we’re going to assign next, and then going from there. But it turns out that by being a little bit intelligent, by following certain heuristics, we might be able to make the search process much more efficient just by choosing very carefully which variable we should explore next. So some of those heuristics include the minimum remaining values, or MRV heuristic, which generally says that if I have a choice between which variable I should select, I should select the variable with the smallest domain, the variable that has the fewest number of remaining values left. With the idea being, if there are only two remaining values left, well, I may as well prune one of them very quickly in order to get to the other, because one of those two has got to be the solution, if a solution does exist. Sometimes minimum remaining values might not give a conclusive result if all the nodes have the same number of remaining values, for example. And in that case, another heuristic that can be helpful to look at is the degree heuristic. The degree of a node is the number of nodes that are attached to that node, the number of nodes that are constrained by that particular node. And if you imagine which variable should I choose, should I choose a variable that has a high degree that is connected to a lot of different things, or a variable with a low degree that is not connected to a lot of different things, well, it can often make sense to choose the variable that has the highest degree that is connected to the most other nodes as the thing you would search first. Why is that the case? Well, it’s because by choosing a variable with a high degree, that is immediately going to constrain the rest of the variables more, and it’s more likely to be able to eliminate large sections of the state space that you don’t need to search through at all. So what could this actually look like? Let’s go back to this search problem here. In this particular case, I’ve made an assignment here. I’ve made an assignment here. And the question is, what should I look at next? And according to the minimum remaining values heuristic, what I should choose is the variable that has the fewest remaining possible values. And in this case, that’s this node here, node C, that only has one variable left in this domain, which in this case is Wednesday, which is a very reasonable choice of a next assignment to make, because I know it’s the only option, for example. I know that the only possible option for C is Wednesday, so I may as well make that assignment and then potentially explore the rest of the space after that. But meanwhile, at the very start of the problem, when I didn’t have any knowledge of what nodes should have what values yet, I still had to pick what node should be the first one that I try and assign a value to. And I arbitrarily just chose the one at the top, node A originally. But we can be more intelligent about that. We can look at this particular graph. All of them have domains of the same size, domain of size 3. So minimum remaining values doesn’t really help us there. But we might notice that node E has the highest degree. It is connected to the most things. And so perhaps it makes sense to begin our search, rather than starting at node A at the very top, start with the node with the highest degree. Start by searching from node E, because from there, that’s going to much more easily allow us to enforce the constraints that are nearby, eliminating large portions of the search space that I might not need to search through. And in fact, by starting with E, we can immediately then assign other variables. And following that, we can actually assign the rest of the variables without needing to do any backtracking at all, even if I’m not using this inference procedure. Just by starting with a node that has a high degree, that is going to very quickly restrict the possible values that other nodes can take on. So that then is how we can go about selecting an unassigned variable in a particular order. Rather than randomly picking a variable, if we’re a little bit intelligent about how we choose it, we can make our search process much, much more efficient by making sure we don’t have to search through portions of the search space that ultimately aren’t going to matter. The other variable we haven’t really talked about, the other function here, is this domain values function. This domain values function that takes a variable and gives me back a sequence of all of the values inside of that variable’s domain. The naive way to approach it is what we did before, which is just go in order, go Monday, then Tuesday, then Wednesday. But the problem is that going in that order might not be the most efficient order to search in, that sometimes it might be more efficient to choose values that are likely to be solutions first and then go to other values. Now, how do you assess whether a value is likelier to lead to a solution or less likely to lead to a solution? Well, one thing you can take a look at is how many constraints get added, how many things get removed from domains as you make this new assignment of a variable to this particular value. And the heuristic we can use here is the least constraining value heuristic, which is the idea that we should return variables in order based on the number of choices that are ruled out for neighboring values. And I want to start with the least constraining value, the value that rules out the fewest possible options. And the idea there is that if all I care about doing is finding a solution, if I start with a value that rules out a lot of other choices, I’m ruling out a lot of possibilities that maybe is going to make it less likely that this particular choice leads to a solution. Whereas on the other hand, if I have a variable and I start by choosing a value that doesn’t rule out very much, well, then I still have a lot of space where there might be a solution that I could ultimately find. And this might seem a little bit counterintuitive and a little bit at odds with what we were talking about before, where I said, when you’re picking a variable, you should pick the variable that is going to have the fewest possible values remaining. But here, I want to pick the value for the variable that is the least constraining. But the general idea is that when I am picking a variable, I would like to prune large portions of the search space by just choosing a variable that is going to allow me to quickly eliminate possible options. Whereas here, within a particular variable, as I’m considering values that that variable could take on, I would like to just find a solution. And so what I want to do is ultimately choose a value that still leaves open the possibility of me finding a solution to be as likely as possible. By not ruling out many options, I leave open the possibility that I can still find a solution without needing to go back later and backtrack. So an example of that might be in this particular situation here, if I’m trying to choose a variable for a value for node C here, that C is equal to either Tuesday or Wednesday. We know it can’t be Monday because it conflicts with this domain here, where we already know that A is Monday, so C must be Tuesday or Wednesday. And the question is, should I try Tuesday first, or should I try Wednesday first? And if I try Tuesday, what gets ruled out? Well, one option gets ruled out here, a second option gets ruled out here, and a third option gets ruled out here. So choosing Tuesday would rule out three possible options. And what about choosing Wednesday? Well, choosing Wednesday would rule out one option here, and it would rule out one option there. And so I have two choices. I can choose Tuesday that rules out three options, or Wednesday that rules out two options. And according to the least constraining value heuristic, what I should probably do is go ahead and choose Wednesday, the one that rules out the fewest number of possible options, leaving open as many chances as possible for me to eventually find the solution inside of the state space. And ultimately, if you continue this process, we will find the solution, an assignment of variables, two values, that allows us to give each of these exams, each of these classes, an exam date that doesn’t conflict with anyone that happens to be enrolled in two classes at the same time. So the big takeaway now with all of this is that there are a number of different ways we can formulate a problem. The ways we’ve looked at today are we can formulate a problem as a local search problem, a problem where we’re looking at a current node and moving to a neighbor based on whether that neighbor is better or worse than the current node that we are looking at. We looked at formulating problems as linear programs, where just by putting things in terms of equations and constraints, we’re able to solve problems a little bit more efficiently. And we saw formulating a problem as a constraint satisfaction problem, creating this graph of all of the constraints that connect two variables that have some constraint between them, and using that information to be able to figure out what the solution should be. And so the takeaway of all of this now is that if we have some problem in artificial intelligence that we would like to use AI to be able to solve them, whether that’s trying to figure out where hospitals should be or trying to solve the traveling salesman problem, trying to optimize productions and costs and whatnot, or trying to figure out how to satisfy certain constraints, whether that’s in a Sudoku puzzle, or whether that’s in trying to figure out how to schedule exams for a university, or any number of a wide variety of types of problems, if we can formulate that problem as one of these sorts of problems, then we can use these known algorithms, these algorithms for enforcing art consistency and backtracking search, these hill climbing and simulated annealing algorithms, these simplex algorithms and interior point algorithms that can be used to solve linear programs, that we can use those techniques to begin to solve a whole wide variety of problems all in this world of optimization inside of artificial intelligence. This was an introduction to artificial intelligence with Python for today. We will see you next time. [” All right. Welcome back, everyone, to an introduction to artificial intelligence with Python. Now, so far in this class, we’ve used AI to solve a number of different problems, giving AI instructions for how to search for a solution, or how to satisfy certain constraints in order to find its way from some input point to some output point in order to solve some sort of problem. Today, we’re going to turn to the world of learning, in particular the idea of machine learning, which generally refers to the idea where we are not going to give the computer explicit instructions for how to perform a task, but rather we are going to give the computer access to information in the form of data, or patterns that it can learn from, and let the computer try and figure out what those patterns are, try and understand that data to be able to perform a task on its own. Now, machine learning comes in a number of different forms, and it’s a very wide field. So today, we’ll explore some of the foundational algorithms and ideas that are behind a lot of the different areas within machine learning. And one of the most popular is the idea of supervised machine learning, or just supervised learning. And supervised learning is a particular type of task. It refers to the task where we give the computer access to a data set, where that data set consists of input-output pairs. And what we would like the computer to do is we would like our AI to be able to figure out some function that maps inputs to outputs. So we have a whole bunch of data that generally consists of some kind of input, some evidence, some information that the computer will have access to. And we would like the computer, based on that input information, to predict what some output is going to be. And we’ll give it some data so that the computer can train its model on and begin to understand how it is that this information works and how it is that the inputs and outputs relate to each other. But ultimately, we hope that our computer will be able to figure out some function that, given those inputs, is able to get those outputs. There are a couple of different tasks within supervised learning. The one we’ll focus on and start with is known as classification. And classification is the problem where, if I give you a whole bunch of inputs, you need to figure out some way to map those inputs into discrete categories, where you can decide what those categories are, and it’s the job of the computer to predict what those categories are going to be. So that might be, for example, I give you information about a bank note, like a US dollar, and I’m asking you to predict for me, does it belong to the category of authentic bank notes, or does it belong to the category of counterfeit bank notes? You need to categorize the input, and we want to train the computer to figure out some function to be able to do that calculation. Another example might be the case of weather, someone we’ve talked about a little bit so far in this class, where we would like to predict on a given day, is it going to rain on that day? Is it going to be cloudy on that day? And before we’ve seen how we could do this, if we really give the computer all the exact probabilities for if these are the conditions, what’s the probability of rain? Oftentimes, we don’t have access to that information, though. But what we do have access to is a whole bunch of data. So if we wanted to be able to predict something like, is it going to rain or is it not going to rain, we would give the computer historical information about days when it was raining and days when it was not raining and ask the computer to look for patterns in that data. So what might that data look like? Well, we could structure that data in a table like this. This might be what our table looks like, where for any particular day, going back, we have information about that day’s humidity, that day’s air pressure, and then importantly, we have a label, something where the human has said that on this particular day, it was raining or it was not raining. So you could fill in this table with a whole bunch of data. And what makes this what we would call a supervised learning exercise is that a human has gone in and labeled each of these data points, said that on this day, when these were the values for the humidity and pressure, that day was a rainy day and this day was a not rainy day. And what we would like the computer to be able to do then is to be able to figure out, given these inputs, given the humidity and the pressure, can the computer predict what label should be associated with that day? Does that day look more like it’s going to be a day that rains or does it look more like a day when it’s not going to rain? Put a little bit more mathematically, you can think of this as a function that takes two inputs, the inputs being the data points that our computer will have access to, things like humidity and pressure. So we could write a function f that takes as input both humidity and pressure. And then the output is going to be what category we would ascribe to these particular input points, what label we would associate with that input. So we’ve seen a couple of example data points here, where given this value for humidity and this value for pressure, we predict, is it going to rain or is it not going to rain? And that’s information that we just gathered from the world. We measured on various different days what the humidity and pressure were. We observed whether or not we saw rain or no rain on that particular day. And this function f is what we would like to approximate. Now, the computer and we humans don’t really know exactly how this function f works. It’s probably quite a complex function. So what we’re going to do instead is attempt to estimate it. We would like to come up with a hypothesis function. h, which is going to try to approximate what f does. We want to come up with some function h that will also take the same inputs and will also produce an output, rain or no rain. And ideally, we’d like these two functions to agree as much as possible. So the goal then of the supervised learning classification tasks is going to be to figure out, what does that function h look like? How can we begin to estimate, given all of this information, all of this data, what category or what label should be assigned to a particular data point? So where could you begin doing this? Well, a reasonable thing to do, especially in this situation, I have two numerical values, is I could try to plot this on a graph that has two axes, an x-axis and a y-axis. And in this case, we’re just going to be using two numerical values as input. But these same types of ideas scale as you add more and more inputs as well. We’ll be plotting things in two dimensions. But as we soon see, you could add more inputs and just imagine things in multiple dimensions. And while we humans have trouble conceptualizing anything really beyond three dimensions, at least visually, a computer has no problem with trying to imagine things in many, many more dimensions, that for a computer, each dimension is just some separate number that it is keeping track of. So it wouldn’t be unreasonable for a computer to think in 10 dimensions or 100 dimensions to be able to try to solve a problem. But for now, we’ve got two inputs. So we’ll graph things along two axes, an x-axis, which will here represent humidity, and a y-axis, which here represents pressure. And what we might do is say, let’s take all of the days that were raining and just try to plot them on this graph and see where they fall on this graph. And here might be all of the rainy days, where each rainy day is one of these blue dots here that corresponds to a particular value for humidity and a particular value for pressure. And then I might do the same thing with the days that were not rainy. So take all the not rainy days, figure out what their values were for each of these two inputs, and go ahead and plot them on this graph as well. And I’ve here plotted them in red. So blue here stands for a rainy day. Red here stands for a not rainy day. And this then is the input that my computer has access to all of this input. And what I would like the computer to be able to do is to train a model such that if I’m ever presented with a new input that doesn’t have a label associated with it, something like this white dot here, I would like to predict, given those values for each of the two inputs, should we classify it as a blue dot, a rainy day, or should we classify it as a red dot, a not rainy day? And if you’re just looking at this picture graphically, trying to say, all right, this white dot, does it look like it belongs to the blue category, or does it look like it belongs to the red category, I think most people would agree that it probably belongs to the blue category. And why is that? Well, it looks like it’s close to other blue dots. And that’s not a very formal notion, but it’s a notion that we’ll formalize in just a moment. That because it seems to be close to this blue dot here, nothing else is closer to it, then we might say that it should be categorized as blue. It should fall into that category of, I think that day is going to be a rainy day based on that input. Might not be totally accurate, but it’s a pretty good guess. And this type of algorithm is actually a very popular and common machine learning algorithm known as nearest neighbor classification. It’s an algorithm for solving these classification-type problems. And in nearest neighbor classification, it’s going to perform this algorithm. What it will do is, given an input, it will choose the class of the nearest data point to that input. By class, we just here mean category, like rain or no rain, counterfeit or not counterfeit. And we choose the category or the class based on the nearest data point. So given all that data, we just looked at, is the nearest data point a blue point or is it a red point? And depending on the answer to that question, we were able to make some sort of judgment. We were able to say something like, we think it’s going to be blue or we think it’s going to be red. So likewise, we could apply this to other data points that we encounter as well. If suddenly this data point comes about, well, its nearest data is red. So we would go ahead and classify this as a red point, not raining. Things get a little bit trickier, though, when you look at a point like this white point over here and you ask the same sort of question. Should it belong to the category of blue points, the rainy days? Or should it belong to the category of red points, the not rainy days? Now, nearest neighbor classification would say the way you solve this problem is look at which point is nearest to that point. You look at this nearest point and say it’s red. It’s a not rainy day. And therefore, according to nearest neighbor classification, I would say that this unlabeled point, well, that should also be red. It should also be classified as a not rainy day. But your intuition might think that that’s a reasonable judgment to make, that it’s the closest thing is a not rainy day. So may as well guess that it’s a not rainy day. But it’s probably also reasonable to look at the bigger picture of things to say, yes, it is true that the nearest point to it was a red point. But it’s surrounded by a whole bunch of other blue points. So looking at the bigger picture, there’s potentially an argument to be made that this point should actually be blue. And with only this data, we actually don’t know for sure. We are given some input, something we’re trying to predict. And we don’t necessarily know what the output is going to be. So in this case, which one is correct is difficult to say. But oftentimes, considering more than just a single neighbor, considering multiple neighbors can sometimes give us a better result. And so there’s a variant on the nearest neighbor classification algorithm that is known as the K nearest neighbor classification algorithm, where K is some parameter, some number that we choose, for how many neighbors are we going to look at. So one nearest neighbor classification is what we saw before. Just pick the one nearest neighbor and use that category. But with K nearest neighbor classification, where K might be 3, or 5, or 7, to say look at the 3, or 5, or 7 closest neighbors, closest data points to that point, works a little bit differently. This algorithm, we’ll give it an input. Choose the most common class out of the K nearest data points to that input. So if we look at the five nearest points, and three of them say it’s raining, and two of them say it’s not raining, we’ll go with the three instead of the two, because each one effectively gets one vote towards what they believe the category ought to be. And ultimately, you choose the category that has the most votes as a consequence of that. So K nearest neighbor classification, fairly straightforward one to understand intuitively. You just look at the neighbors and figure out what the answer might be. And it turns out this can work very, very well for solving a whole variety of different types of classification problems. But not every model is going to work under every situation. And so one of the things we’ll take a look at today, especially in the context of supervised machine learning, is that there are a number of different approaches to machine learning, a number of different algorithms that we can apply, all solving the same type of problem, all solving some kind of classification problem where we want to take inputs and organize it into different categories. And no one algorithm is necessarily always going to be better than some other algorithm. They each have their trade-offs. And maybe depending on the data, one type of algorithm is going to be better suited to trying to model that information than some other algorithm. And so this is what a lot of machine learning research ends up being about, that when you’re trying to apply machine learning techniques, you’re often looking not just at one particular algorithm, but trying multiple different algorithms, trying to see what is going to give you the best results for trying to predict some function that maps inputs to outputs. So what then are the drawbacks of K nearest neighbor classification? Well, there are a couple. One might be that in a naive approach, at least, it could be fairly slow to have to go through and measure the distance between a point and every single one of these points that exist here. Now, there are ways of trying to get around that. There are data structures that can help to make it more quickly to be able to find these neighbors. There are also techniques you can use to try and prune some of this data, remove some of the data points so that you’re only left with the relevant data points just to make it a little bit easier. But ultimately, what we might like to do is come up with another way of trying to do this classification. And one way of trying to do the classification was looking at what are the neighboring points. But another way might be to try to look at all of the data and see if we can come up with some decision boundary, some boundary that will separate the rainy days from the not rainy days. And in the case of two dimensions, we can do that by drawing a line, for example. So what we might want to try to do is just find some line, find some separator that divides the rainy days, the blue points over here, from the not rainy days, the red points over there. We’re now trying a different approach in contrast with the nearest neighbor approach, which just looked at local data around the input data point that we cared about. Now what we’re doing is trying to use a technique known as linear regression to find some sort of line that will separate the two halves from each other. Now sometimes it’ll actually be possible to come up with some line that perfectly separates all the rainy days from the not rainy days. Realistically, though, this is probably cleaner than many data sets will actually be. Oftentimes, data is messier. There are outliers. There’s random noise that happens inside of a particular system. And what we’d like to do is still be able to figure out what a line might look like. So in practice, the data will not always be linearly separable. Or linearly separable refers to some data set where I could draw a line just to separate the two halves of it perfectly. Instead, you might have a situation like this, where there are some rainy points that are on this side of the line and some not rainy points that are on that side of the line. And there may not be a line that perfectly separates what path of the inputs from the other half, that perfectly separates all the rainy days from the not rainy days. But we can still say that this line does a pretty good job. And we’ll try to formalize a little bit later what we mean when we say something like this line does a pretty good job of trying to make that prediction. But for now, let’s just say we’re looking for a line that does as good of a job as we can at trying to separate one category of things from another category of things. So let’s now try to formalize this a little bit more mathematically. We want to come up with some sort of function, some way we can define this line. And our inputs are things like humidity and pressure in this case. So our inputs we might call x1 is going to represent humidity, and x2 is going to represent pressure. These are inputs that we are going to provide to our machine learning algorithm. And given those inputs, we would like for our model to be able to predict some sort of output. And we are going to predict that using our hypothesis function, which we called h. Our hypothesis function is going to take as input x1 and x2, humidity and pressure in this case. And you can imagine if we didn’t just have two inputs, we had three or four or five inputs or more, we could have this hypothesis function take all of those as input. And we’ll see examples of that a little bit later as well. And now the question is, what does this hypothesis function do? Well, it really just needs to measure, is this data point on one side of the boundary, or is it on the other side of the boundary? And how do we formalize that boundary? Well, the boundary is generally going to be a linear combination of these input variables, at least in this particular case. So what we’re trying to do when we say linear combination is take each of these inputs and multiply them by some number that we’re going to have to figure out. We’ll generally call that number a weight for how important should these variables be in trying to determine the answer. So we’ll weight each of these variables with some weight, and we might add a constant to it just to try and make the function a little bit different. And the result, we just need to compare. Is it greater than 0, or is it less than 0 to say, does it belong on one side of the line or the other side of the line? So what that mathematical expression might look like is this. We would take each of my variables, x1 and x2, multiply them by some weight. I don’t yet know what that weight is, but it’s going to be some number, weight 1 and weight 2. And maybe we just want to add some other weight 0 to it, because the function might require us to shift the entire value up or down by a certain amount. And then we just compare. If we do all this math, is it greater than or equal to 0? If so, we might categorize that data point as a rainy day. And otherwise, we might say, no rain. So the key here, then, is that this expression is how we are going to calculate whether it’s a rainy day or not. We’re going to do a bunch of math where we take each of the variables, multiply them by a weight, maybe add an extra weight to it, see if the result is greater than or equal to 0. And using that result of that expression, we’re able to determine whether it’s raining or not raining. This expression here is in this case going to refer to just some line. If you were to plot that graphically, it would just be some line. And what the line actually looks like depends upon these weights. x1 and x2 are the inputs, but these weights are really what determine the shape of that line, the slope of that line, and what that line actually looks like. So we then would like to figure out what these weights should be. We can choose whatever weights we want, but we want to choose weights in such a way that if you pass in a rainy day’s humidity and pressure, then you end up with a result that is greater than or equal to 0. And we would like it such that if we passed into our hypothesis function a not rainy day’s inputs, then the output that we get should be not raining. So before we get there, let’s try and formalize this a little bit more mathematically just to get a sense for how it is that you’ll often see this if you ever go further into supervised machine learning and explore this idea. One thing is that generally for these categories, we’ll sometimes just use the names of the categories like rain and not rain. Often mathematically, if we’re trying to do comparisons between these things, it’s easier just to deal in the world of numbers. So we could just say 1 and 0, 1 for raining, 0 for not raining. So we do all this math. And if the result is greater than or equal to 0, we’ll go ahead and say our hypothesis function outputs 1, meaning raining. And otherwise, it outputs 0, meaning not raining. And oftentimes, this type of expression will instead express using vector mathematics. And all a vector is, if you’re not familiar with the term, is it refers to a sequence of numerical values. You could represent that in Python using a list of numerical values or a tuple with numerical values. And here, we have a couple of sequences of numerical values. One of our vectors, one of our sequences of numerical values, are all of these individual weights, w0, w1, and w2. So we could construct what we’ll call a weight vector, and we’ll see why this is useful in a moment, called w, generally represented using a boldface w, that is just a sequence of these three weights, weight 0, weight 1, and weight 2. And to be able to calculate, based on those weights, whether we think a day is raining or not raining, we’re going to multiply each of those weights by one of our input variables. That w2, this weight, is going to be multiplied by input variable x2. w1 is going to be multiplied by input variable x1. And w0, well, it’s not being multiplied by anything. But to make sure the vectors are the same length, and we’ll see why that’s useful in just a second, we’ll just go ahead and say w0 is being multiplied by 1. Because you can multiply by something by 1, and you end up getting the exact same number. So in addition to the weight vector w, we’ll also have an input vector that we’ll call x that has three values, 1, again, because we’re just multiplying w0 by 1 eventually, and then x1 and x2. So here, then, we’ve represented two distinct vectors, a vector of weights that we need to somehow learn. The goal of our machine learning algorithm is to learn what this weight vector is supposed to be. We could choose any arbitrary set of numbers, and it would produce a function that tries to predict rain or not rain, but it probably wouldn’t be very good. What we want to do is come up with a good choice of these weights so that we’re able to do the accurate predictions. And then this input vector represents a particular input to the function, a data point for which we would like to estimate, is that day a rainy day, or is that day a not rainy day? And so that’s going to vary just depending on what input is provided to our function, what it is that we are trying to estimate. And then to do the calculation, we want to calculate this expression here, and it turns out that expression is what we would call the dot product of these two vectors. The dot product of two vectors just means taking each of the terms in the vectors and multiplying them together, w0 multiply it by 1, w1 multiply it by x1, w2 multiply it by x2, and that’s why these vectors need to be the same length. And then we just add all of the results together. So the dot product of w and x, our weight vector and our input vector, that’s just going to be w0 times 1, or just w0, plus w1 times x1, multiplying these two terms together, plus w2 times x2, multiplying those terms together. So we have our weight vector, which we need to figure out. We need our machine learning algorithm to figure out what the weights should be. We have the input vector representing the data point that we’re trying to predict a category for, predict a label for. And we’re able to do that calculation by taking this dot product, which you’ll often see represented in vector form. But if you haven’t seen vectors before, you can think of it as identical to just this mathematical expression, just doing the multiplication, adding the results together, and then seeing whether the result is greater than or equal to 0 or not. This expression here is identical to the expression that we’re calculating to see whether or not that answer is greater than or equal to 0 in this case. And so for that reason, you’ll often see the hypothesis function written as something like this, a simpler representation where the hypothesis takes as input some input vector x, some humidity and pressure for some day. And we want to predict an output like rain or no rain or 1 or 0 if we choose to represent things numerically. And the way we do that is by taking the dot product of the weights and our input. If it’s greater than or equal to 0, we’ll go ahead and say the output is 1. Otherwise, the output is going to be 0. And this hypothesis, we say, is parameterized by the weights. Depending on what weights we choose, we’ll end up getting a different hypothesis. If we choose the weights randomly, we’re probably not going to get a very good hypothesis function. We’ll get a 1 or a 0. But it’s probably not accurately going to reflect whether we think a day is going to be rainy or not rainy. But if we choose the weights right, we can often do a pretty good job of trying to estimate whether we think the output of the function should be a 1 or a 0. And so the question, then, is how to figure out what these weights should be, how to be able to tune those parameters. And there are a number of ways you can do that. One of the most common is known as the perceptron learning rule. And we’ll see more of this later. But the idea of the perceptron learning rule, and we’re not going to get too deep into the mathematics, we’ll mostly just introduce it more conceptually, is to say that given some data point that we would like to learn from, some data point that has an input x and an output y, where y is like 1 for rain or 0 for not rain, then we’re going to update the weights. And we’ll look at the formula in just a moment. But the big picture idea is that we can start with random weights, but then learn from the data. Take the data points one at a time. And for each one of the data points, figure out, all right, what parameters do we need to change inside of the weights in order to better match that input point. And so that is the value of having access to a lot of data in the supervised machine learning algorithm, is that you take each of the data points and maybe look at them multiple times and constantly try and figure out whether you need to shift your weights in order to better create some weight vector that is able to correctly or more accurately try to estimate what the output should be, whether we think it’s going to be raining or whether we think it’s not going to be raining. So what does that weight update look like? Without going into too much of the mathematics, we’re going to update each of the weights to be the result of the original weight plus some additional expression. And to understand this expression, y, well, y is what the actual output is. And hypothesis of x, the input, that’s going to be what we thought the input was. And so I can replace this by saying what the actual value was minus what our estimate was. And based on the difference between the actual value and what our estimate was, we might want to change our hypothesis, change the way that we do that estimation. If the actual value and the estimate were the same thing, meaning we were correctly able to predict what category this data point belonged to, well, then actual value minus estimate, that’s just going to be 0, which means this whole term on the right-hand side goes to be 0, and the weight doesn’t change. Weight i, where i is like weight 1 or weight 2 or weight 0, weight i just stays at weight i. And none of the weights change if we were able to correctly predict what category the input belonged to. But if our hypothesis didn’t correctly predict what category the input belonged to, well, then maybe then we need to make some changes, adjust the weights so that we’re better able to predict this kind of data point in the future. And what is the way we might do that? Well, if the actual value was bigger than the estimate, then, and for now we’ll go ahead and assume that these x’s are positive values, then if the actual value was bigger than the estimate, well, that means we need to increase the weight in order to make it such that the output is bigger, and therefore we’re more likely to get to the right actual value. And so if the actual value is bigger than the estimate, then actual value minus estimate, that’ll be a positive number. And so you imagine we’re just adding some positive number to the weight just to increase it ever so slightly. And likewise, the inverse case is true, that if the actual value was less than the estimate, the actual value was 0, but we estimated 1, meaning it actually was not raining, but we predicted it was going to be raining. Well, then we want to decrease the value of the weight, because then in that case, we want to try and lower the total value of computing that dot product in order to make it less likely that we would predict that it would actually be raining. So no need to get too deep into the mathematics of that, but the general idea is that every time we encounter some data point, we can adjust these weights accordingly to try and make the weights better line up with the actual data that we have access to. And you can repeat this process with data point after data point until eventually, hopefully, your algorithm converges to some set of weights that do a pretty good job of trying to figure out whether a day is going to be rainy or not raining. And just as a final point about this particular equation, this value alpha here is generally what we’ll call the learning rate. It’s just some parameter, some number we choose for how quickly we’re actually going to be updating these weight values. So that if alpha is bigger, then we’re going to update these weight values by a lot. And if alpha is smaller, then we’ll update the weight values by less. And you can choose a value of alpha. Depending on the problem, different values might suit the situation better or worse than others. So after all of that, after we’ve done this training process of take all this data and using this learning rule, look at all the pieces of data and use each piece of data as an indication to us of do the weights stay the same, do we increase the weights, do we decrease the weights, and if so, by how much? What you end up with is effectively a threshold function. And we can look at what the threshold function looks like like this. On the x-axis here, we have the output of that function, taking the weights, taking the dot product of it with the input. And on the y-axis, we have what the output is going to be, 0, which in this case represented not raining, and 1, which in this case represented raining. And the way that our hypothesis function works is it calculates this value. And if it’s greater than 0 or greater than some threshold value, then we declare that it’s a rainy day. And otherwise, we declare that it’s a not rainy day. And this then graphically is what that function looks like, that initially when the value of this dot product is small, it’s not raining, it’s not raining, it’s not raining. But as soon as it crosses that threshold, we suddenly say, OK, now it’s raining, now it’s raining, now it’s raining. And the way to interpret this kind of representation is that anything on this side of the line, that would be the category of data points where we say, yes, it’s raining. Anything that falls on this side of the line are the data points where we would say, it’s not raining. And again, we want to choose some value for the weights that results in a function that does a pretty good job of trying to do this estimation. But one tricky thing with this type of hard threshold is that it only leaves two possible outcomes. We plug in some data as input. And the output we get is raining or not raining. And there’s no room for anywhere in between. And maybe that’s what you want. Maybe all you want is given some data point, you would like to be able to classify it into one or two or more of these various different categories. But it might also be the case that you care about knowing how strong that prediction is, for example. So if we go back to this instance here, where we have rainy days on this side of the line, not rainy days on that side of the line, you might imagine that let’s look now at these two white data points. This data point here that we would like to predict a label or a category for. And this data point over here that we would also like to predict a label or a category for. It seems likely that you could pretty confidently say that this data point, that should be a rainy day. Seems close to the other rainy days if we’re going by the nearest neighbor strategy. It’s on this side of the line if we’re going by the strategy of just saying, which side of the line does it fall on by figuring out what those weights should be. And if we’re using the line strategy of just which side of the line does it fall on, which side of this decision boundary, well, we’d also say that this point here is also a rainy day because it falls on the side of the line that corresponds to rainy days. But it’s likely that even in this case, we would know that we don’t feel nearly as confident about this data point on the left as compared to this data point on the right. That for this one on the right, we can feel very confident that yes, it’s a rainy day. This one, it’s pretty close to the line if we’re judging just by distance. And so you might be less sure. But our threshold function doesn’t allow for a notion of less sure or more sure about something. It’s what we would call a hard threshold. It’s once you’ve crossed this line, then immediately we say, yes, this is going to be a rainy day. Anywhere before it, we’re going to say it’s not a rainy day. And that may not be helpful in a number of cases. One, this is not a particularly easy function to deal with. As you get deeper into the world of machine learning and are trying to do things like taking derivatives of these curves with this type of function makes things challenging. But the other challenge is that we don’t really have any notion of gradation between things. We don’t have a notion of yes, this is a very strong belief that it’s going to be raining as opposed to it’s probably more likely than not that it’s going to be raining, but maybe not totally sure about that either. So what we can do by taking advantage of a technique known as logistic regression is instead of using this hard threshold type of function, we can use instead a logistic function, something we might call a soft threshold. And that’s going to transform this into looking something a little more like this, something that more nicely curves. And as a result, the possible output values are no longer just 0 and 1, 0 for not raining, 1 for raining. But you can actually get any real numbered value between 0 and 1. But if you’re way over on this side, then you get a value of 0. OK, it’s not going to be raining, and we’re pretty sure about that. And if you’re over on this side, you get a value of 1. And yes, we’re very sure that it’s going to be raining. But in between, you could get some real numbered value, where a value like 0.7 might mean we think it’s going to rain. It’s more probable that it’s going to rain than not based on the data. But we’re not as confident as some of the other data points might be. So one of the advantages of the soft threshold is that it allows us to have an output that could be some real number that potentially reflects some sort of probability, the likelihood that we think that this particular data point belongs to that particular category. And there are some other nice mathematical properties of that as well. So that then is two different approaches to trying to solve this type of classification problem. One is this nearest neighbor type of approach, where you just take a data point and look at the data points that are nearby to try and estimate what category we think it belongs to. And the other approach is the approach of saying, all right, let’s just try and use linear regression, figure out what these weights should be, adjust the weights in order to figure out what line or what decision boundary is going to best separate these two categories. It turns out that another popular approach, a very popular approach if you just have a data set and you want to start trying to do some learning on it, is what we call the support vector machine. And we’re not going to go too much into the mathematics of the support vector machine, but we’ll at least explore it graphically to see what it is that it looks like. And the idea or the motivation behind the support vector machine is the idea that there are actually a lot of different lines that we could draw, a lot of different decision boundaries that we could draw to separate two groups. So for example, I had the red data points over here and the blue data points over here. One possible line I could draw is a line like this, that this line here would separate the red points from the blue points. And it does so perfectly. All the red points are on one side of the line. All the blue points are on the other side of the line. But this should probably make you a little bit nervous. If you come up with a model and the model comes up with a line that looks like this. And the reason why is that you worry about how well it’s going to generalize to other data points that are not necessarily in the data set that we have access to. For example, if there was a point that fell like right here, for example, on the right side of the line, well, then based on that, we might want to guess that it is, in fact, a red point, but it falls on the side of the line where instead we would estimate that it’s a blue point instead. And so based on that, this line is probably not a great choice just because it is so close to these various data points. We might instead prefer like a diagonal line that just goes diagonally through the data set like we’ve seen before. But there too, there’s a lot of diagonal lines that we could draw as well. For example, I could draw this diagonal line here, which also successfully separates all the red points from all of the blue points. From the perspective of something like just trying to figure out some setting of weights that allows us to predict the correct output, this line will predict the correct output for this particular set of data every single time because the red points are on one side, the blue points are on the other. But yet again, you should probably be a little nervous because this line is so close to these red points, even though we’re able to correctly predict on the input data, if there was a point that fell somewhere in this general area, our algorithm, this model, would say that, yeah, we think it’s a blue point, when in actuality, it might belong to the red category instead just because it looks like it’s close to the other red points. What we really want to be able to say, given this data, how can you generalize this as best as possible, is to come up with a line like this that seems like the intuitive line to draw. And the reason why it’s intuitive is because it seems to be as far apart as possible from the red data and the blue data. So that if we generalize a little bit and assume that maybe we have some points that are different from the input but still slightly further away, we can still say that something on this side probably red, something on that side probably blue, and we can make those judgments that way. And that is what support vector machines are designed to do. They’re designed to try and find what we call the maximum margin separator, where the maximum margin separator is just some boundary that maximizes the distance between the groups of points rather than come up with some boundary that’s very close to one set or the other, where in the case before, we wouldn’t have cared. As long as we’re categorizing the input well, that seems all we need to do. The support vector machine will try and find this maximum margin separator, some way of trying to maximize that particular distance. And it does so by finding what we call the support vectors, which are the vectors that are closest to the line, and trying to maximize the distance between the line and those particular points. And it works that way in two dimensions. It also works in higher dimensions, where we’re not looking for some line that separates the two data points, but instead looking for what we generally call a hyperplane, some decision boundary, effectively, that separates one set of data from the other set of data. And this ability of support vector machines to work in higher dimensions actually has a number of other applications as well. But one is that it helpfully deals with cases where data may not be linearly separable. So we talked about linear separability before, this idea that you can take data and just draw a line or some linear combination of the inputs that allows us to perfectly separate the two sets from each other. There are some data sets that are not linearly separable. And some were even two. You would not be able to find a good line at all that would try to do that kind of separation. Something like this, for example. Or if you imagine here are the red points and the blue points around it. If you try to find a line that divides the red points from the blue points, it’s actually going to be difficult, if not impossible, to do that any line you choose, well, if you draw a line here, then you ignore all of these blue points that should actually be blue and not red. Anywhere else you draw a line, there’s going to be a lot of error, a lot of mistakes, a lot of what we’ll soon call loss to that line that you draw, a lot of points that you’re going to categorize incorrectly. What we really want is to be able to find a better decision boundary that may not be just a straight line through this two dimensional space. And what support vector machines can do is they can begin to operate in higher dimensions and be able to find some other decision boundary, like the circle in this case, that actually is able to separate one of these sets of data from the other set of data a lot better. So oftentimes in data sets where the data is not linearly separable, support vector machines by working in higher dimensions can actually figure out a way to solve that kind of problem effectively. So that then, three different approaches to trying to solve these sorts of problems. We’ve seen support vector machines. We’ve seen trying to use linear regression and the perceptron learning rule to be able to figure out how to categorize inputs and outputs. We’ve seen the nearest neighbor approach. No one necessarily better than any other again. It’s going to depend on the data set, the information you have access to. It’s going to depend on what the function looks like that you’re ultimately trying to predict. And this is where a lot of research and experimentation can be involved in trying to figure out how it is to best perform that kind of estimation. But classification is only one of the tasks that you might encounter in supervised machine learning. Because in classification, what we’re trying to predict is some discrete category. We’re trying to predict red or blue, rain or not rain, authentic or counterfeit. But sometimes what we want to predict is a real numbered value. And for that, we have a related problem, not classification, but instead known as regression. And regression is the supervised learning problem where we try and learn a function mapping inputs to outputs same as before. But instead of the outputs being discrete categories, things like rain or not rain, in a regression problem, the output values are generally continuous values, some real number that we would like to predict. This happens all the time as well. You might imagine that a company might take this approach if it’s trying to figure out, for instance, what the effect of its advertising is. How do advertising dollars spent translate into sales for the company’s product, for example? And so they might like to try to predict some function that takes as input the amount of money spent on advertising. And here, we’re just going to use one input. But again, you could scale this up to many more inputs as well if you have a lot of different kinds of data you have access to. And the goal is to learn a function that given this amount of spending on advertising, we’re going to get this amount in sales. And you might judge, based on having access to a whole bunch of data, like for every past month, here is how much we spent on advertising, and here is what sales were. And we would like to predict some sort of hypothesis function that, again, given the amount spent on advertising, we can predict, in this case, some real number, some number estimate of how much sales we expect that company to do in this month or in this quarter or whatever unit of time we’re choosing to measure things in. And so again, the approach to solving this type of problem, we could try using a linear regression type approach where we take this data and we just plot it. On the x-axis, we have advertising dollars spent. On the y-axis, we have sales. And we might just want to try and draw a line that does a pretty good job of trying to estimate this relationship between advertising and sales. And in this case, unlike before, we’re not trying to separate the data points into discrete categories. But instead, in this case, we’re just trying to find a line that approximates this relationship between advertising and sales so that if we want to figure out what the estimated sales are for a particular advertising budget, you just look it up in this line, figure out for this amount of advertising, we would have this amount of sales and just try and make the estimate that way. And so you can try and come up with a line, again, figuring out how to modify the weights using various different techniques to try and make it so that this line fits as well as possible. So with all of these approaches, then, to trying to solve machine learning style problems, the question becomes, how do we evaluate these approaches? How do we evaluate the various different hypotheses that we could come up with? Because each of these algorithms will give us some sort of hypothesis, some function that maps inputs to outputs, and we want to know, how well does that function work? And you can think of evaluating these hypotheses and trying to get a better hypothesis as kind of like an optimization problem. In an optimization problem, as you recall from before, we were either trying to maximize some objective function by trying to find a global maximum, or we were trying to minimize some cost function by trying to find some global minimum. And in the case of evaluating these hypotheses, one thing we might say is that this cost function, the thing we’re trying to minimize, we might be trying to minimize what we would call a loss function. And what a loss function is, is it is a function that is going to estimate for us how poorly our function performs. More formally, it’s like a loss of utility by whenever we predict something that is wrong, that is a loss of utility. That’s going to add to the output of our loss function. And you could come up with any loss function that you want, just some mathematical way of estimating, given each of these data points, given what the actual output is, and given what our projected output is, our estimate, you could calculate some sort of numerical loss for it. But there are a couple of popular loss functions that are worth discussing, just so that you’ve seen them before. When it comes to discrete categories, things like rain or not rain, counterfeit or not counterfeit, one approaches the 0, 1 loss function. And the way that works is for each of the data points, our loss function takes as input what the actual output is, like whether it was actually raining or not raining, and takes our prediction into account. Did we predict, given this data point, that it was raining or not raining? And if the actual value equals the prediction, well, then the 0, 1 loss function will just say the loss is 0. There was no loss of utility, because we were able to predict correctly. And otherwise, if the actual value was not the same thing as what we predicted, well, then in that case, our loss is 1. We lost something, lost some utility, because what we predicted was the output of the function, was not what it actually was. And the goal, then, in a situation like this would be to come up with some hypothesis that minimizes the total empirical loss, the total amount that we’ve lost, if you add up for all these data points what the actual output is and what your hypothesis would have predicted. So in this case, for example, if we go back to classifying days as raining or not raining, and we came up with this decision boundary, how would we evaluate this decision boundary? How much better is it than drawing the line here or drawing the line there? Well, we could take each of the input data points, and each input data point has a label, whether it was raining or whether it was not raining. And we could compare it to the prediction, whether we predicted it would be raining or not raining, and assign it a numerical value as a result. So for example, these points over here, they were all rainy days, and we predicted they would be raining, because they fall on the bottom side of the line. So they have a loss of 0, nothing lost from those situations. And likewise, same is true for some of these points over here, where it was not raining and we predicted it would not be raining either. Where we do have loss are points like this point here and that point there, where we predicted that it would not be raining, but in actuality, it’s a blue point. It was raining. Or likewise here, we predicted that it would be raining, but in actuality, it’s a red point. It was not raining. And so as a result, we miscategorized these data points that we were trying to train on. And as a result, there is some loss here. One loss here, there, here, and there, for a total loss of 4, for example, in this case. And that might be how we would estimate or how we would say that this line is better than a line that goes somewhere else or a line that’s further down, because this line might minimize the loss. So there is no way to do better than just these four points of loss if you’re just drawing a straight line through our space. So the 0, 1 loss function checks. Did we get it right? Did we get it wrong? If we got it right, the loss is 0, nothing lost. If we got it wrong, then our loss function for that data point says 1. And we add up all of those losses across all of our data points to get some sort of empirical loss, how much we have lost across all of these original data points that our algorithm had access to. There are other forms of loss as well that work especially well when we deal with more real valued cases, cases like the mapping between advertising budget and amount that we do in sales, for example. Because in that case, you care not just that you get the number exactly right, but you care how close you were to the actual value. If the actual value is you did like $2,800 in sales and you predicted that you would do $2,900 in sales, maybe that’s pretty good. That’s much better than if you had predicted you’d do $1,000 in sales, for example. And so we would like our loss function to be able to take that into account as well, take into account not just whether the actual value and the expected value are exactly the same, but also take into account how far apart they were. And so for that one approach is what we call L1 loss. L1 loss doesn’t just look at whether actual and predicted are equal to each other, but we take the absolute value of the actual value minus the predicted value. In other words, we just ask how far apart were the actual and predicted values, and we sum that up across all of the data points to be able to get what our answer ultimately is. So what might this actually look like for our data set? Well, if we go back to this representation where we had advertising along the x-axis, sales along the y-axis, our line was our prediction, our estimate for any given amount of advertising, what we predicted sales was going to be. And our L1 loss is just how far apart vertically along the sales axis our prediction was from each of the data points. So we could figure out exactly how far apart our prediction was from each of the data points and figure out as a result of that what our loss is overall for this particular hypothesis just by adding up all of these various different individual losses for each of these data points. And our goal then is to try and minimize that loss, to try and come up with some line that minimizes what the utility loss is by judging how far away our estimate amount of sales is from the actual amount of sales. And turns out there are other loss functions as well. One that’s quite popular is the L2 loss. The L2 loss, instead of just using the absolute value, like how far away the actual value is from the predicted value, it uses the square of actual minus predicted. So how far apart are the actual and predicted value? And it squares that value, effectively penalizing much more harshly anything that is a worse prediction. So you imagine if you have two data points that you predict as being one value away from their actual value, as opposed to one data point that you predict as being two away from its actual value, the L2 loss function will more harshly penalize that one that is two away, because it’s going to square, however, much the differences between the actual value and the predicted value. And depending on the situation, you might want to choose a loss function depending on what you care about minimizing. If you really care about minimizing the error on more outlier cases, then you might want to consider something like this. But if you’ve got a lot of outliers, and you don’t necessarily care about modeling them, then maybe an L1 loss function is preferable. But there are trade-offs here that you need to decide, based on a particular set of data. But what you do run the risk of with any of these loss functions, with anything that we’re trying to do, is a problem known as overfitting. And overfitting is a big problem that you can encounter in machine learning, which happens anytime a model fits too closely with a data set, and as a result, fails to generalize. We would like our model to be able to accurately predict data and inputs and output pairs for the data that we have access to. But the reason we wanted to do so is because we want our model to generalize well to data that we haven’t seen before. I would like to take data from the past year of whether it was raining or not raining, and use that data to generalize it towards the future. Say, in the future, is it going to be raining or not raining? Or if I have a whole bunch of data on what counterfeit and not counterfeit US dollar bills look like in the past when people have encountered them, I’d like to train a computer to be able to, in the future, generalize to other dollar bills that I might see as well. And the problem with overfitting is that if you try and tie yourself too closely to the data set that you’re training your model on, you can end up not generalizing very well. So what does this look like? Well, we might imagine the rainy day and not rainy day example again from here, where the blue points indicate rainy days and the red points indicate not rainy days. And we decided that we felt pretty comfortable with drawing a line like this as the decision boundary between rainy days and not rainy days. So we can pretty comfortably say that points on this side more likely to be rainy days, points on that side more likely to be not rainy days. But the loss, the empirical loss, isn’t zero in this particular case because we didn’t categorize everything perfectly. There was this one outlier, this one day that it wasn’t raining, but yet our model still predicts that it is raining. But that doesn’t necessarily mean our model is bad. It just means the model isn’t 100% accurate. If you really wanted to try and find a hypothesis that resulted in minimizing the loss, you could come up with a different decision boundary. It wouldn’t be a line, but it would look something like this. This decision boundary does separate all of the red points from all of the blue points because the red points fall on this side of this decision boundary, the blue points fall on the other side of the decision boundary. But this, we would probably argue, is not as good of a prediction. Even though it seems to be more accurate based on all of the available training data that we have for training this machine learning model, we might say that it’s probably not going to generalize well. That if there were other data points like here and there, we might still want to consider those to be rainy days because we think this was probably just an outlier. So if the only thing you care about is minimizing the loss on the data you have available to you, you run the risk of overfitting. And this can happen in the classification case. It can also happen in the regression case, that here we predicted what we thought was a pretty good line relating advertising to sales, trying to predict what sales were going to be for a given amount of advertising. But I could come up with a line that does a better job of predicting the training data, and it would be something that looks like this, just connecting all of the various different data points. And now there is no loss at all. Now I’ve perfectly predicted, given any advertising, what sales are. And for all the data available to me, it’s going to be accurate. But it’s probably not going to generalize very well. I have overfit my model on the training data that is available to me. And so in general, we want to avoid overfitting. We’d like strategies to make sure that we haven’t overfit our model to a particular data set. And there are a number of ways that you could try to do this. One way is by examining what it is that we’re optimizing for. In an optimization problem, all we do is we say, there is some cost, and I want to minimize that cost. And so far, we’ve defined that cost function, the cost of a hypothesis, just as being equal to the empirical loss of that hypothesis, like how far away are the actual data points, the outputs, away from what I predicted them to be based on that particular hypothesis. And if all you’re trying to do is minimize cost, meaning minimizing the loss in this case, then the result is going to be that you might overfit, that to minimize cost, you’re going to try and find a way to perfectly match all the input data. And that might happen as a result of overfitting on that particular input data. So in order to address this, you could add something to the cost function. What counts as cost will not just loss, but also some measure of the complexity of the hypothesis. The word the complexity of the hypothesis is something that you would need to define for how complicated does our line look. This is sort of an Occam’s razor-style approach where we want to give preference to a simpler decision boundary, like a straight line, for example, some simpler curve, as opposed to something far more complex that might represent the training data better but might not generalize as well. We’ll generally say that a simpler solution is probably the better solution and probably the one that is more likely to generalize well to other inputs. So we measure what the loss is, but we also measure the complexity. And now that all gets taken into account when we consider the overall cost, that yes, something might have less loss if it better predicts the training data, but if it’s much more complex, it still might not be the best option that we have. And we need to come up with some balance between loss and complexity. And for that reason, you’ll often see this represented as multiplying the complexity by some parameter that we have to choose, parameter lambda in this case, where we’re saying if lambda is a greater value, then we really want to penalize more complex hypotheses. Whereas if lambda is smaller, we’re going to penalize more complex hypotheses a little bit, and it’s up to the machine learning programmer to decide where they want to set that value of lambda for how much do I want to penalize a more complex hypothesis that might fit the data a little better. And again, there’s no one right answer to a lot of these things, but depending on the data set, depending on the data you have available to you and the problem you’re trying to solve, your choice of these parameters may vary, and you may need to experiment a little bit to figure out what the right choice of that is ultimately going to be. This process, then, of considering not only loss, but also some measure of the complexity is known as regularization. Regularization is the process of penalizing a hypothesis that is more complex in order to favor a simpler hypothesis that is more likely to generalize well, more likely to be able to apply to other situations that are dealing with other input points unlike the ones that we’ve necessarily seen before. So oftentimes, you’ll see us add some regularizing term to what we’re trying to minimize in order to avoid this problem of overfitting. Now, another way of making sure we don’t overfit is to run some experiments and to see whether or not we are able to generalize our model that we’ve created to other data sets as well. And it’s for that reason that oftentimes when you’re doing a machine learning experiment, when you’ve got some data and you want to try and come up with some function that predicts, given some input, what the output is going to be, you don’t necessarily want to do your training on all of the data you have available to you that you could employ a method known as holdout cross-validation, where in holdout cross-validation, we split up our data. We split up our data into a training set and a testing set. The training set is the set of data that we’re going to use to train our machine learning model. And the testing set is the set of data that we’re going to use in order to test to see how well our machine learning model actually performed. So the learning happens on the training set. We figure out what the parameters should be. We figure out what the right model is. And then we see, all right, now that we’ve trained the model, we’ll see how well it does at predicting things inside of the testing set, some set of data that we haven’t seen before. And the hope then is that we’re going to be able to predict the testing set pretty well if we’re able to generalize based on the training data that’s available to us. If we’ve overfit the training data, though, and we’re not able to generalize, well, then when we look at the testing set, it’s likely going to be the case that we’re not going to predict things in the testing set nearly as effectively. So this is one method of cross-validation, validating to make sure that the work we have done is actually going to generalize to other data sets as well. And there are other statistical techniques we can use as well. One of the downsides of this just hold out cross-validation is if you say I just split it 50-50, I train using 50% of the data and test using the other 50%, or you could choose other percentages as well, is that there is a fair amount of data that I am now not using to train, that I might be able to get a better model as a result, for example. So one approach is known as k-fold cross-validation. In k-fold cross-validation, rather than just divide things into two sets and run one experiment, we divide things into k different sets. So maybe I divide things up into 10 different sets and then run 10 different experiments. So if I split up my data into 10 different sets of data, then what I’ll do is each time for each of my 10 experiments, I will hold out one of those sets of data, where I’ll say, let me train my model on these nine sets, and then test to see how well it predicts on set number 10. And then pick another set of nine sets to train on, and then test it on the other one that I held out, where each time I train the model on everything minus the one set that I’m holding out, and then test to see how well our model performs on the test that I did hold out. And what you end up getting is 10 different results, 10 different answers for how accurately our model worked. And oftentimes, you could just take the average of those 10 to get an approximation for how well we think our model performs overall. But the key idea is separating the training data from the testing data, because you want to test your model on data that is different from what you trained the model on. Because the training, you want to avoid overfitting. You want to be able to generalize. And the way you test whether you’re able to generalize is by looking at some data that you haven’t seen before and seeing how well we’re actually able to perform. And so if we want to actually implement any of these techniques inside of a programming language like Python, number of ways we could do that. We could write this from scratch on our own, but there are libraries out there that allow us to take advantage of existing implementations of these algorithms, that we can use the same types of algorithms in a lot of different situations. And so there’s a library, very popular one, known as Scikit-learn, which allows us in Python to be able to very quickly get set up with a lot of these different machine learning models. This library has already written an algorithm for nearest neighbor classification, for doing perceptron learning, for doing a bunch of other types of inference and supervised learning that we haven’t yet talked about. But using it, we can begin to try actually testing how these methods work and how accurately they perform. So let’s go ahead and take a look at one approach to trying to solve this type of problem. All right, so I’m first going to pull up banknotes.csv, which is a whole bunch of data provided by UC Irvine, which is information about various different banknotes that people took pictures of various different banknotes and measured various different properties of those banknotes. And in particular, some human categorized each of those banknotes as either a counterfeit banknote or as not counterfeit. And so what you’re looking at here is each row represents one banknote. This is formatted as a CSV spreadsheet, where just comma separated values separating each of these various different fields. We have four different input values for each of these data points, just information, some measurement that was made on the banknote. And what those measurements exactly are aren’t as important as the fact that we do have access to this data. But more importantly, we have access for each of these data points to a label, where 0 indicates something like this was not a counterfeit bill, meaning it was an authentic bill. And a data point labeled 1 means that it is a counterfeit bill, at least according to the human researcher who labeled this particular data. So we have a whole bunch of data representing a whole bunch of different data points, each of which has these various different measurements that were made on that particular bill, and each of which has an output value, 0 or 1, 0 meaning it was a genuine bill, 1 meaning it was a counterfeit bill. And what we would like to do is use supervised learning to begin to predict or model some sort of function that can take these four values as input and predict what the output would be. We want our learning algorithm to find some sort of pattern that is able to predict based on these measurements, something that you could measure just by taking a photo of a bill, predict whether that bill is authentic or whether that bill is counterfeit. And so how can we do that? Well, I’m first going to open up banknote0.py and see how it is that we do this. I’m first importing a lot of things from Scikit-learn, but importantly, I’m going to set my model equal to the perceptron model, which is one of those models that we talked about before. We’re just going to try and figure out some setting of weights that is able to divide our data into two different groups. Then I’m going to go ahead and read data in for my file from banknotes.csv. And basically, for every row, I’m going to separate that row into the first four values of that row, which is the evidence for that row. And then the label, where if the final column in that row is a 0, the label is authentic. And otherwise, it’s going to be counterfeit. So I’m effectively reading data in from the CSV file, dividing into a whole bunch of rows where each row has some evidence, those four input values that are going to be inputs to my hypothesis function. And then the label, the output, whether it is authentic or counterfeit, that is the thing that I am then trying to predict. So the next step is that I would like to split up my data set into a training set and a testing set, some set of data that I would like to train my machine learning model on, and some set of data that I would like to use to test that model, see how well it performed. So what I’ll do is I’ll go ahead and figure out length of the data, how many data points do I have. I’ll go ahead and take half of them, save that number as a number called holdout. That is how many items I’m going to hold out for my data set to save for the testing phase. I’ll randomly shuffle the data so it’s in some random order. And then I’ll say my testing set will be all of the data up to the holdout. So I’ll take holdout many data items, and that will be my testing set. My training data will be everything else, the information that I’m going to train my model on. And then I’ll say I need to divide my training data into two different sets. I need to divide it into my x values, where x here represents the inputs. So the x values, the x values that I’m going to train on, are basically for every row in my training set, I’m going to get the evidence for that row, those four values, where it’s basically a vector of four numbers, where that is going to be all of the input. And then I need the y values. What are the outputs that I want to learn from, the labels that belong to each of these various different input points? Well, that’s going to be the same thing for each row in the training data. But this time, I take that row and get what its label is, whether it is authentic or counterfeit. So I end up with one list of all of these vectors of my input data, and one list, which follows the same order, but is all of the labels that correspond with each of those vectors. And then to train my model, which in this case is just this perceptron model, I just call model.fit, pass in the training data, and what the labels for those training data are. And scikit-learn will take care of fitting the model, will do the entire algorithm for me. And then when it’s done, I can then test to see how well that model performed. So I can say, let me get all of these input vectors for what I want to test on. So for each row in my testing data set, go ahead and get the evidence. And the y values, those are what the actual values were for each of the rows in the testing data set, what the actual label is. But then I’m going to generate some predictions. I’m going to use this model and try and predict, based on the testing vectors, I want to predict what the output is. And my goal then is to now compare y testing with predictions. I want to see how well my predictions, based on the model, actually reflect what the y values were, what the output is, that were actually labeled. Because I now have this label data, I can assess how well the algorithm worked. And so now I can just compute how well we did. I’m going to, this zip function basically just lets me look through two different lists, one by one at the same time. So for each actual value and for each predicted value, if the actual is the same thing as what I predicted, I’ll go ahead and increment the counter by one. Otherwise, I’ll increment my incorrect counter by one. And so at the end, I can print out, here are the results, here’s how many I got right, here’s how many I got wrong, and here was my overall accuracy, for example. So I can go ahead and run this. I can run python banknote0.py. And it’s going to train on half the data set and then test on half the data set. And here are the results for my perceptron model. In this case, it correctly was able to classify 679 bills as correctly either authentic or counterfeit and incorrectly classified seven of them for an overall accuracy of close to 99% accurate. So on this particular data set, using this perceptron model, we were able to predict very well what the output was going to be. And we can try different models, too, that scikit-learn makes it very easy just to swap out one model for another model. So instead of the perceptron model, I can use the support vector machine using the SVC, otherwise known as a support vector classifier, using a support vector machine to classify things into two different groups. And now see, all right, how well does this perform? And all right, this time, we were able to correctly predict 682 and incorrectly predicted four for accuracy of 99.4%. And we could even try the k-neighbors classifier as the model instead. And this takes a parameter, n neighbors, for how many neighbors do you want to look at? Let’s just look at one neighbor, the one nearest neighbor, and use that to predict. Go ahead and run this as well. And it looks like, based on the k-neighbors classifier, looking at just one neighbor, we were able to correctly classify 685 data points, incorrectly classified one. Maybe let’s try three neighbors instead, instead of just using one neighbor. Do more of a k-nearest neighbors approach, where I look at the three nearest neighbors and see how that performs. And that one, in this case, seems to have gotten 100% of all of the predictions correctly described as either authentic banknotes or as counterfeit banknotes. And we could run these experiments multiple times, because I’m randomly reorganizing the data every time. We’re technically training these on slightly different data sets. And so you might want to run multiple experiments to really see how well they’re actually going to perform. But in short, they all perform very well. And while some of them perform slightly better than others here, that might not always be the case for every data set. But you can begin to test now by very quickly putting together these machine learning models using Scikit-learn to be able to train on some training set and then test on some testing set as well. And this splitting up into training groups and testing groups and testing happens so often that Scikit-learn has functions built in for trying to do it. I did it all by hand just now. But if we take a look at banknotes one, we take advantage of some other features that exist in Scikit-learn, where we can really simplify a lot of our logic, that there is a function built into Scikit-learn called train test split, which will automatically split data into a training group and a testing group. I just have to say what proportion should be in the testing group, something like 0.5, half the data inside the testing group. Then I can fit the model on the training data, make the predictions on the testing data, and then just count up. And Scikit-learn has some nice methods for just counting up how many times our testing data match the predictions, how many times our testing data didn’t match the predictions. So very quickly, you can write programs with not all that many lines of code. It’s maybe like 40 lines of code to get through all of these predictions. And then as a result, see how well we’re able to do. So these types of libraries can allow us, without really knowing the implementation details of these algorithms, to be able to use the algorithms in a very practical way to be able to solve these types of problems. So that then was supervised learning, this task of given a whole set of data, some input output pairs, we would like to learn some function that maps those inputs to those outputs. But turns out there are other forms of learning as well. And another popular type of machine learning, especially nowadays, is known as reinforcement learning. And the idea of reinforcement learning is rather than just being given a whole data set at the beginning of input output pairs, reinforcement learning is all about learning from experience. In reinforcement learning, our agent, whether it’s like a physical robot that’s trying to make actions in the world or just some virtual agent that is a program running somewhere, our agent is going to be given a set of rewards or punishments in the form of numerical values. But you can think of them as reward or punishment. And based on that, it learns what actions to take in the future, that our agent, our AI, will be put in some sort of environment. It will make some actions. And based on the actions that it makes, it learns something. It either gets a reward when it does something well, it gets a punishment when it does something poorly, and it learns what to do or what not to do in the future based on those individual experiences. And so what this will often look like is it will often start with some agent, some AI, which might, again, be a physical robot, if you’re imagining a physical robot moving around, but it can also just be a program. And our agent is situated in their environment, where the environment is where they’re going to make their actions, and it’s what’s going to give them rewards or punishments for various actions that they’re in. So for example, the environment is going to start off by putting our agent inside of a state. Our agent has some state that, in a game, might be the state of the game that the agent is playing. In a world that the agent is exploring might be some position inside of a grid representing the world that they’re exploring. But the agent is in some sort of state. And in that state, the agent needs to choose to take an action. The agent likely has multiple actions they can choose from, but they pick an action. So they take an action in a particular state. And as a result of that, the agent will generally get two things in response as we model them. The agent gets a new state that they find themselves in. After being in this state, taking one action, they end up in some other state. And they’re also given some sort of numerical reward, positive meaning reward, meaning it was a good thing, negative generally meaning they did something bad, they received some sort of punishment. And that is all the information the agent has. It’s told what state it’s in. It makes some sort of action. And based on that, it ends up in another state. And it ends up getting some particular reward. And it needs to learn, based on that information, what actions to begin to take in the future. And so you could imagine generalizing this to a lot of different situations. This is oftentimes how you train if you’ve ever seen those robots that are now able to walk around the way humans do. It would be quite difficult to program the robot in exactly the right way to get it to walk the way humans do. You could instead train it through reinforcement learning, give it some sort of numerical reward every time it does something good, like take steps forward, and punish it every time it does something bad, like fall over, and then let the AI just learn based on that sequence of rewards, based on trying to take various different actions. You can begin to have the agent learn what to do in the future and what not to do. So in order to begin to formalize this, the first thing we need to do is formalize this notion of what we mean about states and actions and rewards, like what does this world look like? And oftentimes, we’ll formulate this world as what’s known as a Markov decision process, similar in spirit to Markov chains, which you might recall from before. But a Markov decision process is a model that we can use for decision making, for an agent trying to make decisions in its environment. And it’s a model that allows us to represent the various different states that an agent can be in, the various different actions that they can take, and also what the reward is for taking one action as opposed to another action. So what then does it actually look like? Well, if you recall a Markov chain from before, a Markov chain looked a little something like this, where we had a whole bunch of these individual states, and each state immediately transitioned to another state based on some probability distribution. We saw this in the context of the weather before, where if it was sunny, we said with some probability, it’ll be sunny the next day. With some other probability, it’ll be rainy, for example. But we could also imagine generalizing this. It’s not just sun and rain anymore. We just have these states, where one state leads to another state according to some probability distribution. But in this original model, there was no agent that had any control over this process. It was just entirely probability based, where with some probability, we moved to this next state. But maybe it’s going to be some other state with some other probability. What we’ll now have is the ability for the agent in this state to choose from a set of actions, where maybe instead of just one path forward, they have three different choices of actions that each lead up down different paths. And even this is a bit of an oversimplification, because in each of these states, you might imagine more branching points where there are more decisions that can be taken as well. So we’ve extended the Markov chain to say that from a state, you now have available action choices. And each of those actions might be associated with its own probability distribution of going to various different states. Then in addition, we’ll add another extension, where any time you move from a state, taking an action, going into this other state, we can associate a reward with that outcome, saying either r is positive, meaning some positive reward, or r is negative, meaning there was some sort of punishment. And this then is what we’ll consider to be a Markov decision process. That a Markov decision process has some initial set of states, of states in the world that we can be in. We have some set of actions that, given a state, I can say, what are the actions that are available to me in that state, an action that I can choose from? Then we have some transition model. The transition model before just said that, given my current state, what is the probability that I end up in that next state or this other state? The transition model now has effectively two things we’re conditioning on. We’re saying, given that I’m in this state and that I take this action, what’s the probability that I end up in this next state? Now maybe we live in a very deterministic world in this Markov decision process. We’re given a state and given an action. We know for sure what next state we’ll end up in. But maybe there’s some randomness in the world that when you take in a state and you take an action, you might not always end up in the exact same state. There might be some probabilities involved there as well. The Markov decision process can handle both of those possible cases. And then finally, we have a reward function, generally called r, that in this case says, what is the reward for being in this state, taking this action, and then getting to s prime this next state? So I’m in this original state. I take this action. I get to this next state. What is the reward for doing that process? And you can add up these rewards every time you take an action to get the total amount of rewards that an agent might get from interacting in a particular environment modeled using this Markov decision process. So what might this actually look like in practice? Well, let’s just create a little simulated world here where I have this agent that is just trying to navigate its way. This agent is this yellow dot here, like a robot in the world, trying to navigate its way through this grid. And ultimately, it’s trying to find its way to the goal. And if it gets to the green goal, then it’s going to get some sort of reward. But then we might also have some red squares that are places where you get some sort of punishment, some bad place where we don’t want the agent to go. And if it ends up in the red square, then our agent is going to get some sort of punishment as a result of that. But the agent originally doesn’t know all of these details. It doesn’t know that these states are associated with punishments. But maybe it does know that this state is associated with a reward. Maybe it doesn’t. But it just needs to sort of interact with the environment to try and figure out what to do and what not to do. So the first thing the agent might do is, given no additional information, if it doesn’t know what the punishments are, it doesn’t know where the rewards are, it just might try and take an action. And it takes an action and ends up realizing that it got some sort of punishment. And so what does it learn from that experience? Well, it might learn that when you’re in this state in the future, don’t take the action move to the right, that that is a bad action to take. That in the future, if you ever find yourself back in the state, don’t take this action of going to the right when you’re in this particular state, because that leads to punishment. That might be the intuition at least. And so you could try doing other actions. You move up, all right, that didn’t lead to any immediate rewards. Maybe try something else. Then maybe try something else. And all right, now you found that you got another punishment. And so you learn something from that experience. So the next time you do this whole process, you know that if you ever end up in this square, you shouldn’t take the down action, because being in this state and taking that action ultimately leads to some sort of punishment, a negative reward, in other words. And this process repeats. You might imagine just letting our agent explore the world, learning over time what states tend to correspond with poor actions, learning over time what states correspond with poor actions, until eventually, if it tries enough things randomly, it might find that eventually when you get to this state, if you take the up action in this state, it might find that you actually get a reward from that. And what it can learn from that is that if you’re in this state, you should take the up action, because that leads to a reward. And over time, you can also learn that if you’re in this state, you should take the left action, because that leads to this state that also lets you eventually get to the reward. So you begin to learn over time not only which actions are good in particular states, but also which actions are bad, such that once you know some sequence of good actions that leads you to some sort of reward, our agent can just follow those instructions, follow the experience that it has learned. We didn’t tell the agent what the goal was. We didn’t tell the agent where the punishments were. But the agent can begin to learn from this experience and learn to begin to perform these sorts of tasks better in the future. And so let’s now try to formalize this idea, formalize the idea that we would like to be able to learn in this state taking this action, is that a good thing or a bad thing? There are lots of different models for reinforcement learning. We’re just going to look at one of them today. And the one that we’re going to look at is a method known as Q-learning. And what Q-learning is all about is about learning a function, a function Q, that takes inputs S and A, where S is a state and A is an action that you take in that state. And what this Q function is going to do is it is going to estimate the value. How much reward will I get from taking this action in this state? Originally, we don’t know what this Q function should be. But over time, based on experience, based on trying things out and seeing what the result is, I would like to try and learn what Q of SA is for any particular state and any particular action that I might take in that state. So what is the approach? Well, the approach originally is we’ll start with Q SA equal to 0 for all states S and for all actions A. That initially, before I’ve ever started anything, before I’ve had any experiences, I don’t know the value of taking any action in any given state. So I’m going to assume that the value is just 0 all across the board. But then as I interact with the world, as I experience rewards or punishments, or maybe I go to a cell where I don’t get either reward or a punishment, I want to somehow update my estimate of Q SA. I want to continually update my estimate of Q SA based on the experiences and rewards and punishments that I’ve received, such that in the future, my knowledge of what actions are good and what states will be better. So when we take an action and receive some sort of reward, I want to estimate the new value of Q SA. And I estimate that based on a couple of different things. I estimate it based on the reward that I’m getting from taking this action and getting into the next state. But assuming the situation isn’t over, assuming there are still future actions that I might take as well, I also need to take into account the expected future rewards. That if you imagine an agent interacting with the environment, then sometimes you’ll take an action and get a reward, but then you can keep taking more actions and get more rewards, that these both are relevant, both the current reward I’m getting from this current step and also my future reward. And it might be the case that I’ll want to take a step that doesn’t immediately lead to a reward, because later on down the line, I know it will lead to more rewards as well. So there’s a balancing act between current rewards that the agent experiences and future rewards that the agent experiences as well. And then we need to update QSA. So we estimate the value of QSA based on the current reward and the expected future rewards. And then we need to update this Q function to take into account this new estimate. Now, we already, as we go through this process, we’ll already have an estimate for what we think the value is. Now we have a new estimate, and then somehow we need to combine these two estimates together, and we’ll look at more formal ways that we can actually begin to do that. So to actually show you what this formula looks like, here is the approach we’ll take with Q learning. We’re going to, again, start with Q of S and A being equal to 0 for all states. And then every time we take an action A in state S and observer reward R, we’re going to update our value, our estimate, for Q of SA. And the idea is that we’re going to figure out what the new value estimate is minus what our existing value estimate is. And so we have some preconceived notion for what the value is for taking this action in this state. Maybe our expectation is we currently think the value is 10. But then we’re going to estimate what we now think it’s going to be. Maybe the new value estimate is something like 20. So there’s a delta of 10 that our new value estimate is 10 points higher than what our current value estimate happens to be. And so we have a couple of options here. We need to decide how much we want to adjust our current expectation of what the value is of taking this action in this particular state. And what that difference is, how much we add or subtract from our existing notion of how much do we expect the value to be, is dependent on this parameter alpha, also called a learning rate. And alpha represents, in effect, how much we value new information compared to how much we value old information. An alpha value of 1 means we really value new information. But if we have a new estimate, then it doesn’t matter what our old estimate is. We’re only going to consider our new estimate because we always just want to take into consideration our new information. So the way that works is that if you imagine alpha being 1, well, then we’re taking the old value of QSA and then adding 1 times the new value minus the old value. And that just leaves us with the new value. So when alpha is 1, all we take into consideration is what our new estimate happens to be. But over time, as we go through a lot of experiences, we already have some existing information. We might have tried taking this action nine times already. And now we just tried it a 10th time. And we don’t only want to consider this 10th experience. I also want to consider the fact that my prior nine experiences, those were meaningful, too. And that’s data I don’t necessarily want to lose. And so this alpha controls that decision, controls how important is the new information. 0 would mean ignore all the new information. Just keep this Q value the same. 1 means replace the old information entirely with the new information. And somewhere in between, keep some sort of balance between these two values. We can put this equation a little bit more formally as well. The old value estimate is our old estimate for what the value is of taking this action in a particular state. That’s just Q of SNA. So we have it once here, and we’re going to add something to it. We’re going to add alpha times the new value estimate minus the old value estimate. But the old value estimate, we just look up by calling this Q function. And what then is the new value estimate? Based on this experience we have just taken, what is our new estimate for the value of taking this action in this particular state? Well, it’s going to be composed of two parts. It’s going to be composed of what reward did I just get from taking this action in this state. And then it’s going to be, what can I expect my future rewards to be from this point forward? So it’s going to be R, some reward I’m getting right now, plus whatever I estimate I’m going to get in the future. And how do I estimate what I’m going to get in the future? Well, it’s a bit of another call to this Q function. It’s going to be take the maximum across all possible actions I could take next and say, all right, of all of these possible actions I could take, which one is going to have the highest reward? And so this then looks a little bit complicated. This is going to be our notion for how we’re going to perform this kind of update. I have some estimate, some old estimate, for what the value is of taking this action in this state. And I’m going to update it based on new information that I experience some reward. I predict what my future reward is going to be. And using that I update what I estimate the reward will be for taking this action in this particular state. And there are other additions you might make to this algorithm as well. Sometimes it might not be the case that future rewards you want to wait equally to current rewards. Maybe you want an agent that values reward now over reward later. And so sometimes you can even add another term in here, some other parameter, where you discount future rewards and say future rewards are not as valuable as rewards immediately. That getting reward in the current time step is better than waiting a year and getting rewards later. But that’s something up to the programmer to decide what that parameter ought to be. But the big picture idea of this entire formula is to say that every time we experience some new reward, we take that into account. We update our estimate of how good is this action. And then in the future, we can make decisions based on that algorithm. Once we have some good estimate for every state and for every action, what the value is of taking that action, then we can do something like implement a greedy decision making policy. That if I am in a state and I want to know what action should I take in that state, well, then I consider for all of my possible actions, what is the value of QSA? What is my estimated value of taking that action in that state? And I will just pick the action that has the highest value after I evaluate that expression. So I pick the action that has the highest value. And based on that, that tells me what action I should take. At any given state that I’m in, I can just greedily say across all my actions, this action gives me the highest expected value. And so I’ll go ahead and choose that action as the action that I take as well. But there is a downside to this kind of approach. And then downside comes up in a situation like this, where we know that there is some solution that gets me to the reward. And our agent has been able to figure that out. But it might not necessarily be the best way or the fastest way. If the agent is allowed to explore a little bit more, it might find that it can get the reward faster by taking some other route instead, by going through this particular path that is a faster way to get to that ultimate goal. And maybe we would like for the agent to be able to figure that out as well. But if the agent always takes the actions that it knows to be best, well, when it gets to this particular square, it doesn’t know that this is a good action because it’s never really tried it. But it knows that going down eventually leads its way to this reward. So it might learn in the future that it should just always take this route and it’s never going to explore and go along that route instead. So in reinforcement learning, there is this tension between exploration and exploitation. And exploitation generally refers to using knowledge that the AI already has. The AI already knows that this is a move that leads to reward. So we’ll go ahead and use that move. And exploration is all about exploring other actions that we may not have explored as thoroughly before because maybe one of these actions, even if I don’t know anything about it, might lead to better rewards faster or to more rewards in the future. And so an agent that only ever exploits information and never explores might be able to get reward, but it might not maximize its rewards because it doesn’t know what other possibilities are out there, possibilities that we only know about by taking advantage of exploration. And so how can we try and address this? Well, one possible solution is known as the Epsilon greedy algorithm, where we set Epsilon equal to how often we want to just make a random move, where occasionally we will just make a random move in order to say, let’s try to explore and see what happens. And then the logic of the algorithm will be with probability 1 minus Epsilon, choose the estimated best move. In a greedy case, we’d always choose the best move. But in Epsilon greedy, we’re most of the time going to choose the best move or sometimes going to choose the best move. But sometimes with probability Epsilon, we’re going to choose a random move instead. So every time we’re faced with the ability to take an action, sometimes we’re going to choose the best move. Sometimes we’re just going to choose a random move. So this type of algorithm can be quite powerful in a reinforcement learning context by not always just choosing the best possible move right now, but sometimes, especially early on, allowing yourself to make random moves that allow you to explore various different possible states and actions more, and maybe over time, you might decrease your value of Epsilon. More and more often, choosing the best move after you’re more confident that you’ve explored what all of the possibilities actually are. So we can put this into practice. And one very common application of reinforcement learning is in game playing, that if you want to teach an agent how to play a game, you just let the agent play the game a whole bunch. And then the reward signal happens at the end of the game. When the game is over, if our AI won the game, it gets a reward of like 1, for example. And if it lost the game, it gets a reward of negative 1. And from that, it begins to learn what actions are good and what actions are bad. You don’t have to tell the AI what’s good and what’s bad, but the AI figures it out based on that reward. Winning the game is some signal, losing the game is some signal, and based on all of that, it begins to figure out what decisions it should actually make. So one very simple game, which you may have played before, is a game called Nim. And in the game of Nim, you’ve got a whole bunch of objects in a whole bunch of different piles, where here I’ve represented each pile as an individual row. So you’ve got one object in the first pile, three in the second pile, five in the third pile, seven in the fourth pile. And the game of Nim is a two player game where players take turns removing objects from piles. And the rule is that on any given turn, you were allowed to remove as many objects as you want from any one of these piles, any one of these rows. You have to remove at least one object, but you remove as many as you want from exactly one of the piles. And whoever takes the last object loses. So player one might remove four from this pile here. Player two might remove four from this pile here. So now we’ve got four piles left, one, three, one, and three. Player one might remove the entirety of the second pile. Player two, if they’re being strategic, might remove two from the third pile. Now we’ve got three piles left, each with one object left. Player one might remove one from one pile. Player two removes one from the other pile. And now player one is left with choosing this one object from the last pile, at which point player one loses the game. So fairly simple game. Piles of objects, any turn you choose how many objects to remove from a pile, whoever removes the last object loses. And this is the type of game you could encode into an AI fairly easily, because the states are really just four numbers. Every state is just how many objects in each of the four piles. And the actions are things like, how many am I going to remove from each one of these individual piles? And the reward happens at the end, that if you were the player that had to remove the last object, then you get some sort of punishment. But if you were not, and the other player had to remove the last object, well, then you get some sort of reward. So we could actually try and show a demonstration of this, that I’ve implemented an AI to play the game of Nim. All right, so here, what we’re going to do is create an AI as a result of training the AI on some number of games, that the AI is going to play against itself, where the idea is the AI will play games against itself, learn from each of those experiences, and learn what to do in the future. And then I, the human, will play against the AI. So initially, we’ll say train zero times, meaning we’re not going to let the AI play any practice games against itself in order to learn from its experiences. We’re just going to see how well it plays. And it looks like there are four piles. I can choose how many I remove from any one of the piles. So maybe from pile three, I will remove five objects, for example. So now, AI chose to take one item from pile zero. So I’m left with these piles now, for example. And so here, I could choose maybe to say, I would like to remove from pile two, I’ll remove all five of them, for example. And so AI chose to take two away from pile one. Now I’m left with one pile that has one object, one pile that has two objects. So from pile three, I will remove two objects. And now I’ve left the AI with no choice but to take that last one. And so the game is over, and I was able to win. But I did so because the AI was really just playing randomly. It didn’t have any prior experience that it was using in order to make these sorts of judgments. Now let me let the AI train itself on 10,000 games. I’m going to let the AI play 10,000 games of nim against itself. Every time it wins or loses, it’s going to learn from that experience and learn in the future what to do and what not to do. So here then, I’ll go ahead and run this again. And now you see the AI running through a whole bunch of training games, 10,000 training games against itself. And now it’s going to let me make these sorts of decisions. So now I’m going to play against the AI. Maybe I’ll remove one from pile three. And the AI took everything from pile three, so I’m left with three piles. I’ll go ahead and from pile two maybe remove three items. And the AI removes one item from pile zero. I’m left with two piles, each of which has two items in it. I’ll remove one from pile one, I guess. And the AI took two from pile two, leaving me with no choice but to take one away from pile one. So it seems like after playing 10,000 games of nim against itself, the AI has learned something about what states and what actions tend to be good and has begun to learn some sort of pattern for how to predict what actions are going to be good and what actions are going to be bad in any given state. So reinforcement learning can be a very powerful technique for achieving these sorts of game-playing agents, agents that are able to play a game well just by learning from experience, whether that’s playing against other people or by playing against itself and learning from those experiences as well. Now, nim is a bit of an easy game to use reinforcement learning for because there are so few states. There are only states that are as many as how many different objects are in each of these various different piles. You might imagine that it’s going to be harder if you think of a game like chess or games where there are many, many more states and many, many more actions that you can imagine taking, where it’s not going to be as easy to learn for every state and for every action what the value is going to be. So oftentimes in that case, we can’t necessarily learn exactly what the value is for every state and for every action, but we can approximate it. So much as we saw with minimax, so we could use a depth-limiting approach to stop calculating at a certain point in time, we can do a similar type of approximation known as function approximation in a reinforcement learning context where instead of learning a value of q for every state and every action, we just have some function that estimates what the value is for taking this action in this particular state that might be based on various different features of the state that the agent happens to be in, where you might have to choose what those features actually are. But you can begin to learn some patterns that generalize beyond one specific state and one specific action that you can begin to learn if certain features tend to be good things or bad things. Reinforcement learning can allow you, using a very similar mechanism, to generalize beyond one particular state and say, if this other state looks kind of like this state, then maybe the similar types of actions that worked in one state will also work in another state as well. And so this type of approach can be quite helpful as you begin to deal with reinforcement learning that exist in larger and larger state spaces where it’s just not feasible to explore all of the possible states that could actually exist. So there, then, are two of the main categories of reinforcement learning. Supervised learning, where you have labeled input and output pairs, and reinforcement learning, where an agent learns from rewards or punishments that it receives. The third major category of machine learning that we’ll just touch on briefly is known as unsupervised learning. And unsupervised learning happens when we have data without any additional feedback, without labels, that in the supervised learning case, all of our data had labels. We labeled the data point with whether that was a rainy day or not rainy day. And using those labels, we were able to infer what the pattern was. Or we labeled data as a counterfeit banknote or not a counterfeit. And using those labels, we were able to draw inferences and patterns to figure out what does a banknote look like versus not. In unsupervised learning, we don’t have any access to any of those labels. But we still would like to learn some of those patterns. And one of the tasks that you might want to perform in unsupervised learning is something like clustering, where clustering is just the task of, given some set of objects, organize it into distinct clusters, groups of objects that are similar to one another. And there’s lots of applications for clustering. It comes up in genetic research, where you might have a whole bunch of different genes and you want to cluster them into similar genes if you’re trying to analyze them across a population or across species. It comes up in an image if you want to take all the pixels of an image, cluster them into different parts of the image. Comes a lot up in market research if you want to divide your consumers into different groups so you know which groups to target with certain types of product advertisements, for example, and a number of other contexts as well in which clustering can be very applicable. One technique for clustering is an algorithm known as k-means clustering. And what k-means clustering is going to do is it is going to divide all of our data points into k different clusters. And it’s going to do so by repeating this process of assigning points to clusters and then moving around those clusters at centers. We’re going to define a cluster by its center, the middle of the cluster, and then assign points to that cluster based on which center is closest to that point. And I’ll show you an example of that now. Here, for example, I have a whole bunch of unlabeled data, just various data points that are in some sort of graphical space. And I would like to group them into various different clusters. But I don’t know how to do that originally. And let’s say I want to assign like three clusters to this group. And you have to choose how many clusters you want in k-means clustering that you could try multiple and see how well those values perform. But I’ll start just by randomly picking some places to put the centers of those clusters. Maybe I have a blue cluster, a red cluster, and a green cluster. And I’m going to start with the centers of those clusters just being in these three locations here. And what k-means clustering tells us to do is once I have the centers of the clusters, assign every point to a cluster based on which cluster center it is closest to. So we end up with something like this, where all of these points are closer to the blue cluster center than any other cluster center. All of these points here are closer to the green cluster center than any other cluster center. And then these two points plus these points over here, those are all closest to the red cluster center instead. So here then is one possible assignment of all these points to three different clusters. But it’s not great that it seems like in this red cluster, these points are kind of far apart. In this green cluster, these points are kind of far apart. It might not be my ideal choice of how I would cluster these various different data points. But k-means clustering is an iterative process that after I do this, there is a next step, which is that after I’ve assigned all of the points to the cluster center that it is nearest to, we are going to re-center the clusters, meaning take the cluster centers, these diamond shapes here, and move them to the middle, or the average, effectively, of all of the points that are in that cluster. So we’ll take this blue point, this blue center, and go ahead and move it to the middle or to the center of all of the points that were assigned to the blue cluster, moving it slightly to the right in this case. And we’ll do the same thing for red. We’ll move the cluster center to the middle of all of these points, weighted by how many points there are. There are more points over here, so the red center ends up moving a little bit further that way. And likewise, for the green center, there are many more points on this side of the green center. So the green center ends up being pulled a little bit further in this direction. So we re-center all of the clusters, and then we repeat the process. We go ahead and now reassign all of the points to the cluster center that they are now closest to. And now that we’ve moved around the cluster centers, these cluster assignments might change. That this point originally was closer to the red cluster center, but now it’s actually closer to the blue cluster center. Same goes for this point as well. And these three points that were originally closer to the green cluster center are now closer to the red cluster center instead. So we can reassign what colors or which clusters each of these data points belongs to, and then repeat the process again, moving each of these cluster means and the middles of the clusterism to the mean, the average, of all of the other points that happen to be there, and repeat the process again. Go ahead and assign each of the points to the cluster that they are closest to. So once we reach a point where we’ve assigned all the points to clusters to the cluster that they are nearest to, and nothing changed, we’ve reached a sort of equilibrium in this situation, where no points are changing their allegiance. And as a result, we can declare this algorithm is now over. And we now have some assignment of each of these points into three different clusters. And it looks like we did a pretty good job of trying to identify which points are more similar to one another than they are to points in other groups. So we have the green cluster down here, this blue cluster here, and then this red cluster over there as well. And we did so without any access to some labels to tell us what these various different clusters were. We just used an algorithm in an unsupervised sense without any of those labels to figure out which points belonged to which categories. And again, lots of applications for this type of clustering technique. And there are many more algorithms in each of these various different fields within machine learning, supervised and reinforcement and unsupervised. But those are many of the big picture foundational ideas that underlie a lot of these techniques, where these are the problems that we’re trying to solve. And we try and solve those problems using a number of different methods of trying to take data and learn patterns in that data, whether that’s trying to find neighboring data points that are similar or trying to minimize some sort of loss function or any number of other techniques that allow us to begin to try to solve these sorts of problems. That then was a look at some of the principles that are at the foundation of modern machine learning, this ability to take data and learn from that data so that the computer can perform a task even if they haven’t explicitly been given instructions in order to do so. Next time, we’ll continue this conversation about machine learning, looking at other techniques we can use for solving these sorts of problems. We’ll see you then. All right, welcome back, everyone, to an introduction to artificial intelligence with Python. Now, last time, we took a look at machine learning, a set of techniques that computers can use in order to take a set of data and learn some patterns inside of that data, learn how to perform a task even if we the programmers didn’t give the computer explicit instructions for how to perform that task. Today, we transition to one of the most popular techniques and tools within machine learning, that of neural networks. And neural networks were inspired as early as the 1940s by researchers who were thinking about how it is that humans learn, studying neuroscience in the human brain and trying to see whether or not we could apply those same ideas to computers as well and model computer learning off of human learning. So how is the brain structured? Well, very simply put, the brain consists of a whole bunch of neurons. And those neurons are connected to one another and communicate with one another in some way. In particular, if you think about the structure of a biological neural network, something like this, there are a couple of key properties that scientists observed. One was that these neurons are connected to each other and receive electrical signals from one another, that one neuron can propagate electrical signals to another neuron. And another point is that neurons process those input signals and then can be activated, that a neuron becomes activated at a certain point and then can propagate further signals onto neurons in the future. And so the question then became, could we take this biological idea of how it is that humans learn with brains and with neurons and apply that to a machine as well, in effect designing an artificial neural network, or an ANN, which will be a mathematical model for learning that is inspired by these biological neural networks? And what artificial neural networks will allow us to do is they will first be able to model some sort of mathematical function. Every time you look at a neural network, which we’ll see more of later today, each one of them is really just some mathematical function that is mapping certain inputs to particular outputs based on the structure of the network, that depending on where we place particular units inside of this neural network, that’s going to determine how it is that the network is going to function. And in particular, artificial neural networks are going to lend themselves to a way that we can learn what the network’s parameters should be. We’ll see more on that in just a moment. But in effect, we want a model such that it is easy for us to be able to write some code that allows for the network to be able to figure out how to model the right mathematical function given a particular set of input data. So in order to create our artificial neural network, instead of using biological neurons, we’re just going to use what we’re going to call units, units inside of a neural network, which we can represent kind of like a node in a graph, which will here be represented just by a blue circle like this. And these artificial units, these artificial neurons, can be connected to one another. So here, for instance, we have two units that are connected by this edge inside of this graph, effectively. And so what we’re going to do now is think of this idea as some sort of mapping from inputs to outputs. So we have one unit that is connected to another unit that we might think of this side of the input and that side of the output. And what we’re trying to do then is to figure out how to solve a problem, how to model some sort of mathematical function. And this might take the form of something we saw last time, which was something like we have certain inputs, like variables x1 and x2. And given those inputs, we want to perform some sort of task, a task like predicting whether or not it’s going to rain. And ideally, we’d like some way, given these inputs, x1 and x2, which stand for some sort of variables to do with the weather, we would like to be able to predict, in this case, a Boolean classification. Is it going to rain, or is it not going to rain? And we did this last time by way of a mathematical function. We defined some function, h, for our hypothesis function, that took as input x1 and x2, the two inputs that we cared about processing, in order to determine whether we thought it was going to rain or whether we thought it was not going to rain. The question then becomes, what does this hypothesis function do in order to make that determination? And we decided last time to use a linear combination of these input variables to determine what the output should be. So our hypothesis function was equal to something like this. Weight 0 plus weight 1 times x1 plus weight 2 times x2. So what’s going on here is that x1 and x2, those are input variables, the inputs to this hypothesis function. And each of those input variables is being multiplied by some weight, which is just some number. So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2. And we have this additional weight, weight 0, that doesn’t get multiplied by an input variable at all, that just serves to either move the function up or move the function’s value down. You can think of this as either a weight that’s just multiplied by some dummy value, like the number 1. It’s multiplied by 1, and so it’s not multiplied by anything. Or sometimes, you’ll see in the literature, people call this variable weight 0 a bias, so that you can think of these variables as slightly different. We have weights that are multiplied by the input, and we separately add some bias to the result as well. You’ll hear both of those terminologies used when people talk about neural networks and machine learning. So in effect, what we’ve done here is that in order to define a hypothesis function, we just need to decide and figure out what these weights should be to determine what values to multiply by our inputs to get some sort of result. Of course, at the end of this, what we need to do is make some sort of classification, like rainy or not rainy. And to do that, we use some sort of function that defines some sort of threshold. And so we saw, for instance, the step function, which is defined as 1 if the result of multiplying the weights by the inputs is at least 0, otherwise it’s 0. And you can think of this line down the middle as kind of like a dotted line. Effectively, it stays at 0 all the way up to one point, and then the function steps or jumps up to 1. So it’s 0 before it reaches some threshold, and then it’s 1 after it reaches a particular threshold. And so this was one way we could define what will come to call an activation function, a function that determines when it is that this output becomes active, changes to 1 instead of being a 0. But we also saw that if we didn’t just want a purely binary classification, we didn’t want purely 1 or 0, but we wanted to allow for some in-between real numbered values, we could use a different function. And there are a number of choices, but the one that we looked at was the logistic sigmoid function that has sort of an s-shaped curve, where we could represent this as a probability that may be somewhere in between the probability of rain or something like 0.5. Maybe a little bit later, the probability of rain is 0.8. And so rather than just have a binary classification of 0 or 1, we could allow for numbers that are in between as well. And it turns out there are many other different types of activation functions, where an activation function just takes the output of multiplying the weights together and adding that bias, and then figuring out what the actual output should be. Another popular one is the rectified linear unit, otherwise known as ReLU. And the way that works is that it just takes its input and takes the maximum of that input and 0. So if it’s positive, it remains unchanged. But if it’s 0, if it’s negative, it goes ahead and levels out at 0. And there are other activation functions that we could choose as well. But in short, each of these activation functions, you can just think of as a function that gets applied to the result of all of this computation. We take some function g and apply it to the result of all of that calculation. And this then is what we saw last time, the way of defining some hypothesis function that takes in inputs, calculate some linear combination of those inputs, and then passes it through some sort of activation function to get our output. And this actually turns out to be the model for the simplest of neural networks, that we’re going to instead represent this mathematical idea graphically by using a structure like this. Here then is a neural network that has two inputs. We can think of this as x1 and this as x2. And then one output, which you can think of as classifying whether or not we think it’s going to rain or not rain, for example, in this particular instance. And so how exactly does this model work? Well, each of these two inputs represents one of our input variables, x1 and x2. And notice that these inputs are connected to this output via these edges, which are going to be defined by their weights. So these edges each have a weight associated with them, weight 1 and weight 2. And then this output unit, what it’s going to do is it is going to calculate an output based on those inputs and based on those weights. This output unit is going to multiply all the inputs by their weights, add in this bias term, which you can think of as an extra w0 term that gets added into it, and then we pass it through an activation function. So this then is just a graphical way of representing the same idea we saw last time just mathematically. And we’re going to call this a very simple neural network. And we’d like for this neural network to be able to learn how to calculate some function, that we want some function for the neural network to learn. And the neural network is going to learn what should the values of w0, w1, and w2 be? What should the activation function be in order to get the result that we would expect? So we can actually take a look at an example of this. What then is a very simple function that we might calculate? Well, if we recall back from when we were looking at propositional logic, one of the simplest functions we looked at was something like the or function that takes two inputs, x and y, and outputs 1, otherwise known as true, if either one of the inputs or both of them are 1, and outputs of 0 if both of the inputs are 0 or false. So this then is the or function. And this was the truth table for the or function, that as long as either of the inputs are 1, the output of the function is 1, and the only case where the output is 0 is where both of the inputs are 0. So the question is, how could we take this and train a neural network to be able to learn this particular function? What would those weights look like? Well, we could do something like this. Here’s our neural network. And I’ll propose that in order to calculate the or function, we’re going to use a value of 1 for each of the weights. And we’ll use a bias of negative 1. And then we’ll just use this step function as our activation function. How then does this work? Well, if I wanted to calculate something like 0 or 0, which we know to be 0 because false or false is false, then what are we going to do? Well, our output unit is going to calculate this input multiplied by the weight, 0 times 1, that’s 0. Same thing here, 0 times 1, that’s 0. And we’ll add to that the bias minus 1. So that’ll give us a result of negative 1. If we plot that on our activation function, negative 1 is here. It’s before the threshold, which means either 0 or 1. It’s only 1 after the threshold. Since negative 1 is before the threshold, the output that this unit provides is going to be 0. And that’s what we would expect it to be, that 0 or 0 should be 0. What if instead we had had 1 or 0, where this is the number 1? Well, in this case, in order to calculate what the output is going to be, we again have to do this weighted sum, 1 times 1, that’s 1. 0 times 1, that’s 0. Sum of that so far is 1. Add negative 1 to that. Well, then the output is 0. And if we plot 0 on the step function, 0 ends up being here. It’s just at the threshold. And so the output here is going to be 1, because the output of 1 or 0, that’s 1. So that’s what we would expect as well. And just for one more example, if I had 1 or 1, what would the result be? Well, 1 times 1 is 1. 1 times 1 is 1. The sum of those is 2. I add the bias term to that. I get the number 1. 1 plotted on this graph is way over there. That’s well beyond the threshold. And so this output is going to be 1 as well. The output is always 0 or 1, depending on whether or not we’re past the threshold. And this neural network then models the OR function, a very simple function, definitely. But it still is able to model it correctly. If I give it the inputs, it will tell me what x1 or x2 happens to be. And you could imagine trying to do this for other functions as well. A function like the AND function, for instance, that takes two inputs and calculates whether both x and y are true. So if x is 1 and y is 1, then the output of x and y is 1. But in all the other cases, the output is 0. How could we model that inside of a neural network as well? Well, it turns out we could do it in the same way, except instead of negative 1 as the bias, we can use negative 2 as the bias instead. What does that end up looking like? Well, if I had 1 and 1, that should be 1, because 1 true and true is equal to true. Well, I take 1 times 1, that’s 1. 1 times 1 is 1. I get a total sum of 2 so far. Now I add the bias of negative 2, and I get the value 0. And 0, when I plot it on the activation function, is just past that threshold, and so the output is going to be 1. But if I had any other input, for example, like 1 and 0, well, the weighted sum of these is 1 plus 0 is going to be 1. Minus 2 is going to give us negative 1, and negative 1 is not past that threshold, and so the output is going to be 0. So those then are some very simple functions that we can model using a neural network that has two inputs and one output, where our goal is to be able to figure out what those weights should be in order to determine what the output should be. And you could imagine generalizing this to calculate more complex functions as well, that maybe, given the humidity and the pressure, we want to calculate what’s the probability that it’s going to rain, for example. Or we might want to do a regression-style problem. We’re given some amount of advertising, and given what month it is maybe, we want to predict what our expected sales are going to be for that particular month. So you could imagine these inputs and outputs being different as well. And it turns out that in some problems, we’re not just going to have two inputs, and the nice thing about these neural networks is that we can compose multiple units together, make our networks more complex just by adding more units into this particular neural network. So the network we’ve been looking at has two inputs and one output. But we could just as easily say, let’s go ahead and have three inputs in there, or have even more inputs, where we could arbitrarily decide however many inputs there are to our problem, all going to be calculating some sort of output that we care about figuring out the value of. How then does the math work for figuring out that output? Well, it’s going to work in a very similar way. In the case of two inputs, we had two weights indicated by these edges, and we multiplied the weights by the numbers, adding this bias term. And we’ll do the same thing in the other cases as well. If I have three inputs, you’ll imagine multiplying each of these three inputs by each of these weights. If I had five inputs instead, we’re going to do the same thing. Here I’m saying sum up from 1 to 5, xi multiplied by weight i. So take each of the five input variables, multiply them by their corresponding weight, and then add the bias to that. So this would be a case where there are five inputs into this neural network, for example. But there could be more, arbitrarily many nodes that we want inside of this neural network, where each time we’re just going to sum up all of those input variables multiplied by their weight and then add the bias term at the very end. And so this allows us to be able to represent problems that have even more inputs just by growing the size of our neural network. Now, the next question we might ask is a question about how it is that we train these neural networks. In the case of the or function and the and function, they were simple enough functions that I could just tell you, like here, what the weights should be. And you could probably reason through it yourself what the weights should be in order to calculate the output that you want. But in general, with functions like predicting sales or predicting whether or not it’s going to rain, these are much trickier functions to be able to figure out. We would like the computer to have some mechanism of calculating what it is that the weights should be, how it is to set the weights so that our neural network is able to accurately model the function that we care about trying to estimate. And it turns out that the strategy for doing this, inspired by the domain of calculus, is a technique called gradient descent. And what gradient descent is, it is an algorithm for minimizing loss when you’re training a neural network. And recall that loss refers to how bad our hypothesis function happens to be, that we can define certain loss functions. And we saw some examples of loss functions last time that just give us a number for any particular hypothesis, saying, how poorly does it model the data? How many examples does it get wrong? How are they worse or less bad as compared to other hypothesis functions that we might define? And this loss function is just a mathematical function. And when you have a mathematical function, in calculus what you could do is calculate something known as the gradient, which you can think of as like a slope. It’s the direction the loss function is moving at any particular point. And what it’s going to tell us is, in which direction should we be moving these weights in order to minimize the amount of loss? And so generally speaking, we won’t get into the calculus of it. But the high level idea for gradient descent is going to look something like this. If we want to train a neural network, we’ll go ahead and start just by choosing the weights randomly. Just pick random weights for all of the weights in the neural network. And then we’ll use the input data that we have access to in order to train the network, in order to figure out what the weights should actually be. So we’ll repeat this process again and again. The first step is we’re going to calculate the gradient based on all of the data points. So we’ll look at all the data and figure out what the gradient is at the place where we currently are for the current setting of the weights, which means in which direction should we move the weights in order to minimize the total amount of loss, in order to make our solution better. And once we’ve calculated that gradient, which direction we should move in the loss function, well, then we can just update those weights according to the gradient. Take a small step in the direction of those weights in order to try to make our solution a little bit better. And the size of the step that we take, that’s going to vary. And you can choose that when you’re training a particular neural network. But in short, the idea is going to be take all the data points, figure out based on those data points in what direction the weights should move, and then move the weights one small step in that direction. And if you repeat that process over and over again, adjusting the weights a little bit at a time based on all the data points, eventually you should end up with a pretty good solution to trying to solve this sort of problem. At least that’s what we would hope to happen. Now, if you look at this algorithm, a good question to ask anytime you’re analyzing an algorithm is what is going to be the expensive part of doing the calculation? What’s going to take a lot of work to try to figure out? What is going to be expensive to calculate? And in particular, in the case of gradient descent, the really expensive part is this all data points part right here, having to take all of the data points and using all of those data points figure out what the gradient is at this particular setting of all of the weights. Because odds are in a big machine learning problem where you’re trying to solve a big problem with a lot of data, you have a lot of data points in order to calculate. And figuring out the gradient based on all of those data points is going to be expensive. And you’ll have to do it many times. You’ll likely repeat this process again and again and again, going through all the data points, taking one small step over and over as you try and figure out what the optimal setting of those weights happens to be. It turns out that we would ideally like to be able to train our neural networks faster, to be able to more quickly converge to some sort of solution that is going to be a good solution to the problem. So in that case, there are alternatives to just standard gradient descent, which looks at all of the data points at once. We can employ a method like stochastic gradient descent, which will randomly just choose one data point at a time to calculate the gradient based on, instead of calculating it based on all of the data points. So the idea there is that we have some setting of the weights. We pick a data point. And based on that one data point, we figure out in which direction should we move all of the weights and move the weights in that small direction, then take another data point and do that again and repeat this process again and again, maybe looking at each of the data points multiple times, but each time only using one data point to calculate the gradient, to calculate which direction we should move in. Now, just using one data point instead of all of the data points probably gives us a less accurate estimate of what the gradient actually is. But on the plus side, it’s going to be much faster to be able to calculate, that we can much more quickly calculate what the gradient is based on one data point, instead of calculating based on all of the data points and having to do all of that computational work again and again. So there are trade-offs here between looking at all of the data points and just looking at one data point. And it turns out that a middle ground that is also quite popular is a technique called mini-batch gradient descent, where the idea there is instead of looking at all of the data versus just a single point, we instead divide our data set up into small batches, groups of data points, where you can decide how big a particular batch is. But in short, you’re just going to look at a small number of points at any given time, hopefully getting a more accurate estimate of the gradient, but also not requiring all of the computational effort needed to look at every single one of these data points. So gradient descent, then, is this technique that we can use in order to train these neural networks, in order to figure out what the setting of all of these weights should be if we want some way to try and get an accurate notion of how it is that this function should work, some way of modeling how to transform the inputs into particular outputs. Now, so far, the networks that we’ve taken a look at have all been structured similar to this. We have some number of inputs, maybe two or three or five or more. And then we have one output that is just predicting like rain or no rain or just predicting one particular value. But often in machine learning problems, we don’t just care about one output. We might care about an output that has multiple different values associated with it. So in the same way that we could take a neural network and add units to the input layer, we can likewise add inputs or add outputs to the output layer as well. Instead of just one output, you could imagine we have two outputs, or we could have four outputs, for example, where in each case, as we add more inputs or add more outputs, if we want to keep this network fully connected between these two layers, we just need to add more weights, that now each of these input nodes has four weights associated with each of the four outputs. And that’s true for each of these various different input nodes. So as we add nodes, we add more weights in order to make sure that each of the inputs can somehow be connected to each of the outputs so that each output value can be calculated based on what the value of the input happens to be. So what might a case be where we want multiple different output values? Well, you might consider that in the case of weather predicting, for example, we might not just care whether it’s raining or not raining. There might be multiple different categories of weather that we would like to categorize the weather into. With just a single output variable, we can do a binary classification, like rain or no rain, for instance, 1 or 0. But it doesn’t allow us to do much more than that. With multiple output variables, I might be able to use each one to predict something a little different. Maybe I want to categorize the weather into one of four different categories, something like is it going to be raining or sunny or cloudy or snowy. And I now have four output variables that can be used to represent maybe the probability that it is rainy as opposed to sunny as opposed to cloudy or as opposed to snowy. How then would this neural network work? Well, we have some input variables that represent some data that we have collected about the weather. Each of those inputs gets multiplied by each of these various different weights. We have more multiplications to do, but these are fairly quick mathematical operations to perform. And then what we get is after passing them through some sort of activation function in the outputs, we end up getting some sort of number, where that number, you might imagine, you could interpret as a probability, like a probability that it is one category as opposed to another category. So here we’re saying that based on the inputs, we think there is a 10% chance that it’s raining, a 60% chance that it’s sunny, a 20% chance of cloudy, a 10% chance that it’s snowy. And given that output, if these represent a probability distribution, well, then you could just pick whichever one has the highest value, in this case, sunny, and say that, well, most likely, we think that this categorization of inputs means that the output should be snowy or should be sunny. And that is what we would expect the weather to be in this particular instance. And so this allows us to do these sort of multi-class classifications, where instead of just having a binary classification, 1 or 0, we can have as many different categories as we want. And we can have our neural network output these probabilities over which categories are more likely than other categories. And using that data, we’re able to draw some sort of inference on what it is that we should do. So this was sort of the idea of supervised machine learning. I can give this neural network a whole bunch of data, a whole bunch of input data corresponding to some label, some output data, like we know that it was raining on this day, we know that it was sunny on that day. And using all of that data, the algorithm can use gradient descent to figure out what all of the weights should be in order to create some sort of model that hopefully allows us a way to predict what we think the weather is going to be. But neural networks have a lot of other applications as well. You could imagine applying the same sort of idea to a reinforcement learning sort of example as well, where you remember that in reinforcement learning, what we wanted to do is train some sort of agent to learn what action to take, depending on what state they currently happen to be in. So depending on the current state of the world, we wanted the agent to pick from one of the available actions that is available to them. And you might model that by having each of these input variables represent some information about the state, some data about what state our agent is currently in. And then the output, for example, could be each of the various different actions that our agent could take, action 1, 2, 3, and 4. And you might imagine that this network would work in the same way, but based on these particular inputs, we go ahead and calculate values for each of these outputs. And those outputs could model which action is better than other actions. And we could just choose, based on looking at those outputs, which action we should take. And so these neural networks are very broadly applicable, that all they’re really doing is modeling some mathematical function. So anything that we can frame as a mathematical function, something like classifying inputs into various different categories or figuring out based on some input state what action we should take, these are all mathematical functions that we could attempt to model by taking advantage of this neural network structure, and in particular, taking advantage of this technique, gradient descent, that we can use in order to figure out what the weights should be in order to do this sort of calculation. Now, how is it that you would go about training a neural network that has multiple outputs instead of just one? Well, with just a single output, we could see what the output for that value should be, and then you update all of the weights that corresponded to it. And when we have multiple outputs, at least in this particular case, we can really think of this as four separate neural networks, that really we just have one network here that has these three inputs corresponding with these three weights corresponding to this one output value. And the same thing is true for this output value. This output value effectively defines yet another neural network that has these same three inputs, but a different set of weights that correspond to this output. And likewise, this output has its own set of weights as well, and same thing for the fourth output too. And so if you wanted to train a neural network that had four outputs instead of just one, in this case where the inputs are directly connected to the outputs, you could really think of this as just training four independent neural networks. We know what the outputs for each of these four should be based on our input data, and using that data, we can begin to figure out what all of these individual weights should be. And maybe there’s an additional step at the end to make sure that we turn these values into a probability distribution such that we can interpret which one is better than another or more likely than another as a category or something like that. So this then seems like it does a pretty good job of taking inputs and trying to predict what outputs should be. And we’ll see some real examples of this in just a moment as well. But it’s important then to think about what the limitations of this sort of approach is, of just taking some linear combination of inputs and passing it into some sort of activation function. And it turns out that when we do this in the case of binary classification, trying to predict does it belong to one category or another, we can only predict things that are linearly separable. Because we’re taking a linear combination of inputs and using that to define some decision boundary or threshold, then what we get is a situation where if we have this set of data, we can predict a line that separates linearly the red points from the blue points, but a single unit that is making a binary classification, otherwise known as a perceptron, can’t deal with a situation like this, where we’ve seen this type of situation before, where there is no straight line that just goes straight through the data that will divide the red points away from the blue points. It’s a more complex decision boundary. The decision boundary somehow needs to capture the things inside of this circle. And there isn’t really a line that will allow us to deal with that. So this is the limitation of the perceptron, these units that just make these binary decisions based on their inputs, that a single perceptron is only capable of learning a linearly separable decision boundary. All it can do is define a line. And sure, it can give us probabilities based on how close to that decision boundary we are, but it can only really decide based on a linear decision boundary. And so this doesn’t seem like it’s going to generalize well to situations where real world data is involved, because real world data often isn’t linearly separable. It often isn’t the case that we can just draw a line through the data and be able to divide it up into multiple groups. So what then is the solution to this? Well, what was proposed was the idea of a multilayer neural network, that so far all of the neural networks we’ve seen have had a set of inputs and a set of outputs, and the inputs are connected to those outputs. But in a multilayer neural network, this is going to be an artificial neural network that has an input layer still. It has an output layer, but also has one or more hidden layers in between. Other layers of artificial neurons or units that are going to calculate their own values as well. So instead of a neural network that looks like this with three inputs and one output, you might imagine in the middle here injecting a hidden layer, something like this. This is a hidden layer that has four nodes. You could choose how many nodes or units end up going into the hidden layer. You can have multiple hidden layers as well. And so now each of these inputs isn’t directly connected to the output. Each of the inputs is connected to this hidden layer. And then all of the nodes in the hidden layer, those are connected to the one output. And so this is just another step that we can take towards calculating more complex functions. Each of these hidden units will calculate its output value, otherwise known as its activation, based on a linear combination of all the inputs. And once we have values for all of these nodes, as opposed to this just being the output, we do the same thing again. Calculate the output for this node based on multiplying each of the values for these units by their weights as well. So in effect, the way this works is that we start with inputs. They get multiplied by weights in order to calculate values for the hidden nodes. Those get multiplied by weights in order to figure out what the ultimate output is going to be. And the advantage of layering things like this is it gives us an ability to model more complex functions, that instead of just having a single decision boundary, a single line dividing the red points from the blue points, each of these hidden nodes can learn a different decision boundary. And we can combine those decision boundaries to figure out what the ultimate output is going to be. And as we begin to imagine more complex situations, you could imagine each of these nodes learning some useful property or learning some useful feature of all of the inputs and us somehow learning how to combine those features together in order to get the output that we actually want. Now, the natural question when we begin to look at this now is to ask the question of, how do we train a neural network that has hidden layers inside of it? And this turns out to initially be a bit of a tricky question, because the input data that we are given is we are given values for all of the inputs, and we’re given what the value of the output should be, what the category is, for example. But the input data doesn’t tell us what the values for all of these nodes should be. So we don’t know how far off each of these nodes actually is because we’re only given data for the inputs and the outputs. The reason this is called the hidden layer is because the data that is made available to us doesn’t tell us what the values for all of these intermediate nodes should actually be. And so the strategy people came up with was to say that if you know what the error or the losses on the output node, well, then based on what these weights are, if one of these weights is higher than another, you can calculate an estimate for how much the error from this node was due to this part of the hidden node, or this part of the hidden layer, or this part of the hidden layer, based on the values of these weights, in effect saying that based on the error from the output, I can back propagate the error and figure out an estimate for what the error is for each of these nodes in the hidden layer as well. And there’s some more calculus here that we won’t get into the details of, but the idea of this algorithm is known as back propagation. It’s an algorithm for training a neural network with multiple different hidden layers. And the idea for this, the pseudocode for it, will again be if we want to run gradient descent with back propagation. We’ll start with a random choice of weights, as we did before. And now we’ll go ahead and repeat the training process again and again. But what we’re going to do each time is now we’re going to calculate the error for the output layer first. We know the output and what it should be, and we know what we calculated so we can figure out what the error there is. But then we’re going to repeat for every layer, starting with the output layer, moving back into the hidden layer, then the hidden layer before that if there are multiple hidden layers, going back all the way to the very first hidden layer, assuming there are multiple, we’re going to propagate the error back one layer. Whatever the error was from the output, figure out what the error should be a layer before that based on what the values of those weights are. And then we can update those weights. So graphically, the way you might think about this is that we first start with the output. We know what the output should be. We know what output we calculated. And based on that, we can figure out, all right, how do we need to update those weights? Backpropagating the error to these nodes. And using that, we can figure out how we should update these weights. And you might imagine if there are multiple layers, we could repeat this process again and again to begin to figure out how all of these weights should be updated. And this backpropagation algorithm is really the key algorithm that makes neural networks possible. It makes it possible to take these multi-level structures and be able to train those structures depending on what the values of these weights are in order to figure out how it is that we should go about updating those weights in order to create some function that is able to minimize the total amount of loss, to figure out some good setting of the weights that will take the inputs and translate it into the output that we expect. And this works, as we said, not just for a single hidden layer. But you can imagine multiple hidden layers, where each hidden layer we just define however many nodes we want, where each of the nodes in one layer, we can connect to the nodes in the next layer, defining more and more complex networks that are able to model more and more complex types of functions. And so this type of network is what we might call a deep neural network, part of a larger family of deep learning algorithms, if you’ve ever heard that term. And all deep learning is about is it’s using multiple layers to be able to predict and be able to model higher level features inside of the input, to be able to figure out what the output should be. And so a deep neural network is just a neural network that has multiple of these hidden layers, where we start at the input, calculate values for this layer, then this layer, then this layer, and then ultimately get an output. And this allows us to be able to model more and more sophisticated types of functions, that each of these layers can calculate something a little bit different, and we can combine that information to figure out what the output should be. Of course, as with any situation of machine learning, as we begin to make our models more and more complex, to model more and more complex functions, the risk we run is something like overfitting. And we talked about overfitting last time in the context of overfitting based on when we were training our models to be able to learn some sort of decision boundary, where overfitting happens when we fit too closely to the training data. And as a result, we don’t generalize well to other situations as well. And one of the risks we run with a far more complex neural network that has many, many different nodes is that we might overfit based on the input data. We might grow over reliant on certain nodes to calculate things just purely based on the input data that doesn’t allow us to generalize very well to the output. And there are a number of strategies for dealing with overfitting. But one of the most popular in the context of neural networks is a technique known as dropout. And what dropout does is it, when we’re training the neural network, what we’ll do in dropout is temporarily remove units, temporarily remove these artificial neurons from our network chosen at random. And the goal here is to prevent over-reliance on certain units. What generally happens in overfitting is that we begin to over-rely on certain units inside the neural network to be able to tell us how to interpret the input data. What dropout will do is randomly remove some of these units in order to reduce the chance that we over-rely on certain units to make our neural network more robust, to be able to handle the situations even when we just drop out particular neurons entirely. So the way that might work is we have a network like this. And as we’re training it, when we go about trying to update the weights the first time, we’ll just randomly pick some percentage of the nodes to drop out of the network. It’s as if those nodes aren’t there at all. It’s as if the weights associated with those nodes aren’t there at all. And we’ll train it this way. Then the next time we update the weights, we’ll pick a different set and just go ahead and train that way. And then again, randomly choose and train with other nodes that have been dropped out as well. And the goal of that is that after the training process, if you train by dropping out random nodes inside of this neural network, you hopefully end up with a network that’s a little bit more robust, that doesn’t rely too heavily on any one particular node, but more generally learns how to approximate a function in general. So that then is a look at some of these techniques that we can use in order to implement a neural network, to get at the idea of taking this input, passing it through these various different layers in order to produce some sort of output. And what we’d like to do now is take those ideas and put them into code. And to do that, there are a number of different machine learning libraries, neural network libraries that we can use that allow us to get access to someone’s implementation of back propagation and all of these hidden layers. And one of the most popular, developed by Google, is known as TensorFlow, a library that we can use for quickly creating neural networks and modeling them and running them on some sample data to see what the output is going to be. And before we actually start writing code, we’ll go ahead and take a look at TensorFlow’s playground, which will be an opportunity for us just to play around with this idea of neural networks in different layers, just to get a sense for what it is that we can do by taking advantage of neural networks. So let’s go ahead and go into TensorFlow’s playground, which you can go to by visiting that URL from before. And what we’re going to do now is we’re going to try and learn the decision boundary for this particular output. I want to learn to separate the orange points from the blue points. And I’d like to learn some sort of setting of weights inside of a neural network that will be able to separate those from each other. The features we have access to, our input data, are the x value and the y value, so the two values along each of the two axes. And what I’ll do now is I can set particular parameters, like what activation function I would like to use. And I’ll just go ahead and press play and see what happens. And what happens here is that you’ll see that just by using these two input features, the x value and the y value, with no hidden layers, just take the input, x and y values, and figure out what the decision boundary is. Our neural network learns pretty quickly that in order to divide these two points, we should just use this line. This line acts as a decision boundary that separates this group of points from that group of points, and it does it very well. You can see up here what the loss is. The training loss is 0, meaning we were able to perfectly model separating these two points from each other inside of our training data. So this was a fairly simple case of trying to apply a neural network because the data is very clean. It’s very nicely linearly separable. We could just draw a line that separates all of those points from each other. Let’s now consider a more complex case. So I’ll go ahead and pause the simulation, and we’ll go ahead and look at this data set here. This data set is a little bit more complex now. In this data set, we still have blue and orange points that we’d like to separate from each other. But there’s no single line that we can draw that is going to be able to figure out how to separate the blue from the orange, because the blue is located in these two quadrants, and the orange is located here and here. It’s a more complex function to be able to learn. So let’s see what happens. If we just try and predict based on those inputs, the x and y coordinates, what the output should be, I’ll press Play. And what you’ll notice is that we’re not really able to draw much of a conclusion, that we’re not able to very cleanly see how we should divide the orange points from the blue points, and you don’t see a very clean separation there. So it seems like we don’t have enough sophistication inside of our network to be able to model something that is that complex. We need a better model for this neural network. And I’ll do that by adding a hidden layer. So now I have a hidden layer that has two neurons inside of it. So I have two inputs that then go to two neurons inside of a hidden layer that then go to our output. And now I’ll press Play. And what you’ll notice here is that we’re able to do slightly better. We’re able to now say, all right, these points are definitely blue. These points are definitely orange. We’re still struggling a little bit with these points up here, though. And what we can do is we can see for each of these hidden neurons, what is it exactly that these hidden neurons are doing? Each hidden neuron is learning its own decision boundary. And we can see what that boundary is. This first neuron is learning, all right, this line that seems to separate some of the blue points from the rest of the points. This other hidden neuron is learning another line that seems to be separating the orange points in the lower right from the rest of the points. So that’s why we’re able to figure out these two areas in the bottom region. But we’re still not able to perfectly classify all of the points. So let’s go ahead and add another neuron. Now we’ve got three neurons inside of our hidden layer and see what we’re able to learn now. All right, well, now we seem to be doing a better job. By learning three different decision boundaries, which each of the three neurons inside of our hidden layer, we’re able to much better figure out how to separate these blue points from the orange points. And we can see what each of these hidden neurons is learning. Each one is learning a slightly different decision boundary. And then we’re combining those decision boundaries together to figure out what the overall output should be. And then we can try it one more time by adding a fourth neuron there and try learning that. And it seems like now we can do even better at trying to separate the blue points from the orange points. But we were only able to do this by adding a hidden layer, by adding some layer that is learning some other boundaries and combining those boundaries to determine the output. And the strength, the size and thickness of these lines indicate how high these weights are, how important each of these inputs is for making this sort of calculation. And we can do maybe one more simulation. Let’s go ahead and try this on a data set that looks like this. Go ahead and get rid of the hidden layer. Here now we’re trying to separate the blue points from the orange points where all the blue points are located, again, inside of a circle effectively. So we’re not going to be able to learn a line. Notice I press Play. And we’re really not able to draw any sort of classification at all because there is no line that cleanly separates the blue points from the orange points. So let’s try to solve this by introducing a hidden layer. I’ll go ahead and press Play. And all right, with two neurons in a hidden layer, we’re able to do a little better because we effectively learned two different decision boundaries. We learned this line here. And we learned this line on the right-hand side. And right now we’re just saying, all right, well, if it’s in between, we’ll call it blue. And if it’s outside, we’ll call it orange. So not great, but certainly better than before, that we’re learning one decision boundary and another. And based on those, we can figure out what the output should be. But let’s now go ahead and add a third neuron and see what happens now. I go ahead and train it. And now, using three different decision boundaries that are learned by each of these hidden neurons, we’re able to much more accurately model this distinction between blue points and orange points. We’re able to figure out maybe with these three decision boundaries, combining them together, you can imagine figuring out what the output should be and how to make that sort of classification. And so the goal here is just to get a sense for having more neurons in these hidden layers allows us to learn more structure in the data, allows us to figure out what the relevant and important decision boundaries are. And then using this backpropagation algorithm, we’re able to figure out what the values of these weights should be in order to train this network to be able to classify one category of points away from another category of points instead. And this is ultimately what we’re going to be trying to do whenever we’re training a neural network. So let’s go ahead and actually see an example of this. You’ll recall from last time that we had this banknotes file that included information about counterfeit banknotes as opposed to authentic banknotes, where I had four different values for each banknote and then a categorization of whether that banknote is considered to be authentic or a counterfeit note. And what I wanted to do was, based on that input information, figure out some function that could calculate based on the input information what category it belonged to. And what I’ve written here in banknotes.py is a neural network that will learn just that, a network that learns based on all of the input whether or not we should categorize a banknote as authentic or as counterfeit. The first step is the same as what we saw from last time. I’m really just reading the data in and getting it into an appropriate format. And so this is where more of the writing Python code on your own comes in, in terms of manipulating this data, massaging the data into a format that will be understood by a machine learning library like scikit-learn or like TensorFlow. And so here I separate it into a training and a testing set. And now what I’m doing down below is I’m creating a neural network. Here I’m using TF, which stands for TensorFlow. Up above, I said import TensorFlow as TF, TF just an abbreviation that we’ll often use so we don’t need to write out TensorFlow every time we want to use anything inside of the library. I’m using TF.keras. Keras is an API, a set of functions that we can use in order to manipulate neural networks inside of TensorFlow. And it turns out there are other machine learning libraries that also use the Keras API. But here I’m saying, all right, go ahead and give me a model that is a sequential model, a sequential neural network, meaning one layer after another. And now I’m going to add to that model what layers I want inside of my neural network. So here I’m saying model.add. Go ahead and add a dense layer. And when we say a dense layer, we mean a layer that is just each of the nodes inside of the layer is going to be connected to each of the nodes from the previous layer. So we have a densely connected layer. This layer is going to have eight units inside of it. So it’s going to be a hidden layer inside of a neural network with eight different units, eight artificial neurons, each of which might learn something different. And I just sort of chose eight arbitrarily. You could choose a different number of hidden nodes inside of the layer. And as we saw before, depending on the number of units there are inside of your hidden layer, more units means you can learn more complex functions. So maybe you can more accurately model the training data. But it comes at the cost. More units means more weights that you need to figure out how to update. So it might be more expensive to do that calculation. And you also run the risk of overfitting on the data. If you have too many units and you learn to just overfit on the training data, that’s not good either. So there is a balance. And there’s often a testing process where you’ll train on some data and maybe validate how well you’re doing on a separate set of data, often called a validation set, to see, all right, which setting of parameters. How many layers should I have? How many units should be in each layer? Which one of those performs the best on the validation set? So you can do some testing to figure out what these hyper parameters, so called, should be equal to. Next, I specify what the input shape is. Meaning, all right, what does my input look like? My input has four values. And so the input shape is just four, because we have four inputs. And then I specify what the activation function is. And the activation function, again, we can choose. There are a number of different activation functions. Here I’m using relu, which you might recall from earlier. And then I’ll add an output layer. So I have my hidden layer. Now I’m adding one more layer that will just have one unit, because all I want to do is predict something like counterfeit build or authentic build. So I just need a single unit. And the activation function I’m going to use here is that sigmoid activation function, which, again, was that S-shaped curve that just gave us a probability of what is the probability that this is a counterfeit build, as opposed to an authentic build. So that, then, is the structure of my neural network, a sequential neural network that has one hidden layer with eight units inside of it, and then one output layer that just has a single unit inside of it. And I can choose how many units there are. I can choose the activation function. Then I’m going to compile this model. TensorFlow gives you a choice of how you would like to optimize the weights. There are various different algorithms for doing that. What type of loss function you want to use. Again, many different options for doing that. And then how I want to evaluate my model, well, I care about accuracy. I care about how many of my points am I able to classify correctly versus not correctly as counterfeit or not counterfeit. And I would like it to report to me how accurate my model is performing. Then, now that I’ve defined that model, I call model.fit to say go ahead and train the model. Train it on all the training data plus all of the training labels. So labels for each of those pieces of training data. And I’m saying run it for 20 epics, meaning go ahead and go through each of these training points 20 times, effectively. Go through the data 20 times and keep trying to update the weights. If I did it for more, I could train for even longer and maybe get a more accurate result. But then after I fit it on all the data, I’ll go ahead and just test it. I’ll evaluate my model using model.evaluate built into TensorFlow that is just going to tell me how well do I perform on the testing data. So ultimately, this is just going to give me some numbers that tell me how well we did in this particular case. So now what I’m going to do is go into banknotes and go ahead and run banknotes.py. And what’s going to happen now is it’s going to read in all of that training data. It’s going to generate a neural network with all my inputs, my eight hidden units inside my layer, and then an output unit. And now what it’s doing is it’s training. It’s training 20 times. And each time you can see how my accuracy is increasing on my training data. It starts off the very first time not very accurate, though better than random, something like 79% of the time. It’s able to accurately classify one bill from another. But as I keep training, notice this accuracy value improves and improves and improves until after I’ve trained through all the data points 20 times, it looks like my accuracy is above 99% on the training data. And here’s where I tested it on a whole bunch of testing data. And it looks like in this case, I was also like 99.8% accurate. So just using that, I was able to generate a neural network that can detect counterfeit bills from authentic bills based on this input data 99.8% of the time, at least based on this particular testing data. And I might want to test it with more data as well, just to be confident about that. But this is really the value of using a machine learning library like TensorFlow. And there are others available for Python and other languages as well. But all I have to do is define the structure of the network and define the data that I’m going to pass into the network. And then TensorFlow runs the backpropagation algorithm for learning what all of those weights should be, for figuring out how to train this neural network to be able to accurately, as accurately as possible, figure out what the output values should be there as well. And so this then was a look at what it is that neural networks can do just using these sequences of layer after layer after layer. And you can begin to imagine applying these to much more general problems. And one big problem in computing and artificial intelligence more generally is the problem of computer vision. Computer vision is all about computational methods for analyzing and understanding images. You might have pictures that you want the computer to figure out how to deal with, how to process those images and figure out how to produce some sort of useful result out of this. You’ve seen this in the context of social media websites that are able to look at a photo that contains a whole bunch of faces. And it’s able to figure out what’s a picture of whom and label those and tag them with appropriate people. This is becoming increasingly relevant as we begin to discuss self-driving cars, that these cars now have cameras. And we would like for the computer to have some sort of algorithm that looks at the image and figures out what color is the light, what cars are around us and in what direction, for example. And so computer vision is all about taking an image and figuring out what sort of computation, what sort of calculation we can do with that image. It’s also relevant in the context of something like handwriting recognition. This, what you’re looking at, is an example of the MNIST data set. It’s a big data set just of handwritten digits that we could use to ideally try and figure out how to predict, given someone’s handwriting, given a photo of a digit that they have drawn, can you predict whether it’s a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, for example. So this sort of handwriting recognition is yet another task that we might want to use computer vision tasks and tools to be able to apply it towards. This might be a task that we might care about. So how, then, can we use neural networks to be able to solve a problem like this? Well, neural networks rely upon some sort of input where that input is just numerical data. We have a whole bunch of units where each one of them just represents some sort of number. And so in the context of something like handwriting recognition or in the context of just an image, you might imagine that an image is really just a grid of pixels, grid of dots where each dot has some sort of color. And in the context of something like handwriting recognition, you might imagine that if you just fill in each of these dots in a particular way, you can generate a 2 or an 8, for example, based on which dots happen to be shaded in and which dots are not. And we can represent each of these pixel values just using numbers. So for a particular pixel, for example, 0 might represent entirely black. Depending on how you’re representing color, it’s often common to represent color values on a 0 to 255 range so that you can represent a color using 8 bits for a particular value, like how much white is in the image. So 0 might represent all black. 255 might represent entirely white as a pixel. And somewhere in between might represent some shade of gray, for example. But you might imagine not just having a single slider that determines how much white is in the image, but if you had a color image, you might imagine three different numerical values, a red, green, and blue value, where the red value controls how much red is in the image. We have one value for controlling how much green is in the pixel and one value for how much blue is in the pixel as well. And depending on how it is that you set these values of red, green, and blue, you can get a different color. And so any pixel can really be represented, in this case, by three numerical values, a red value, a green value, and a blue value. And if you take a whole bunch of these pixels, assemble them together inside of a grid of pixels, then you really just have a whole bunch of numerical values that you can use in order to perform some sort of prediction task. And so what you might imagine doing is using the same techniques we talked about before, just design a neural network with a lot of inputs, that for each of the pixels, we might have one or three different inputs in the case of a color image, a different input that is just connected to a deep neural network, for example. And this deep neural network might take all of the pixels inside of the image of what digit a person drew. And the output might be like 10 neurons that classify it as a 0, or a 1, or a 2, or a 3, or just tells us in some way what that digit happens to be. Now, there are a couple of drawbacks to this approach. The first drawback to the approach is just the size of this input array, that we have a whole bunch of inputs. If we have a big image that has a lot of different channels, we’re looking at a lot of inputs, and therefore a lot of weights that we have to calculate. And a second problem is the fact that by flattening everything into just this structure of all the pixels, we’ve lost access to a lot of the information about the structure of the image that’s relevant, that really, when a person looks at an image, they’re looking at particular features of the image. They’re looking at curves. They’re looking at shapes. They’re looking at what things can you identify in different regions of the image, and maybe put those things together in order to get a better picture of what the overall image is about. And by just turning it into pixel values for each of the pixels, sure, you might be able to learn that structure, but it might be challenging in order to do so. It might be helpful to take advantage of the fact that you can use properties of the image itself, the fact that it’s structured in a particular way, to be able to improve the way that we learn based on that image too. So in order to figure out how we can train our neural networks to better be able to deal with images, we’ll introduce a couple of ideas, a couple of algorithms that we can apply that allow us to take the image and extract some useful information out of that image. And the first idea we’ll introduce is the notion of image convolution. And what image convolution is all about is it’s about filtering an image, sort of extracting useful or relevant features out of the image. And the way we do that is by applying a particular filter that basically adds the value for every pixel with the values for all of the neighboring pixels to it, according to some sort of kernel matrix, which we’ll see in a moment, is going to allow us to weight these pixels in various different ways. And the goal of image convolution, then, is to extract some sort of interesting or useful features out of an image, to be able to take a pixel and, based on its neighboring pixels, maybe predict some sort of valuable information. Something like taking a pixel and looking at its neighboring pixels, you might be able to predict whether or not there’s some sort of curve inside the image, or whether it’s forming the outline of a particular line or a shape, for example. And that might be useful if you’re trying to use all of these various different features to combine them to say something meaningful about an image as a whole. So how, then, does image convolution work? Well, we start with a kernel matrix. And the kernel matrix looks something like this. And the idea of this is that, given a pixel that will be the middle pixel, we’re going to multiply each of the neighboring pixels by these values in order to get some sort of result by summing up all the numbers together. So if I take this kernel, which you can think of as a filter that I’m going to apply to the image, and let’s say that I take this image. This is a 4 by 4 image. We’ll think of it as just a black and white image, where each one is just a single pixel value. So somewhere between 0 and 255, for example. So we have a whole bunch of individual pixel values like this. And what I’d like to do is apply this kernel, this filter, so to speak, to this image. And the way I’ll do that is, all right, the kernel is 3 by 3. You can imagine a 5 by 5 kernel or a larger kernel, too. And I’ll take it and just first apply it to the first 3 by 3 section of the image. And what I’ll do is I’ll take each of these pixel values, multiply it by its corresponding value in the filter matrix, and add all of the results together. So here, for example, I’ll say 10 times 0, plus 20 times negative 1, plus 30 times 0, so on and so forth, doing all of this calculation. And at the end, if I take all these values, multiply them by their corresponding value in the kernel, add the results together, for this particular set of 9 pixels, I get the value of 10, for example. And then what I’ll do is I’ll slide this 3 by 3 grid, effectively, over. I’ll slide the kernel by 1 to look at the next 3 by 3 section. Here, I’m just sliding it over by 1 pixel. But you might imagine a different stride length, or maybe I jump by multiple pixels at a time if you really wanted to. You have different options here. But here, I’m just sliding over, looking at the next 3 by 3 section. And I’ll do the same math, 20 times 0, plus 30 times negative 1, plus 40 times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5. And what I end up getting is the number 20. Then you can imagine shifting over to this one, doing the same thing, calculating the number 40, for example, and then doing the same thing here, and calculating a value there as well. And so what we have now is what we’ll call a feature map. We have taken this kernel, applied it to each of these various different regions, and what we get is some representation of a filtered version of that image. And so to give a more concrete example of why it is that this kind of thing could be useful, let’s take this kernel matrix, for example, which is quite a famous one, that has an 8 in the middle, and then all of the neighboring pixels get a negative 1. And let’s imagine we wanted to apply that to a 3 by 3 part of an image that looks like this, where all the values are the same. They’re all 20, for instance. Well, in this case, if you do 20 times 8, and then subtract 20, subtract 20, subtract 20 for each of the eight neighbors, well, the result of that is you just get that expression, which comes out to be 0. You multiplied 20 by 8, but then you subtracted 20 eight times, according to that particular kernel. The result of all that is just 0. So the takeaway here is that when a lot of the pixels are the same value, we end up getting a value close to 0. If, though, we had something like this, 20 is along this first row, then 50 is in the second row, and 50 is in the third row, well, then when you do this, because it’s the same kind of math, 20 times negative 1, 20 times negative 1, so on and so forth, then I get a higher value, a value like 90 in this particular case. And so the more general idea here is that by applying this kernel, negative 1s, 8 in the middle, and then negative 1s, what I get is when this middle value is very different from the neighboring values, like 50 is greater than these 20s, then you’ll end up with a value higher than 0. If this number is higher than its neighbors, you end up getting a bigger output. But if this value is the same as all of its neighbors, then you get a lower output, something like 0. And it turns out that this sort of filter can therefore be used in something like detecting edges in an image. Or I want to detect the boundaries between various different objects inside of an image. I might use a filter like this, which is able to tell whether the value of this pixel is different from the values of the neighboring pixel, if it’s greater than the values of the pixels that happen to surround it. And so we can use this in terms of image filtering. And so I’ll show you an example of that. I have here in filter.py a file that uses Python’s image library, or PIL, to do some image filtering. I go ahead and open an image. And then all I’m going to do is apply a kernel to that image. It’s going to be a 3 by 3 kernel, same kind of kernel we saw before. And here is the kernel. This is just a list representation of the same matrix that I showed you a moment ago. It’s negative 1, negative 1, negative 1. The second row is negative 1, 8, negative 1. And the third row is all negative 1s. And then at the end, I’m going to go ahead and show the filtered image. So if, for example, I go into convolution directory and I open up an image, like bridge.png, this is what an input image might look like, just an image of a bridge over a river. Now I’m going to go ahead and run this filter program on the bridge. And what I get is this image here. Just by taking the original image and applying that filter to each 3 by 3 grid, I’ve extracted all of the boundaries, all of the edges inside the image that separate one part of the image from another. So here I’ve got a representation of boundaries between particular parts of the image. And you might imagine that if a machine learning algorithm is trying to learn what an image is of, a filter like this could be pretty useful. Maybe the machine learning algorithm doesn’t care about all of the details of the image. It just cares about certain useful features. It cares about particular shapes that are able to help it determine that based on the image, this is going to be a bridge, for example. And so this type of idea of image convolution can allow us to apply filters to images that allow us to extract useful results out of those images, taking an image and extracting its edges, for example. And you might imagine many other filters that could be applied to an image that are able to extract particular values as well. And a filter might have separate kernels for the red values, the green values, and the blue values that are all summed together at the end, such that you could have particular filters looking for, is there red in this part of the image? Are there green in other parts of the image? You can begin to assemble these relevant and useful filters that are able to do these calculations as well. So that then was the idea of image convolution, applying some sort of filter to an image to be able to extract some useful features out of that image. But all the while, these images are still pretty big. There’s a lot of pixels involved in the image. And realistically speaking, if you’ve got a really big image, that poses a couple of problems. One, it means a lot of input going into the neural network. But two, it also means that we really have to care about what’s in each particular pixel. Whereas realistically, we often, if you’re looking at an image, you don’t care whether something is in one particular pixel versus the pixel immediately to the right of it. They’re pretty close together. You really just care about whether there’s a particular feature in some region of the image. And maybe you don’t care about exactly which pixel it happens to be in. And so there’s a technique we can use known as pooling. And what pooling is, is it means reducing the size of an input by sampling from regions inside of the input. So we’re going to take a big image and turn it into a smaller image by using pooling. And in particular, one of the most popular types of pooling is called max pooling. And what max pooling does is it pools just by choosing the maximum value in a particular region. So for example, let’s imagine I had this 4 by 4 image. But I wanted to reduce its dimensions. I wanted to make it a smaller image so that I have fewer inputs to work with. Well, what I could do is I could apply a 2 by 2 max pool, where the idea would be that I’m going to first look at this 2 by 2 region and say, what is the maximum value in that region? Well, it’s the number 50. So we’ll go ahead and just use the number 50. And then we’ll look at this 2 by 2 region. What is the maximum value here? It’s 110, so that’s going to be my value. Likewise here, the maximum value looks like 20. Go ahead and put that there. Then for this last region, the maximum value was 40. So we’ll go ahead and use that. And what I have now is a smaller representation of this same original image that I obtained just by picking the maximum value from each of these regions. So again, the advantages here are now I only have to deal with a 2 by 2 input instead of a 4 by 4. And you can imagine shrinking the size of an image even more. But in addition to that, I’m now able to make my analysis independent of whether a particular value was in this pixel or this pixel. I don’t care if the 50 was here or here. As long as it was generally in this region, I’ll still get access to that value. So it makes our algorithms a little bit more robust as well. So that then is pooling, taking the size of the image, reducing it a little bit by just sampling from particular regions inside of the image. And now we can put all of these ideas together, pooling, image convolution, and neural networks all together into another type of neural network called a convolutional neural network, or a CNN, which is a neural network that uses this convolution step usually in the context of analyzing an image, for example. And so the way that a convolutional neural network works is that we start with some sort of input image, some grid of pixels. But rather than immediately put that into the neural network layers that we’ve seen before, we’ll start by applying a convolution step, where the convolution step involves applying some number of different image filters to our original image in order to get what we call a feature map, the result of applying some filter to an image. And we could do this once, but in general, we’ll do this multiple times, getting a whole bunch of different feature maps, each of which might extract some different relevant feature out of the image, some different important characteristic of the image that we might care about using in order to calculate what the result should be. And in the same way that when we train neural networks, we can train neural networks to learn the weights between particular units inside of the neural networks, we can also train neural networks to learn what those filters should be, what the values of the filters should be in order to get the most useful, most relevant information out of the original image just by figuring out what setting of those filter values, the values inside of that kernel, results in minimizing the loss function, minimizing how poorly our hypothesis actually performs in figuring out the classification of a particular image, for example. So we first apply this convolution step, get a whole bunch of these various different feature maps. But these feature maps are quite large. There’s a lot of pixel values that happen to be here. And so a logical next step to take is a pooling step, where we reduce the size of these images by using max pooling, for example, extracting the maximum value from any particular region. There are other pooling methods that exist as well, depending on the situation. You could use something like average pooling, where instead of taking the maximum value from a region, you take the average value from a region, which has its uses as well. But in effect, what pooling will do is it will take these feature maps and reduce their dimensions so that we end up with smaller grids with fewer pixels. And this then is going to be easier for us to deal with. It’s going to mean fewer inputs that we have to worry about. And it’s also going to mean we’re more resilient, more robust against potential movements of particular values, just by one pixel, when ultimately we really don’t care about those one-pixel differences that might arise in the original image. And now, after we’ve done this pooling step, now we have a whole bunch of values that we can then flatten out and just put into a more traditional neural network. So we go ahead and flatten it, and then we end up with a traditional neural network that has one input for each of these values in each of these resulting feature maps after we do the convolution and after we do the pooling step. And so this then is the general structure of a convolutional network. We begin with the image, apply convolution, apply pooling, flatten the results, and then put that into a more traditional neural network that might itself have hidden layers. You can have deep convolutional networks that have hidden layers in between this flattened layer and the eventual output to be able to calculate various different features of those values. But this then can help us to be able to use convolution and pooling to use our knowledge about the structure of an image to be able to get better results, to be able to train our networks faster in order to better capture particular parts of the image. And there’s no reason necessarily why you can only use these steps once. In fact, in practice, you’ll often use convolution and pooling multiple times in multiple different steps. See, what you might imagine doing is starting with an image, first applying convolution to get a whole bunch of maps, then applying pooling, then applying convolution again, because these maps are still pretty big. You can apply convolution to try and extract relevant features out of this result. Then take those results, apply pooling in order to reduce their dimensions, and then take that and feed it into a neural network that maybe has fewer inputs. So here I have two different convolution and pooling steps. I do convolution and pooling once, and then I do convolution and pooling a second time, each time extracting useful features from the layer before it, each time using pooling to reduce the dimensions of what you’re ultimately looking at. And the goal now of this sort of model is that in each of these steps, you can begin to learn different types of features of the original image. That maybe in the first step, you learn very low level features. Just learn and look for features like edges and curves and shapes, because based on pixels and their neighboring values, you can figure out, all right, what are the edges? What are the curves? What are the various different shapes that might be present there? But then once you have a mapping that just represents where the edges and curves and shapes happen to be, you can imagine applying the same sort of process again to begin to look for higher level features, look for objects, maybe look for people’s eyes and facial recognition, for example. Maybe look for more complex shapes like the curves on a particular number if you’re trying to recognize a digit in a handwriting recognition sort of scenario. And then after all of that, now that you have these results that represent these higher level features, you can pass them into a neural network, which is really just a deep neural network that looks like this, where you might imagine making a binary classification or classifying into multiple categories or performing various different tasks on this sort of model. So convolutional neural networks can be quite powerful and quite popular when it comes towards trying to analyze images. We don’t strictly need them. We could have just used a vanilla neural network that just operates with layer after layer, as we’ve seen before. But these convolutional neural networks can be quite helpful, in particular, because of the way they model the way a human might look at an image, that instead of a human looking at every single pixel simultaneously and trying to convolve all of them by multiplying them together, you might imagine that what convolution is really doing is looking at various different regions of the image and extracting relevant information and features out of those parts of the image, the same way that a human might have visual receptors that are looking at particular parts of what they see and using those combining them to figure out what meaning they can draw from all of those various different inputs. And so you might imagine applying this to a situation like handwriting recognition. So we’ll go ahead and see an example of that now, where I’ll go ahead and open up handwriting.py. Again, what we do here is we first import TensorFlow. And then TensorFlow, it turns out, has a few data sets that are built into the library that you can just immediately access. And one of the most famous data sets in machine learning is the MNIST data set, which is just a data set of a whole bunch of samples of people’s handwritten digits. I showed you a slide of that a little while ago. And what we can do is just immediately access that data set which is built into the library so that if I want to do something like train on a whole bunch of handwritten digits, I can just use the data set that is provided to me. Of course, if I had my own data set of handwritten images, I can apply the same idea. I’d first just need to take those images and turn them into an array of pixels, because that’s the way that these are going to be formatted. They’re going to be formatted as, effectively, an array of individual pixels. Now there’s a bit of reshaping I need to do, just turning the data into a format that I can put into my convolutional neural network. So this is doing things like taking all the values and dividing them by 255. If you remember, these color values tend to range from 0 to 255. So I can divide them by 255 just to put them into 0 to 1 range, which might be a little bit easier to train on. And then doing various other modifications to the data just to get it into a nice usable format. But here’s the interesting and important part. Here is where I create the convolutional neural network, the CNN, where here I’m saying, go ahead and use a sequential model. And before I could use model.add to say add a layer, add a layer, add a layer, another way I could define it is just by passing as input to this sequential neural network a list of all of the layers that I want. And so here, the very first layer in my model is a convolution layer, where I’m first going to apply convolution to my image. I’m going to use 13 different filters. So my model is going to learn 32, rather, 32 different filters that I would like to learn on the input image, where each filter is going to be a 3 by 3 kernel. So we saw those 3 by 3 kernels before, where we could multiply each value in a 3 by 3 grid by a value, multiply it, and add all the results together. So here, I’m going to learn 32 different of these 3 by 3 filters. I can, again, specify my activation function. And I specify what my input shape is. My input shape in the banknotes case was just 4. I had 4 inputs. My input shape here is going to be 28, 28, 1, because for each of these handwritten digits, it turns out that the MNIST data set organizes their data. Each image is a 28 by 28 pixel grid. So we’re going to have a 28 by 28 pixel grid. And each one of those images only has one channel value. These handwritten digits are just black and white. So there’s just a single color value representing how much black or how much white. You might imagine that in a color image, if you were doing this sort of thing, you might have three different channels, a red, a green, and a blue channel, for example. But in the case of just handwriting recognition, recognizing a digit, we’re just going to use a single value for, like, shaded in or not shaded in. And it might range, but it’s just a single color value. And that, then, is the very first layer of our neural network, a convolutional layer that will take the input and learn a whole bunch of different filters that we can apply to the input to extract meaningful features. Next step is going to be a max pooling layer, also built right into TensorFlow, where this is going to be a layer that is going to use a pool size of 2 by 2, meaning we’re going to look at 2 by 2 regions inside of the image and just extract the maximum value. Again, we’ve seen why this can be helpful. It’ll help to reduce the size of our input. And once we’ve done that, we’ll go ahead and flatten all of the units just into a single layer that we can then pass into the rest of the neural network. And now, here’s the rest of the neural network. Here, I’m saying, let’s add a hidden layer to my neural network with 128 units, so a whole bunch of hidden units inside of the hidden layer. And just to prevent overfitting, I can add a dropout to that. Say, you know what, when you’re training, randomly dropout half of the nodes from this hidden layer just to make sure we don’t become over-reliant on any particular node, we begin to really generalize and stop ourselves from overfitting. So TensorFlow allows us, just by adding a single line, to add dropout into our model as well, such that when it’s training, it will perform this dropout step in order to help make sure that we don’t overfit on this particular data. And then finally, I add an output layer. The output layer is going to have 10 units, one for each category that I would like to classify digits into, so 0 through 9, 10 different categories. And the activation function I’m going to use here is called the softmax activation function. And in short, what the softmax activation function is going to do is it’s going to take the output and turn it into a probability distribution. So ultimately, it’s going to tell me, what did we estimate the probability is that this is a 2 versus a 3 versus a 4. And so it will turn it into that probability distribution for me. Next up, I’ll go ahead and compile my model and fit it on all of my training data. And then I can evaluate how well the neural network performs. And then I’ve added to my Python program, if I’ve provided a command line argument like the name of a file, I’m going to go ahead and save the model to a file. And so this can be quite useful too. Once you’ve done the training step, which could take some time in terms of taking all the time, going through the data, running back propagation with gradient descent to be able to say, all right, how should we adjust the weight to this particular model? You end up calculating values for these weights, calculating values for these filters. You’d like to remember that information so you can use it later. And so TensorFlow allows us to just save a model to a file, such that later, if we want to use the model we’ve learned, use the weights that we’ve learned to make some sort of new prediction, we can just use the model that already exists. So what we’re doing here is after we’ve done all the calculation, we go ahead and save the model to a file, such that we can use it a little bit later. So for example, if I go into digits, I’m going to run handwriting.py. I won’t save it this time. We’ll just run it and go ahead and see what happens. What will happen is we need to go through the model in order to train on all of these samples of handwritten digits. The MNIST data set gives us thousands and thousands of sample handwritten digits in the same format that we can use in order to train. And so now what you’re seeing is this training process. And unlike the banknotes case, where there was much fewer data points, the data was very, very simple, here this data is more complex and this training process takes time. And so this is another one of those cases where when training neural networks, this is why computational power is so important that oftentimes you see people wanting to use sophisticated GPUs in order to more efficiently be able to do this sort of neural network training. It also speaks to the reason why more data can be helpful. The more sample data points you have, the better you can begin to do this training. So here we’re going through 60,000 different samples of handwritten digits. And I said we’re going to go through them 10 times. We’re going to go through the data set 10 times, training each time, hopefully improving upon our weights with every time we run through this data set. And we can see over here on the right what the accuracy is each time we go ahead and run this model, that the first time it looks like we got an accuracy of about 92% of the digits correct based on this training set. We increased that to 96% or 97%. And every time we run this, we’re going to see hopefully the accuracy improve as we continue to try and use that gradient descent, that process of trying to run the algorithm, to minimize the loss that we get in order to more accurately predict what the output should be. And what this process is doing is it’s learning not only the weights, but it’s learning the features to use, the kernel matrix to use when performing that convolution step. Because this is a convolutional neural network, where I’m first performing those convolutions and then doing the more traditional neural network structure, this is going to learn all of those individual steps as well. And so here we see the TensorFlow provides me with some very nice output, telling me about how many seconds are left with each of these training runs that allows me to see just how well we’re doing. So we’ll go ahead and see how this network performs. It looks like we’ve gone through the data set seven times. We’re going through it an eighth time now. And at this point, the accuracy is pretty high. We saw we went from 92% up to 97%. Now it looks like 98%. And at this point, it seems like things are starting to level out. It’s probably a limit to how accurate we can ultimately be without running the risk of overfitting. Of course, with enough nodes, you would just memorize the input and overfit upon them. But we’d like to avoid doing that. And Dropout will help us with this. But now we see we’re almost done finishing our training step. We’re at 55,000. All right, we finished training. And now it’s going to go ahead and test for us on 10,000 samples. And it looks like on the testing set, we were at 98.8% accurate. So we ended up doing pretty well, it seems, on this testing set to see how accurately can we predict these handwritten digits. And so what we could do then is actually test it out. I’ve written a program called Recognition.py using PyGame. If you pass it a model that’s been trained, and I pre-trained an example model using this input data, what we can do is see whether or not we’ve been able to train this convolutional neural network to be able to predict handwriting, for example. So I can try, just like drawing a handwritten digit. I’ll go ahead and draw the number 2, for example. So there’s my number 2. Again, this is messy. If you tried to imagine, how would you write a program with just ifs and thens to be able to do this sort of calculation, it would be tricky to do so. But here I’ll press Classify, and all right, it seems I was able to correctly classify that what I drew was the number 2. I’ll go ahead and reset it, try it again. We’ll draw an 8, for example. So here is an 8. Press Classify. And all right, it predicts that the digit that I drew was an 8. And the key here is this really begins to show the power of what the neural network is doing, somehow looking at various different features of these different pixels, figuring out what the relevant features are, and figuring out how to combine them to get a classification. And this would be a difficult task to provide explicit instructions to the computer on how to do, to use a whole bunch of ifs ands to process all these pixel values to figure out what the handwritten digit is. Everyone’s going to draw their 8s a little bit differently. If I drew the 8 again, it would look a little bit different. And yet, ideally, we want to train a network to be robust enough so that it begins to learn these patterns on its own. All I said was, here is the structure of the network, and here is the data on which to train the network. And the network learning algorithm just tries to figure out what is the optimal set of weights, what is the optimal set of filters to use them in order to be able to accurately classify a digit into one category or another. Just going to show the power of these sorts of convolutional neural networks. And so that then was a look at how we can use convolutional neural networks to begin to solve problems with regards to computer vision, the ability to take an image and begin to analyze it. So this is the type of analysis you might imagine that’s happening in self-driving cars that are able to figure out what filters to apply to an image to understand what it is that the computer is looking at, or the same type of idea that might be applied to facial recognition and social media to be able to determine how to recognize faces in an image as well. You can imagine a neural network that instead of classifying into one of 10 different digits could instead classify like, is this person A or is this person B, trying to tell those people apart just based on convolution. And so now what we’ll take a look at is yet another type of neural network that can be quite popular for certain types of tasks. But to do so, we’ll try to generalize and think about our neural network a little bit more abstractly. That here we have a sample deep neural network where we have this input layer, a whole bunch of different hidden layers that are performing certain types of calculations, and then an output layer here that just generates some sort of output that we care about calculating. But we could imagine representing this a little more simply like this. Here is just a more abstract representation of our neural network. We have some input that might be like a vector of a whole bunch of different values as our input. That gets passed into a network that performs some sort of calculation or computation, and that network produces some sort of output. That output might be a single value. It might be a whole bunch of different values. But this is the general structure of the neural network that we’ve seen. There is some sort of input that gets fed into the network. And using that input, the network calculates what the output should be. And this sort of model for a neural network is what we might call a feed-forward neural network. Feed-forward neural networks have connections only in one direction. They move from one layer to the next layer to the layer after that, such that the inputs pass through various different hidden layers and then ultimately produce some sort of output. So feed-forward neural networks were very helpful for solving these types of classification problems that we saw before. We have a whole bunch of input. We want to learn what setting of weights will allow us to calculate the output effectively. But there are some limitations on feed-forward neural networks that we’ll see in a moment. In particular, the input needs to be of a fixed shape, like a fixed number of neurons are in the input layer. And there’s a fixed shape for the output, like a fixed number of neurons in the output layer. And that has some limitations of its own. And a possible solution to this, and we’ll see examples of the types of problems we can solve for this in just a second, is instead of just a feed-forward neural network, where there are only connections in one direction from left to right effectively across the network, we could also imagine a recurrent neural network, where a recurrent neural network generates output that gets fed back into itself as input for future runs of that network. So whereas in a traditional neural network, we have inputs that get fed into the network, that get fed into the output. And the only thing that determines the output is based on the original input and based on the calculation we do inside of the network itself. This goes in contrast with a recurrent neural network, where in a recurrent neural network, you can imagine output from the network feeding back to itself into the network again as input for the next time you do the calculations inside of the network. What this allows is it allows the network to maintain some sort of state, to store some sort of information that can be used on future runs of the network. Previously, the network just defined some weights, and we passed inputs through the network, and it generated outputs. But the network wasn’t saving any information based on those inputs to be able to remember for future iterations or for future runs. What a recurrent neural network will let us do is let the network store information that gets passed back in as input to the network again the next time we try and perform some sort of action. And this is particularly helpful when dealing with sequences of data. So we’ll see a real world example of this right now, actually. Microsoft has developed an AI known as the caption bot. And what the caption bot does is it says, I can understand the content of any photograph, and I’ll try to describe it as well as any human. I’ll analyze your photo, but I won’t store it or share it. And so what Microsoft’s caption bot seems to be claiming to do is it can take an image and figure out what’s in the image and just give us a caption to describe it. So let’s try it out. Here, for example, is an image of Harvard Square. It’s some people walking in front of one of the buildings at Harvard Square. I’ll go ahead and take the URL for that image, and I’ll paste it into caption bot and just press Go. So caption bot is analyzing the image, and then it says, I think it’s a group of people walking in front of a building, which seems amazing. The AI is able to look at this image and figure out what’s in the image. And the important thing to recognize here is that this is no longer just a classification task. We saw being able to classify images with a convolutional neural network where the job was take the image and then figure out, is it a 0 or a 1 or a 2, or is it this person’s face or that person’s face? What seems to be happening here is the input is an image, and we know how to get networks to take input of images, but the output is text. It’s a sentence. It’s a phrase, like a group of people walking in front of a building. And this would seem to pose a challenge for our more traditional feed-forward neural networks, for the reason being that in traditional neural networks, we just have a fixed-size input and a fixed-size output. There are a certain number of neurons in the input to our neural network and a certain number of outputs for our neural network, and then some calculation that goes on in between. But the size of the inputs and the number of values in the input and the number of values in the output, those are always going to be fixed based on the structure of the neural network. And that makes it difficult to imagine how a neural network could take an image like this and say it’s a group of people walking in front of the building because the output is text, like it’s a sequence of words. Now, it might be possible for a neural network to output one word, one word you could represent as a vector of values, and you can imagine ways of doing that. Next time, we’ll talk a little bit more about AI as it relates to language and language processing. But a sequence of words is much more challenging because depending on the image, you might imagine the output is a different number of words. We could have sequences of different lengths, and somehow we still want to be able to generate the appropriate output. And so the strategy here is to use a recurrent neural network, a neural network that can feed its own output back into itself as input for the next time. And this allows us to do what we call a one-to-many relationship for inputs to outputs, that in vanilla, more traditional neural networks, these are what we might consider to be one-to-one neural networks. You pass in one set of values as input. You get one vector of values as the output. But in this case, we want to pass in one value as input, the image, and we want to get a sequence, many values as output, where each value is like one of these words that gets produced by this particular algorithm. And so the way we might do this is we might imagine starting by providing input, the image, into our neural network. And the neural network is going to generate output, but the output is not going to be the whole sequence of words, because we can’t represent the whole sequence of words using just a fixed set of neurons. Instead, the output is just going to be the first word. We’re going to train the network to output what the first word of the caption should be. And you could imagine that Microsoft has trained this by running a whole bunch of training samples through the AI, giving it a whole bunch of pictures and what the appropriate caption was, and having the AI begin to learn from that. But now, because the network generates output that can be fed back into itself, you could imagine the output of the network being fed back into the same network. This here looks like a separate network, but it’s really the same network that’s just getting different input, that this network’s output gets fed back into itself, but it’s going to generate another output. And that other output is going to be the second word in the caption. And this recurrent neural network then, this network is going to generate other output that can be fed back into itself to generate yet another word, fed back into itself to generate another word. And so recurrent neural networks allow us to represent this one-to-many structure. You provide one image as input, and the neural network can pass data into the next run of the network, and then again and again, such that you could run the network multiple times, each time generating a different output still based on that original input. And this is where recurrent neural networks become particularly useful when dealing with sequences of inputs or outputs. And my output is a sequence of words, and since I can’t very easily represent outputting an entire sequence of words, I’ll instead output that sequence one word at a time by allowing my network to pass information about what still needs to be said about the photo into the next stage of running the network. So you could run the network multiple times, the same network with the same weights, just getting different input each time. First, getting input from the image, and then getting input from the network itself as additional information about what additionally needs to be given in a particular caption, for example. So this then is a one-to-many relationship inside of a recurrent neural network, but it turns out there are other models that we can use, other ways we can try and use recurrent neural networks to be able to represent data that might be stored in other forms as well. We saw how we could use neural networks in order to analyze images in the context of convolutional neural networks that take an image, figure out various different properties of the image, and are able to draw some sort of conclusion based on that. But you might imagine that something like YouTube, they need to be able to do a lot of learning based on video. They need to look through videos to detect if they’re like copyright violations, or they need to be able to look through videos to maybe identify what particular items are inside of the video, for example. And video, you might imagine, is much more difficult to put in as input to a neural network, because whereas an image, you could just treat each pixel as a different value, videos are sequences. They’re sequences of images, and each sequence might be of different length. And so it might be challenging to represent that entire video as a single vector of values that you could pass in to a neural network. And so here, too, recurrent neural networks can be a valuable solution for trying to solve this type of problem. Then instead of just passing in a single input into our neural network, we could pass in the input one frame at a time, you might imagine. First, taking the first frame of the video, passing it into the network, and then maybe not having the network output anything at all yet. Let it take in another input, and this time, pass it into the network. But the network gets information from the last time we provided an input into the network. Then we pass in a third input, and then a fourth input, where each time, what the network gets is it gets the most recent input, like each frame of the video. But it also gets information the network processed from all of the previous iterations. So on frame number four, you end up getting the input for frame number four plus information the network has calculated from the first three frames. And using all of that data combined, this recurrent neural network can begin to learn how to extract patterns from a sequence of data as well. And so you might imagine, if you want to classify a video into a number of different genres, like an educational video, or a music video, or different types of videos, that’s a classification task, where you want to take as input each of the frames of the video, and you want to output something like what it is, what category that it happens to belong to. And you can imagine doing this sort of thing, this sort of many-to-one learning, any time your input is a sequence. And so input is a sequence in the context of video. It could be in the context of, like, if someone has typed a message and you want to be able to categorize that message, like if you’re trying to take a movie review and trying to classify it as, is it a positive review or a negative review? That input is a sequence of words, and the output is a classification, positive or negative. There, too, a recurrent neural network might be helpful for analyzing sequences of words. And they’re quite popular when it comes to dealing with language. Could even be used for spoken language as well, that spoken language is an audio waveform that can be segmented into distinct chunks. And each of those could be passed in as an input into a recurrent neural network to be able to classify someone’s voice, for instance. If you want to do voice recognition to say, is this one person or is this another, here are also cases where you might want this many-to-one architecture for a recurrent neural network. And then as one final problem, just to take a look at in terms of what we can do with these sorts of networks, imagine what Google Translate is doing. So what Google Translate is doing is it’s taking some text written in one language and converting it into text written in some other language, for example, where now this input is a sequence of data. It’s a sequence of words. And the output is a sequence of words as well. It’s also a sequence. So here we want effectively a many-to-many relationship. Our input is a sequence and our output is a sequence as well. And it’s not quite going to work to just say, take each word in the input and translate it into a word in the output. Because ultimately, different languages put their words in different orders. And maybe one language uses two words for something, whereas another language only uses one. So we really want some way to take this information, this input, encode it somehow, and use that encoding to generate what the output ultimately should be. And this has been one of the big advancements in automated translation technology, is the ability to use the neural networks to do this instead of older, more traditional methods. And this has improved accuracy dramatically. And the way you might imagine doing this is, again, using a recurrent neural network with multiple inputs and multiple outputs. We start by passing in all the input. Input goes into the network. Another input, like another word, goes into the network. And we do this multiple times, like once for each word in the input that I’m trying to translate. And only after all of that is done does the network now start to generate output, like the first word of the translated sentence, and the next word of the translated sentence, so on and so forth, where each time the network passes information to itself by allowing for this model of giving some sort of state from one run in the network to the next run, assembling information about all the inputs, and then passing in information about which part of the output in order to generate next. And there are a number of different types of these sorts of recurrent neural networks. One of the most popular is known as the long short-term memory neural network, otherwise known as LSTM. But in general, these types of networks can be very, very powerful whenever we’re dealing with sequences, whether those are sequences of images or especially sequences of words when it comes towards dealing with natural language. And so that then were just some of the different types of neural networks that can be used to do all sorts of different computations. And these are incredibly versatile tools that can be applied to a number of different domains. We only looked at a couple of the most popular types of neural networks from more traditional feed-forward neural networks, convolutional neural networks, and recurrent neural networks. But there are other types as well. There are adversarial networks where networks compete with each other to try and be able to generate new types of data, as well as other networks that can solve other tasks based on what they happen to be structured and adapted for. And these are very powerful tools in machine learning from being able to very easily learn based on some set of input data and to be able to, therefore, figure out how to calculate some function from inputs to outputs, whether it’s input to some sort of classification like analyzing an image and getting a digit or machine translation where the input is in one language and the output is in another. These tools have a lot of applications for machine learning more generally. Next time, we’ll look at machine learning and AI in particular in the context of natural language. We talked a little bit about this today, but looking at how it is that our AI can begin to understand natural language and can begin to be able to analyze and do useful tasks with regards to human language, which turns out to be a challenging and interesting task. So we’ll see you next time. And welcome back, everybody, to our final class in an introduction to artificial intelligence with Python. Now, so far in this class, we’ve been taking problems that we want to solve intelligently and framing them in ways that computers are going to be able to make sense of. We’ve been taking problems and framing them as search problems or constraint satisfaction problems or optimization problems, for example. In essence, we have been trying to communicate about problems in ways that our computer is going to be able to understand. Today, the goal is going to be to get computers to understand the way you and I communicate naturally via our own natural languages, languages like English. But natural language contains a lot of nuance and complexity that’s going to make it challenging for computers to be able to understand. So we’ll need to explore some new tools and some new techniques to allow computers to make sense of natural language. So what is it exactly that we’re trying to get computers to do? Well, they all fall under this general heading of natural language processing, getting computers to work with natural language. And these tasks include tasks like automatic summarization. Given a long text, can we train the computer to be able to come up with a shorter representation of it? Information extraction, getting the computer to pull out relevant facts or details out of some text. Machine translation, like Google Translate, translating some text from one language into another language. Question answering, if you’ve ever asked a question to your phone or had a conversation with an AI chatbot where you provide some text to the computer, the computer is able to understand that text and then generate some text in response. Text classification, where we provide some text to the computer and the computer assigns it a label, positive or negative, inbox or spam, for example. And there are several other kinds of tasks that all fall under this heading of natural language processing. But before we take a look at how the computer might try to solve these kinds of tasks, it might be useful for us to think about language in general. What are the kinds of challenges that we might need to deal with as we start to think about language and getting a computer to be able to understand it? So one part of language that we’ll need to consider is the syntax of language. Syntax is all about the structure of language. Language is composed of individual words. And those words are composed together in some kind of structured whole. And if our computer is going to be able to understand language, it’s going to need to understand something about that structure. So let’s take a couple of examples. Here, for instance, is a sentence. Just before 9 o’clock, Sherlock Holmes stepped briskly into the room. That sentence is made up of words. And those words together form a structured whole. This is syntactically valid as a sentence. But we could take some of those same words, rearrange them, and come up with a sentence that is not syntactically valid. Here, for example, just before Sherlock Holmes 9 o’clock stepped briskly the room is still composed of valid words. But they’re not in any kind of logical whole. This is not a syntactically well-formed sentence. Another interesting challenge is that some sentences will have multiple possible valid structures. Here’s a sentence, for example. I saw the man on the mountain with a telescope. And here, this is a valid sentence. But it actually has two different possible structures that lend themselves to two different interpretations and two different meanings. Maybe I, the one doing the seeing, am the one with the telescope. Or maybe the man on the mountain is the one with the telescope. And so natural language is ambiguous. Sometimes the same sentence can be interpreted in multiple ways. And that’s something that we’ll need to think about as well. And this lends itself to another problem within language that we’ll need to think about, which is semantics. While syntax is all about the structure of language, semantics is about the meaning of language. It’s not enough for a computer just to know that a sentence is well-structured if it doesn’t know what that sentence means. And so semantics is going to concern itself with the meaning of words and the meaning of sentences. So if we go back to that same sentence as before, just before 9 o’clock, Sherlock Holmes stepped briskly into the room, I could come up with another sentence, say the sentence, a few minutes before 9, Sherlock Holmes walked quickly into the room. And those are two different sentences with some of the words the same and some of the words different. But the two sentences have essentially the same meaning. And so ideally, whatever model we build, we’ll be able to understand that these two sentences, while different, mean something very similar. Some syntactically well-formed sentences don’t mean anything at all. A famous example from linguist Noam Chomsky is the sentence, colorless green ideas sleep furiously. This is a syntactically, structurally well-formed sentence. We’ve got adjectives modifying a noun, ideas. We’ve got a verb and an adverb in the correct positions. But when taken as a whole, the sentence doesn’t really mean anything. And so if our computers are going to be able to work with natural language and perform tasks in natural language processing, these are some concerns we’ll need to think about. We’ll need to be thinking about syntax. And we’ll need to be thinking about semantics. So how could we go about trying to teach a computer how to understand the structure of natural language? Well, one approach we might take is by starting by thinking about the rules of natural language. Our natural languages have rules. In English, for example, nouns tend to come before verbs. Nouns can be modified by adjectives, for example. And so if only we could formalize those rules, then we could give those rules to a computer, and the computer would be able to make sense of them and understand them. And so let’s try to do exactly that. We’re going to try to define a formal grammar. Where a formal grammar is some system of rules for generating sentences in a language. This is going to be a rule-based approach to natural language processing. We’re going to give the computer some rules that we know about language and have the computer use those rules to make sense of the structure of language. And there are a number of different types of formal grammars. Each one of them has slightly different use cases. But today, we’re going to focus specifically on one kind of grammar known as a context-free grammar. So how does the context-free grammar work? Well, here is a sentence that we might want a computer to generate. She saw the city. And we’re going to call each of these words a terminal symbol. A terminal symbol, because once our computer has generated the word, there’s nothing else for it to generate. Once it’s generated the sentence, the computer is done. We’re going to associate each of these terminal symbols with a non-terminal symbol that generates it. So here we’ve got n, which stands for noun, like she or city. We’ve got v as a non-terminal symbol, which stands for a verb. And then we have d, which stands for determiner. A determiner is a word like the or a or an in English, for example. So each of these non-terminal symbols can generate the terminal symbols that we ultimately care about generating. But how do we know, or how does the computer know which non-terminal symbols are associated with which terminal symbols? Well, to do that, we need some kind of rule. Here are some what we call rewriting rules that have a non-terminal symbol on the left-hand side of an arrow. And on the right side is what that non-terminal symbol can be replaced with. So here we’re saying the non-terminal symbol n, again, which stands for noun, could be replaced by any of these options separated by vertical bars. n could be replaced by she or city or car or hairy. d for determiner could be replaced by the a or an and so forth. Each of these non-terminal symbols could be replaced by any of these words. We can also have non-terminal symbols that are replaced by other non-terminal symbols. Here is an interesting rule, np arrow n bar dn. So what does that mean? Well, np stands for a noun phrase. Sometimes when we have a noun phrase in a sentence, it’s not just a single word, it could be multiple words. And so here we’re saying a noun phrase could be just a noun, or it could be a determiner followed by a noun. So we might have a noun phrase that’s just a noun, like she, that’s a noun phrase. Or we could have a noun phrase that’s multiple words, something like the city also acts as a noun phrase. But in this case, it’s composed of two words, a determiner, the, and a noun city. We could do the same for verb phrases. A verb phrase, or VP, might be just a verb, or it might be a verb followed by a noun phrase. So we could have a verb phrase that’s just a single word, like the word walked, or we could have a verb phrase that is an entire phrase, something like saw the city, as an entire verb phrase. A sentence, meanwhile, we might then define as a noun phrase followed by a verb phrase. And so this would allow us to generate a sentence like she saw the city, an entire sentence made up of a noun phrase, which is just the word she, and then a verb phrase, which is saw the city, saw which is a verb, and then the city, which itself is also a noun phrase. And so if we could give these rules to a computer explaining to it what non-terminal symbols could be replaced by what other symbols, then a computer could take a sentence and begin to understand the structure of that sentence. And so let’s take a look at an example of how we might do that. And to do that, we’re going to use a Python library called NLTK, or the Natural Language Toolkit, which we’ll see a couple of times today. It contains a lot of helpful features and functions that we can use for trying to deal with and process natural language. So here we’ll take a look at how we can use NLTK in order to parse a context-free grammar. So let’s go ahead and open up cfg0.py, cfg standing for context-free grammar. And what you’ll see in this file is that I first import NLTK, the Natural Language Toolkit. And the first thing I do is define a context-free grammar, saying that a sentence is a noun phrase followed by a verb phrase. I’m defining what a noun phrase is, defining what a verb phrase is, and then giving some examples of what I can do with these non-terminal symbols, D for determiner, N for noun, and V for verb. We’re going to use NLTK to parse that grammar. Then we’ll ask the user for some input in the form of a sentence and split it into words. And then we’ll use this context-free grammar parser to try to parse that sentence and print out the resulting syntax tree. So let’s take a look at an example. We’ll go ahead and go into my cfg directory, and we’ll run cfg0.py. And here I’m asked to type in a sentence. Let’s say I type in she walked. And when I do that, I see that she walked is a valid sentence, where she is a noun phrase, and walked is the corresponding verb phrase. I could try to do this with a more complex sentence too. I could do something like she saw the city. And here we see that she is the noun phrase, and then saw the city is the entire verb phrase that makes up this sentence. So that was a very simple grammar. Let’s take a look at a slightly more complex grammar. Here is cfg1.py, where a sentence is still a noun phrase followed by a verb phrase, but I’ve added some other possible non-terminal symbols too. I have AP for adjective phrase and PP for prepositional phrase. And we specified that we could have an adjective phrase before a noun phrase or a prepositional phrase after a noun, for example. So lots of additional ways that we might try to structure a sentence and interpret and parse one of those resulting sentences. So let’s see that one in action. We’ll go ahead and run cfg1.py with this new grammar. And we’ll try a sentence like she saw the wide street. Here, Python’s NLTK is able to parse that sentence and identify that she saw the wide street has this particular structure, a sentence with a noun phrase and a verb phrase, where that verb phrase has a noun phrase that within it contains an adjective. And so it’s able to get some sense for what the structure of this language actually is. Let’s try another example. Let’s say she saw the dog with the binoculars. And we’ll try that sentence. And here, we get one possible syntax tree, she saw the dog with the binoculars. But notice that this sentence is actually a little bit ambiguous in our own natural language. Who has the binoculars? Is it she who has the binoculars or the dog who has the binoculars? And NLTK is able to identify both possible structures for the sentence. In this case, the dog with the binoculars is an entire noun phrase. It’s all underneath this NP here. So it’s the dog that has the binoculars. But we also got an alternative parse tree, where the dog is just the noun phrase. And with the binoculars is a prepositional phrase modifying saw. So she saw the dog and she used the binoculars in order to see the dog as well. So this allows us to get a sense for the structure of natural language. But it relies on us writing all of these rules. And it would take a lot of effort to write all of the rules for any possible sentence that someone might write or say in the English language. Language is complicated. And as a result, there are going to be some very complex rules. So what else might we try? We might try to take a statistical lens towards approaching this problem of natural language processing. If we were able to give the computer a lot of existing data of sentences written in the English language, what could we try to learn from that data? Well, it might be difficult to try and interpret long pieces of text all at once. So instead, what we might want to do is break up that longer text into smaller pieces of information instead. In particular, we might try to create n-grams out of a longer sequence of text. An n-gram is just some contiguous sequence of n items from a sample of text. It might be n characters in a row or n words in a row, for example. So let’s take a passage from Sherlock Holmes. And let’s look for all of the trigrams. A trigram is an n-gram where n is equal to 3. So in this case, we’re looking for sequences of three words in a row. So the trigrams here would be phrases like how often have. That’s three words in a row. Often have I is another trigram. Have I said, I said to, said to you, to you that. These are all trigrams, sequences of three words that appear in sequence. And if we could give the computer a large corpus of text and have it pull out all of the trigrams in this case, it could get a sense for what sequences of three words tend to appear next to each other in our own natural language and, as a result, get some sense for what the structure of the language actually is. So let’s take a look at an example of that. How can we use NLTK to try to get access to information about n-grams? So here, we’re going to open up ngrams.py. And this is a Python program that’s going to load a corpus of data, just some text files, into our computer’s memory. And then we’re going to use NLTK’s ngrams function, which is going to go through the corpus of text, pulling out all of the ngrams for a particular value of n. And then, by using Python’s counter class, we’re going to figure out what are the most common ngrams inside of this entire corpus of text. And we’re going to need a data set in order to do this. And I’ve prepared a data set of some of the stories of Sherlock Holmes. So it’s just a bunch of text files. A lot of words for it to analyze. And as a result, we’ll get a sense for what sequences of two words or three words that tend to be most common in natural language. So let’s give this a try. We’ll go into my ngrams directory. And we’ll run ngrams.py. We’ll try an n value of 2. So we’re looking for sequences of two words in a row. And we’ll use our corpus of stories from Sherlock Holmes. And when we run this program, we get a list of the most common ngrams where n is equal to 2, otherwise known as a bigram. So the most common one is of the. That’s a sequence of two words that appears quite frequently in natural language. Then in the. And it was. These are all common sequences of two words that appear in a row. Let’s instead now try running ngrams with n equal to 3. Let’s get all of the trigrams and see what we get. And now we see the most common trigrams are it was a. One of the. I think that. These are all sequences of three words that appear quite frequently. And we were able to do this essentially via a process known as tokenization. Tokenization is the process of splitting a sequence of characters into pieces. In this case, we’re splitting a long sequence of text into individual words and then looking at sequences of those words to get a sense for the structure of natural language. So once we’ve done this, once we’ve done the tokenization, once we’ve built up our corpus of ngrams, what can we do with that information? So the one thing that we might try is we could build a Markov chain, which you might recall from when we talked about probability. Recall that a Markov chain is some sequence of values where we can predict one value based on the values that came before it. And as a result, if we know all of the common ngrams in the English language, what words tend to be associated with what other words in sequence, we can use that to predict what word might come next in a sequence of words. And so we could build a Markov chain for language in order to try to generate natural language that follows the same statistical patterns as some input data. So let’s take a look at that and build a Markov chain for natural language. And as input, I’m going to use the works of William Shakespeare. So here I have a file Shakespeare.txt, which is just a bunch of the works of William Shakespeare. It’s a long text file, so plenty of data to analyze. And here in generator.py, I’m using a third party Python library in order to do this analysis. We’re going to read in the sample of text, and then we’re going to train a Markov model based on that text. And then we’re going to have the Markov chain generate some sentences. We’re going to generate a sentence that doesn’t appear in the original text, but that follows the same statistical patterns that’s generating it based on the ngrams trying to predict what word is likely to come next that we would expect based on those statistical patterns. So we’ll go ahead and go into our Markov directory, run this generator with the works of William Shakespeare’s input. And what we’re going to get are five new sentences, where these sentences are not necessarily sentences from the original input text itself, but just that follow the same statistical patterns. It’s predicting what word is likely to come next based on the input data that we’ve seen and the types of words that tend to appear in sequence there too. And so we’re able to generate these sentences. Of course, so far, there’s no guarantee that any of the sentences that are generated actually mean anything or make any sense. They just happen to follow the statistical patterns that our computer is already aware of. So we’ll return to this issue of how to generate text in perhaps a more accurate or more meaningful way a little bit later. So let’s now turn our attention to a slightly different problem, and that’s the problem of text classification. Text classification is the problem where we have some text and we want to put that text into some kind of category. We want to apply some sort of label to that text. And this kind of problem shows up in a wide variety of places. A commonplace might be your email inbox, for example. You get an email and you want your computer to be able to identify whether the email belongs in your inbox or whether it should be filtered out into spam. So we need to classify the text. Is it a good email or is it spam? Another common use case is sentiment analysis. We might want to know whether the sentiment of some text is positive or negative. And so how might we do that? This comes up in situations like product reviews, where we might have a bunch of reviews for a product on some website. My grandson loved it so much fun. Product broke after a few days. One of the best games I’ve played in a long time and kind of cheap and flimsy, not worth it. Here’s some example sentences that you might see on a product review website. And you and I could pretty easily look at this list of product reviews and decide which ones are positive and which ones are negative. We might say the first one and the third one, those seem like positive sentiment messages. But the second one and the fourth one seem like negative sentiment messages. But how did we know that? And how could we train a computer to be able to figure that out as well? Well, you might have clued your eye in on particular key words, where those particular words tend to mean something positive or negative. So you might have identified words like loved and fun and best tend to be associated with positive messages. And words like broke and cheap and flimsy tend to be associated with negative messages. So if only we could train a computer to be able to learn what words tend to be associated with positive versus negative messages, then maybe we could train a computer to do this kind of sentiment analysis as well. So we’re going to try to do just that. We’re going to use a model known as the bag of words model, which is a model that represents text as just an unordered collection of words. For the purpose of this model, we’re not going to worry about the sequence and the ordering of the words, which word came first, second, or third. We’re just going to treat the text as a collection of words in no particular order. And we’re losing information there, right? The order of words is important. And we’ll come back to that a little bit later. But for now, to simplify our model, it’ll help us tremendously just to think about text as some unordered collection of words. And in particular, we’re going to use the bag of words model to build something known as a naive Bayes classifier. So what is a naive Bayes classifier? Well, it’s a tool that’s going to allow us to classify text based on Bayes rule, again, which you might remember from when we talked about probability. Bayes rule says that the probability of B given A is equal to the probability of A given B multiplied by the probability of B divided by the probability of A. So how are we going to use this rule to be able to analyze text? Well, what are we interested in? We’re interested in the probability that a message has a positive sentiment and the probability that a message has a negative sentiment, which I’m here for simplicity going to represent just with these emoji, happy face and frown face, as positive and negative sentiment. And so if I had a review, something like my grandson loved it, then what I’m interested in is not just the probability that a message has positive sentiment, but the conditional probability that a message has positive sentiment given that this is the message my grandson loved it. But how do I go about calculating this value, the probability that the message is positive given that the review is this sequence of words? Well, here’s where the bag of words model comes in. Rather than treat this review as a string of a sequence of words in order, we’re just going to treat it as an unordered collection of words. We’re going to try to calculate the probability that the review is positive given that all of these words, my grandson loved it, are in the review in no particular order, just this unordered collection of words. And this is a conditional probability, which we can then apply Bayes rule to try to make sense of. And so according to Bayes rule, this conditional probability is equal to what? It’s equal to the probability that all of these four words are in the review given that the review is positive multiplied by the probability that the review is positive divided by the probability that all of these words happen to be in the review. So this is the value now that we’re going to try to calculate. Now, one thing you might notice is that the denominator here, the probability that all of these words appear in the review, doesn’t actually depend on whether or not we’re looking at the positive sentiment or negative sentiment case. So we can actually get rid of this denominator. We don’t need to calculate it. We can just say that this probability is proportional to the numerator. And then at the end, we’re going to need to normalize the probability distribution to make sure that all of the values sum up to the value 1. So now, how do we calculate this value? Well, this is the probability of all of these words given positive times probability of positive. And that, by the definition of joint probability, is just one big joint probability, the probability that all of these things are the case, that it’s a positive review, and that all four of these words are in the review. But still, it’s not entirely obvious how we calculate that value. And here is where we need to make one more assumption. And this is where the naive part of naive Bayes comes in. We’re going to make the assumption that all of the words are independent of each other. And by that, I mean that if the word grandson is in the review, that doesn’t change the probability that the word loved is in the review or that the word it is in the review, for example. And in practice, this assumption might not be true. It’s almost certainly the case that the probability of words do depend on each other. But it’s going to simplify our analysis and still give us reasonably good results just to assume that the words are independent of each other and they only depend on whether it’s positive or negative. You might, for example, expect the word loved to appear more often in a positive review than in a negative review. So what does that mean? Well, if we make this assumption, then we can say that this value, the probability we’re interested in, is not directly proportional to, but it’s naively proportional to this value. The probability that the review is positive times the probability that my is in the review, given that it’s positive, times the probability that grandson is in the review, given that it’s positive, and so on for the other two words that happen to be in this review. And now this value, which looks a little more complex, is actually a value that we can calculate pretty easily. So how are we going to estimate the probability that the review is positive? Well, if we have some training data, some example data of example reviews where each one has already been labeled as positive or negative, then we can estimate the probability that a review is positive just by counting the number of positive samples and dividing by the total number of samples that we have in our training data. And for the conditional probabilities, the probability of loved, given that it’s positive, well, that’s going to be the number of positive samples with loved in it divided by the total number of positive samples. So let’s take a look at an actual example to see how we could try to calculate these values. Here I’ve put together some sample data. The way to interpret the sample data is that based on the training data, 49% of the reviews are positive, 51% are negative. And then over here in this table, we have some conditional probabilities. And then we have if the review is positive, then there is a 30% chance that my appears in it. And if the review is negative, there is a 20% chance that my appears in it. And based on our training data among the positive reviews, 1% of them contain the word grandson. And among the negative reviews, 2% contain the word grandson. So using this data, let’s try to calculate this value, the value we’re interested in. And to do that, we’ll need to multiply all of these values together. The probability of positive, and then all of these positive conditional probabilities. And when we do that, we get some value. And then we can do the same thing for the negative case. We’re going to do the same thing, take the probability that it’s negative, multiply it by all of these conditional probabilities, and we’re going to get some other value. And now these values don’t sum to one. They’re not a probability distribution yet. But I can normalize them and get some values. And that tells me that we’re going to predict that my grandson loved it. We think there’s a 68% chance, probability 0.68, that that is a positive sentiment review, and 0.32 probability that it’s a negative review. So what problems might we run into here? What could potentially go wrong when doing this kind of analysis in order to analyze whether text has a positive or negative sentiment? Well, a couple of problems might arise. One problem might be, what if the word grandson never appears for any of the positive reviews? If that were the case, then when we try to calculate the value, the probability that we think the review is positive, we’re going to multiply all these values together, and we’re just going to get 0 for the positive case, because we’re all going to ultimately multiply by that 0 value. And so we’re going to say that we think there is no chance that the review is positive because it contains the word grandson. And in our training data, we’ve never seen the word grandson appear in a positive sentiment message before. And that’s probably not the right analysis, because in cases of rare words, it might be the case that in nowhere in our training data did we ever see the word grandson appear in a message that has positive sentiment. So what can we do to solve this problem? Well, one thing we’ll often do is some kind of additive smoothing, where we add some value alpha to each value in our distribution just to smooth out the data a little bit. And a common form of this is Laplace smoothing, where we add 1 to each value in our distribution. In essence, we pretend we’ve seen each value one more time than we actually have. So if we’ve never seen the word grandson for a positive review, we pretend we’ve seen it once. If we’ve seen it once, we pretend we’ve seen it twice, just to avoid the possibility that we might multiply by 0 and as a result, get some results we don’t want in our analysis. So let’s see what this looks like in practice. Let’s try to do some naive Bayes classification in order to classify text as either positive or negative. We’ll take a look at sentiment.py. And what this is going to do is load some sample data into memory, some examples of positive reviews and negative reviews. And then we’re going to train a naive Bayes classifier on all of this training data, training data that includes all of the words we see in positive reviews and all of the words we see in negative reviews. And then we’re going to try to classify some input. And so we’re going to do this based on a corpus of data. I have some example positive reviews. Here are some positive reviews. It was great, so much fun, for example. And then some negative reviews, not worth it, kind of cheap. These are some examples of negative reviews. So now let’s try to run this classifier and see how it would classify particular text as either positive or negative. We’ll go ahead and run our sentiment analysis on this corpus. And we need to provide it with a review. So I’ll say something like, I enjoyed it. And we see that the classifier says there is about a 0.92 probability that we think that this particular review is positive. Let’s try something negative. We’ll try kind of overpriced. And we see that there is a 0.96 probability now that we think that this particular review is negative. And so our naive Bayes classifier has learned what kinds of words tend to appear in positive reviews and what kinds of words tend to appear in negative reviews. And as a result of that, we’ve been able to design a classifier that can predict whether a particular review is positive or negative. And so this definitely is a useful tool that we can use to try and make some predictions. But we had to make some assumptions in order to get there. So what if we want to now try to build some more sophisticated models, use some tools from machine learning to try and take better advantage of language data to be able to draw more accurate conclusions and solve new kinds of tasks and new kinds of problems? Well, we’ve seen a couple of times now that when we want to take some data and take some input, put it in a way that the computer is going to be able to make sense of, it can be helpful to take that data and turn it into numbers, ultimately. And so what we might want to try to do is come up with some word representation, some way to take a word and translate its meaning into numbers. Because, for example, if we wanted to use a neural network to be able to process language, give our language to a neural network and have it make some predictions or perform some analysis there, a neural network takes its input and produces its output a vector of values, a vector of numbers. And so what we might want to do is take our data and somehow take words and convert them into some kind of numeric representation. So how might we do that? How might we take words and turn them into numbers? Let’s take a look at an example. Here’s a sentence, he wrote a book. And let’s say I wanted to take each of those words and turn it into a vector of values. Here’s one way I might do that. We’ll say he is going to be a vector that has a 1 in the first position and the rest of the values are 0. Wrote will have a 1 in the second position and the rest of the values are 0. A has a 1 in the third position with the rest of the value 0. And book has a 1 in the fourth position with the rest of the value 0. So each of these words now has a distinct vector representation. And this is what we often call a one-hot representation, a representation of the meaning of a word as a vector with a single 1 and all of the rest of the values are 0. And so when doing this, we now have a numeric representation for every word and we could pass in those vector representations into a neural network or other models that require some kind of numeric data as input. But this one-hot representation actually has a couple of problems and it’s not ideal for a few reasons. One reason is, here we’re just looking at four words. But if you imagine a vocabulary of thousands of words or more, these vectors are going to get quite long in order to have a distinct vector for every possible word in a vocabulary. And as a result of that, these longer vectors are going to be more difficult to deal with, more difficult to train, and so forth. And so that might be a problem. Another problem is a little bit more subtle. If we want to represent a word as a vector, and in particular the meaning of a word as a vector, then ideally it should be the case that words that have similar meanings should also have similar vector representations, so that they’re close to each other together inside a vector space. But that’s not really going to be the case with these one-hot representations, because if we take some similar words, say the word wrote and the word authored, which means similar things, they have entirely different vector representations. Likewise, book and novel, those two words mean somewhat similar things, but they have entirely different vector representations because they each have a one in some different position. And so that’s not ideal either. So what we might be interested in instead is some kind of distributed representation. A distributed representation is the representation of the meaning of a word distributed across multiple values, instead of just being one-hot with a one in one position. Here is what a distributed representation of words might be. Each word is associated with some vector of values, with the meaning distributed across multiple values, ideally in such a way that similar words have a similar vector representation. But how are we going to come up with those values? Where do those values come from? How can we define the meaning of a word in this distributed sequence of numbers? Well, to do that, we’re going to draw inspiration from a quote from British linguist J.R. Firth, who said, you shall know a word by the company it keeps. In other words, we’re going to define the meaning of a word based on the words that appear around it, the context words around it. Take, for example, this context, for blank he ate. You might wonder, what words could reasonably fill in that blank? Well, it might be words like breakfast or lunch or dinner. All of those could reasonably fill in that blank. And so what we’re going to say is because the words breakfast and lunch and dinner appear in a similar context, that they must have a similar meaning. And that’s something our computer could understand and try to learn. A computer could look at a big corpus of text, look at what words tend to appear in similar context to each other, and use that to identify which words have a similar meaning and should therefore appear close to each other inside a vector space. And so one common model for doing this is known as the word to vec model. It’s a model for generating word vectors, a vector representation for every word by looking at data and looking at the context in which a word appears. The idea is going to be this. If you start out with all of the words just in some random position in space and train it on some training data, what the word to vec model will do is start to learn what words appear in similar contexts. And it will move these vectors around in such a way that hopefully words with similar meanings, breakfast, lunch, and dinner, book, memoir, novel, will hopefully appear to be near to each other as vectors as well. So let’s now take a look at what word to vec might look like in practice when implemented in code. What I have here inside of words.txt is a pre-trained model where each of these words has some vector representation trained by word to vec. Each of these words has some sequence of values representing its meaning, hopefully in such a way that similar words are represented by similar vectors. I also have this file vectors.py, which is going to open up the words and form them into a dictionary. And we also define some useful functions like distance to get the distance between two word vectors and closest words to find which words are nearby in terms of having close vectors to each other. And so let’s give this a try. We’ll go ahead and open a Python interpreter. And I’m going to import these vectors. And we might say, all right, what is the vector representation of the word book? And we get this big long vector that represents the word book as a sequence of values. And this sequence of values by itself is not all that meaningful. But it is meaningful in the context of comparing it to other vectors for other words. So we could use this distance function, which is going to get us the distance between two word vectors. And we might say, what is the distance between the vector representation for the word book and the vector representation for the word novel? And we see that it’s 0.34. You can kind of interpret 0 as being really close together and 1 being very far apart. And so now, what is the distance between book and, let’s say, breakfast? Well, book and breakfast are more different from each other than book and novel are. So I would hopefully expect the distance to be larger. And in fact, it is 0.64 approximately. These two words are further away from each other. And what about now the distance between, let’s say, lunch and breakfast? Well, that’s about 0.2. Those are even closer together. They have a meaning that is closer to each other. Another interesting thing we might do is calculate the closest words. We might say, what are the closest words, according to Word2Vec, to the word book? And let’s say, let’s get the 10 closest words. What are the 10 closest vectors to the vector representation for the word book? And when we perform that analysis, we get this list of words. The closest one is book itself, but we also have books plural, and then essay, memoir, essays, novella, anthology, and so on. All of these words mean something similar to the word book, according to Word2Vec, at least, because they have a similar vector representation. So it seems like we’ve done a pretty good job of trying to capture this kind of vector representation of word meaning. One other interesting side effect of Word2Vec is that it’s also able to capture something about the relationships between words as well. Let’s take a look at an example. Here, for instance, are two words, man and king. And these are each represented by Word2Vec as vectors. So what might happen if I subtracted one from the other, calculated the value king minus man? Well, that will be the vector that will take us from man to king, somehow represent this relationship between the vector representation of the word man and the vector representation of the word king. And that’s what this value, king minus man, represents. So what would happen if I took the vector representation of the word woman and added that same value, king minus man, to it? What would we get as the closest word to that, for example? Well, we could try it. Let’s go ahead and go back to our Python interpreter and give this a try. I could say, what is the closest word to the vector representation of the word king minus the representation of the word man plus the representation of the word woman? And we see that the closest word is the word queen. We’ve somehow been able to capture the relationship between king and man. And then when we apply it to the word woman, we get, as the result, the word queen. So Word2Vec has been able to capture not just the words and how they’re similar to each other, but also something about the relationships between words and how those words are connected to each other. So now that we have this vector representation of words, what can we now do with it? Now we can represent words as numbers. And so we might try to pass those words as input to, say, a neural network. Neural networks we’ve seen are very powerful tools for identifying patterns and making predictions. Recall that a neural network you can think of as all of these units. But really what the neural network is doing is taking some input, passing it into the network, and then producing some output. And by providing the neural network with training data, we’re able to update the weights inside of the network so that the neural network can do a more accurate job of translating those inputs into those outputs. And now that we can represent words as numbers that could be the input or output, you could imagine passing a word in as input to a neural network and getting a word as output. And so when might that be useful? One common use for neural networks is in machine translation, when we want to translate text from one language into another, say translate English into French by passing English into the neural network and getting some French output. You might imagine, for instance, that we could take the English word for lamp, pass it into the neural network, get the French word for lamp as output. But in practice, when we’re translating text from one language to another, we’re usually not just interested in translating a single word from one language to another, but a sequence, say a sentence or a paragraph of words. Here, for example, is another paragraph, again taken from Sherlock Holmes, written in English. And what I might want to do is take that entire sentence, pass it into the neural network, and get as output a French translation of the same sentence. But recall that a neural network’s input and output needs to be of some fixed size. And a sentence is not a fixed size. It’s variable. You might have shorter sentences, and you might have longer sentences. So somehow, we need to solve the problem of translating a sequence into another sequence by means of a neural network. And that’s going to be true not only for machine translation, but also for other problems, problems like question answering. If I want to pass as input a question, something like what is the capital of Massachusetts, feed that as input into the neural network, I would hope that what I would get as output is a sentence like the capital is Boston, again, translating some sequence into some other sequence. And if you’ve ever had a conversation with an AI chatbot, or have ever asked your phone a question, it needs to do something like this. It needs to understand the sequence of words that you, the human, provided as input. And then the computer needs to generate some sequence of words as output. So how can we do this? Well, one tool that we can use is the recurrent neural network, which we took a look at last time, which is a way for us to provide a sequence of values to a neural network by running the neural network multiple times. And each time we run the neural network, what we’re going to do is we’re going to keep track of some hidden state. And that hidden state is going to be passed from one run of the neural network to the next run of the neural network, keeping track of all of the relevant information. And so let’s take a look at how we can apply that to something like this. And in particular, we’re going to look at an architecture known as an encoder-decoder architecture, where we’re going to encode this question into some kind of hidden state, and then use a decoder to decode that hidden state into the output that we’re interested in. So what’s that going to look like? We’ll start with the first word, the word what. That goes into our neural network, and it’s going to produce some hidden state. This is some information about the word what that our neural network is going to need to keep track of. Then when the second word comes along, we’re going to feed it into that same encoder neural network, but it’s going to get as input that hidden state as well. So we pass in the second word. We also get the information about the hidden state, and that’s going to continue for the other words in the input. This is going to produce a new hidden state. And so then when we get to the third word, the, that goes into the encoder. It also gets access to the hidden state, and then it produces a new hidden state that gets passed into the next run when we use the word capital. And the same thing is going to repeat for the other words that appear in the input. So of Massachusetts, that produces one final piece of hidden state. Now somehow, we need to signal the fact that we’re done. There’s nothing left in the input. And we typically do this by passing some kind of special token, say an end token, into the neural network. And now the decoding process is going to start. We’re going to generate the word the. But in addition to generating the word the, this decoder network is also going to generate some kind of hidden state. And so what happens the next time? Well, to generate the next word, it might be helpful to know what the first word was. So we might pass the first word the back into the decoder network. It’s going to get as input this hidden state, and it’s going to generate the next word capital. And that’s also going to generate some hidden state. And we’ll repeat that, passing capital into the network to generate the third word is, and then one more time in order to get the fourth word Boston. And at that point, we’re done. But how do we know we’re done? Usually, we’ll do this one more time, pass Boston into the decoder network, and get an output some end token to indicate that that is the end of our input. And so this then is how we could use a recurrent neural network to take some input, encode it into some hidden state, and then use that hidden state to decode it into the output we’re interested in. To visualize it in a slightly different way, we have some input sequence. This is just some sequence of words. That input sequence goes into the encoder, which in this case is a recurrent neural network generating these hidden states along the way until we generate some final hidden state, at which point we start the decoding process. Again, using a recurrent neural network, that’s going to generate the output sequence as well. So we’ve got the encoder, which is encoding the information about the input sequence into this hidden state, and then the decoder, which takes that hidden state and uses it in order to generate the output sequence. But there are some problems. And for many years, this was the state of the art. The recurrent neural network and variance on this approach were some of the best ways we knew in order to perform tasks in natural language processing. But there are some problems that we might want to try to deal with and that have been dealt with over the years to try and improve upon this kind of model. And one problem you might notice happens in this encoder stage. We’ve taken this input sequence, the sequence of words, and encoded it all into this final piece of hidden state. And that final piece of hidden state needs to contain all of the information from the input sequence that we need in order to generate the output sequence. And while that’s possible, it becomes increasingly difficult as the sequence gets larger and larger. For larger and larger input sequences, it’s going to become more and more difficult to store all of the information we need about the input inside this single hidden state piece of context. That’s a lot of information to pack into just a single value. It might be useful for us, when generating output, to not just refer to this one value, but to all of the previous hidden values that have been generated by the encoder. And so that might be useful, but how could we do that? We’ve got a lot of different values. We need to combine them somehow. So you could imagine adding them together, taking the average of them, for example. But doing that would assume that all of these pieces of hidden state are equally important. But that’s not necessarily true either. Some of these pieces of hidden state are going to be more important than others, depending on what word they most closely correspond to. This piece of hidden state very closely corresponds to the first word of the input sequence. This one very closely corresponds to the second word of the input sequence, for example. And some of those are going to be more important than others. To make matters more complicated, depending on which word of the output sequence we’re generating, different input words might be more or less important. And so what we really want is some way to decide for ourselves which of the input values are worth paying attention to, at what point in time. And this is the key idea behind a mechanism known as attention. Attention is all about letting us decide which values are important to pay attention to, when generating, in this case, the next word in our sequence. So let’s take a look at an example of that. Here’s a sentence. What is the capital of Massachusetts? Same sentence as before. And let’s imagine that we were trying to answer that question by generating tokens of output. So what would the output look like? Well, it’s going to look like something like the capital is. And let’s say we’re now trying to generate this last word here. What is that last word? How is the computer going to figure it out? Well, what it’s going to need to do is decide which values it’s going to pay attention to. And so the attention mechanism will allow us to calculate some attention scores for each word, some value corresponding to each word, determining how relevant is it for us to pay attention to that word right now? And in this case, when generating the fourth word of the output sequence, the most important words to pay attention to might be capital and Massachusetts, for example. That those words are going to be particularly relevant. And there are a number of different mechanisms that have been used in order to calculate these attention scores. It could be something as simple as a dot product to see how similar two vectors are, or we could train an entire neural network to calculate these attention scores. But the key idea is that during the training process for our neural network, we’re going to learn how to calculate these attention scores. Our model is going to learn what is important to pay attention to in order to decide what the next word should be. So the result of all of this, calculating these attention scores, is that we can calculate some value, some value for each input word, determining how important is it for us to pay attention to that particular value. And recall that each of these input words is also associated with one of these hidden state context vectors, capturing information about the sentence up to that point, but primarily focused on that word in particular. And so what we can now do is if we have all of these vectors and we have values representing how important is it for us to pay attention to those particular vectors, is we can take a weighted average. We can take all of these vectors, multiply them by their attention scores, and add them up to get some new vector value, which is going to represent the context from the input, but specifically paying attention to the words that we think are most important. And once we’ve done that, that context vector can be fed into our decoder in order to say that the word should be, in this case, Boston. So attention is this very powerful tool that allows any word when we’re trying to decode it to decide which words from the input should we pay attention to in order to determine what’s important for generating the next word of the output. And one of the first places this was really used was in the field of machine translation. Here’s an example of a diagram from the paper that introduced this idea, which was focused on trying to translate English sentences into French sentences. So we have an input English sentence up along the top, and then along the left side, the output French equivalent of that same sentence. And what you see in all of these squares are the attention scores visualized, where a lighter square indicates a higher attention score. And what you’ll notice is that there’s a strong correspondence between the French word and the equivalent English word, that the French word for agreement is really paying attention to the English word for agreement in order to decide what French word should be generated at that point in time. And sometimes you might pay attention to multiple words if you look at the French word for economic. That’s primarily paying attention to the English word for economic, but also paying attention to the English word for European in this case too. And so attention scores are very easy to visualize to get a sense for what is our machine learning model really paying attention to, what information is it using in order to determine what’s important and what’s not in order to determine what the ultimate output token should be. And so when we combine the attention mechanism with a recurrent neural network, we can get very powerful and useful results where we’re able to generate an output sequence by paying attention to the input sequence too. But there are other problems with this approach of using a recurrent neural network as well. In particular, notice that every run of the neural network depends on the output of the previous step. And that was important for getting a sense for the sequence of words and the ordering of those particular words. But we can’t run this unit of the neural network until after we’ve calculated the hidden state from the run before it from the previous input token. And what that means is that it’s very difficult to parallelize this process. That as the input sequence get longer and longer, we might want to use parallelism to try and speed up this process of training the neural network and making sense of all of this language data. But it’s difficult to do that. And it’s slow to do that with a recurrent neural network because all of it needs to be performed in sequence. And that’s become an increasing challenge as we’ve started to get larger and larger language models. The more language data that we have available to us to use to train our machine learning models, the more accurate it can be, the better representation of language it can have, the better understanding it can have, and the better results that we can see. And so we’ve seen this growth of large language models that are using larger and larger data sets. But as a result, they take longer and longer to train. And so this problem that recurrent neural networks are not easy to parallelize has become an increasing problem. And as a result of that, that was one of the main motivations for a different architecture, for thinking about how to deal with natural language. And that’s known as the transformer architecture. And this has been a significant milestone in the world of natural language processing for really increasing how well we can perform these kinds of natural language processing tasks, as well as how quickly we can train a machine learning model to be able to produce effective results. There are a number of different types of transformers in terms of how they work. But what we’re going to take a look at here is the basic architecture for how one might work with a transformer to get a sense for what’s involved and what we’re doing. So let’s start with the model we were looking at before, specifically at this encoder part of our encoder-decoder architecture, where we used a recurrent neural network to take this input sequence and capture all of this information about the hidden state and the information we need to know about that input sequence. Right now, it all needs to happen in this linear progression. But what the transformer is going to allow us to do is process each of the words independently in a way that’s easy to parallelize, rather than have each word wait for some other word. Each word is going to go through this same neural network and produce some kind of encoded representation of that particular input word. And all of this is going to happen in parallel. Now, it’s happening for all of the words at once, but we’re really just going to focus on what’s happening for one word to make it clear. But know that whatever you’re seeing happen for this one word is going to happen for all of the other input words, too. So what’s going on here? Well, we start with some input word. That input word goes into the neural network. And the output is hopefully some encoded representation of the input word, the information we need to know about the input word that’s going to be relevant to us as we’re generating the output. And because we’re doing this each word independently, it’s easy to parallelize. We don’t have to wait for the previous word before we run this word through the neural network. But what did we lose in this process by trying to parallelize this whole thing? Well, we’ve lost all notion of word ordering. The order of words is important. The sentence, Sherlock Holmes gave the book to Watson, has a different meaning than Watson gave the book to Sherlock Holmes. And so we want to keep track of that information about word position. In the recurrent neural network, that happened for us automatically because we could run each word one at a time through the neural network, get the hidden state, pass it on to the next run of the neural network. But that’s not the case here with the transformer, where each word is being processed independent of all of the other ones. So what are we going to do to try to solve that problem? One thing we can do is add some kind of positional encoding to the input word. The positional encoding is some vector that represents the position of the word in the sentence. This is the first word, the second word, the third word, and so forth. We’re going to add that to the input word. And the result of that is going to be a vector that captures multiple pieces of information. It captures the input word itself as well as where in the sentence it appears. The result of that is we can pass the output of that addition, the addition of the input word and the positional encoding into the neural network. That way, the neural network knows the word and where it appears in the sentence and can use both of those pieces of information to determine how best to represent the meaning of that word in the encoded representation at the end of it. In addition to what we have here, in addition to the positional encoding and this feed forward neural network, we’re also going to add one additional component, which is going to be a self-attention step. This is going to be attention where we’re paying attention to the other input words. Because the meaning or interpretation of an input word might vary depending on the other words in the input as well. And so we’re going to allow each word in the input to decide what other words in the input it should pay attention to in order to decide on its encoded representation. And that’s going to allow us to get a better encoded representation for each word because words are defined by their context, by the words around them and how they’re used in that particular context. This kind of self-attention is so valuable, in fact, that oftentimes the transformer will use multiple different self-attention layers at the same time to allow for this model to be able to pay attention to multiple facets of the input at the same time. And we call this multi-headed attention, where each attention head can pay attention to something different. And as a result, this network can learn to pay attention to many different parts of the input for this input word all at the same time. And in the spirit of deep learning, these two steps, this multi-headed self-attention layer and this neural network layer, that itself can be repeated multiple times, too, in order to get a deeper representation, in order to learn deeper patterns within the input text and ultimately get a better representation of language in order to get useful encoded representations of all of the input words. And so this is the process that a transformer might use in order to take an input word and get it its encoded representation. And the key idea is to really rely on this attention step in order to get information that’s useful in order to determine how to encode that word. And that process is going to repeat for all of the input words that are in the input sequence. We’re going to take all of the input words, encode them with some kind of positional encoding, feed those into these self-attention and feed-forward neural networks in order to ultimately get these encoded representations of the words. That’s the result of the encoder. We get all of these encoded representations that will be useful to us when it comes time then to try to decode all of this information into the output sequence we’re interested in. And again, this might take place in the context of machine translation, where the output is going to be the same sentence in a different language, or it might be an answer to a question in the case of an AI chatbot, for example. And so now let’s take a look at how that decoder is going to work. Ultimately, it’s going to have a very similar structure. Any time we’re trying to generate the next output word, we need to know what the previous output word is, as well as its positional encoding. Where in the output sequence are we? And we’re going to have these same steps, self-attention, because we might want an output word to be able to pay attention to other words in that same output, as well as a neural network. And that might itself repeat multiple times. But in this decoder, we’re going to add one additional step. We’re going to add an additional attention step, where instead of self-attention, where the output word is going to pay attention to other output words, in this step, we’re going to allow the output word to pay attention to the encoded representations. So recall that the encoder is taking all of the input words and transforming them into these encoded representations of all of the input words. But it’s going to be important for us to be able to decide which of those encoded representations we want to pay attention to when generating any particular token in the output sequence. And that’s what this additional attention step is going to allow us to do. It’s saying that every time we’re generating a word of the output, we can pay attention to the other words in the output, because we might want to know, what are the words we’ve generated previously? And we want to pay attention to some of them to decide what word is going to be next in the sequence. But we also care about paying attention to the input words, too. And we want the ability to decide which of these encoded representations of the input words are going to be relevant in order for us to generate the next step. And so these two pieces combine together. We have this encoder that takes all of the input words and produces this encoded representation. And we have this decoder that is able to take the previous output word, pay attention to that encoded input, and then generate the next output word. And this is one of the possible architectures we could use for a transformer, with the key idea being these attention steps that allow words to pay attention to each other. During the training process here, we can now much more easily parallelize this, because we don’t have to wait for all of the words to happen in sequence. And we can learn how we should perform these attention steps. The model is able to learn what is important to pay attention to, what things do I need to pay attention to, in order to be more accurate at predicting what the output word is. And this has proved to be a tremendously effective model for conversational AI agents, for building machine translation systems. And there have been many variants proposed on this model, too. Some transformers only use an encoder. Some only use a decoder. Some use some other combination of these different particular features. But the key ideas ultimately remain the same, this real focus on trying to pay attention to what is most important. And the world of natural language processing is fast growing and fast evolving. Year after year, we keep coming up with new models that allow us to do an even better job of performing these natural language related tasks, all on the surface of solving the tricky problem, which is our own natural language. We’ve seen how the syntax and semantics of our language is ambiguous, and it introduces all of these new challenges that we need to think about, if we’re going to be able to design AI agents that are able to work with language effectively. So as we think about where we’ve been in this class, all of the different types of artificial intelligence we’ve considered, we’ve looked at artificial intelligence in a wide variety of different forms now. We started by taking a look at search problems, where we looked at how AI can search for solutions, play games, and find the optimal decision to make. We talked about knowledge, how AI can represent information that it knows and use that information to generate new knowledge as well. Then we looked at what AI can do when it’s less certain, when it doesn’t know things for sure, and we have to represent things in terms of probability. We then took a look at optimization problems. We saw how a lot of problems in AI can be boiled down to trying to maximize or minimize some function. And we looked at strategies that AI can use in order to do that kind of maximizing and minimizing. We then looked at the world of machine learning, learning from data in order to figure out some patterns and identify how to perform a task by looking at the training data that we have available to it. And one of the most powerful tools there was the neural network, the sequence of units whose weights can be trained in order to allow us to really effectively go from input to output and predict how to get there by learning these underlying patterns. And then today, we took a look at language itself, trying to understand how can we train the computer to be able to understand our natural language, to be able to understand syntax and semantics, make sense of and generate natural language, which introduces a number of interesting problems too. And we’ve really just scratched the surface of artificial intelligence. There is so much interesting research and interesting new techniques and algorithms and ideas being introduced to try to solve these types of problems. So I hope you enjoyed this exploration into the world of artificial intelligence. A huge thanks to all of the course’s teaching staff and production team for making the class possible. This was an introduction to artificial intelligence with Python.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
October 2, 2025
Data Science and Machine Learning Foundations
This PDF excerpt details a machine learning foundations course. It covers core concepts like supervised and unsupervised learning, regression and classification models, and essential algorithms. The curriculum also explores practical skills, including Python programming with relevant libraries, natural language processing (NLP), and model evaluation metrics. Several case studies illustrate applying these techniques to various problems, such as house price prediction and customer segmentation. Finally, career advice is offered on navigating the data science job market and building a strong professional portfolio.

Data Science & Machine Learning Study Guide

Quiz
1. How can machine learning improve crop yields for farmers? Machine learning can analyze data to optimize crop yields by monitoring soil health and making decisions about planting, fertilizing, and other practices. This can lead to increased revenue for farmers by improving the efficiency of their operations and reducing costs.
2. Explain the purpose of the Central Limit Theorem in statistical analysis. The Central Limit Theorem states that the distribution of sample means will approximate a normal distribution as the sample size increases, regardless of the original population distribution. This allows for statistical inference about a population based on sample data.
3. What is the primary difference between supervised and unsupervised learning? In supervised learning, a model is trained using labeled data to predict outcomes. In unsupervised learning, a model is trained on unlabeled data to find patterns or clusters within the data without a specific target variable.
4. Name three popular supervised learning algorithms. Three popular supervised learning algorithms are K-Nearest Neighbors (KNN), Decision Trees, and Random Forest. These algorithms are used for both classification and regression tasks.
5. Explain the concept of “bagging” in machine learning. Bagging, short for bootstrap aggregating, involves training multiple models on different subsets of the training data, and then combining their predictions. This technique reduces variance in predictions and creates a more stable prediction model.
6. What are two metrics used to evaluate the performance of a regression model? Two metrics used to evaluate regression models include Residual Sum of Squares (RSS) and R-squared. The RSS measures the sum of the squared differences between predicted and actual values, while R-squared quantifies the proportion of variance explained by the model.
7. Define entropy as it relates to decision trees. In the context of decision trees, entropy measures the impurity or randomness of a data set. A higher entropy value indicates a more mixed class distribution, and decision trees attempt to reduce entropy by splitting data into more pure subsets.
8. What are dummy variables and why are they used in linear regression? Dummy variables are binary variables (0 or 1) used to represent categorical variables in a regression model. They are used to include categorical data in linear regression without misinterpreting the nature of the categorical variables.
9. Why is it necessary to split data into training and testing sets? Splitting data into training and testing sets allows for training the model on one subset of data and then evaluating its performance on a different, unseen subset. This prevents overfitting and helps determine how well the model generalizes to new, real-world data.
10. What is the role of the learning rate in gradient descent? The learning rate (or step size) determines how much the model’s parameters are adjusted during each iteration of gradient descent. A smaller learning rate means smaller steps toward the minimum. A large rate can lead to overshooting or oscillations, and is not the same thing as momentum.
Answer Key
1. Machine learning algorithms can analyze data related to crop health and soil conditions to make data-driven recommendations, which allows farmers to optimize their yield and revenue by using resources more effectively.
2. The Central Limit Theorem is important because it allows data scientists to make inferences about a population by analyzing a sample, and it allows them to understand the distribution of sample means which is a building block to statistical analysis.
3. Supervised learning uses labeled data with defined inputs and outputs for model training, while unsupervised learning works with unlabeled data to discover structures and patterns without predefined results.
4. K-Nearest Neighbors, Decision Trees, and Random Forests are some of the most popular supervised learning algorithms. Each can be used for classification or regression problems.
5. Bagging involves creating multiple training sets using resampling techniques, which allows multiple models to train before their outputs are averaged or voted on. This increases the stability and robustness of the final output.
6. Residual Sum of Squares (RSS) measures error while R-squared measures goodness of fit.
7. Entropy in decision trees measures the impurity or disorder of a dataset. The lower the entropy, the more pure the classification for a given subset of data and vice-versa.
8. Dummy variables are numerical values (0 or 1) that can represent string or categorical variables in an algorithm. This transformation is often required for regression models that are designed to read numerical inputs.
9. Data should be split into training and test sets to prevent overfitting, train and evaluate the model, and ensure that it can generalize well to real-world data that it has not seen.
10. The learning rate is the size of the step taken in each iteration of gradient descent, which determines how quickly the algorithm converges towards the local or global minimum of the error function.
Essay Questions
1. Discuss the importance of data preprocessing in machine learning projects. What are some common data preprocessing techniques, and why are they necessary?
2. Compare and contrast the strengths and weaknesses of different types of machine learning algorithms (e.g., supervised vs. unsupervised, linear vs. non-linear, etc.). Provide specific examples to illustrate your points.
3. Explain the concept of bias and variance in machine learning. How can these issues be addressed when building predictive models?
4. Describe the process of building a recommendation system, including the key challenges and techniques involved. Consider different data sources and evaluation methods.
5. Discuss the ethical considerations that data scientists should take into account when working on machine learning projects. How can fairness and transparency be ensured in the development of AI systems?
Glossary
- Adam: An optimization algorithm that combines the benefits of AdaGrad and RMSprop, often used for training neural networks.
- Bagging: A machine learning ensemble method that creates multiple models using random subsets of the training data to reduce variance.
- Boosting: A machine learning ensemble method that combines weak learners into a strong learner by iteratively focusing on misclassified samples.
- Central Limit Theorem: A theorem stating that the distribution of sample means approaches a normal distribution as the sample size increases.
- Classification: A machine learning task that involves predicting the category or class of a given data point.
- Clustering: An unsupervised learning technique that groups similar data points into clusters.
- Confidence Interval: A range of values that is likely to contain the true population parameter with a certain level of confidence.
- Cosine Similarity: A measure of similarity between two non-zero vectors, often used in recommendation systems.
- DB Scan: A density-based clustering algorithm that identifies clusters based on data point density.
- Decision Trees: A supervised learning algorithm that uses a tree-like structure to make decisions based on input features.
- Dummy Variable: A binary variable (0 or 1) used to represent categorical variables in a regression model.
- Entropy: A measure of disorder or randomness in a dataset, particularly used in decision trees.
- Feature Engineering: The process of transforming raw data into features that can be used in machine learning models.
- Gradient Descent: An optimization algorithm used to minimize the error function of a model by iteratively updating parameters.
- Heteroskedasticity: A condition in which the variance of the error terms in a regression model is not constant across observations.
- Homoskedasticity: A condition in which the variance of the error terms in a regression model is constant across observations.
- Hypothesis Testing: A statistical method used to determine whether there is enough evidence to reject a null hypothesis.
- Inferential Statistics: A branch of statistics that deals with drawing conclusions about a population based on a sample of data.
- K-Means: A clustering algorithm that partitions data points into a specified number of clusters based on their distance from cluster centers.
- K-Nearest Neighbors (KNN): A supervised learning algorithm that classifies or predicts data based on the majority class among its nearest neighbors.
- Law of Large Numbers: A theorem stating that as the sample size increases, the sample mean will converge to the population mean.
- Linear Discriminant Analysis (LDA): A dimensionality reduction and classification technique that finds linear combinations of features to separate classes.
- Logarithm: The inverse operation of exponentiation, used to find the exponent required to reach a certain value.
- Mini-batch Gradient Descent: An optimization method that updates parameters based on a subset of the training data in each iteration.
- Momentum (in Gradient Descent): A technique used with gradient descent that adds a fraction of the previous parameter update to the current update, which reduces oscillations during the search for local or global minima.
- Multi-colinearity: A condition in which independent variables in a regression model are highly correlated with each other.
- Ordinary Least Squares (OLS): A method for estimating the parameters of a linear regression model by minimizing the sum of squared residuals.
- Overfitting: When a model learns the training data too well and cannot generalize to unseen data.
- P-value: The probability of obtaining a result as extreme as the observed result, assuming the null hypothesis is true.
- Random Forest: An ensemble learning method that combines multiple decision trees to make predictions.
- Regression: A machine learning task that involves predicting a continuous numerical output.
- Residual: The difference between the actual value of the dependent variable and the value predicted by a regression model.
- Residual Sum of Squares (RSS): A metric that calculates the sum of the squared differences between the actual and predicted values.
- RMSprop: An optimization algorithm that adapts the learning rate for each parameter based on the root mean square of past gradients.
- R-squared (R²): A statistical measure that indicates the proportion of variance in the dependent variable that is explained by the independent variables in a regression model.
- Standard Deviation: A measure of the amount of variation or dispersion in a set of values.
- Statistical Significance: A concept that determines if a given finding is likely not due to chance; statistical significance is determined through the calculation of a p-value.
- Stochastic Gradient Descent (SGD): An optimization algorithm that updates parameters based on a single random sample of the training data in each iteration.
- Stop Words: Common words in a language that are often removed from text during preprocessing (e.g., “the,” “is,” “a”).
- Supervised Learning: A type of machine learning where a model is trained using labeled data to make predictions.
- Unsupervised Learning: A type of machine learning where a model is trained using unlabeled data to discover patterns or clusters.
AI, Machine Learning, and Data Science Foundations

Okay, here is a detailed briefing document synthesizing the provided sources.

Briefing Document: AI, Machine Learning, and Data Science Foundations

Overview

This document summarizes key concepts and techniques discussed in the provided material. The sources primarily cover a range of topics, including: foundational mathematical and statistical concepts, various machine learning algorithms, deep learning and generative AI, model evaluation techniques, practical application examples in customer segmentation and sales analysis, and finally optimization methods and concepts related to building a recommendation system. The materials appear to be derived from a course or a set of educational resources aimed at individuals seeking to develop skills in AI, machine learning and data science.

Key Themes and Ideas
1. Foundational Mathematics and Statistics
- Essential Math Concepts: A strong foundation in mathematics is crucial. The materials emphasize the importance of understanding exponents, logarithms, the mathematical constant “e,” and pi. Crucially, understanding how these concepts transform when taking derivatives is critical for many machine learning algorithms. For instance, the material mentions that “you need to know what is logarithm what is logarithm at the base of two what is logarithm at the base of e and then at the base of 10…and how does those transform when it comes to taking derivative of the logarithm taking the derivative of the exponent.”
- Statistical Foundations: The course emphasizes descriptive and inferential statistics. Descriptive measures include “distance measures” and “variational measures.” Inferential statistics requires an understanding of theories such as the “Central limit theorem” and “the law of large numbers.” There is also the need to grasp “population sample,” “unbiased sample,” “hypothesis testing,” “confidence interval,” and “statistical significance.” The importance is highlighted that “you need to know those Infamous theories such as Central limit theorem the law of uh large numbers uh and how you can um relate to this idea of population sample unbias sample and also u a hypothesis testing confidence interval statistical sign ific an uh and uh how you can test different theories by using uh this idea of statistical”.
1. Machine Learning Algorithms:
- Supervised Learning: The course covers various supervised learning algorithms, including:
- “Linear discriminant analysis” (LDA): Used for classification by combining multiple features to predict outcomes, as shown in the example of predicting movie preferences by combining movie length and genre.
- “K-Nearest Neighbors” (KNN)
- “Decision Trees”: Used for both classification and regression tasks.
- “Random Forests”: An ensemble method that combines multiple decision trees.
- Boosting Algorithms (e.g. “light GBM, GBM, HG Boost”): Another approach to improve model performance by sequentially training models. The training of these algorithms incorporates “previous stump’s errors.”
- Unsupervised Learning:“K-Means”: A clustering algorithm for grouping data points. Example is given in customer segmentation by their transaction history, “you can for instance use uh K means uh DB scan hierarchal clustering and then you can evaluate your uh clustering algoritms and then select the one that performs the best”.
- “DBScan”: A density-based clustering algorithm, noted for its increasing popularity.
- “Hierarchical Clustering”: Another approach to clustering.
- Bagging: An ensemble method used to reduce variance and create more stable predictions, exemplified through a weight loss prediction based on “daily calorie intake and workout duration.”
- AdaBoost: An algorithm where “each stump is made by using the previous stump’s errors”, also used for building prediction models, exemplified with a housing price prediction project.
1. Deep Learning and Generative AI
- Optimization Algorithms: The material introduces the need for “Adam W RMS prop” optimization techniques.
- Generative Models: The course touches upon more advanced topics including “variation Auto encoders” and “large language models.”
- Natural Language Processing (NLP): It emphasizes the importance of understanding concepts like “n-grams,” “attention mechanisms” (both self-attention and multi-head self-attention), “encoder-decoder architecture of Transformers,” and related algorithms such as “gpts or Birch model.” The sources emphasize “if you want to move towards the NLP side of generative Ai and you want to know how the ched GPT has been invented how the gpts work or the birth mode Ro uh then you will definitely need to uh get into this topic of language model”.
1. Model Evaluation
- Regression Metrics: The document introduces “residual sum of squares” (RSS) as a common metric for evaluating linear regression models. The formula for the RSS is explicitly provided: “the RSS or the residual sum of square or the beta is equal to sum of all the squar of y i minus y hat across all I is equal to 1 till n”.
- Clustering Metrics: The course mentions entropy, and the “Silo score” which is “a measure of the similarity of the data point to its own cluster compared to the other clusters”.
- Regularization: The use of L2 regularization is mentioned, where “Lambda which is always positive so is always larger than equal zero is the tuning parameter or the penalty” and “the Lambda serves to control the relative impact of the penalty on the regression coefficient estimates.”
1. Practical Applications and Case Studies:
- Customer Segmentation: Clustering algorithms (K-means, DBScan) can be used to segment customers based on transaction history.
- Sales Analysis: The material includes analysis of customer types, “consumer, corporate, and home office”, top spending customers, and sales trends over time. There is a suggestion that “a seasonal Trend” might be apparent if a longer time period is considered.
- Geographic Sales Mapping: The material includes using maps to visualize sales per state, which is deemed helpful for companies looking to expand into new geographic areas.
- Housing Price Prediction: A linear regression model is applied to predict house prices using features like median income, average rooms, and proximity to the ocean. An important note is made about the definition of “residual” in this context, with the reminder that “you do not confuse the error with the residual so error can never be observed error you can never calculate and you will never know but what you can do is to predict the error and you can when you predict the error then you get a residual”.
1. Linear Regression and OLS
- Regression Model: The document explains that the linear regression model aims to estimate the relationship between independent and dependent variables. In the context, it emphasizes that “beta Z that you see here is not a variable and it’s called intercept or constant something that is unknown so we don’t have that in our data and is one of the parameters of linear regression it’s an unknown number which the linear regression model should estimate”.
- Ordinary Least Squares (OLS): OLS is a core method to minimize the “sum of squared residuals”. The material states that “the OLS tries to find the line that will minimize its value”.
- Assumptions: The materials mention an assumption of constant variance (homoscedasticity) for errors, and notes “you can check for this assumption by plotting the residual and see whether there is a funnel like graph”. The importance of using a correct statistical test is also highlighted when considering p values.
- Dummy Variables: The need to transform categorical features into dummy variables to be used in linear regression models, with the warning that “you always need to drop at least one of the categories” due to the multicolinearity problem. The process of creating dummy variables is outlined: “we will use the uh get uncore d function in Python from pandas in order to uh go from this one variable to uh five different variable per each of this category”.
- Variable Interpretation: Coefficients in a linear regression model represent the impact of an independent variable on the dependent variable. For example, the material notes, “when we look at the total number of rooms and we increase the number of rooms by uh one additional unit so one more room added to the total underscore rooms then the uh house value uh decreases by minus 2.67”.
- Model Summary Output: The materials discuss interpreting model output metrics such as R-squared which “is the Matrix that show cases what is the um goodness of fit of your model”. It also mentions how to interpret p values.
1. Recommendation Systems
- Feature Engineering: A critical step is identifying and engineering the appropriate features, with the recommendation system based on “data points you use to make decisions about what to recommend”.
- Text Preprocessing: Text data must be cleaned and preprocessed, including removing “stop words” and vectorizing using TF-IDF or similar methods. An example is given “if we use no pen we use no action pack we use denture once we use movies once you 233 use Inspire once and you re use me once and the rest we don’t use it SWS which means we get the vector 0 0 1 1 1 1 0 0 zero here”.
- Cosine Similarity: A technique to find similarity between text vectors. The cosine similarity is defined as “an equation of the dot product of two vectors and the multiplication of the magnitudes of the two vectors”.
- Recommending: The system then recommends items with the highest cosine similarity scores, as mentioned with “we are going to provide we are going to recommend five movies of course you can recommend many or 50 movies that’s completely up to [Music] you”.
1. Career Advice and Perspective
- The Importance of a Plan: The material emphasizes the value of creating a career plan and focusing on actionable steps. The advice is “this kind of plan actually make you focus because if you are not focusing on that thing you could just going anywhere at that lose loose loose loose lose your way”.
- Learning by Doing: The speaker advocates doing smaller projects to prove your abilities, especially as a junior data scientist. As they state, “the best way is like yeah just do the work if like a smaller like as you said previously youly like it might be boring stuff it might be an assum it might be not leading anywhere but those kind of work show”.
- Business Acumen: Data scientists should focus on how their work provides value to the business, and “data scientist is someone who bring the value to the business and making the decision for the battle any business”.
- Personal Branding: Building a personal brand is also seen as important, with the recommendation that “having a newsletter and having a LinkedIn following” can help. Technical portfolio sites like “GitHub” are recommended.
- Data Scientist Skills: The ability to show your thought process and motivation is important in data science interviews. As the speaker notes, “how’s your uh thought process going how’s your what what motivated you to do this kind of project what motivated you to do uh this kind of code what motivated you to present this kinde of result”.
- Future of Data Science: The future of data science is predicted to become “invaluable to the business”, especially given the current rapid development of AI.
- Business Fundamentals: The importance of thinking about the needs-based aspect of a business, that it must be something people need or “if my roof was leaking and it’s raining outside and I’m in my house you know and water is pouring on my head I have to fix that whether I’m broke or not you know”.
- Entrepreneurship: The importance of planning, which was inspired by being a pilot where “pilots don’t take off unless we know where we’re going”.
- Growth: The experience at GE emphasized that “growing so fast it was doubling in size every three years and that that really informed my thinking about growth”.
- Mergers and Aquisitions (M&A): The business principle of using debt to buy underpriced assets that can be later sold at a higher multiple for profit.
1. Optimization
- Gradient Descent (GD): The update of the weight is equal to the current weight parameter minus the learning rate times the gradient and so “the same we also do for our second parameter which is the bias Factor”.
- Stochastic Gradient Descent (SGD): HGD is different from GD in that it “uses the gradient from a single data point which is just one observation in order to update our parameters”. This makes it “much faster and computationally much less expensive compared to the GD”.
- SGD With Momentum: SGD with momentum addresses the disadvantages of the basic SGD algorithm.
- Mini-Batch Gradient Descent: A trade-off between the two, and “it tries to strike a balance by selecting smaller batches and calculating the gradient over them”.
- RMSprop: RMSprop is introduced as an algorithm for controlling learning rates, where “for the parameters that will have a small gradients we will be then controlling this and we will be increasing their learning rate to ensure that the gradient will not vanish”.
Conclusion

These materials provide a broad introduction to data science, machine learning, and AI. They cover mathematical and statistical foundations, various algorithms (both supervised and unsupervised), deep learning concepts, model evaluation, and provide case studies to illustrate the practical application of such techniques. The inclusion of career advice and reflections makes it a very holistic learning experience. The information is designed to build a foundational understanding and introduce more complex concepts.

Essential Concepts in Machine Learning

Frequently Asked Questions
- What are some real-world applications of machine learning, as discussed in the context of this course? Machine learning has diverse applications, including optimizing crop yields by monitoring soil health, and predicting customer preferences, such as in the entertainment industry as seen with Netflix’s recommendations. It’s also useful in customer segmentation (identifying “good”, “better”, and “best” customers based on transaction history) and creating personalized recommendations (like prioritizing movies based on a user’s preferred genre). Further, machine learning can help companies decide which geographic areas are most promising for their products based on sales data and can help investors identify which features of a house are correlated with its value.
- What are the core mathematical concepts that are essential for understanding machine learning and data science? A foundational understanding of several mathematical concepts is critical. This includes: the idea of using variables with different exponents (e.g., X, X², X³), understanding logarithms at different bases (base 2, base e, base 10), comprehending the meaning of ‘e’ and ‘Pi’, mastering exponents and logarithms and how they transform when taking derivatives. A fundamental understanding of descriptive (distance measures, variational measures) and inferential statistics (central limit theorem, law of large numbers, population vs. sample, hypothesis testing) is also essential.
- What specific machine learning algorithms should I be familiar with, and what are their uses? The course highlights the importance of both supervised and unsupervised learning techniques. For supervised learning, you should know linear discriminant analysis (LDA), K-Nearest Neighbors (KNN), decision trees (for both classification and regression), random forests, and boosting algorithms like light GBM, GBM, and XGBoost. For unsupervised learning, understanding K-Means clustering, DBSCAN, and hierarchical clustering is crucial. These algorithms are used in various applications like classification, clustering, and regression.
- How can I assess the performance of my machine learning models? Several metrics are used to evaluate model performance, depending on the task at hand. For regression models, the residual sum of squares (RSS) is crucial; it measures the difference between predicted and actual values. Metrics like entropy, also the Gini index, and the silhouette score (which measures the similarity of a data point to its own cluster vs. other clusters) are used for evaluating classification and clustering models. Additionally, concepts like the penalty term, used to control impact of model complexity, and the L2 Norm used in regression are highlighted as important for proper evaluation.
- What is the significance of linear regression and what key concepts should I know? Linear regression is used to model the relationship between a dependent variable (Y) and one or more independent variables (X). A crucial aspect is estimating coefficients (betas) and intercepts which quantify these relationships. It is key to understand concepts like the residuals (differences between predicted and actual values), and how ordinary least squares (OLS) is used to minimize the sum of squared residuals. In understanding linear regression, it is also important not to confuse errors (which are never observed and can’t be calculated) with residuals (which are predictions of errors). It’s also crucial to be aware of assumptions about your errors and their variance.
- What are dummy variables, and why are they used in modeling? Dummy variables are binary (0 or 1) variables used to represent categorical data in regression models. When transforming categorical variables like ocean proximity (with categories such as near bay, inland, etc.), each category becomes a separate dummy variable. The “1” indicates that a condition is met, and a “0” indicates that it is not. It is essential to drop one of these dummy variables to avoid perfect multicollinearity (where one variable is predictable from other variables) which could cause an OLS violation.
- What are some of the main ideas behind recommendation systems as discussed in the course? Recommendation systems rely on data points to identify similarities between items to generate personalized results. Text data preprocessing is often done using techniques like tokenization, removing stop words, and stemming to convert data into vectors. Cosine similarity is used to measure the angle between two vector representations. This allows one to calculate how similar different data points (such as movies) are, based on common features (like genre, plot keywords). For example, a movie can be represented as a vector in a high-dimensional space that captures different properties about the movie. This approach enables recommendations based on calculated similarity scores.
- What key steps and strategies are recommended for aspiring data scientists? The course emphasizes several critical steps. It’s important to start with projects to demonstrate the ability to apply data science skills. This includes going beyond basic technical knowledge and considering the “why” behind projects. A focus on building a personal brand, which can be done through online platforms like LinkedIn, GitHub, and Medium is recommended. Understanding the business value of data science is key, which includes communicating project findings effectively. Also emphasized is creating a career plan and acting responsibly for your career choices. Finally, focusing on a niche or specific sector is recommended to ensure that one’s technical skills match the business needs.
Fundamentals of Machine Learning

Machine learning (ML) is a branch of artificial intelligence (AI) that builds models based on data, learns from that data, and makes decisions [1]. ML is used across many industries, including healthcare, finance, entertainment, marketing, and transportation [2-9].

Key Concepts in Machine Learning:
- Supervised Learning: Algorithms are trained using labeled data [10]. Examples include regression and classification models [11].
- Regression: Predicts continuous values, such as house prices [12, 13].
- Classification: Predicts categorical values, such as whether an email is spam [12, 14].
- Unsupervised Learning: Algorithms are trained using unlabeled data, and the model must find patterns without guidance [11]. Examples include clustering and outlier detection techniques [12].
- Semi-Supervised Learning: A combination of supervised and unsupervised learning [15].
Machine Learning Algorithms:
- Linear Regression: A statistical or machine learning method used to model the impact of a change in a variable [16, 17]. It can be used for causal analysis and predictive analytics [17].
- Logistic Regression: Used for classification, especially with binary outcomes [14, 15, 18].
- K-Nearest Neighbors (KNN): A classification algorithm [19, 20].
- Decision Trees: Can be used for both classification and regression [19, 21]. They are transparent and handle diverse data, making them useful in various industries [22-25].
- Random Forest: An ensemble learning method that combines multiple decision trees, suitable for classification and regression [19, 26, 27].
- Boosting Algorithms: Such as AdaBoost, light GBM, GBM, and XGBoost, build trees using information from previous trees to improve performance [19, 28, 29].
- K-Means: A clustering algorithm [19, 30].
- DB Scan: A clustering algorithm that is becoming increasingly popular [19].
- Hierarchical Clustering: Another clustering technique [19, 30].
Important Steps in Machine Learning:
- Data Preparation: This involves splitting data into training and test sets and handling missing values [31-33].
- Feature Engineering: Identifying and selecting the most relevant data points (features) to be used by the model to generate the most accurate results [34, 35].
- Model Training: Selecting an appropriate algorithm and training it on the training data [36].
- Model Evaluation: Assessing model performance using appropriate metrics [37].
Model Evaluation Metrics:
- Regression Models:
- Residual Sum of Squares (RSS) [38].
- Mean Squared Error (MSE) [38, 39].
- Root Mean Squared Error (RMSE) [38, 39].
- Mean Absolute Error (MAE) [38, 39].
- Classification Models:
- Accuracy: Proportion of correctly classified instances [40].
- Precision: Measures the accuracy of positive predictions [40].
- Recall: Measures the model’s ability to identify all positive instances [40].
- F1 Score: Combines precision and recall into a single metric [39, 40].
Bias-Variance Tradeoff:
- Bias: The inability of a model to capture the true relationship in the data [41]. Complex models tend to have low bias but high variance [41-43].
- Variance: The sensitivity of a model to changes in the training data [41-43]. Simpler models have low variance but high bias [41-43].
- Overfitting: Occurs when a model learns the training data too well, including noise [44, 45]. This results in poor performance on unseen data [44].
- Underfitting: Occurs when a model is too simple to capture the underlying patterns in the data [45].
Techniques to address overfitting:
- Reducing model complexity: Using simpler models to reduce the chances of overfitting [46].
- Cross-validation: Using different subsets of data for training and testing to get a more realistic measure of model performance [46].
- Early stopping: Monitoring the model performance and stopping the training process when it begins to decrease [47].
- Regularization techniques: Such as L1 and L2 regularization, helps to prevent overfitting by adding penalty terms that reduce the complexity of the model [48-50].
Python and Machine Learning:
- Python is a popular programming language for machine learning because it has a lot of libraries, including:
- Pandas: For data manipulation and analysis [51].
- NumPy: For numerical operations [51, 52].
- Scikit-learn (sklearn): For machine learning algorithms and tools [13, 51-59].
- SciPy: For scientific computing [51].
- NLTK: For natural language processing [51].
- TensorFlow and PyTorch: For deep learning [51, 60, 61].
- Matplotlib: For data visualization [52, 62, 63].
- Seaborn: For data visualization [62].
Natural Language Processing (NLP):
- NLP is used to process and analyze text data [64, 65].
- Key steps include: text cleaning (lowercasing, punctuation removal, tokenization, stemming, and lemmatization), and converting text to numerical data with techniques such as TF-IDF, word embeddings, subword embeddings and character embeddings [66-68].
- NLP is used in applications such as chatbots, virtual assistants, and recommender systems [7, 8, 66].
Deep Learning:
- Deep learning is an advanced form of machine learning that uses neural networks with multiple layers [7, 60, 68].
- Examples include:
- Recurrent Neural Networks (RNNs) [69, 70].
- Artificial Neural Networks (ANNs) [69].
- Convolutional Neural Networks (CNNs) [69, 70].
- Generative Adversarial Networks (GANs) [69].
- Transformers [8, 61, 71-74].
Practical Applications of Machine Learning:
- Recommender Systems: Suggesting products, movies, or jobs to users [6, 9, 64, 75-77].
- Predictive Analytics: Using data to forecast future outcomes, such as house prices [13, 17, 78].
- Fraud Detection: Identifying fraudulent transactions in finance [4, 27, 79].
- Customer Segmentation: Grouping customers based on their behavior [30, 80].
- Image Recognition: Classifying images [14, 81, 82].
- Autonomous Vehicles: Enabling self-driving cars [7].
- Chatbots and virtual assistants: Providing automated customer support using NLP [8, 18, 83].
Career Paths in Machine Learning:
- Machine Learning Researcher: Focuses on developing and testing new machine learning algorithms [84, 85].
- Machine Learning Engineer: Focuses on implementing and deploying machine learning models [85-87].
- AI Researcher: Similar to machine learning researcher but focuses on more advanced models like deep learning and generative AI [70, 74, 88].
- AI Engineer: Similar to machine learning engineer but works with more advanced AI models [70, 74, 88].
- Data Scientist: A broad role that uses data analysis, statistics, and machine learning to solve business problems [54, 89-93].
Additional Considerations:
- It’s important to develop not only technical skills, but also communication skills, business acumen, and the ability to translate business needs into data science problems [91, 94-96].
- A strong data science portfolio is key for getting into the field [97].
- Continuous learning is essential to keep up with the latest technology [98, 99].
- Personal branding can open up many opportunities [100].
This overview should provide a strong foundation in the fundamentals of machine learning.

A Comprehensive Guide to Data Science

Data science is a field that uses data analysis, statistics, and machine learning to solve business problems [1, 2]. It is a broad field with many applications, and it is becoming increasingly important in today’s world [3]. Data science is not just about crunching numbers; it also involves communication, business acumen, and translation skills [4].

Key Aspects of Data Science:
- Data Analysis: Examining data to understand patterns and insights [5, 6].
- Statistics: Applying statistical methods to analyze data, test hypotheses and make inferences [7, 8].
- Descriptive statistics, which includes measures like mean, median, and standard deviation, helps in summarizing data [8].
- Inferential statistics, which involves concepts like the central limit theorem and hypothesis testing, help in drawing conclusions about a population based on a sample [9].
- Probability distributions are also important in understanding machine learning concepts [10].
- Machine Learning (ML): Using algorithms to build models based on data, learn from it, and make decisions [2, 11-13].
- Supervised learning involves training algorithms on labeled data for tasks like regression and classification [13-16]. Regression is used to predict continuous values, while classification is used to predict categorical values [13, 17].
- Unsupervised learning involves training algorithms on unlabeled data to identify patterns, as in clustering and outlier detection [13, 18, 19].
- Programming: Using programming languages such as Python to implement data science techniques [20]. Python is popular due to its versatility and many libraries [20, 21].
- Libraries such as Pandas and NumPy are used for data manipulation [22, 23].
- Scikit-learn is used for implementing machine learning models [22, 24, 25].
- TensorFlow and PyTorch are used for deep learning [22, 26].
- Libraries such as Matplotlib and Seaborn are used for data visualization [17, 25, 27, 28].
- Data Visualization: Representing data through charts, graphs, and other visual formats to communicate insights [25, 27].
- Business Acumen: Understanding business needs and translating them into data science problems and solutions [4, 29].
The Data Science Process:
1. Data Collection: Gathering relevant data from various sources [30].
2. Data Preparation: Cleaning and preprocessing data, which involves:
- Handling missing values by removing or imputing them [31, 32].
- Identifying and removing outliers [32-35].
- Data wrangling: transforming and cleaning data for analysis [6].
- Data exploration: using descriptive statistics and data visualization to understand the data [36-39].
- Data Splitting: Dividing data into training, validation, and test sets [14].
1. Feature Engineering: Identifying, selecting, and transforming variables [40, 41].
2. Model Training: Selecting an appropriate algorithm, training it on the training data, and optimizing it with validation data [14].
3. Model Evaluation: Assessing model performance using relevant metrics on the test data [14, 42].
4. Deployment and Communication: Communicating results and translating them into actionable insights for stakeholders [43].
Applications of Data Science:
- Business and Finance: Customer segmentation, fraud detection, credit risk assessment [44-46].
- Healthcare: Disease diagnosis, risk prediction, treatment planning [46, 47].
- Operations Management: Optimizing decision-making using data [44].
- Engineering: Fault diagnosis [46-48].
- Biology: Classification of species [47-49].
- Customer service: Developing troubleshooting guides and chatbots [47-49].
- Recommender systems are used in entertainment, marketing, and other industries to suggest products or movies to users [30, 50, 51].
- Predictive Analytics are used to forecast future outcomes [24, 41, 52].
Key Skills for Data Scientists:
- Technical Skills: Proficiency in programming languages such as Python and knowledge of relevant libraries. Also expertise in statistics, mathematics, and machine learning [20].
- Communication Skills: Ability to communicate results to technical and non-technical audiences [4, 43].
- Business Skills: Understanding business requirements and translating them into data-driven solutions [4, 29].
- Problem-solving skills: Ability to define, analyze, and solve complex problems [4, 29].
Career Paths in Data Science:
- Data Scientist
- Machine Learning Engineer
- AI Engineer
- Data Science Manager
- NLP Engineer
- Data Analyst
Additional Considerations:
- A strong portfolio demonstrating data science project is essential to showcase practical skills [53-56].
- Continuous learning is necessary to keep up with the latest technology in the field [57].
- Personal branding can enhance opportunities in data science [58-61].
- Data scientists must be able to adapt to the evolving landscape of AI and machine learning [62, 63].
This information should give a comprehensive overview of the field of data science.

Artificial Intelligence: Applications Across Industries

Artificial intelligence (AI) has a wide range of applications across various industries [1, 2]. Machine learning, a branch of AI, is used to build models based on data and learn from this data to make decisions [1].

Here are some key applications of AI:
- Healthcare: AI is used in the diagnosis of diseases, including cancer, and for identifying severe effects of illnesses [3]. It also helps with drug discovery, personalized medicine, treatment plans, and improving hospital operations [3, 4]. Additionally, AI helps in predicting the number of patients that a hospital can expect in the emergency room [4].
- Finance: AI is used for fraud detection in credit card and banking operations [5]. It is also used in trading, combined with quantitative finance, to help traders make decisions about stocks, bonds, and other assets [5].
- Retail: AI helps in understanding and estimating demand for products, determining the most appropriate warehouses for shipping, and building recommender systems and search engines [5, 6].
- Marketing: AI is used to understand consumer behavior and target specific groups, which helps reduce marketing costs and increase conversion rates [7, 8].
- Transportation: AI is used in autonomous vehicles and self-driving cars [8].
- Natural Language Processing (NLP): AI is behind applications such as chatbots, virtual assistants, and large language models [8, 9]. These tools use text data to answer questions and provide information [9].
- Smart Home Devices: AI powers smart home devices like Alexa [9].
- Agriculture: AI is used to estimate weather conditions, predict crop production, monitor soil health, and optimize crop yields [9, 10].
- Entertainment: AI is used to build recommender systems that suggest movies and other content based on user data. Netflix is a good example of a company that uses AI in this way [10, 11].
- Customer service: AI powers chatbots that can categorize customer inquiries and provide appropriate responses, reducing wait times and improving support efficiency [12-15].
- Game playing: AI is used to design AI opponents in games [13, 14, 16].
- E-commerce: AI is used to provide personalized product recommendations [14, 16].
- Human Resources: AI helps to identify factors influencing employee retention [16, 17].
- Fault Diagnosis: AI helps isolate the cause of malfunctions in complex systems by analyzing sensor data [12, 18].
- Biology: AI is used to categorize species based on characteristics or DNA sequences [12, 15].
- Remote Sensing: AI is used to analyze satellite imagery and classify land cover types [12, 15].
In addition to these, AI is also used in many areas of data science, such as customer segmentation [19-21], fraud detection [19-22], credit risk assessment [19-21], and operations management [19, 21, 23, 24].

Overall, AI is a powerful technology with a wide range of applications that improve efficiency, decision-making, and customer experience in many areas [11].

Essential Python Libraries for Data Science

Python libraries are essential tools in data science, machine learning, and AI, providing pre-written functions and modules that streamline complex tasks [1]. Here’s an overview of the key Python libraries mentioned in the sources:
- Pandas: This library is fundamental for data manipulation and analysis [2, 3]. It provides data structures like DataFrames, which are useful for data wrangling, cleaning, and preprocessing [3, 4]. Pandas is used for tasks such as reading data, handling missing values, identifying outliers, and performing data filtering [3, 5].
- NumPy: NumPy is a library for numerical computing in Python [2, 3, 6]. It is used for working with arrays and matrices and performing mathematical operations [3, 7]. NumPy is essential for data visualization and other tasks in machine learning [3].
- Matplotlib: This library is used for creating visualizations like plots, charts, and histograms [6-8]. Specifically, pyplot is a module within Matplotlib used for plotting [9, 10].
- Seaborn: Seaborn is another data visualization library that is known for creating more appealing visualizations [8, 11].
- Scikit-learn (psyit learn): This library provides a wide range of machine learning algorithms and tools for tasks like regression, classification, clustering, and model evaluation [2, 6, 10, 12]. It includes modules for model selection, ensemble learning, and metrics [13]. Scikit-learn also includes tools for data preprocessing, such as splitting the data into training and testing sets [14, 15].
- Statsmodels: This library is used for statistical modeling and econometrics and has capabilities for linear regression [12, 16]. It is particularly useful for causal analysis because it provides detailed statistical summaries of model results [17, 18].
- NLTK (Natural Language Toolkit): This library is used for natural language processing tasks [2]. It is helpful for text data cleaning, such as tokenization, stemming, lemmatization, and stop word removal [19, 20]. NLTK also assists in text analysis and processing [21].
- TensorFlow and PyTorch: These are deep learning frameworks used for building and training neural networks and implementing deep learning models [2, 22, 23]. They are essential for advanced machine learning tasks, such as building large language models [2].
- Pickle: This library is used for serializing and deserializing Python objects, which is useful for saving and loading models and data [24, 25].
- Requests: This library is used for making HTTP requests, which is useful for fetching data from web APIs, like movie posters [25].
These libraries facilitate various stages of the data science workflow [26]:
- Data loading and preparation: Libraries like Pandas and NumPy are used to load, clean, and transform data [2, 26].
- Data visualization: Libraries like Matplotlib and Seaborn are used to create plots and charts that help to understand data and communicate insights [6-8].
- Model training and evaluation: Libraries like Scikit-learn and Statsmodels are used to implement machine learning algorithms, train models, and evaluate their performance [2, 12, 26].
- Deep learning: Frameworks such as TensorFlow and PyTorch are used for building complex neural networks and deep learning models [2, 22].
- Natural language processing: Libraries such as NLTK are used for processing and analyzing text data [2, 27].
Mastering these Python libraries is crucial for anyone looking to work in data science, machine learning, or AI [1, 26]. They provide the necessary tools for implementing a wide array of tasks, from basic data analysis to advanced model building [1, 2, 22, 26].

Machine Learning Model Evaluation

Model evaluation is a crucial step in the machine learning process that assesses the performance and effectiveness of a trained model [1, 2]. It involves using various metrics to quantify how well the model is performing, which helps to identify whether the model is suitable for its intended purpose and how it can be improved [2-4]. The choice of evaluation metrics depends on the specific type of machine learning problem, such as regression or classification [5].

Key Concepts in Model Evaluation:
- Performance Metrics: These are measures used to evaluate how well a model is performing. Different metrics are appropriate for different types of tasks [5, 6].
- For regression models, common metrics include:
- Residual Sum of Squares (RSS): Measures the sum of the squares of the differences between the predicted and true values [6-8].
- Mean Squared Error (MSE): Calculates the average of the squared differences between predicted and true values [6, 7].
- Root Mean Squared Error (RMSE): The square root of the MSE, which provides a measure of the error in the same units as the target variable [6, 7].
- Mean Absolute Error (MAE): Calculates the average of the absolute differences between predicted and true values. MAE is less sensitive to outliers compared to MSE [6, 7, 9].
- For classification models, common metrics include:
- Accuracy: Measures the proportion of correct predictions made by the model [9, 10].
- Precision: Measures the proportion of true positive predictions among all positive predictions made by the model [7, 9, 10].
- Recall: Measures the proportion of true positive predictions among all actual positive instances [7, 9, 11].
- F1 Score: The harmonic mean of precision and recall, providing a balanced measure of a model’s performance [7, 9].
- Area Under the Curve (AUC): A metric used when plotting the Receiver Operating Characteristic (ROC) curve to assess the performance of binary classification models [12].
- Cross-entropy: A loss function used to measure the difference between the predicted and true probability distributions, often used in classification problems [7, 13, 14].
- Bias and Variance: These concepts are essential for understanding model performance [3, 15].
- Bias refers to the error introduced by approximating a real-world problem with a simplified model, which can cause the model to underfit the data [3, 4].
- Variance measures how much the model’s predictions vary for different training data sets; high variance can cause the model to overfit the data [3, 16].
- Overfitting and Underfitting: These issues can affect model accuracy [17, 18].
- Overfitting occurs when a model learns the training data too well, including noise, and performs poorly on new, unseen data [17-19].
- Underfitting occurs when a model is too simple and cannot capture the underlying patterns in the training data [17, 18].
- Training, Validation, and Test Sets: Data is typically split into three sets [2, 20]:
- Training Set: Used to train the model.
- Validation Set: Used to tune model hyperparameters and prevent overfitting.
- Test Set: Used to evaluate the final model’s performance on unseen data [20-22].
- Hyperparameter Tuning: Adjusting model parameters to minimize errors and optimize performance, often using the validation set [21, 23, 24].
- Cross-Validation: A resampling technique that allows the model to be trained and tested on different subsets of the data to assess its generalization ability [7, 25].
- K-fold cross-validation divides the data into k subsets or folds and iteratively trains and evaluates the model by using each fold as the test set once [7].
- Leave-one-out cross-validation uses each data point as a test set, training the model on all the remaining data points [7].
- Early Stopping: A technique where the model’s performance on a validation set is monitored during the training process, and training is stopped when the performance starts to decrease [25, 26].
- Ensemble Methods: Techniques that combine multiple models to improve performance and reduce overfitting. Some ensemble techniques are decision trees, random forests, and boosting techniques such as Adaboost, Gradient Boosting Machines (GBM), and XGBoost [26]. Bagging is an ensemble technique that reduces variance by training multiple models and averaging the results [27-29].
Step-by-Step Process for Model Evaluation:
1. Data Splitting: Divide the data into training, validation, and test sets [2, 20].
2. Algorithm Selection: Choose an appropriate algorithm based on the problem and data characteristics [24].
3. Model Training: Train the selected model using the training data [24].
4. Hyperparameter Tuning: Adjust model parameters using the validation data to minimize errors [21].
5. Model Evaluation: Evaluate the model’s performance on the test data using chosen metrics [21, 22].
6. Analysis and Refinement: Analyze the results, make adjustments, and retrain the model if necessary [3, 17, 30].
Importance of Model Evaluation:
- Ensures Model Generalization: It helps to ensure that the model performs well on new, unseen data, rather than just memorizing the training data [22].
- Identifies Model Issues: It helps in detecting issues like overfitting, underfitting, and bias [17-19].
- Guides Model Improvement: It provides insights into how the model can be improved through hyperparameter tuning, data collection, or algorithm selection [21, 24, 25].
- Validates Model Reliability: It validates the model’s ability to provide accurate and reliable results [2, 15].
Additional Notes:
- Statistical significance is an important concept in model evaluation to ensure that the results are unlikely to have occurred by random chance [31, 32].
- When evaluating models, it is important to understand the trade-off between model complexity and generalizability [33, 34].
- It is important to check the assumptions of the model, for example, when using linear regression, it is essential to check assumptions such as linearity, exogeneity, and homoscedasticity [35-39].
- Different types of machine learning models should be evaluated using appropriate metrics. For example, classification models use metrics like accuracy, precision, recall, and F1 score, while regression models use metrics like MSE, RMSE, and MAE [6, 9].
By carefully evaluating machine learning models, one can build reliable systems that address real-world problems effectively [2, 3, 40, 41].

AI Foundations Course – Python, Machine Learning, Deep Learning, Data Science

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
March 3, 2025
PyTorch Deep Learning & Machine Learning
This PDF excerpt details a PyTorch deep learning course. The course teaches PyTorch fundamentals, including tensor manipulation and neural network architecture. It covers various machine learning concepts, such as linear and non-linear regression, classification (binary and multi-class), and computer vision. Practical coding examples using Google Colab are provided throughout, demonstrating model building, training, testing, saving, and loading. The course also addresses common errors and troubleshooting techniques, emphasizing practical application and experimentation.

PyTorch Deep Learning Study Guide

Quiz
1. What is the difference between a scalar and a vector? A scalar is a single number, while a vector has magnitude and direction and is represented by multiple numbers in a single dimension.
2. How can you determine the number of dimensions of a tensor? You can determine the number of dimensions of a tensor by counting the number of pairs of square brackets, or by calling the endim function on a tensor.
3. What is the purpose of the .shape attribute of a tensor? The .shape attribute of a tensor returns a tuple that represents the size of each dimension of the tensor. It indicates the number of elements in each dimension, providing information about the tensor’s structure.
4. What does the dtype of a tensor represent? The dtype of a tensor represents the data type of the elements within the tensor, such as float32, float16, or int32. It specifies how the numbers are stored in memory, impacting precision and memory usage.
5. What is the difference between reshape and view when manipulating tensors? Both reshape and view change the shape of a tensor. Reshape copies data and allocates new memory, while view creates a new view of the existing tensor data, meaning that changes in the view will impact the original data.
6. Explain what tensor aggregation is and provide an example. Tensor aggregation involves reducing the number of elements in a tensor by applying an operation like min, max, or mean. For example, finding the minimum value in a tensor reduces all of the elements to a single number.
7. What does the stack function do to tensors and how is it different from unsqueeze? The stack function concatenates a sequence of tensors along a new dimension, increasing the dimensions of the tensor by one. The unsqueeze adds a single dimension to a target tensor at a specified dimension.
8. What does the term “device agnostic code” mean, and why is it important in PyTorch? Device-agnostic code in PyTorch means writing code that can run on either a CPU or GPU without modification. This is important for portability and leveraging the power of GPUs when available.
9. In PyTorch, what is a “parameter”, how is it created, and what special property does it have? A “parameter” is a special type of tensor created using nn.parameter that is a module attribute. When assigned as a module attribute, parameters are automatically added to a module’s parameter list, enabling gradient tracking during training.
10. Explain the primary difference between the training loop and the testing/evaluation loop in a neural network. The training loop involves the forward pass, loss calculation, backpropagation and updating the model’s parameters through optimization, whereas the testing/evaluation loop involves only the forward pass and loss and/or accuracy calculation without gradient calculation and parameter updates.
Essay Questions
1. Discuss the importance of tensor operations in deep learning. Provide specific examples of how reshaping, indexing, and aggregation are utilized.
2. Explain the significance of data types in PyTorch tensors, and elaborate on the potential issues that can arise from data type mismatches during tensor operations.
3. Compare and contrast the use of reshape, view, stack, squeeze, and unsqueeze when dealing with tensors. In what scenarios might one operation be preferable over another?
4. Describe the key steps involved in the training loop of a neural network. Explain the role of the loss function, optimizer, and backpropagation in the learning process.
5. Explain the purpose of the torch.utils.data.DataLoader and the advantages it provides. Discuss how it can improve the efficiency and ease of use of data during neural network training.
Glossary

Scalar: A single numerical value. It has no direction or multiple dimensions.

Vector: A mathematical object that has both magnitude and direction, often represented as an ordered list of numbers, i.e. in one dimension.

Matrix: A rectangular array of numbers arranged in rows and columns, i.e. in two dimensions.

Tensor: A generalization of scalars, vectors, and matrices. It can have any number of dimensions.

Dimension (dim): Refers to the number of indices needed to address individual elements in a tensor, which is also the number of bracket pairs.

Shape: A tuple that describes the size of each dimension of a tensor.

Dtype: The data type of the elements in a tensor, such as float32, int64, etc.

Indexing: Selecting specific elements or sub-tensors from a tensor using their positions in the dimensions.

Reshape: Changing the shape of a tensor while preserving the number of elements.

View: Creating a new view of a tensor’s data without copying. Changing the view will change the original data, and vice versa.

Aggregation: Reducing the number of elements in a tensor by applying an operation (e.g., min, max, mean).

Stack: Combining multiple tensors along a new dimension.

Squeeze: Removing dimensions of size 1 from a tensor.

Unsqueeze: Adding a new dimension of size 1 to a tensor.

Device: The hardware on which computations are performed (e.g., CPU, GPU).

Device Agnostic Code: Code that can run on different devices (CPU or GPU) without modification.

Parameter (nn.Parameter): A special type of tensor that can be tracked during training, is a module attribute and is automatically added to a module’s parameter list.

Epoch: A complete pass through the entire training dataset.

Training Loop: The process of iterating through the training data, calculating loss, and updating model parameters.

Testing/Evaluation Loop: The process of evaluating model performance on a separate test dataset.

DataLoader: A utility in PyTorch that creates an iterable over a dataset, managing batching and shuffling of the data.

Flatten: A layer that flattens a multi-dimensional tensor into a single dimension.

PyTorch Deep Learning Fundamentals

Okay, here’s a detailed briefing document summarizing the key themes and ideas from the provided source, with relevant quotes included:

Briefing Document: PyTorch Deep Learning Fundamentals

Introduction:

This document summarizes the core concepts and practical implementations of PyTorch for deep learning, as detailed in the provided course excerpts. The focus is on tensors, their properties, manipulations, and usage within the context of neural network building and training.

I. Tensors: The Building Blocks
- Definition: Tensors are the fundamental data structure in PyTorch, used to encode data as numbers. Traditional terms like scalars, vectors, and matrices are all represented as tensors in PyTorch.
- “basically anytime you encode data into numbers, it’s of a tensor data type.”
- Scalars: A single number.
- “A single number, number of dimensions, zero.”
- Vectors: Have magnitude and direction and typically have more than one number.
- “a vector typically has more than one number”
- “a number with direction, number of dimensions, one”
- Matrices: Two-dimensional tensors.
- “a matrix, a tensor.”
- Dimensions (ndim): Represented by the number of square bracket pairings in the tensor’s definition.
- “dimension is like number of square brackets…number of pairs of closing square brackets.”
- Shape: Defines the size of each dimension in a tensor.
- For example, a vector [1, 2] has a shape of (2,) or (2,1). A matrix [[1, 2], [3, 4]] has a shape of (2, 2).
- “the shape of the vector is two. So we have two by one elements.”
- Data Type (dtype): Tensors have a data type (e.g., float32, float16, int32, long). The default dtype in PyTorch is float32.
- “the default data type in pytorch, even if it’s specified as none is going to come out as float 32.”
- It’s important to ensure tensors have compatible data types when performing operations to avoid errors.
- Device: Tensors can reside on different devices, such as the CPU or GPU (CUDA). Device-agnostic code is recommended to handle this.
II. Tensor Creation and Manipulation
- Creation:torch.tensor(): Creates tensors from lists or NumPy arrays.
- torch.zeros(): Creates a tensor filled with zeros.
- torch.ones(): Creates a tensor filled with ones.
- torch.arange(): Creates a 1D tensor with a range of values.
- torch.rand(): Creates a tensor with random values.
- torch.randn(): Creates a tensor with random values from normal distribution.
- torch.zeros_like()/torch.ones_like()/torch.rand_like(): Creates tensors with the same shape as another tensor.
- Indexing: Tensors can be accessed via numerical indices, allowing one to extract elements or subsets.
- “This is where the square brackets, the pairings come into play.”
- Reshaping:reshape(): Changes the shape of a tensor, provided the total number of elements remains the same.
- view(): Creates a view of the tensor, sharing the same memory, but does not change the shape of the original tensor. Modifying a view changes the original tensor.
- Stacking: torch.stack() concatenates tensors along a new dimension. torch.vstack() and torch.hstack() are similar along specific axes.
- Squeezing and Unsqueezing: squeeze() removes dimensions of size 1, and unsqueeze() adds dimensions of size 1.
- Element-wise operations: standard operations like +, -, *, / are applied element-wise.
- If reassigning the tensor variable (e.g., tensor = tensor * 10), the original tensor will be changed.
- Matrix Multiplication: Use @ operator (or .matmul() function). Inner dimensions must match for valid matrix multiplication.
- “inner dimensions must match.”
- Transpose: tensor.T will tranpose a tensor (swap rows/columns)
- Aggregation: Functions like torch.min(), torch.max(), torch.mean(), and their respective index finders like torch.argmin()/torch.argmax() reduce the tensor to scalar values.
- “So you’re turning it from nine elements to one element, hence aggregation.”
- Attributes: tensors have attributes like dtype, shape (or size), and can be retrieved with tensor.dtype or tensor.shape (or tensor.size())
III. Neural Networks with PyTorch
- torch.nn Module: The module provides building blocks for creating neural networks.
- “nn is the building block layer for neural networks.”
- nn.Module: The base class for all neural network modules. Custom models should inherit from this class.
- Linear Layers (nn.Linear): Represents a linear transformation (y = Wx + b).
- Activation Functions: Non-linear functions such as ReLU (Rectified Linear Unit) and Sigmoid, enable neural networks to learn complex patterns.
- “one divided by one plus torch exponential of negative x.”
- Parameter (nn.Parameter): A special type of tensor that is added to a module’s parameter list, allowing automatic gradient tracking
- “Parameters are torch tensor subclasses…automatically added to the list of its parameters.”
- It’s critical to set requires_grad=True for parameters that need to be optimized during training.
- Sequential Container (nn.Sequential): A convenient way to create models by stacking layers in a sequence.
- Forward Pass: The computation of the model’s output given the input data. This is implemented in the forward() method of a class inheriting from nn.Module.
- “Do the forward pass.”
- Loss Functions: Measure the difference between the predicted and actual values.
- “Calculate the loss.”
- Optimizers: Algorithms that update the model’s parameters based on the loss function during training (e.g., torch.optim.SGD).
- “optimise a step, step, step.”
- Use optimizer.zero_grad() to reset the gradients before each training step.
- Training Loop: The iterative process of:
1. Forward pass
2. Calculate Loss
3. Optimizer zero grad
4. Loss backwards
5. Optimizer Step
- Evaluation Mode: Set the model to model.eval() before doing inference (testing/evaluation), and it sets requires_grad=False
IV. Data Handling
- torch.utils.data.Dataset: A class for representing datasets, and custom datasets can be built using this.
- torch.utils.data.DataLoader: An iterable to batch data for use during training.
- “This creates a Python iterable over a data set.”
- Transforms: Functions that modify data (e.g., images) before they are used in training. They can be composed together.
- “This little transforms module, the torch vision library will change that back to 64 64.”
- Device Agnostic Data: Send data to the appropriate device (CPU/GPU) using .to(device)
- NumPy Interoperability: PyTorch can handle NumPy arrays with torch.from_numpy(), but the data type needs to be changed to torch.float32 from float64
V. Visualization
- Matplotlib: Library is used for visualizing plots and images.
- “Our data explorers motto is visualize, visualize, visualize.”
- plt.imshow(): Displays images.
- plt.plot(): Displays data in a line plot.
VI. Key Practices
- Visualize, Visualize, Visualize: Emphasized for data exploration.
- Device-Agnostic Code: Aim to write code that can run on both CPU and GPU.
- Typo Avoidance: Be careful to avoid typos as they can cause errors.
VII. Specific Examples/Concepts Highlighted:
- Image data: tensors are often (height, width, color_channels) or (batch_size, color_channels, height, width)
- Linear regression: the formula y=weight * x + bias
- Non linear transformations: using activation functions to introduce non-linearity
- Multi-class data sets: Using make_blobs function to generate multiple data classes.
- Convolutional layers (nn.Conv2d): For processing images, which require specific parameters like in-channels, out-channels, kernel size, stride, and padding.
- Flatten layer (nn.Flatten): Used to flatten the input into a vector before a linear layer.
- Data Loaders: Batches of data in an iterable for training or evaluation loops.
Conclusion:

This document provides a foundation for understanding the essential elements of PyTorch for deep learning. It highlights the importance of tensors, their manipulation, and their role in building and training neural networks. Key concepts such as the training loop, device-agnostic coding, and the value of visualization are also emphasized.

This briefing should serve as a useful reference for anyone learning PyTorch and deep learning fundamentals from these course materials.

PyTorch Fundamentals: Tensors and Neural Networks

1. What is a tensor in PyTorch and how does it relate to scalars, vectors, and matrices?

In PyTorch, a tensor is the fundamental data structure used to represent data. Think of it as a generalization of scalars, vectors, and matrices. A scalar is a single number (0 dimensions), a vector has magnitude and direction, and is represented by one dimension, while a matrix has two dimensions. Tensors can have any number of dimensions and can store numerical data of various types. In essence, when you encode any kind of data into numbers within PyTorch, it becomes a tensor. PyTorch uses the term tensor to refer to any of these data types.

2. How are the dimensions and shape of a tensor determined?

The dimension of a tensor can be determined by the number of square bracket pairs used to define it. For example, [1, 2, 3] is a vector with one dimension (one pair of square brackets), and [[1, 2], [3, 4]] is a matrix with two dimensions (two pairs). The shape of a tensor refers to the size of each dimension. For instance, [1, 2, 3] has a shape of (3), meaning 3 elements in the first dimension, while [[1, 2], [3, 4]] has a shape of (2, 2), meaning 2 rows and 2 columns. Note: The shape is determined by the number of elements in each dimension.

3. How do you create tensors with specific values in PyTorch?

PyTorch provides various functions to create tensors:
- torch.tensor([value1, value2, …]) directly creates a tensor from a Python list. You can control the data type (dtype) of the tensor during its creation by passing the dtype argument.
- torch.zeros(size) creates a tensor filled with zeros of the specified size.
- torch.ones(size) creates a tensor filled with ones of the specified size.
- torch.rand(size) creates a tensor filled with random values from a uniform distribution (between 0 and 1) of the specified size.
- torch.arange(start, end, step) creates a 1D tensor containing values from start to end (exclusive), incrementing by step.
- torch.zeros_like(other_tensor) and torch.ones_like(other_tensor) create tensors with the same shape and dtype as the other_tensor, filled with zeros or ones respectively.
4. What is the importance of data types (dtypes) in tensors, and how can they be changed?

Data types determine how data is stored in memory, which has implications for precision and memory usage. The default data type in PyTorch is torch.float32. To change a tensor’s data type, you can use the .type() method, e.g. tensor.type(torch.float16) will convert a tensor to 16 bit float. While PyTorch can often automatically handle operations between different data types, using the correct data type can prevent unexpected errors or behaviors. It’s good to be explicit.

5. What are tensor attributes such as shape, size, and Dtype and how do they relate to tensor manipulation?

These are attributes that can be used to understand, manipulate, and diagnose issues with tensors.
- Shape: An attribute that represents the dimensions of the tensor. For example, a matrix might have a shape of (3, 4), indicating it has 3 rows and 4 columns. You can access this information by using .shape
- Size: Acts like .shape but is a method i.e. .size(). It will return the dimensions of the tensor.
- Dtype: Stands for data type. This defines the way the data is stored and impacts precision and memory use. You can access this by using .dtype.
These attributes can be used to diagnose issues, for example you might want to ensure all tensors have compatible data types and dimensions for multiplication.

6. How do operations like reshape, view, stack, unsqueeze, and squeeze modify the shape of tensors?
- reshape(new_shape): Changes the shape of a tensor to a new shape, as long as the total number of elements remains the same, a tensor with 9 elements can be reshaped into (3, 3) or (9, 1) for example.
- view(new_shape): Similar to reshape, but it can only be used to change the dimensions of a contiguous tensor (a tensor that has elements in continuous memory) and will also share the same memory as the original tensor meaning changes will impact each other.
- stack(tensors, dim): Concatenates multiple tensors along a new dimension (specified by dim) and increases the overall dimensionality by 1.
- unsqueeze(dim): Inserts a new dimension of size one at a specified position, increasing the overall dimensionality by 1.
- squeeze(): Removes all dimensions with size one in a tensor, reducing overall dimensionality of a tensor.
7. What are the key components of a basic neural network training loop?

The key components include:
- Forward Pass: The input data goes through the model, producing the output.
- Calculate Loss: The error is calculated by comparing the output to the true labels.
- Zero Gradients: Previous gradients are cleared before starting a new iteration to prevent accumulating them across iterations.
- Backward Pass: The error is backpropagated through the network to calculate gradients.
- Optimize Step: The model’s parameters are updated based on the gradients using an optimizer.
- Testing / Validation Step: The model’s performance is evaluated against a test or validation dataset.
8. What is the purpose of torch.nn.Module and torch.nn.Parameter in PyTorch?
- torch.nn.Module is a base class for creating neural network models. Modules provide a way to organize and group layers and functions, such as linear layers, activation functions, and other model components. It keeps track of learnable parameters.
- torch.nn.Parameter is a special subclass of torch.Tensor that is used to represent the learnable parameters of a model. When parameters are assigned as module attributes, PyTorch automatically registers them for gradient tracking and optimization. It tracks gradient when ‘requires_grad’ is set to true. Setting requires_grad=True on parameters tells PyTorch to calculate and store gradients for them during backpropagation.
PyTorch: A Deep Learning Framework

PyTorch is a machine learning framework written in Python that is used for deep learning and other machine learning tasks [1]. The framework is popular for research and allows users to write fast deep learning code that can be accelerated by GPUs [2, 3].

Key aspects of PyTorch include:
- Tensors: PyTorch uses tensors as a fundamental building block for numerical data representation. These can be of various types, and neural networks perform mathematical operations on them [4, 5].
- Neural Networks: PyTorch is often used for building neural networks, including fully connected and convolutional neural networks [6]. These networks are constructed using layers from the torch.nn module [7].
- GPU Acceleration: PyTorch can leverage GPUs via CUDA to accelerate machine learning code. GPUs are fast at numerical calculations, which are very important in deep learning [8-10].
- Flexibility: The framework allows for customization, and users can combine layers in different ways to build various kinds of neural networks [6, 11].
- Popularity: PyTorch is a popular research machine learning framework, with 58% of papers with code implemented using PyTorch [2, 12, 13]. It is used by major organizations such as Tesla, OpenAI, Facebook, and Microsoft [14-16].
The typical workflow when using PyTorch for deep learning includes:
- Data Preparation: The first step is getting the data ready, which can involve numerical encoding, turning the data into tensors, and loading the data [17-19].
- Model Building: PyTorch models are built using the nn.Module class as a base and defining the forward computation [20-23]. This includes choosing appropriate layers and defining their interconnections [11].
- Model Fitting: The model is fitted to the data using an optimization loop and a loss function [19]. This involves calculating gradients using back propagation and updating model parameters using gradient descent [24-27].
- Model Evaluation: Model performance is evaluated by measuring how well the model performs on unseen data, using metrics such as accuracy and loss [28].
- Saving and Loading: Trained models can be saved and reloaded using the torch.save, torch.load, and torch.nn.Module.load_state_dict functions [29, 30].
Some additional notes on PyTorch include:
- Reproducibility: Randomness is important in neural networks; it’s necessary to set random seeds to ensure reproducibility of experiments [31, 32].
- Device Agnostic Code: It’s useful to write device agnostic code, which means code that can run on either a CPU or a GPU [33, 34].
- Integration: PyTorch integrates well with other libraries, such as NumPy, which is useful for pre-processing and other numerical tasks [35, 36].
- Documentation: The PyTorch website and documentation serve as the primary resource for learning about the framework [2, 37, 38].
- Community Support: Online forums and communities provide places to ask questions and share code [38-40].
Overall, PyTorch is a very popular and powerful tool for deep learning and machine learning [2, 12, 13]. It provides tools to enable users to build, train, and deploy neural networks with ease [3, 16, 41].

Understanding Machine Learning Models

Machine learning models learn patterns from data, which is converted into numerical representations, and then use these patterns to make predictions or classifications [1-4]. The models are built using code and math [1].

Here are some key aspects of machine learning models based on the sources:
- Data Transformation: Machine learning models require data to be converted into numbers, a process sometimes called numerical encoding [1-4]. This can include images, text, tables of numbers, audio files, or any other type of data [1].
- Pattern Recognition: After data is converted to numbers, machine learning models use algorithms to find patterns in that data [1, 3-5]. These patterns can be complex and are often not interpretable by humans [6, 7]. The models can learn patterns through code, using algorithms to find the relationships in the numerical data [5].
- Traditional Programming vs. Machine Learning: In traditional programming, rules are hand-written to manipulate input data and produce desired outputs [8]. In contrast, machine learning algorithms learn these rules from data [9, 10].
- Supervised Learning: Many machine learning algorithms use supervised learning. This involves providing input data along with corresponding output data (features and labels), and then the algorithm learns the relationships between the inputs and outputs [9].
- Parameters: Machine learning models learn parameters that represent the patterns in the data [6, 11]. Parameters are values that the model sets itself [12]. These are often numerical and can be large, sometimes numbering in the millions or even trillions [6].
- Explainability: The patterns learned by a deep learning model are often uninterpretable by a human [6]. Sometimes, these patterns are lists of numbers in the millions, which is difficult for a person to understand [6, 7].
- Model Evaluation: The performance of a machine learning model can be evaluated by making predictions and comparing those predictions to known labels or targets [13-15]. The goal of training a model is to move from some unknown parameters to a better, known representation of the data [16]. The loss function is used to measure how wrong a model’s predictions are compared to the ideal predictions [17].
- Model Types: Machine learning models include:
- Linear Regression: Models which use a linear formula to draw patterns in data [18]. These models use parameters such as weights and biases to perform forward computation [18].
- Neural Networks: Neural networks are the foundation of deep learning [19]. These are typically used for unstructured data such as images [19, 20]. They use a combination of linear and non-linear functions to draw patterns in data [21-23].
- Convolutional Neural Networks (CNNs): These are a type of neural network often used for computer vision tasks [19, 24]. They process images through a series of layers, identifying spatial features in the data [25].
- Gradient Boosted Machines: Algorithms such as XGBoost are often used for structured data [26].
- Use Cases: Machine learning can be applied to virtually any problem where data can be converted into numbers and patterns can be found [3, 4]. However, simple rule-based systems are preferred if they can solve a problem, and machine learning should not be used simply because it can [5, 27]. Machine learning is useful for complex problems with long lists of rules [28, 29].
- Model Training: The training process is iterative and involves multiple steps, and it can also be seen as an experimental process [30, 31]. In each step, the machine learning model is used to make predictions and its parameters are adjusted to minimize error [13, 32].
In summary, machine learning models are algorithms that can learn patterns from data by converting the data into numbers, using various algorithms, and adjusting parameters to improve performance. Models are typically evaluated against known data with a loss function, and there are many types of models and use cases depending on the type of problem [6, 9-11, 13, 32].

Understanding Neural Networks

Neural networks are a type of machine learning model inspired by the structure of the human brain [1]. They are comprised of interconnected nodes, or neurons, organized in layers, and they are used to identify patterns in data [1-3].

Here are some key concepts for understanding neural networks:
- Structure:
- Layers: Neural networks are made of layers, including an input layer, one or more hidden layers, and an output layer [1, 2]. The ‘deep’ in deep learning comes from having multiple hidden layers [1, 4].
- Nodes/Neurons: Each layer is composed of nodes or neurons [4, 5]. Each node performs a mathematical operation on the input it receives.
- Connections: Nodes in adjacent layers are connected, and these connections have associated weights that are adjusted during the learning process [6].
- Architecture: The arrangement of layers and connections determines the neural network’s architecture [7].
- Function:
- Forward Pass: In a forward pass, input data is passed through the network, layer by layer [8]. Each layer performs mathematical operations on the input, using linear and non-linear functions [5, 9].
- Mathematical Operations: Each layer is typically a combination of linear (straight line) and nonlinear (non-straight line) functions [9].
- Nonlinearity: Nonlinear functions, such as ReLU or sigmoid, are critical for enabling the network to learn complex patterns [9-11].
- Representation Learning: The network learns a representation of the input data by manipulating patterns and features through its layers [6, 12]. This representation is also called a weight matrix or weight tensor [13].
- Output: The output of the network is a representation of the learned patterns, which can be converted into a human-understandable format [12-14].
- Learning Process:
- Random Initialization: Neural networks start with random numbers as parameters, and they adjust those numbers to better represent the data [15, 16].
- Loss Function: A loss function is used to measure how wrong the model’s predictions are compared to ideal predictions [17-19].
- Backpropagation: Backpropagation is an algorithm that calculates the gradients of the loss with respect to the model’s parameters [20].
- Gradient Descent: Gradient descent is an optimization algorithm used to update model parameters to minimize the loss function [20, 21].
- Types of Neural Networks:
- Fully Connected Neural Networks: These networks have connections between all nodes in adjacent layers [1, 22].
- Convolutional Neural Networks (CNNs): CNNs are particularly useful for processing images and other visual data, and they use convolutional layers to identify spatial features [1, 23, 24].
- Recurrent Neural Networks (RNNs): These are often used for sequence data [1, 25].
- Transformers: Transformers have become popular in recent years and are used in natural language processing and other applications [1, 25, 26].
- Customization: Neural networks are highly customizable, and they can be designed in many different ways [4, 25, 27]. The specific architecture and layers used are often tailored to the specific problem at hand [22, 24, 26-28].
Neural networks are a core component of deep learning, and they can be applied to a wide range of problems including image recognition, natural language processing, and many others [22, 23, 25, 26]. The key to using neural networks effectively is to convert data into a numerical representation, design a network that can learn patterns from the data, and use optimization techniques to train the model.

Machine Learning Model Training

The model training process in machine learning involves using algorithms to adjust a model’s parameters so it can learn patterns from data and make accurate predictions [1, 2]. Here’s an overview of the key steps in training a model, according to the sources:
- Initialization: The process begins with a model that has randomly assigned parameters, such as weights and biases [1, 3]. These parameters are what the model adjusts during training [4, 5].
- Data Input: The training process requires input data to be passed through the model [1]. The data is typically split into a training set for learning and a test set for evaluation [6].
- Forward Pass: Input data is passed through the model, layer by layer [7]. Each layer performs mathematical operations on the input, which may include both linear and nonlinear functions [8]. This forward computation produces a prediction, called the model’s output or sometimes logits [9, 10].
- Loss Calculation: A loss function is used to measure how wrong the model’s predictions are compared to the ideal outputs [4, 11]. The loss function provides a numerical value that represents the error or deviation of the model’s predictions from the actual values [12]. The goal of the training process is to minimize this loss [12, 13].
- Backpropagation: After the loss is calculated, the backpropagation algorithm computes the gradients of the loss with respect to the model’s parameters [2, 14, 15]. Gradients indicate the direction and magnitude of the change needed to reduce the loss [1].
- Optimization: An optimizer uses the calculated gradients to update the model’s parameters [4, 11, 16]. Gradient descent is a commonly used optimization algorithm that adjusts the parameters to minimize the loss [1, 2, 15]. The learning rate is a hyperparameter that determines the size of the adjustments [5, 17].
- Training Loop: The process of forward pass, loss calculation, backpropagation, and optimization is repeated iteratively through a training loop [11, 17, 18]. The training loop is where the model learns patterns on the training data [19]. Each iteration of the loop is called an epoch [20].
- Evaluation: After training, the model’s performance is evaluated on a separate test data set [19]. This evaluation helps to measure how well the model has learned and whether it can generalize to unseen data [21].
In PyTorch, the training loop typically involves these steps:
1. Setting the model to training mode using model.train() [22, 23]. This tells the model to track gradients so that they can be used to update the model’s parameters [23].
2. Performing a forward pass by passing the data through the model.
3. Calculating the loss by comparing the model’s prediction with the actual data labels.
4. Setting gradients to zero using optimizer.zero_grad() [24].
5. Performing backpropagation using loss.backward() [15, 24].
6. Updating the model’s parameters using optimizer.step() [24].
During training, models can have two modes: train and evaluation.
- The train mode tracks gradients and other settings to learn from the data [22, 23].
- The evaluation mode turns off settings not needed for evaluation such as dropout, and it turns off gradient tracking to make the code run faster [25, 26].
Other key points about the model training process are:
- Hyperparameters: The training process involves the use of hyperparameters, which are values set by the user, like the learning rate or the number of epochs [5, 23].
- Experimentation: Model training is often an experimental process, with various parameters and settings being tried to find the best performing model [27, 28].
- Data: The quality and quantity of the training data has a big effect on the model’s performance [29, 30].
- Reproducibility: Randomness is an important part of training; to reproduce results, it is necessary to set random seeds [31, 32].
- Visualization: Visualizing model training through metrics such as accuracy and loss curves is important in understanding whether the model is learning effectively [33-35].
- Inference: When making predictions after training, the term inference is also used [36]. Inference uses a model to make predictions using unseen data [26, 36].
In summary, the model training process in machine learning involves iteratively adjusting a model’s parameters to minimize error by using the techniques of gradient descent and backpropagation [1, 2, 14, 15].

PyTorch Model Deployment

The sources discuss model deployment in the context of saving and loading models, which is a key part of making a model usable in an application or other context. Here’s a breakdown of model deployment methods based on the sources:
- Saving Models:State Dictionary: The recommended way to save a PyTorch model is to save its state dictionary [1, 2]. The state dictionary contains the model’s learned parameters, such as weights and biases [3, 4]. This is more flexible than saving the entire model [2].
- File Extension: PyTorch models are commonly saved with a .pth or .pt file extension [5].
- Saving Process: The saving process involves creating a directory path, defining a model name, and then using torch.save() to save the state dictionary to the specified file path [6, 7].
- Flexibility: Saving the state dictionary provides flexibility in how the model is loaded and used [8].
- Loading Models:Loading State Dictionary: To load a saved model, you must create a new instance of the model class and then load the saved state dictionary into that instance [4]. This is done using the load_state_dict() method, along with torch.load(), which reads the file containing the saved state dictionary [9, 10].
- New Instance: When loading a model, it’s important to remember that you must create a new instance of the model class, and then load the saved parameters into that instance using the load_state_dict method [4, 9, 11].
- Loading Process: The loading process involves creating a new instance of the model and then calling load_state_dict on the model with the file path to the saved model [12].
- Inference Mode:Evaluation Mode: Before loading a model for use, the model is typically set to evaluation mode by calling model.eval() [13, 14]. This turns off settings not needed for evaluation, such as dropout layers [15-17].
- Gradient Tracking: It is also common to use inference mode via the context manager torch.inference_mode to turn off gradient tracking, which speeds up the process of making predictions [18-21]. This is used when you are not training the model, but rather using it to make predictions [19].
- Deployment Context:Reusability: The sources mention that a saved model can be reused in the same notebook or sent to a friend to try out, or used in a week’s time [22].
- Cloud Deployment: Models can be deployed in applications or in the cloud [23].
- Model Transfer:Transfer Learning: The source mentions that parameters from one model could be used in another model; this process is called transfer learning [24].
- Other Considerations:Device Agnostic Code: It is recommended to write code that is device agnostic, so it can run on either a CPU or a GPU [25-27].
- Reproducibility: Random seeds should be set for reproducibility [28, 29].
- Model Equivalence: After loading a model, it is important to test that the loaded model is equivalent to the original model by comparing predictions [14, 30-32].
In summary, model deployment involves saving the trained model’s parameters using its state dictionary, loading these parameters into a new model instance, and using the model in evaluation mode with inference turned on, to make predictions. The sources emphasize the importance of saving models for later use, sharing them, and deploying them in applications or cloud environments.

PyTorch for Deep Learning & Machine Learning – Full Course

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
March 1, 2025
Harvard CS50’s Artificial Intelligence with Python – Full University Course
This source explains how AI can be used for problem-solving, moving from explicit instructions to learning from data. It introduces supervised learning, where AI learns to map inputs to outputs using labeled datasets, covering classification tasks and nearest neighbor algorithms. The source also discusses linear regression, support vector machines, and techniques like perceptron learning. It transitions to reinforcement learning, where AI learns through rewards and punishments in an environment, and touches on unsupervised learning with clustering techniques like k-means. Finally, the document explores neural networks, detailing their structure, training via gradient descent and backpropagation, and their applications in various AI problems.

Propositional Logic, Model Checking, and Beyond: A Comprehensive Study Guide

I. Review of Key Concepts
- Propositional Logic: A system for representing logical statements and reasoning about their truth values.
- Propositional Symbols: Variables representing simple statements that can be either true or false (e.g., P, Q, R).
- Logical Connectives: Symbols used to combine propositional symbols into more complex statements:
- and (∧): Both statements must be true for the combined statement to be true.
- or (∨): At least one statement must be true for the combined statement to be true.
- not (¬): Reverses the truth value of a statement.
- implies (→): If the first statement is true, then the second statement must also be true.
- biconditional (↔): Both statements have the same truth value (both true or both false).
- Knowledge Base (KB): A set of sentences representing facts known about the world.
- Query (α): A question about the world that we want to answer using the KB.
- Entailment (KB ⊨ α): The relationship between the KB and a query, meaning that the KB logically implies the query; whenever the KB is true, the query must also be true.
- Model: An assignment of truth values (true or false) to all propositional symbols in the language. Represents a possible world or state.
- Model Checking: An algorithm for determining entailment by enumerating all possible models and checking if, in every model where the KB is true, the query is also true.
- Inference Algorithm: A procedure to derive new sentences from existing ones in the KB.
- Inference Rules: Logical equivalences used to manipulate and simplify logical expressions (e.g., implication elimination, De Morgan’s laws, distributive law).
- Soundness: An inference algorithm is sound if it only derives conclusions that are entailed by the KB.
- Completeness: An inference algorithm is complete if it can derive all conclusions that are entailed by the KB.
- Conjunctive Normal Form (CNF): A logical sentence expressed as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals.
- Clause: A disjunction of literals (e.g., P or not Q or R).
- Literal: A propositional symbol or its negation (e.g., P, not Q).
- Resolution: An inference rule that combines two clauses containing complementary literals to produce a new clause.
- Factoring: Removing duplicate literals within a clause.
- Empty Clause: The result of resolving two contradictory clauses, representing a contradiction (always false).
- Inference by Resolution: An algorithm for proving entailment by converting the KB and the negation of the query to CNF, and then repeatedly applying the resolution rule until the empty clause is derived.
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Inclusion-Exclusion Formula: A formula for calculating the probability of A or B: P(A or B) = P(A) + P(B) – P(A and B).
- Marginalization: Calculating the probability of a variable by summing over all possible values of other variables: P(A) = Σ P(A and B).
- Conditioning: Expressing the probability of A in terms of the conditional probability of A given B and the probability of B: P(A) = P(A|B) * P(B) + P(A|¬B) * P(¬B).
- Conditional Probability: The probability of event A occurring given that event B has already occurred, denoted P(A|B).
- Random Variable: A variable whose value is a numerical outcome of a random phenomenon.
- Heuristic Function: An estimate of the “goodness” of a state (e.g., the distance to the goal).
- Local Search: A class of optimization algorithms that start with an initial state and iteratively improve it by moving to neighboring states.
- Hill Climbing: A local search algorithm that repeatedly moves to the neighbor with the highest value.
- Steepest Ascent Hill Climbing: Chooses the best neighbor among all neighbors in each iteration.
- Stochastic Hill Climbing: Chooses a neighbor randomly from the neighbors that are better than the current state.
- First Choice Hill Climbing: Chooses the first neighbor with a higher value and moves there.
- Random Restart Hill Climbing: Runs hill climbing multiple times with different initial states and returns the best result.
- Local Beam Search: Keeps track of k best states and expands all of them in each iteration.
- Local Maximum/Minimum: A state that is better than all its neighbors but not the best state overall.
- Simulated Annealing: A local search algorithm that sometimes accepts worse neighbors with a probability that decreases over time (temperature).
- Temperature (in Simulated Annealing): A parameter that controls the probability of accepting worse neighbors; high temperature means higher probability, and low temperature means lower probability.
- Delta E (ΔE): The difference in value (or cost) between the current state and a neighboring state.
- Traveling Salesman Problem (TSP): Finding the shortest possible route that visits every city and returns to the origin city.
- NP-Complete Problems: A class of problems for which no known polynomial-time algorithm exists.
- Linear Programming: A mathematical technique for optimizing a linear objective function subject to linear equality and inequality constraints.
- Objective Function: A mathematical expression to be minimized or maximized in linear programming.
- Constraints: Restrictions or limitations on the values of variables in linear programming.
- Constraint Satisfaction Problem (CSP): A problem where the goal is to find values for a set of variables that satisfy a set of constraints.
- Variables (in CSP): Entities with associated domains of possible values.
- Domains (in CSP): The set of possible values that can be assigned to a variable.
- Constraints (in CSP): Restrictions on the values that variables can take, specifying allowable combinations of values.
- Unary Constraint: A constraint involving only one variable.
- Binary Constraint: A constraint involving two variables.
- Node Consistency: Ensuring that all values in a variable’s domain satisfy the variable’s unary constraints.
- Arc Consistency: Ensuring that for every value in a variable’s domain, there exists a consistent value in the domain of each of its neighboring variables.
- AC3: A common algorithm for enforcing arc consistency.
- Backtracking Search: A recursive algorithm that explores possible solutions by trying different values for variables and backtracking when a constraint is violated.
- Minimum Remaining Values (MRV) Heuristic: A variable selection strategy that chooses the variable with the fewest remaining legal values.
- Degree Heuristic: A variable selection strategy that chooses the variable involved in the largest number of constraints on other unassigned variables.
- Least Constraining Value Heuristic: A value selection strategy that chooses the value that rules out the fewest choices for neighboring variables in the constraint graph.
- Supervised Machine Learning: A type of machine learning where an algorithm learns from labeled data to make predictions or classifications.
- Inputs (x): The features or attributes used by a machine learning model to make predictions.
- Outputs (y): The target variables or labels that a machine learning model is trained to predict.
- Hypothesis Function (h): A mathematical function that maps inputs to outputs.
- Weights (w): Parameters in a machine learning model that determine the importance of each input feature.
- Learning Rate (α): A parameter that controls the step size during training.
- Threshold Function: A function that outputs one value if the input is above a threshold and another value if the input is below the threshold.
- Logistic Regression: A statistical method for binary classification using a logistic function to model the probability of a certain class or event.
- Soft Threshold: A function that smoothly transitions between two values, allowing for outputs between 0 and 1.
- Dot Product: A mathematical operation that multiplies corresponding elements of two vectors and sums the results.
- Gradient Descent: An iterative optimization algorithm for finding the minimum of a function.
- Stochastic Gradient Descent: An optimization algorithm that updates the parameters of a machine learning model using the gradient computed from a single randomly chosen data point.
- Mini-Batch Gradient Descent: An optimization algorithm that updates the parameters of a machine learning model using the gradient computed from a small batch of data points.
- Neural Networks: A type of machine learning model inspired by the structure of the human brain, consisting of interconnected nodes (neurons) organized in layers.
- Activation Function: A function applied to the output of a neuron in a neural network to introduce non-linearity.
- Layers (in Neural Networks): A level of nodes that receive input from other nodes and pass their output to additional nodes.
- Natural Language Processing (NLP): The branch of AI that deals with the interaction between computers and human language.
- Syntax: The set of rules that govern the structure of sentences in a language.
- Semantics: The meaning of words, phrases, and sentences in a language.
- Formal Grammar: A set of rules for generating sentences in a language.
- Context-Free Grammar: A type of formal grammar where rules consist of a single non-terminal symbol on the left-hand side.
- Terminal Symbol: A symbol that represents a word in a language.
- Non-Terminal Symbol: A symbol that represents a phrase or category of words in a language.
- Rewriting Rules: Rules that specify how non-terminal symbols can be replaced by other symbols.
- Noun Phrase: A phrase that functions as a noun.
- Verb Phrase: A phrase that functions as a verb.
- Natural Language Toolkit (NLTK): A Python library for NLP.
- Parsing: The process of analyzing a sentence according to the rules of a grammar.
- Syntax Tree: A hierarchical representation of the structure of a sentence.
- Statistical NLP: An approach to NLP that uses statistical models learned from data.
- n-gram: A contiguous sequence of n items from a sample of text.
- Markov Chain: A sequence of events where the probability of each event depends only on the previous event.
- Tokenization: The process of splitting a sequence of characters into pieces (tokens).
- Text Classification: The task of assigning a category label to a text.
- Sentiment Analysis: Determining the emotional tone or attitude expressed in a piece of text.
- Bag-of-Words Model: A text representation that represents a document as the counts of its words, disregarding grammar and word order.
- Term Frequency (TF): The number of times a term appears in a document.
- Inverse Document Frequency (IDF): A measure of how rare a term is across a collection of documents.
- TF-IDF: A weight used in information retrieval and text mining that reflects how important a word is to a document in a corpus.
- Stop Words: Common words that are often removed from text before processing.
- Word Embeddings: Vector representations of words that capture semantic relationships.
- One-Hot Representation: A vector representation where each word is represented by a vector with a 1 in the corresponding index and 0s elsewhere.
- Distributed Representation: A vector representation where the meaning of a word is distributed across multiple values.
- Word2Vec: A model for learning word embeddings.
II. Short Answer Quiz
1. Explain the difference between soundness and completeness in the context of inference algorithms. Soundness means that any conclusion drawn by the algorithm is actually entailed by the knowledge base. Completeness means that the algorithm is capable of deriving every conclusion that is entailed by the knowledge base.
2. Describe the process of converting a logical sentence into Conjunctive Normal Form (CNF). The process involves eliminating bi-conditionals and implications, moving negations inward using De Morgan’s laws, and using the distributive law to get a conjunction of clauses where each clause is a disjunction of literals.
3. What is the purpose of using the resolution inference rule in propositional logic? The resolution rule is used to derive new clauses from existing ones, aiming to ultimately derive the empty clause, which indicates a contradiction and proves entailment.
4. Explain the marginalization rule and provide a simple example. Marginalization calculates the probability of a variable by summing over all possible values of other variables. For example, if you want to know the probability that someone likes ice cream, you would take the probability of them liking ice cream and liking chocolate times the probability that they like chocolate.
5. What is the key idea behind local search algorithms? Local search algorithms start with an initial state and iteratively improve it by moving to neighboring states, based on some evaluation function, without necessarily keeping track of the path taken to reach the solution.
6. Describe how simulated annealing helps to avoid local optima. Simulated annealing accepts worse neighbors with a probability that decreases over time, allowing the algorithm to escape local optima early in the search and converge towards a global optimum later.
7. In linear programming, what are the roles of the objective function and constraints? The objective function is what we want to minimize or maximize, while constraints are limitations on the values of variables that must be satisfied.
8. What is the purpose of enforcing arc consistency in a constraint satisfaction problem (CSP)? Enforcing arc consistency reduces the domains of variables by removing values that cannot be part of any solution due to binary constraints, making the search for a solution more efficient.
9. Explain the difference between a one-hot representation and a distributed representation in NLP. A one-hot representation represents a word as a vector with a 1 in the corresponding index and 0s elsewhere, while a distributed representation distributes the meaning of a word across multiple values in a vector.
10. How do word embedding models like Word2Vec capture semantic relationships between words? Word2Vec captures semantic relationships by training a model to predict the context words surrounding a given word in a large corpus, resulting in vector representations where similar words are located close to each other in vector space.
III. Essay Questions
1. Compare and contrast model checking and inference by resolution as methods for determining entailment in propositional logic. Discuss the advantages and disadvantages of each approach.
2. Explain how local search algorithms can be applied to solve optimization problems. Discuss the challenges of local optima and describe techniques, such as simulated annealing, for overcoming these challenges.
3. Describe the general framework of a constraint satisfaction problem (CSP). Discuss the role of variable and value selection heuristics in improving the efficiency of backtracking search for solving CSPs.
4. Explain the process of training a machine learning model for sentiment analysis. Discuss the different text representation techniques, such as bag-of-words and TF-IDF, and the role of word embeddings.
5. Describe the key concepts in Natural Language Processing (NLP), including syntax and semantics. Discuss how NLP techniques are used to understand and generate natural language.
IV. Glossary of Key Terms
- Activation Function: A function applied to the output of a neuron in a neural network to introduce non-linearity, enabling the network to learn complex patterns.
- Arc Consistency: A constraint satisfaction technique ensuring that for every value in a variable’s domain, there exists a consistent value in the domain of each of its neighboring variables based on the problem constraints.
- Backtracking Search: A recursive algorithm that explores possible solutions by trying different values for variables and backtracking when a constraint is violated, allowing the algorithm to systematically search the solution space.
- Bag-of-Words Model: A text representation in NLP that represents a document as the counts of its words, disregarding grammar and word order, which helps quantify the content of texts for analysis.
- Clause: In logic, it is the statement that combines different literals with “or” relationship.
- Complete: An inference algorithm that can derive all conclusions entailed by the KB.
- Conditioning: A probability rule that expresses the probability of one event in terms of its conditional probability, and this rule is used to find the probabilities that are unknown with the information given.
- Conjunctive Normal Form (CNF): A standardized logical sentence expressed as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals, simplifying logical deductions.
- Constraints: Limitation to the conditions of the variables in linear programing or constraint satisfaction problems.
- Context-Free Grammar: A type of formal grammar where rules consist of a single non-terminal symbol on the left-hand side, used to define the syntax of programming languages.
- Delta E (ΔE): The difference in value between the current state and its neighboring states.
- Distributed Representation: It describes the meaning of the representation of a word distributing over multiple values in vector which is the idea behind the word embedding technique.
- Domain: The set of possible values that can be assigned to a variable.
- Entailment (KB ⊨ α): KB logically implies that α; whenever KB is true, so does α, which is the relationship that is important when the machine needs to find if the conclusion is correct or not.
- Formal Grammar: A set of rules for generating sentences in a language, and those rules are applied in order to find what it is that is trying to be said in language analysis.
- Heuristic Function: It estimates the ‘goodness’ of a state (e.g., the distance to the goal), which will let machine learning models take efficient and near perfect results.
- Hill Climbing: This iterative optimization algorithm is characterized by continuously searching to find better solution while moving to a better neighbor and also have the highest value.
- Hypothesis Function (h): This function maps inputs to outputs and can be used to learn and predict.
- Inclusion-Exclusion Formula: Used to find the P(A or B), in which it finds the P(A), P(B), P(A and B), and finds P(A)+P(B)-P(A and B) in result.
- Inference Algorithm: A procedure to derive new sentences from existing ones in the KB.
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Knowledge Base (KB): A set of sentences representing facts known about the world.
- Layers (in Neural Networks): A level of nodes that receive input from other nodes and pass their output to additional nodes.
- Learning Rate (α): It controls the step size during the machine learning algorithm.
- Linear Programming: A mathematical technique for optimizing a linear objective function subject to linear equality and inequality constraints.
- Literal: A propositional symbol or its negation (e.g., P, not Q) that describes the condition of a statement.
- Local Maximum/Minimum: A state that is better than all its neighbors but not the best state overall.
- Local Search: A class of optimization algorithms that start with an initial state and iteratively improve it by moving to neighboring states.
- Logistic Regression: A statistical method for binary classification using a logistic function to model the probability of a certain class or event.
- Marginalization: Calculating the probability of a variable by summing over all possible values of other variables: P(A) = Σ P(A and B).
- Markov Chain: A sequence of events where the probability of each event depends only on the previous event, allowing modeling of sequences over time.
- Model: An assignment of truth values (true or false) to all propositional symbols in the language that represents the state.
- Model Checking: An algorithm for determining entailment by enumerating all possible models and checking if, in every model where the KB is true, the query is also true.
- n-gram: A contiguous sequence of n items from a sample of text that helps in analyzing languages and predicting text.
- Natural Language Processing (NLP): The field of AI that is related to the understanding of human language.
- Noun Phrase: A phrase that functions as a noun to use for language parsing.
- NP-Complete Problems: A class of problems for which no known polynomial-time algorithm exists.
- Objective Function: An mathematical function to be minimized or maximized in linear programming.
- One-Hot Representation: A vector representation where each word is represented by a vector with a 1 in the corresponding index and 0s elsewhere.
- Parsing: This process of taking a sentence and analyzing it according to grammar rules in NLP.
- Propositional Logic: A system for representing logical statements and reasoning about their truth values.
- Query (α): The question that we want to answer using the KB.
- Random Variable: A variable whose value is a numerical outcome of a random phenomenon.
- Rewriting Rules: Rules that specify how non-terminal symbols can be replaced by other symbols.
- Semantics: the meaning of words, phrases, and sentences in a language, which helps with extracting the insights and understanding of language.
- Simulated Annealing: A local search algorithm that sometimes accepts worse neighbors with a probability that decreases over time (temperature).
- Soft Threshold: A function that smoothly transitions between two values, allowing for outputs between 0 and 1.
- Soundness: An inference algorithm is sound if it only derives conclusions that are entailed by the KB.
- Statistical NLP: An approach to NLP that uses statistical models learned from data.
- Steepest Ascent Hill Climbing: Chooses the best neighbor among all neighbors in each iteration.
- Stop Words: Common words that are often removed from text before processing.
- Syntax: The set of rules that govern the structure of sentences in a language.
- Syntax Tree: A hierarchical representation of the structure of a sentence, used to know how a structure looks with a graphical approach.
- Temperature (in Simulated Annealing): A parameter that controls the probability of accepting worse neighbors; high temperature means higher probability, and low temperature means lower probability.
- Tokenization: The process of splitting a sequence of characters into pieces (tokens), which allows for language parsing and to read for machines.
- Traveling Salesman Problem (TSP): Finding the shortest possible route that visits every city and returns to the origin city.
- Unary Constraint: A constraint involving only one variable.
- Verb Phrase: A phrase that functions as a verb to be analyzed in parsing.
- Weights (w): Parameters in a machine learning model that determine the importance of each input feature, letting it know the emphasis on each feature.
- Word Embeddings: Vector representations of words that capture semantic relationships.
- Word2Vec: A model for learning word embeddings by knowing what words mean, learning and classifying similar words.
AI: Reasoning, Search, NLP, and Learning Techniques

Here’s a briefing document summarizing the main themes and ideas from the provided sources.

Briefing Document: Artificial Intelligence – Reasoning, Search, and Natural Language Processing

Overview:

The sources cover several fundamental concepts in Artificial Intelligence (AI), including logical reasoning, search algorithms, probabilistic reasoning, and natural language processing (NLP). They explore techniques for representing knowledge, drawing inferences, solving problems through search, handling uncertainty, and enabling computers to understand and generate human language.

I. Logical Reasoning and Inference:
- Entailment and Inference Algorithms: The core idea is that AI systems should be able to determine if a knowledge base (KB) entails a query (alpha). This means: “Given some query about the world…the question we want to ask…is does KB, our knowledge base, entail alpha? In other words, using only the information we know inside of our knowledge base…can we conclude that this sentence alpha is true?”
- Model Checking: This is a basic inference algorithm. It involves enumerating all possible models (assignments of truth values to variables) and checking if, in every model where the knowledge base is true, the query (alpha) is also true. “If we wanted to determine if our knowledge base entails some query alpha, then we are going to enumerate all possible models…And if in every model where our knowledge base is true, alpha is also true, then we know that the knowledge base entails alpha.”
- Inference Rules: These are logical transformations used to derive new knowledge from existing knowledge. Examples include:
- Implication Elimination: alpha implies beta can be transformed into not alpha or beta. “This is a way to translate if-then statements into or statements… if I have the implication, alpha implies beta, that I can draw the conclusion that either not alpha or beta”
- Biconditional Elimination: a if and only if b becomes a implies b and b implies a.
- De Morgan’s Laws: These laws relate ANDs and ORs through negation. not (alpha and beta) is equivalent to not alpha or not beta. And not (alpha or beta) is equivalent to not alpha and not beta. “If it is not true that alpha and beta, well, then either not alpha or not beta… if you have a negation in front of an and expression, you move the negation inwards, so to speak…and then flip the and into an or.”
- Distributive Law: alpha and (beta or gamma) is equivalent to (alpha and beta) or (alpha and gamma).
- Conjunctive Normal Form (CNF): A standard form for logical sentences where it is represented as a conjunction (AND) of clauses, where each clause is a disjunction (OR) of literals (propositional symbols or their negations). “A conjunctive normal form sentence is a logical sentence that is a conjunction of clauses…a conjunction of clauses means it is an and of individual clauses, each of which has ors in it.”
- Resolution: An inference rule that applies to clauses in CNF. If you have P or Q and not P or R, you can resolve them to get Q or R. This involves dealing with factoring (removing duplicate literals) and the empty clause (representing a contradiction). “…if I have two clauses where there’s something that conflicts or something complementary between those two clauses, I can resolve them to get a new clause, to draw a new conclusion.”
- Inference by Resolution: To prove that a knowledge base entails a query (alpha), we assume not alpha and try to derive a contradiction (the empty clause) using resolution. “We want to prove that our knowledge base entails some query alpha…we’re going to try to prove that if we know the knowledge and not alpha, that that would be a contradiction…To determine if our knowledge base entails some query alpha, we’re going to convert knowledge base and not alpha to conjunctive normal form”
II. Search Algorithms:
- Search Problems: Defined by an initial state, actions, a transition model, a goal test, and a path cost function.
- Local Search: Algorithms that operate on a single current state and move to neighbors. They don’t care about the path to the solution.
- Hill Climbing: A simple local search algorithm that repeatedly moves to the neighbor with the highest value (or lowest cost). It suffers from problems with local maxima/minima. “Generally, what hill climbing is going to do is it’s going to consider the neighbors of that state…and pick the highest one I can…continually looking at all of my neighbors and picking the highest neighbor…until I get to a point…where I consider both of my neighbors and both of my neighbors have a lower value than I do.”
- Variations: Steepest ascent, stochastic, first choice, random restart, local beam search.
- Simulated Annealing: A local search algorithm that sometimes accepts worse moves to escape local optima. The probability of accepting a worse move depends on the “temperature” and the difference in cost (delta E). “whereas before, we never, ever wanted to take a move that made our situation worse, now we sometimes want to make a move that is actually going to make our situation worse…And so how do we do that? How do we decide to sometimes accept some state that might actually be worse? Well, we’re going to accept a worse state with some probability.”
- Linear Programming: A family of problems where the goal is to minimize a cost function subject to linear constraints. “the goal of linear programming is to minimize a cost function…subject to particular constraints, subjects to equations that are of the form like this of some sequence of variables is less than a bound or is equal to some particular value”
III. Constraint Satisfaction Problems (CSPs):
- Definition: Problems defined by variables, domains (possible values for each variable), and constraints.
- Node Consistency: Ensuring that all values in a variable’s domain satisfy the unary constraints (constraints involving only that variable). “…we can pick any of these values in the domain. And there won’t be a unary constraint that is violated as a result of it.”
- Arc Consistency: Ensuring that all values in a variable’s domain satisfy the binary constraints (constraints involving two variables). “In order to make some variable x arc consistent with respect to some other variable y, we need to remove any element from x’s domain to make sure that every choice for x, every choice in x’s domain, has a possible choice for y.”
- AC3: An algorithm for enforcing arc consistency across an entire CSP. It maintains a queue of arcs and revises domains to ensure consistency. “AC3 takes a constraint satisfaction problem. And it enforces our consistency across the entire problem…It’s going to basically maintain a queue or basically just a line of all of the arcs that it needs to make consistent.”
- Backtracking Search: A depth-first search algorithm for solving CSPs. It assigns values to variables one at a time, backtracking when a constraint is violated.
- Minimum Remaining Values (MRV): A heuristic for variable selection that chooses the variable with the fewest remaining legal values in its domain. “Select the variable that has the fewest legal values remaining in its domain…In the example of the classes and the exam slots, you would prefer to choose the class that can only meet on one possible day.”
- Degree Heuristic: A heuristic used to select what the best variable will be. “The general approach is that in cases of ties, where two or more of the classes each can only have one possible day of the exam left, we want to choose the one that is involved in the most constraints, the one that we expect to potentially have the bigger impact on the overall problem”
- Least Constraining Value: A heuristic for value selection that chooses the value that rules out the fewest choices for neighboring variables. “Loop over the values in the domain that we haven’t yet tried and pick the value that rules out the fewest values from the neighboring variables.”
IV. Probabilistic Reasoning:
- Joint Probability Distribution: A table showing the probabilities of all possible combinations of values for a set of random variables.
- Inclusion-Exclusion Principle: Used to calculate the probability of A or B: P(A or B) = P(A) + P(B) – P(A and B). Deals with the problem of overcounting when calculating probabilities.
- Marginalization: A rule used to calculate the probability of a variable by summing over all possible values of other variables. “I need to sum up not just over B and not B, but for all of the possible values that the other random variable could take on…I’m going to sum up over j, where j is going to range over all of the possible values that y can take on. Well, let’s look at the probability that x equals xi and y equals yj.”
- Conditioning: Similar to marginalization, but uses conditional probabilities instead of joint probabilities.
V. Supervised Learning:
- Hypothesis Function: A function that maps inputs to outputs. In supervised learning the input consists of a set of labeled data points, each with multiple features and one associated value, or ‘label’. The job of supervised learning is to ‘learn’ a model that correctly maps an input consisting of a data point with multiple features to a corresponding output.
- Weights: Parameters of the hypothesis function that determine the importance of different input features. “We’ll generally call that number a weight for how important should these variables be in trying to determine the answer.”
- Threshold Function: A function that outputs one category if the weighted sum of inputs is above a threshold and another category otherwise. “If we do all this math, is it greater than or equal to 0? If so, we might categorize that data point as a rainy day. And otherwise, we might say, no rain.”
- Logistic Regression: Uses a logistic function (sigmoid) instead of a hard threshold, allowing for probabilistic outputs between 0 and 1. “Instead of using this hard threshold type of function, we can use instead a logistic function…And as a result, the possible output values are no longer just 0 and 1…But you can actually get any real numbered value between 0 and 1.”
- Gradient Descent: An iterative optimization algorithm used to find the optimal weights for a model by repeatedly updating the weights in the direction of the negative gradient of the cost function. “And we can use gradient descent to train a neural network, that gradient descent is going to tell us how to adjust the weights to try and lower that overall cost on all the data points.”
- Stochastic Gradient Descent: Updates the weights based on a single randomly chosen data point at each iteration.
- Mini-Batch Gradient Descent: Updates the weights based on a small batch of data points at each iteration.
- Neural Networks: A network of interconnected nodes (neurons) organized in layers. Each connection has a weight. Neural networks take an input and ‘learn’ to modify the weight of each connection to accurately map an input to an output. A simple neural network consists of an input layer and an output layer, while more complex neural networks consist of several hidden layers between input and output. “we create a network of nodes…and if we want, we can connect all of these nodes together such that every node in the first layer is connected to every node in the second layer…And each of these edges has a weight associated with it.”
- Activation Function: A function applied to the output of each node in a neural network to introduce non-linearity. “You take the inputs, you multiply them by the weights, and then you typically are going to transform that value a little bit using what’s called an activation function.”
- Multi-Class Classification: A classification problem with more than two categories. Can be handled using neural networks with multiple output nodes, each representing the probability of belonging to a particular class.
VI. Natural Language Processing (NLP):
- Syntax: The structure of language.
- Semantics: The meaning of language. “While syntax is all about the structure of language, semantics is about the meaning of language. It’s not enough for a computer just to know that a sentence is well-structured if it doesn’t know what that sentence means.”
- Formal Grammar: A system of rules for generating sentences in a language.
- Context-Free Grammar (CFG): A type of formal grammar that defines rules for rewriting non-terminal symbols into terminal symbols (words) or other non-terminal symbols. “a context-free grammar is some system of rules for generating sentences in a language…We’re going to give the computer some rules that we know about language and have the computer use those rules to make sense of the structure of language.”
- NLTK (Natural Language Toolkit): A Python library for NLP tasks.
- N-grams: Contiguous sequences of n items (characters or words) from a sample of text.
- Tokenization: The process of splitting a sequence of characters into pieces, such as words.
- Markov Chain: A sequence of values where one value can be predicted based on the preceding values. Can be used for language generation. “Recall that a Markov chain is some sequence of values where we can predict one value based on the values that came before it…we can use that to predict what word might come next in a sequence of words.”
- Text Classification: The problem of assigning a category or label to a piece of text.
- Sentiment Analysis: A specific text classification task that involves determining the sentiment (positive, negative, neutral) of a piece of text.
- Bag of Words: A representation of text as a collection of words, disregarding grammar and word order, but keeping track of word frequencies. “With the bag of words representation, I’m just going to keep track of the count of every single word, which I’m going to call features.”
- TF-IDF (Term Frequency-Inverse Document Frequency): A weighting scheme that assigns higher weights to words that are frequent in a document but rare in the overall corpus.
- One-Hot Representation: A vector representation of a word where one element is 1 and all other elements are 0. “Each of these words now has a distinct vector representation. And this is what we often call a one-hot representation, a representation of the meaning of a word as a vector with a single 1 and all of the rest of the values are 0.”
- Distributed Representation: A vector representation of a word where the meaning is distributed across multiple values, ideally in such a way that similar words have similar vector representations.
- Word Embeddings: Distributed representations of words that capture semantic relationships.
- Word2Vec: A model for generating word embeddings based on the context in which words appear. “we’re going to define the meaning of a word based on the words that appear around it, the context words around it…we’re going to say is because the words breakfast and lunch and dinner appear in a similar context, that they must have a similar meaning.”
This briefing document provides a high-level overview of the concepts covered in the sources. It highlights key definitions, algorithms, and techniques used in AI.

NLP, ML, and Problem Solving: FAQ

Natural Language Processing, Machine Learning and Problem Solving: FAQ

1. What is the core concept of “entailment” in the context of knowledge bases and inference algorithms, and how does model checking help determine entailment?

Entailment refers to whether a knowledge base (KB) logically implies a query (alpha). In other words, can you conclude that alpha is true solely based on the information within the KB? Model checking is an algorithm that answers this by enumerating all possible models (assignments of true/false to propositional symbols). If, in every model where the KB is true, alpha is also true, then the KB entails alpha. Essentially, it exhaustively checks if alpha must be true whenever the KB is true.

2. Explain the model checking algorithm, including how it enumerates models and determines if a knowledge base entails a query.

The model checking algorithm involves the following steps:
1. Enumerate all possible models: List every possible combination of truth values (true or false) for all propositional symbols in the knowledge base and query.
2. Evaluate the knowledge base in each model: Determine if the knowledge base (KB) is true or false in each of the enumerated models.
3. Check the query in models where the KB is true: For every model where the KB is true, check if the query (alpha) is also true.
- Determine entailment:If alpha is true in every model where the KB is true, then the KB entails alpha.
- If there exists at least one model where the KB is true but alpha is false, then the KB does not entail alpha.
3. What are inference rules in propositional logic, and give examples of implication elimination, biconditional elimination, and De Morgan’s laws?

Inference rules are logical equivalences that allow you to transform logical sentences into different, but logically equivalent, forms. This is useful for drawing new conclusions from existing knowledge. Here are some examples:
- Implication Elimination: alpha implies beta is equivalent to not alpha or beta. This replaces an implication with an OR statement.
- Biconditional Elimination: alpha if and only if beta is equivalent to (alpha implies beta) and (beta implies alpha). This breaks down a biconditional into two implications.
- De Morgan’s Laws:not (alpha and beta) is equivalent to not alpha or not beta. The negation of a conjunction is the disjunction of the negations.
- not (alpha or beta) is equivalent to not alpha and not beta. The negation of a disjunction is the conjunction of the negations.
4. Describe the conjunctive normal form (CNF) and explain the steps to convert a logical formula into CNF.

Conjunctive Normal Form (CNF) is a standard logical format where a sentence is represented as a conjunction (AND) of clauses, and each clause is a disjunction (OR) of literals. A literal is either a propositional symbol or its negation. The steps to convert a formula to CNF are:
1. Eliminate Biconditionals: Replace all alpha <-> beta with (alpha -> beta) ^ (beta -> alpha).
2. Eliminate Implications: Replace all alpha -> beta with ~alpha v beta.
3. Move Negations Inwards: Use De Morgan’s laws to move negations inward, so they apply only to literals (e.g., ~ (alpha ^ beta) becomes ~alpha v ~beta).
4. Distribute ORs over ANDs: Use the distributive law to transform the expression into a conjunction of clauses (e.g., alpha v (beta ^ gamma) becomes (alpha v beta) ^ (alpha v gamma)).
5. Explain the resolution inference rule and the resolution algorithm for proving entailment. What is “inference by resolution,” and how does the empty clause relate to contradiction?

The resolution inference rule states that if you have two clauses, alpha OR beta and ~alpha OR gamma, you can infer beta OR gamma. It essentially eliminates a complementary pair of literals (alpha and ~alpha) and combines the remaining literals into a new clause. “Inference by resolution” uses this rule repeatedly to derive new clauses.

The resolution algorithm for proving entailment involves:
1. Negate the query: To prove KB entails alpha, assume ~alpha.
2. Convert to CNF: Convert KB AND ~alpha into CNF.
3. Resolution Loop: Repeatedly apply the resolution rule to pairs of clauses in the CNF formula. Add any new clauses generated back into the set of clauses. If factoring is needed, remove any duplicate literals in resulting clause.
4. Check for Empty Clause: If, at any point, you derive the “empty clause” (a clause with no literals, representing “false”), this means you’ve found a contradiction.
5. Determine Entailment: If you derive the empty clause, then KB entails alpha (because KB AND ~alpha leads to a contradiction, so it must be the case that if KB is true, then alpha must be true). If you can no longer derive new clauses and haven’t found the empty clause, then KB does not entail alpha.
The empty clause signifies a contradiction because it represents a situation where both P and NOT P are true, which is impossible. Finding the empty clause through resolution proves that the initial assumption (the negated query) was inconsistent with the knowledge base.

6. Explain the inclusion-exclusion principle and the marginalization rule in probability theory, providing examples of their application.
- Inclusion-Exclusion Principle: This principle calculates the probability of A OR B. The formula is: P(A or B) = P(A) + P(B) – P(A and B). It is used to correct for over counting when calculating P(A or B).
- Example: The probability of rolling a 6 on a red die (A) OR a 6 on a blue die (B). If you just add P(A) + P(B), you’re double-counting the case where both dice show 6. Subtracting P(A and B) (the probability of both dice showing 6) corrects for this.
- Marginalization Rule: This rule allows you to calculate the probability of one variable (A) by summing over all possible values of another variable (B). The formula is: P(A) = Σ P(A and B).
- Example: Probability of it being cloudy (A), given the joint distribution of cloudiness and raininess (B). We calculate P(cloudy) by summing P(cloudy and rainy) + P(cloudy and not rainy). We consider all possible cases that take place, and then look at the probability that the probability of A happens in each of the cases. This is useful for finding an individual (unconditional) probability from a joint probability distribution.
7. Describe the hill climbing algorithm, including its pseudocode, and discuss its limitations (local optima). Also explain variations like stochastic hill climbing and random restart hill climbing.

The hill climbing algorithm is a local search technique used to find a maximum (or minimum) of a function. Its pseudocode is as follows:
1. Start with a current state (often random).
2. Loop: a. Find the neighbor of the current state with the highest (or lowest) value. b. If the neighbor is better than the current state, move to the neighbor ( current = neighbor). c. If the neighbor is not better, terminate and return the current state.
A major limitation of hill climbing is that it can get stuck in local optima: points that are better than their immediate neighbors but not the best overall solution.

Variations:
- Stochastic Hill Climbing: Randomly choose a neighbor with a better value, rather than always picking the best neighbor. This can help escape plateaus (areas of the search space with relatively equal value), but not always a local optimum.
- Random Restart Hill Climbing: Run the hill climbing algorithm multiple times from different random starting states. Keep track of the best solution found across all runs. This increases the chance of finding the global optimum by exploring different regions of the search space.
8. Explain the simulated annealing algorithm and how it can potentially escape local optima compared to simple hill climbing.

Simulated Annealing is a metaheuristic optimization algorithm that can be used for finding the global minimum of a function that may possess several local minima. Simulated Annealing works by first randomly picking a state. Then the algorithm calculates the cost of the state and then makes a neighbor of the state to calculate that cost as well. If the neighbor cost is better, than the new current state becomes the new neighbor. However, simulated annealing adds a twist. Even if the neighbor cost is not better than the current state, you still have a probability of setting the current state to the new worse neighbor to try and dislodge yourself.

This probability is based on a temperature. At the beginning, the temperature is high so there is a better probability to dislodge yourself and explore the search space even if it may lead to worse results at first. As the algorithm iterates, the temperature starts to go down, so it slowly starts to look for better neighbors instead of just exploring and dislodging.

Simulated Annealing is thus better than simple hill climbing because simple hill climbing never goes to a state that may lead to worse results, so as a result gets stuck in local optima as described in the hill climbing algorithm, which SA doesn’t suffer from.

Supervised Learning: Classification, Regression, and Evaluation

Supervised learning is a type of machine learning where a computer is given access to a dataset of input-output pairs and learns a function that maps inputs to outputs. The computer uses the data to train its model and understand the relationships between inputs and outputs. The goal is for the AI to learn to predict outputs based on new input data.

Key aspects of supervised learning:
- Input-output pairs: The computer is provided with a dataset where each data point consists of an input and a corresponding desired output.
- Function mapping: The goal is to find a function that accurately maps inputs to outputs, allowing the computer to make predictions on new, unseen data.
- Training: The computer uses the provided data to train its model, adjusting its internal parameters to minimize the difference between its predictions and the actual outputs.
Classification and regression are two common tasks within supervised learning.
- Classification: Aims to map inputs into discrete categories. An example would be classifying a banknote as authentic or counterfeit based on its features.
- Regression: Aims to predict continuous output values. For example, predicting sales based on advertising spending.
Implementation and evaluation
- Libraries such as Scikit-learn in Python provide tools to implement supervised learning algorithms.
- The data is typically split into training and testing sets. The model is trained on the training set and evaluated on the testing set to assess its ability to generalize to new data.
- Holdout cross-validation splits the data into training and testing sets. The training set trains the machine learning model. The testing set tests how well the machine learning model performs.
- K-fold cross-validation divides data into k different sets and runs k different experiments.
Machine Learning: Algorithms, Techniques, and Applications

Machine learning involves enabling computers to learn from data and experiences without explicit instructions. Instead of programming a computer with explicit rules, machine learning allows the computer to learn patterns from data and improve its performance on a specific task.

Key aspects of machine learning:
- Learning from Data: Machine learning algorithms use data to identify patterns, make predictions, and improve decision-making.
- Algorithms and Techniques: Machine learning encompasses a wide range of algorithms and techniques that enable computers to learn from data.
- Pattern Recognition: Machine learning algorithms identify underlying patterns and relationships within data.
Machine learning comes in different forms, including supervised learning, reinforcement learning and unsupervised learning.
- Supervised learning involves training a model on a labeled dataset consisting of input-output pairs, enabling the model to learn a function that maps inputs to outputs.
- Reinforcement learning involves training an agent to make decisions in an environment to maximize a reward signal.
- Unsupervised learning involves discovering patterns and relationships in unlabeled data without explicit guidance. Clustering is a task preformed in unsupervised learning that involves organizing a set of objects into distinct clusters or groups of similar objects.
Neural networks are a popular tool in machine learning inspired by the structure of the human brain and can be very effective at certain tasks. A neural network is a mathematical model for learning inspired by biological neural networks. Artificial neural networks can model mathematical functions and learn network parameters.

TensorFlow is a library that can be used for creating neural networks, modeling them, and running them on sample data.

Machine learning has a wide variety of applications including: recognizing faces in photos, playing games, understanding human language, spam detection, search and optimization problems, and more.

Neural Networks: Models, Training, and Applications

Neural networks are a popular tool in modern machine learning that draw inspiration from the way human brains learn and reason. They are a type of model that is effective at learning from some set of input data to figure out how to calculate some function from inputs to outputs.

Key aspects of neural networks:
- Mathematical Model: A neural network is a mathematical model for learning inspired by biological neural networks.
- Units: Instead of biological neurons, neural networks use units inside of the network. The units can be represented like nodes in a graph.
- Layers: Neural networks are composed of multiple layers of interconnected nodes or units, including an input layer, one or more hidden layers, and an output layer.
- Weights: Connections between units are defined by weights. The weights determine how signals are passed between connected nodes.
- Activation Functions: Activation functions introduce non-linearity into the network, allowing it to learn complex patterns and relationships in the data.
- Backpropagation: Backpropagation is a key algorithm that makes training multi-layered neural networks possible. The backpropagation algorithm is used to adjust the weights in the network during training to minimize the difference between predicted and actual outputs.
- Versatility: Neural networks are versatile tools applicable to a number of domains.
There are different types of neural networks, each designed for specific tasks:
- Feed-forward neural networks have connections that only move in one direction. The inputs pass through hidden layers and ultimately produce an output.
- Convolutional neural networks (CNNs) are designed for processing grid-like data, such as images. CNNs apply convolutional layers and pooling layers to extract features from images.
- Recurrent neural networks (RNNs) are designed for processing sequential data, such as text or time series. RNNs have connections that loop back into themselves, allowing them to maintain a hidden state that captures information about the sequence. Long short-term memory (LSTM) neural network is a popular type of RNN.
Training Neural Networks:
- Gradient descent is a technique used to train neural networks by minimizing a loss function. Gradient descent involves iteratively adjusting the weights of the network based on the gradient of the loss function with respect to the weights.
- Stochastic gradient descent randomly chooses one data point at a time to calculate the gradient based on, instead of calculating it based on all of the data points.
- Mini-batch gradient descent divides the data set up into small batches, groups of data points, to calculate the gradient based on.
- Overfitting occurs when a neural network is too complex and fits the training data too closely, resulting in poor generalization to new data.
- Dropout is a technique used to combat overfitting by randomly removing units from the neural network during training.
TensorFlow is a library that can be used for creating neural networks, modeling them, and running them on sample data.

Understanding Gradient Descent in Neural Networks

Gradient descent is an algorithm inspired by calculus for minimizing loss when training a neural network. In the context of neural networks, “loss” refers to how poorly a hypothesis function models data.

Key aspects of gradient descent:
- Loss Function: Gradient descent aims to minimize a loss function, which quantifies how poorly the neural network performs.
- Gradient Calculation: The algorithm calculates the gradient of the loss function with respect to the network’s weights. The gradient indicates the direction in which the weights should be adjusted to reduce the loss.
- Weight Update: The weights are updated by taking a small step in the direction opposite to the gradient. The size of this step can vary and is chosen when training the neural network.
- Iterative Process: This process is repeated iteratively, adjusting the weights little by little based on the data points, with the aim of converging towards a good solution.
There are variations to the standard gradient descent algorithm:
- Stochastic Gradient Descent: Instead of looking at all data points at once, stochastic gradient descent randomly chooses one data point at a time to calculate the gradient. This provides a less accurate gradient estimate but is faster to compute.
- Mini-Batch Gradient Descent: This approach is a middle ground between standard and stochastic gradient descent, where the data set is divided into small batches and the gradient is calculated based on these batches.
Understanding Neural Network Hidden Layers

Hidden layers are intermediate layers of artificial neurons or units within a neural network between the input layer and the output layer.

Here’s more about hidden layers and how they contribute to neural network functionality:
- Structure and Function In a neural network, the input layer receives the initial data, and the output layer produces the final result. The hidden layers lie in between, performing complex transformations on the input data to help the network learn non-linear relationships.
- Nodes and Connections Each hidden layer contains a certain number of nodes or units, each connected to the nodes in the preceding and following layers. The connections between nodes have weights, which are adjusted during training to optimize the network’s performance.
- Activation Each unit calculates its output value based on a linear combination of all the inputs. The advantage of layering like this gives an ability to model more complex functions.
Backpropagation: One of the challenges of neural networks is training neural networks that have hidden layers inside of them. The input data provides values for all of the inputs, and what the value of the output should be. However, the input data does not provide what the values for all of the nodes in the hidden layer should be. The key algorithm that makes training the hidden layers of neural networks possible is called backpropagation.

Deep Neural Networks: Neural networks that contain multiple hidden layers are called deep neural networks. The presence of multiple hidden layers allows the network to model more complex functions. Each layer can learn different features of the input, and these features can be combined to produce the desired output. However, complex networks are at greater risk of overfitting.

Dropout: Dropout is a technique that can combat overfitting in neural networks. It involves temporarily removing units from the network during training to prevent over-reliance on any single node.

Harvard CS50’s Artificial Intelligence with Python – Full University Course

The Original Text

This course from Harvard University explores the concepts and algorithms at the foundation of modern artificial intelligence, diving into the ideas that give rise to technologies like game-playing engines, handwriting recognition, and machine translation. You’ll gain exposure to the theory behind graph search algorithms, classification, optimization, reinforcement learning, and other topics in artificial intelligence and machine learning. Brian Yu teaches this course. Hello, world. This is CS50, and this is an introduction to artificial intelligence with Python with CS50’s own Brian Yu. This course picks up where CS50 itself leaves off and explores the concepts and algorithms at the foundation of modern AI. We’ll start with a look at how AI can search for solutions to problems, whether those problems are learning how to play a game or trying to find driving directions to a destination. We’ll then look at how AI can represent information, both knowledge that our AI is certain about, but also information and events about which our AI might be uncertain, learning how to represent that information, but more importantly, how to use that information to draw inferences and new conclusions as well. We’ll explore how AI can solve various types of optimization problems, trying to maximize profits or minimize costs or satisfy some other constraints before turning our attention to the fast-growing field of machine learning, where we won’t tell our AI exactly how to solve a problem, but instead, give our AI access to data and experiences so that our AI can learn on its own how to perform these tasks. In particular, we’ll look at neural networks, one of the most popular tools in modern machine learning, inspired by the way that human brains learn and reason as well before finally taking a look at the world of natural language processing so that it’s not just us humans learning to learn how artificial intelligence is able to speak, but also AI learning how to understand and interpret human language as well. We’ll explore these ideas and algorithms, and along the way, give you the opportunity to build your own AI programs to implement all of this and more. This is CS50. All right. Welcome, everyone, to an introduction to artificial intelligence with Python. My name is Brian Yu, and in this class, we’ll explore some of the ideas and techniques and algorithms that are at the foundation of artificial intelligence. Now, artificial intelligence covers a wide variety of types of techniques. Anytime you see a computer do something that appears to be intelligent or rational in some way, like recognizing someone’s face in a photo, or being able to play a game better than people can, or being able to understand human language when we talk to our phones and they understand what we mean and are able to respond back to us, these are all examples of AI, or artificial intelligence. And in this class, we’ll explore some of the ideas that make that AI possible. So we’ll begin our conversations with search, the problem of we have an AI, and we would like the AI to be able to search for solutions to some kind of problem, no matter what that problem might be. Whether it’s trying to get driving directions from point A to point B, or trying to figure out how to play a game, given a tic-tac-toe game, for example, figuring out what move it ought to make. After that, we’ll take a look at knowledge. Ideally, we want our AI to be able to know information, to be able to represent that information, and more importantly, to be able to draw inferences from that information, to be able to use the information it knows and draw additional conclusions. So we’ll talk about how AI can be programmed in order to do just that. Then we’ll explore the topic of uncertainty, talking about ideas of what happens if a computer isn’t sure about a fact, but maybe is only sure with a certain probability. So we’ll talk about some of the ideas behind probability, and how computers can begin to deal with uncertain events in order to be a little bit more intelligent in that sense as well. After that, we’ll turn our attention to optimization, problems of when the computer is trying to optimize for some sort of goal, especially in a situation where there might be multiple ways that a computer might solve a problem, but we’re looking for a better way, or potentially the best way, if that’s at all possible. Then we’ll take a look at machine learning, or learning more generally, and looking at how, when we have access to data, our computers can be programmed to be quite intelligent by learning from data and learning from experience, being able to perform a task better and better based on greater access to data. So your email, for example, where your email inbox somehow knows which of your emails are good emails and which of your emails are spam. These are all examples of computers being able to learn from past experiences and past data. We’ll take a look, too, at how computers are able to draw inspiration from human intelligence, looking at the structure of the human brain, and how neural networks can be a computer analog to that sort of idea, and how, by taking advantage of a certain type of structure of a computer program, we can write neural networks that are able to perform tasks very, very effectively. And then finally, we’ll turn our attention to language, not programming languages, but human languages that we speak every day. And taking a look at the challenges that come about as a computer tries to understand natural language, and how it is some of the natural language processing that occurs in modern artificial intelligence can actually work. But today, we’ll begin our conversation with search, this problem of trying to figure out what to do when we have some sort of situation that the computer is in, some sort of environment that an agent is in, so to speak, and we would like for that agent to be able to somehow look for a solution to that problem. Now, these problems can come in any number of different types of formats. One example, for instance, might be something like this classic 15 puzzle with the sliding tiles that you might have seen. Where you’re trying to slide the tiles in order to make sure that all the numbers line up in order. This is an example of what you might call a search problem. The 15 puzzle begins in an initially mixed up state, and we need some way of finding moves to make in order to return the puzzle to its solved state. But there are similar problems that you can frame in other ways. Trying to find your way through a maze, for example, is another example of a search problem. You begin in one place, you have some goal of where you’re trying to get to, and you need to figure out the correct sequence of actions that will take you from that initial state to the goal. And while this is a little bit abstract, any time we talk about maze solving in this class, you can translate it to something a little more real world. Something like driving directions. If you ever wonder how Google Maps is able to figure out what is the best way for you to get from point A to point B, and what turns to make at what time, depending on traffic, for example, it’s often some sort of search algorithm. You have an AI that is trying to get from an initial position to some sort of goal by taking some sequence of actions. So we’ll start our conversations today by thinking about these types of search problems and what goes in to solving a search problem like this in order for an AI to be able to find a good solution. In order to do so, though, we’re going to need to introduce a little bit of terminology, some of which I’ve already used. But the first term we’ll need to think about is an agent. An agent is just some entity that perceives its environment. It somehow is able to perceive the things around it and act on that environment in some way. So in the case of the driving directions, your agent might be some representation of a car that is trying to figure out what actions to take in order to arrive at a destination. In the case of the 15 puzzle with the sliding tiles, the agent might be the AI or the person that is trying to solve that puzzle to try and figure out what tiles to move in order to get to that solution. Next, we introduce the idea of a state. A state is just some configuration of the agent in its environment. So in the 15 puzzle, for example, any state might be any one of these three, for example. A state is just some configuration of the tiles. And each of these states is different and is going to require a slightly different solution. A different sequence of actions will be needed in each one of these in order to get from this initial state to the goal, which is where we’re trying to get. So the initial state, then, what is that? The initial state is just the state where the agent begins. It is one such state where we’re going to start from. And this is going to be the starting point for our search algorithm, so to speak. We’re going to begin with this initial state and then start to reason about it, to think about what actions might we apply to that initial state in order to figure out how to get from the beginning to the end, from the initial position to whatever our goal happens to be. And how do we make our way from that initial position to the goal? Well, ultimately, it’s via taking actions. Actions are just choices that we can make in any given state. And in AI, we’re always going to try to formalize these ideas a little bit more precisely, such that we could program them a little bit more mathematically, so to speak. So this will be a recurring theme. And we can more precisely define actions as a function. We’re going to effectively define a function called actions that takes an input, s, where s is going to be some state that exists inside of our environment. And actions of s is going to take the state as input and return as output the set of all actions that can be executed in that state. And so it’s possible that some actions are only valid in certain states and not in other states. And we’ll see examples of that soon, too. So in the case of the 15 puzzle, for example, there are generally going to be four possible actions that we can do most of the time. We can slide a tile to the right, slide a tile to the left, slide a tile up, or slide a tile down, for example. And those are going to be the actions that are available to us. So somehow our AI, our program, needs some encoding of the state, which is often going to be in some numerical format, and some encoding of these actions. But it also needs some encoding of the relationship between these things. How do the states and actions relate to one another? And in order to do that, we’ll introduce to our AI a transition model, which will be a description of what state we get after we perform some available action in some other state. And again, we can be a little bit more precise about this, define this transition model a little bit more formally, again, as a function. The function is going to be a function called result that this time takes two inputs. Input number one is s, some state. And input number two is a, some action. And the output of this function result is it is going to give us the state that we get after we perform action a in state s. So let’s take a look at an example to see more precisely what this actually means. Here is an example of a state, of the 15 puzzle, for example. And here is an example of an action, sliding a tile to the right. What happens if we pass these as inputs to the result function? Again, the result function takes this board, this state, as its first input. And it takes an action as a second input. And of course, here, I’m describing things visually so that you can see visually what the state is and what the action is. In a computer, you might represent one of these actions as just some number that represents the action. Or if you’re familiar with enums that allow you to enumerate multiple possibilities, it might be something like that. And this state might just be represented as an array or two-dimensional array of all of these numbers that exist. But here, we’re going to show it visually just so you can see it. But when we take this state and this action, pass it into the result function, the output is a new state. The state we get after we take a tile and slide it to the right, and this is the state we get as a result. If we had a different action and a different state, for example, and pass that into the result function, we’d get a different answer altogether. So the result function needs to take care of figuring out how to take a state and take an action and get what results. And this is going to be our transition model that describes how it is that states and actions are related to each other. If we take this transition model and think about it more generally and across the entire problem, we can form what we might call a state space. The set of all of the states we can get from the initial state via any sequence of actions, by taking 0 or 1 or 2 or more actions in addition to that, so we could draw a diagram that looks something like this, where every state is represented here by a game board, and there are arrows that connect every state to every other state we can get to from that state. And the state space is much larger than what you see just here. This is just a sample of what the state space might actually look like. And in general, across many search problems, whether they’re this particular 15 puzzle or driving directions or something else, the state space is going to look something like this. We have individual states and arrows that are connecting them. And oftentimes, just for simplicity, we’ll simplify our representation of this entire thing as a graph, some sequence of nodes and edges that connect nodes. But you can think of this more abstract representation as the exact same idea. Each of these little circles or nodes is going to represent one of the states inside of our problem. And the arrows here represent the actions that we can take in any particular state, taking us from one particular state to another state, for example. All right. So now we have this idea of nodes that are representing these states, actions that can take us from one state to another, and a transition model that defines what happens after we take a particular action. So the next step we need to figure out is how we know when the AI is done solving the problem. The AI needs some way to know when it gets to the goal that it’s found the goal. So the next thing we’ll need to encode into our artificial intelligence is a goal test, some way to determine whether a given state is a goal state. In the case of something like driving directions, it might be pretty easy. If you’re in a state that corresponds to whatever the user typed in as their intended destination, well, then you know you’re in a goal state. In the 15 puzzle, it might be checking the numbers to make sure they’re all in ascending order. But the AI needs some way to encode whether or not any state they happen to be in is a goal. And some problems might have one goal, like a maze where you have one initial position and one ending position, and that’s the goal. In other more complex problems, you might imagine that there are multiple possible goals. That there are multiple ways to solve a problem, and we might not care which one the computer finds, as long as it does find a particular goal. However, sometimes the computer doesn’t just care about finding a goal, but finding a goal well, or one with a low cost. And it’s for that reason that the last piece of terminology that we’ll use to define these search problems is something called a path cost. You might imagine that in the case of driving directions, it would be pretty annoying if I said I wanted directions from point A to point B, and the route that Google Maps gave me was a long route with lots of detours that were unnecessary that took longer than it should have for me to get to that destination. And it’s for that reason that when we’re formulating search problems, we’ll often give every path some sort of numerical cost, some number telling us how expensive it is to take this particular option, and then tell our AI that instead of just finding a solution, some way of getting from the initial state to the goal, we’d really like to find one that minimizes this path cost. That is, less expensive, or takes less time, or minimizes some other numerical value. We can represent this graphically if we take a look at this graph again, and imagine that each of these arrows, each of these actions that we can take from one state to another state, has some sort of number associated with it. That number being the path cost of this particular action, where some of the costs for any particular action might be more expensive than the cost for some other action, for example. Although this will only happen in some sorts of problems. In other problems, we can simplify the diagram and just assume that the cost of any particular action is the same. And this is probably the case in something like the 15 puzzle, for example, where it doesn’t really make a difference whether I’m moving right or moving left. The only thing that matters is the total number of steps that I have to take to get from point A to point B. And each of those steps is of equal cost. We can just assume it’s of some constant cost like one. And so this now forms the basis for what we might consider to be a search problem. A search problem has some sort of initial state, some place where we begin, some sort of action that we can take or multiple actions that we can take in any given state. And it has a transition model. Some way of defining what happens when we go from one state and take one action, what state do we end up with as a result. In addition to that, we need some goal test to know whether or not we’ve reached a goal. And then we need a path cost function that tells us for any particular path, by following some sequence of actions, how expensive is that path. What does its cost in terms of money or time or some other resource that we are trying to minimize our usage of. And the goal ultimately is to find a solution. Where a solution in this case is just some sequence of actions that will take us from the initial state to the goal state. And ideally, we’d like to find not just any solution but the optimal solution, which is a solution that has the lowest path cost among all of the possible solutions. And in some cases, there might be multiple optimal solutions. But an optimal solution just means that there is no way that we could have done better in terms of finding that solution. So now we’ve defined the problem. And now we need to begin to figure out how it is that we’re going to solve this kind of search problem. And in order to do so, you’ll probably imagine that our computer is going to need to represent a whole bunch of data about this particular problem. We need to represent data about where we are in the problem. And we might need to be considering multiple different options at once. And oftentimes, when we’re trying to package a whole bunch of data related to a state together, we’ll do so using a data structure that we’re going to call a node. A node is a data structure that is just going to keep track of a variety of different values. And specifically, in the case of a search problem, it’s going to keep track of these four values in particular. Every node is going to keep track of a state, the state we’re currently on. And every node is also going to keep track of a parent. A parent being the state before us or the node that we used in order to get to this current state. And this is going to be relevant because eventually, once we reach the goal node, once we get to the end, we want to know what sequence of actions we use in order to get to that goal. And the way we’ll know that is by looking at these parents to keep track of what led us to the goal and what led us to that state and what led us to the state before that, so on and so forth, backtracking our way to the beginning so that we know the entire sequence of actions we needed in order to get from the beginning to the end. The node is also going to keep track of what action we took in order to get from the parent to the current state. And the node is also going to keep track of a path cost. In other words, it’s going to keep track of the number that represents how long it took to get from the initial state to the state that we currently happen to be at. And we’ll see why this is relevant as we start to talk about some of the optimizations that we can make in terms of these search problems more generally. So this is the data structure that we’re going to use in order to solve the problem. And now let’s talk about the approach. How might we actually begin to solve the problem? Well, as you might imagine, what we’re going to do is we’re going to start at one particular state, and we’re just going to explore from there. The intuition is that from a given state, we have multiple options that we could take, and we’re going to explore those options. And once we explore those options, we’ll find that more options than that are going to make themselves available. And we’re going to consider all of the available options to be stored inside of a single data structure that we’ll call the frontier. The frontier is going to represent all of the things that we could explore next that we haven’t yet explored or visited. So in our approach, we’re going to begin the search algorithm by starting with a frontier that just contains one state. The frontier is going to contain the initial state, because at the beginning, that’s the only state we know about. That is the only state that exists. And then our search algorithm is effectively going to follow a loop. We’re going to repeat some process again and again and again. The first thing we’re going to do is if the frontier is empty, then there’s no solution. And we can report that there is no way to get to the goal. And that’s certainly possible. There are certain types of problems that an AI might try to explore and realize that there is no way to solve that problem. And that’s useful information for humans to know as well. So if ever the frontier is empty, that means there’s nothing left to explore. And we haven’t yet found a solution, so there is no solution. There’s nothing left to explore. Otherwise, what we’ll do is we’ll remove a node from the frontier. So right now at the beginning, the frontier just contains one node representing the initial state. But over time, the frontier might grow. It might contain multiple states. And so here, we’re just going to remove a single node from that frontier. If that node happens to be a goal, then we found a solution. So we remove a node from the frontier and ask ourselves, is this the goal? And we do that by applying the goal test that we talked about earlier, asking if we’re at the destination. Or asking if all the numbers of the 15 puzzle happen to be in order. So if the node contains the goal, we found a solution. Great. We’re done. And otherwise, what we’ll need to do is we’ll need to expand the node. And this is a term of art in artificial intelligence. To expand the node just means to look at all of the neighbors of that node. In other words, consider all of the possible actions that I could take from the state that this node is representing and what nodes could I get to from there. We’re going to take all of those nodes, the next nodes that I can get to from this current one I’m looking at, and add those to the frontier. And then we’ll repeat this process. So at a very high level, the idea is we start with a frontier that contains the initial state. And we’re constantly removing a node from the frontier, looking at where we can get to next and adding those nodes to the frontier, repeating this process over and over until either we remove a node from the frontier and it contains a goal, meaning we’ve solved the problem, or we run into a situation where the frontier is empty, at which point we’re left with no solution. So let’s actually try and take the pseudocode, put it into practice by taking a look at an example of a sample search problem. So right here, I have a sample graph. A is connected to B via this action. B is connected to nodes C and D. C is connected to E. D is connected to F. And what I’d like to do is have my AI find a path from A to E. We want to get from this initial state to this goal state. So how are we going to do that? Well, we’re going to start with a frontier that contains the initial state. This is going to represent our frontier. So our frontier initially will just contain A, that initial state where we’re going to begin. And now we’ll repeat this process. If the frontier is empty, no solution. That’s not a problem, because the frontier is not empty. So we’ll remove a node from the frontier as the one to consider next. There’s only one node in the frontier. So we’ll go ahead and remove it from the frontier. But now A, this initial node, this is the node we’re currently considering. We follow the next step. We ask ourselves, is this node the goal? No, it’s not. A is not the goal. E is the goal. So we don’t return the solution. So instead, we go to this last step, expand the node, and add the resulting nodes to the frontier. What does that mean? Well, it means take this state A and consider where we could get to next. And after A, what we could get to next is only B. So that’s what we get when we expand A. We find B. And we add B to the frontier. And now B is in the frontier. And we repeat the process again. We say, all right, the frontier is not empty. So let’s remove B from the frontier. B is now the node that we’re considering. We ask ourselves, is B the goal? No, it’s not. So we go ahead and expand B and add its resulting nodes to the frontier. What happens when we expand B? In other words, what nodes can we get to from B? Well, we can get to C and D. So we’ll go ahead and add C and D from the frontier. And now we have two nodes in the frontier, C and D. And we repeat the process again. We remove a node from the frontier. For now, I’ll do so arbitrarily just by picking C. We’ll see why later, how choosing which node you remove from the frontier is actually quite an important part of the algorithm. But for now, I’ll arbitrarily remove C, say it’s not the goal. So we’ll add E, the next one, to the frontier. Then let’s say I remove E from the frontier. And now I check I’m currently looking at state E. Is it a goal state? It is, because I’m trying to find a path from A to E. So I would return the goal. And that now would be the solution, that I’m now able to return the solution. And I have found a path from A to E. So this is the general idea, the general approach of this search algorithm, to follow these steps, constantly removing nodes from the frontier, until we’re able to find a solution. So the next question you might reasonably ask is, what could go wrong here? What are the potential problems with an approach like this? And here’s one example of a problem that could arise from this sort of approach. Imagine this same graph, same as before, with one change. The change being now, instead of just an arrow from A to B, we also have an arrow from B to A, meaning we can go in both directions. And this is true in something like the 15 puzzle, where when I slide a tile to the right, I could then slide a tile to the left to get back to the original position. I could go back and forth between A and B. And that’s what these double arrows symbolize, the idea that from one state, I can get to another, and then I can get back. And that’s true in many search problems. What’s going to happen if I try to apply the same approach now? Well, I’ll begin with A, same as before. And I’ll remove A from the frontier. And then I’ll consider where I can get to from A. And after A, the only place I can get to is B. So B goes into the frontier. Then I’ll say, all right, let’s take a look at B. That’s the only thing left in the frontier. Where can I get to from B? Before, it was just C and D. But now, because of that reverse arrow, I can get to A or C or D. So all three, A, C, and D, all of those now go into the frontier. They are places I can get to from B. And now I remove one from the frontier. And maybe I’m unlucky, and maybe I pick A. And now I’m looking at A again. And I consider, where can I get to from A? And from A, well, I can get to B. And now we start to see the problem. But if I’m not careful, I go from A to B, and then back to A, and then to B again. And I could be going in this infinite loop, where I never make any progress, because I’m constantly just going back and forth between two states that I’ve already seen. So what is the solution to this? We need some way to deal with this problem. And the way that we can deal with this problem is by somehow keeping track of what we’ve already explored. And the logic is going to be, well, if we’ve already explored the state, there’s no reason to go back to it. Once we’ve explored a state, don’t go back to it. Don’t bother adding it to the frontier. There’s no need to. So here’s going to be our revised approach, a better way to approach this sort of search problem. And it’s going to look very similar, just with a couple of modifications. We’ll start with a frontier that contains the initial state, same as before. But now we’ll start with another data structure, which will just be a set of nodes that we’ve already explored. So what are the states we’ve explored? Initially, it’s empty. We have an empty explored set. And now we repeat. If the frontier is empty, no solution, same as before. We remove a node from the frontier. We check to see if it’s a goal state, return the solution. None of this is any different so far. But now what we’re going to do is we’re going to add the node to the explored state. So if it happens to be the case that we remove a node from the frontier and it’s not the goal, we’ll add it to the explored set so that we know we’ve already explored it. We don’t need to go back to it again if it happens to come up later. And then the final step, we expand the node and we add the resulting nodes to the frontier. But before, we just always added the resulting nodes to the frontier. We’re going to be a little clever about it this time. We’re only going to add the nodes to the frontier if they aren’t already in the frontier and if they aren’t already in the explored set. So we’ll check both the frontier and the explored set, make sure that the node isn’t already in one of those two. And so long as it isn’t, then we’ll go ahead and add it to the frontier, but not otherwise. And so that revised approach is ultimately what’s going to help make sure that we don’t go back and forth between two nodes. Now, the one point that I’ve kind of glossed over here so far is this step here, removing a node from the frontier. Before, I just chose arbitrarily. Like, let’s just remove a node and that’s it. But it turns out it’s actually quite important how we decide to structure our frontier, how we add and how we remove our nodes. The frontier is a data structure and we need to make a choice about in what order are we going to be removing elements. And one of the simplest data structures for adding and removing elements is something called a stack. And a stack is a data structure that is a last in, first out data type, which means the last thing that I add to the frontier is going to be the first thing that I remove from the frontier. So the most recent thing to go into the stack or the frontier in this case is going to be the node that I explore. So let’s see what happens if I apply this stack-based approach to something like this problem, finding a path from A to E. What’s going to happen? Well, again, we’ll start with A and we’ll say, all right, let’s go ahead and look at A first. And then notice this time, we’ve added A to the explored set. A is something we’ve now explored. We have this data structure that’s keeping track. We then say from A, we can get to B. And all right, from B, what can we do? Well, from B, we can explore B and get to both C and D. So we added C and then D. So now, when we explore a node, we’re going to treat the frontier as a stack, last in, first out. D was the last one to come in. So we’ll go ahead and explore that next and say, all right, where can we get to from D? Well, we can get to F. And so all right, we’ll put F into the frontier. And now, because the frontier is a stack, F is the most recent thing that’s gone in the stack. So F is what we’ll explore next. We’ll explore F and say, all right, where can we get to from F? Well, we can’t get anywhere, so nothing gets added to the frontier. So now, what was the new most recent thing added to the frontier? Well, it’s now C, the only thing left in the frontier. We’ll explore that from which we can see, all right, from C, we can get to E. So E goes into the frontier. And then we say, all right, let’s look at E. And E is now the solution. And now, we’ve solved the problem. So when we treat the frontier like a stack, a last in, first out data structure, that’s the result we get. We go from A to B to D to F. And then we sort of backed up and went down to C and then E. And it’s important to get a visual sense for how this algorithm is working. We went very deep in this search tree, so to speak, all the way until the bottom where we hit a dead end. And then we effectively backed up and explored this other route that we didn’t try before. And it’s this going very deep in the search tree idea, this way the algorithm ends up working when we use a stack that we call this version of the algorithm depth first search. Depth first search is the search algorithm where we always explore the deepest node in the frontier. We keep going deeper and deeper through our search tree. And then if we hit a dead end, we back up and we try something else instead. But depth first search is just one of the possible search options that we could use. It turns out that there’s another algorithm called breadth first search, which behaves very similarly to depth first search with one difference. Instead of always exploring the deepest node in the search tree, the way the depth first search does, breadth first search is always going to explore the shallowest node in the frontier. So what does that mean? Well, it means that instead of using a stack which depth first search or DFS used, where the most recent item added to the frontier is the one we’ll explore next, in breadth first search or BFS, we’ll instead use a queue, where a queue is a first in first out data type, where the very first thing we add to the frontier is the first one we’ll explore and they effectively form a line or a queue, where the earlier you arrive in the frontier, the earlier you get explored. So what would that mean for the same exact problem, finding a path from A to E? Well, we start with A, same as before, then we’ll go ahead and have explored A and say, where can we get to from A? Well, from A, we can get to B, same as before. From B, same as before, we can get to C and D. So C and D get added to the frontier. This time, though, we added C to the frontier before D. So we’ll explore C first. So C gets explored. And from C, where can we get to? Well, we can get to E. So E gets added to the frontier. But because D was explored before E, we’ll look at D next. So we’ll explore D and say, where can we get to from D? We can get to F. And only then will we say, all right, now we can get to E. And so what breadth first search or BFS did is we started here, we looked at both C and D, and then we looked at E. Effectively, we’re looking at things one away from the initial state, then two away from the initial state, and only then, things that are three away from the initial state, unlike depth first search, which just went as deep as possible into the search tree until it hit a dead end and then ultimately had to back up. So these now are two different search algorithms that we could apply in order to try and solve a problem. And let’s take a look at how these would actually work in practice with something like maze solving, for example. So here’s an example of a maze. These empty cells represent places where our agent can move. These darkened gray cells represent walls that the agent can’t pass through. And ultimately, our agent, our AI, is going to try to find a way to get from position A to position B via some sequence of actions, where those actions are left, right, up, and down. What will depth first search do in this case? Well, depth first search will just follow one path. If it reaches a fork in the road where it has multiple different options, depth first search is just, in this case, going to choose one. That doesn’t a real preference. But it’s going to keep following one until it hits a dead end. And when it hits a dead end, depth first search effectively goes back to the last decision point and tries the other path, fully exhausting this entire path. And when it realizes that, OK, the goal is not here, then it turns its attention to this path. It goes as deep as possible. When it hits a dead end, it backs up and then tries this other path, keeps going as deep as possible down one particular path. And when it realizes that that’s a dead end, then it’ll back up, and then ultimately find its way to the goal. And maybe you got lucky, and maybe you made a different choice earlier on. But ultimately, this is how depth first search is going to work. It’s going to keep following until it hits a dead end. And when it hits a dead end, it backs up and looks for a different solution. And so one thing you might reasonably ask is, is this algorithm always going to work? Will it always actually find a way to get from the initial state? To the goal. And it turns out that as long as our maze is finite, as long as there are only finitely many spaces where we can travel, then, yes, depth first search is going to find a solution. Because eventually, it’ll just explore everything. If the maze happens to be infinite and there’s an infinite state space, which does exist in certain types of problems, then it’s a slightly different story. But as long as our maze has finitely many squares, we’re going to find a solution. The next question, though, that we want to ask is, is it going to be a good solution? Is it the optimal solution that we can find? And the answer there is not necessarily. And let’s take a look at an example of that. In this maze, for example, we’re again trying to find our way from A to B. And you notice here there are multiple possible solutions. We could go this way or we could go up in order to make our way from A to B. Now, if we’re lucky, depth first search will choose this way and get to B. But there’s no reason necessarily why depth first search would choose between going up or going to the right. It’s sort of an arbitrary decision point because both are going to be added to the frontier. And ultimately, if we get unlucky, depth first search might choose to explore this path first because it’s just a random choice at this point. It’ll explore, explore, explore. And it’ll eventually find the goal, this particular path, when in actuality there was a better path. There was a more optimal solution that used fewer steps, assuming we’re measuring the cost of a solution based on the number of steps that we need to take. So depth first search, if we’re unlucky, might end up not finding the best solution when a better solution is available. So that’s DFS, depth first search. How does BFS, or breadth first search, compare? How would it work in this particular situation? Well, the algorithm is going to look very different visually in terms of how BFS explores. Because BFS looks at shallower nodes first, the idea is going to be, BFS will first look at all of the nodes that are one away from the initial state. Look here and look here, for example, just at the two nodes that are immediately next to this initial state. Then it’ll explore nodes that are two away, looking at this state and that state, for example. Then it’ll explore nodes that are three away, this state and that state. Whereas depth first search just picked one path and kept following it, breadth first search, on the other hand, is taking the option of exploring all of the possible paths as kind of at the same time bouncing back between them, looking deeper and deeper at each one, but making sure to explore the shallower ones or the ones that are closer to the initial state earlier. So we’ll keep following this pattern, looking at things that are four away, looking at things that are five away, looking at things that are six away, until eventually we make our way to the goal. And in this case, it’s true we had to explore some states that ultimately didn’t lead us anywhere, but the path that we found to the goal was the optimal path. This is the shortest way that we could get to the goal. And so what might happen then in a larger maze? Well, let’s take a look at something like this and how breadth first search is going to behave. Well, breadth first search, again, we’ll just keep following the states until it receives a decision point. It could go either left or right. And while DFS just picked one and kept following that until it hit a dead end, BFS, on the other hand, will explore both. It’ll say look at this node, then this node, and it’ll look at this node, then that node. So on and so forth. And when it hits a decision point here, rather than pick one left or two right and explore that path, it’ll again explore both, alternating between them, going deeper and deeper. We’ll explore here, and then maybe here and here, and then keep going. Explore here and slowly make our way, you can visually see, further and further out. Once we get to this decision point, we’ll explore both up and down until ultimately we make our way to the goal. And what you’ll notice is, yes, breadth first search did find our way from A to B by following this particular path, but it needed to explore a lot of states in order to do so. And so we see some trade offs here between DFS and BFS, that in DFS, there may be some cases where there is some memory savings as compared to a breadth first approach, where breadth first search in this case had to explore a lot of states. But maybe that won’t always be the case. So now let’s actually turn our attention to some code and look at the code that we could actually write in order to implement something like depth first search or breadth first search in the context of solving a maze, for example. So I’ll go ahead and go into my terminal. And what I have here inside of maze.py is an implementation of this same idea of maze solving. I’ve defined a class called node that in this case is keeping track of the state, the parent, in other words, the state before the state, and the action. In this case, we’re not keeping track of the path cost because we can calculate the cost of the path at the end after we found our way from the initial state to the goal. In addition to this, I’ve defined a class called a stack frontier. And if unfamiliar with a class, a class is a way for me to define a way to generate objects in Python. It refers to an idea of object oriented programming, where the idea here is that I would like to create an object that is able to store all of my frontier data. And I would like to have functions, otherwise known as methods, on that object that I can use to manipulate the object. And so what’s going on here, if unfamiliar with the syntax, is I have a function that initially creates a frontier that I’m going to represent using a list. And initially, my frontier is represented by the empty list. There’s nothing in my frontier to begin with. I have an add function that adds something to the frontier as by appending it to the end of the list. I have a function that checks if the frontier contains a particular state. I have an empty function that checks if the frontier is empty. If the frontier is empty, that just means the length of the frontier is 0. And then I have a function for removing something from the frontier. I can’t remove something from the frontier if the frontier is empty, so I check for that first. But otherwise, if the frontier isn’t empty, recall that I’m implementing this frontier as a stack, a last in first out data structure, which means the last thing I add to the frontier, in other words, the last thing in the list, is the item that I should remove from this frontier. So what you’ll see here is I have removed the last item of a list. And if you index into a Python list with negative 1, that gets you the last item in the list. Since 0 is the first item, negative 1 kind of wraps around and gets you to the last item in the list. So we give that the node. We call that node. We update the frontier here on line 28 to say, go ahead and remove that node that you just removed from the frontier. And then we return the node as a result. So this class here effectively implements the idea of a frontier. It gives me a way to add something to a frontier and a way to remove something from the frontier as a stack. I’ve also, just for good measure, implemented an alternative version of the same thing called a queue frontier, which in parentheses you’ll see here, it inherits from a stack frontier, meaning it’s going to do all the same things that the stack frontier did, except the way we remove a node from the frontier is going to be slightly different. Instead of removing from the end of the list the way we would in a stack, we’re instead going to remove from the beginning of the list. Self.frontier 0 will get me the first node in the frontier, the first one that was added, and that is going to be the one that we return in the case of a queue. Then under here, I have a definition of a class called maze. This is going to handle the process of taking a sequence, a maze-like text file, and figuring out how to solve it. So it will take as input a text file that looks something like this, for example, where we see hash marks that are here representing walls, and I have the character A representing the starting position and the character B representing the ending position. And you can take a look at the code for parsing this text file right now. That’s the less interesting part. The more interesting part is this solve function here, the solve function is going to figure out how to actually get from point A to point B. And here we see an implementation of the exact same idea we saw from a moment ago. We’re going to keep track of how many states we’ve explored, just so we can report that data later. But I start with a node that represents just the start state. And I start with a frontier that, in this case, is a stack frontier. And given that I’m treating my frontier as a stack, you might imagine that the algorithm I’m using here is now depth-first search, because depth-first search, or DFS, uses a stack as its data structure. And initially, this frontier is just going to contain the start state. We initialize an explored set that initially is empty. There’s nothing we’ve explored so far. And now here’s our loop, that notion of repeating something again and again. First, we check if the frontier is empty by calling that empty function that we saw the implementation of a moment ago. And if the frontier is indeed empty, we’ll go ahead and raise an exception, or a Python error, to say, sorry, there is no solution to this problem. Otherwise, we’ll go ahead and remove a node from the frontier as by calling frontier.remove and update the number of states we’ve explored, because now we’ve explored one additional state. So we say self.numexplored plus equals 1, adding 1 to the number of states we’ve explored. Once we remove a node from the frontier, recall that the next step is to see whether or not it’s the goal, the goal test. And in the case of the maze, the goal is pretty easy. I check to see whether the state of the node is equal to the goal. Initially, when I set up the maze, I set up this value called goal, which is a property of the maze, so I can just check to see if the node is actually the goal. And if it is the goal, then what I want to do is backtrack my way towards figuring out what actions I took in order to get to this goal. And how do I do that? We’ll recall that every node stores its parent, the node that came before it that we used to get to this node, and also the action used in order to get there. So I can create this loop where I’m constantly just looking at the parent of every node and keeping track for all of the parents what action I took to get from the parent to this current node. So this loop is going to keep repeating this process of looking through all of the parent nodes until we get back to the initial state, which has no parent, where node.parent is going to be equal to none. As I do so, I’m going to be building up the list of all of the actions that I’m following and the list of all the cells that are part of the solution. But I’ll reverse them because when I build it up, going from the goal back to the initial state and building the sequence of actions from the goal to the initial state, but I want to reverse them in order to get the sequence of actions from the initial state to the goal. And that is ultimately going to be the solution. So all of that happens if the current state is equal to the goal. And otherwise, if it’s not the goal, well, then I’ll go ahead and add this state to the explored set to say, I’ve explored this state now. No need to go back to it if I come across it in the future. And then this logic here implements the idea of adding neighbors to the frontier. I’m saying, look at all of my neighbors, and I implemented a function called neighbors that you can take a look at. And for each of those neighbors, I’m going to check, is the state already in the frontier? Is the state already in the explored set? And if it’s not in either of those, then I’ll go ahead and add this new child node, this new node, to the frontier. So there’s a fair amount of syntax here, but the key here is not to understand all the nuances of the syntax. So feel free to take a closer look at this file on your own to get a sense for how it is working. But the key is to see how this is an implementation of the same pseudocode, the same idea that we were describing a moment ago on the screen when we were looking at the steps that we might follow in order to solve this kind of search problem. So now let’s actually see this in action. I’ll go ahead and run maze.py on maze1.txt, for example. And what we’ll see is here, we have a printout of what the maze initially looked like. And then here down below is after we’ve solved it. We had to explore 11 states in order to do it, and we found a path from A to B. And in this program, I just happened to generate a graphical representation of this as well. So I can open up maze.png, which is generated by this program, that shows you where in the darker color here are the walls, red is the initial state, green is the goal, and yellow is the path that was followed. We found a path from the initial state to the goal. But now let’s take a look at a more sophisticated maze to see what might happen instead. Let’s look now at maze2.txt. We’re now here. We have a much larger maze. Again, we’re trying to find our way from point A to point B. But now you’ll imagine that depth-first search might not be so lucky. It might not get the goal on the first try. It might have to follow one path, then backtrack and explore something else a little bit later. So let’s try this. We’ll run python maze.py of maze2.txt, this time trying on this other maze. And now, depth-first search is able to find a solution. Here, as indicated by the stars, is a way to get from A to B. And we can represent this visually by opening up this maze. Here’s what that maze looks like, and highlighted in yellow is the path that was found from the initial state to the goal. But how many states did we have to explore before we found that path? Well, recall that in my program, I was keeping track of the number of states that we’ve explored so far. And so I can go back to the terminal and see that, all right, in order to solve this problem, we had to explore 399 different states. And in fact, if I make one small modification of the program and tell the program at the end when we output this image, I added an argument called show explored. And if I set show explored equal to true and rerun this program, python maze.py, running it on maze2, and then I open the maze, what you’ll see here is highlighted in red are all of the states that had to be explored to get from the initial state to the goal. Depth-first search, or DFS, didn’t find its way to the goal right away. It made a choice to first explore this direction. And when it explored this direction, it had to follow every conceivable path all the way to the very end, even this long and winding one, in order to realize that, you know what? That’s a dead end. And instead, the program needed to backtrack. After going this direction, it must have gone this direction. It got lucky here by just not choosing this path, but it got unlucky here, exploring this direction, exploring a bunch of states it didn’t need to, and then likewise exploring all of this top part of the graph when it probably didn’t need to do that either. So all in all, depth-first search here really not performing optimally, or probably exploring more states than it needs to. It finds an optimal solution, the best path to the goal, but the number of states needed to explore in order to do so, the number of steps I had to take, that was much higher. So let’s compare. How would breadth-first search, or BFS, do on this exact same maze instead? And in order to do so, it’s a very easy change. The algorithm for DFS and BFS is identical with the exception of what data structure we use to represent the frontier, that in DFS, I used a stack frontier, last in, first out, whereas in BFS, I’m going to use a queue frontier, first in, first out, where the first thing I add to the frontier is the first thing that I remove. So I’ll go back to the terminal, rerun this program on the same maze, and now you’ll see that the number of states we had to explore was only 77 as compared to almost 400 when we used depth-first search. And we can see exactly why. We can see what happened if we open up maze.png now and take a look. Again, yellow highlight is the solution that breadth-first search found, which incidentally is the same solution that depth-first search found. They’re both finding the best solution. But notice all the white unexplored cells. There was much fewer states that needed to be explored in order to make our way to the goal because breadth-first search operates a little more shallowly. It’s exploring things that are close to the initial state without exploring things that are further away. So if the goal is not too far away, then breadth-first search can actually behave quite effectively on a maze that looks a little something like this. Now, in this case, both BFS and DFS ended up finding the same solution, but that won’t always be the case. And in fact, let’s take a look at one more example. For instance, maze3.txt. In maze3.txt, notice that here there are multiple ways that you could get from A to B. It’s a relatively small maze, but let’s look at what happens. If I use, and I’ll go ahead and turn off show explored so we just see the solution. If I use BFS, breadth-first search, to solve maze3.txt, well, then we find a solution, and if I open up the maze, here is the solution that we found. It is the optimal one. With just four steps, we can get from the initial state to what the goal happens to be. But what happens if we tried to use depth-first search or DFS instead? Well, again, I’ll go back up to my Q frontier, where Q frontier means that we’re using breadth-first search, and I’ll change it to a stack frontier, which means that now we’ll be using depth-first search. I’ll rerun pythonmaze.py, and now you’ll see that we find the solution, but it is not the optimal solution. This instead is what our algorithm finds, and maybe depth-first search would have found the solution. It’s possible, but it’s not guaranteed that if we just happen to be unlucky, if we choose this state instead of that state, then depth-first search might find a longer route to get from the initial state to the goal. So we do see some trade-offs here, where depth-first search might not find the optimal solution. So at that point, it seems like breadth-first search is pretty good. Is that the best we can do, where it’s going to find us the optimal solution, and we don’t have to worry about situations where we might end up finding a longer path to the solution than what actually exists? Where the goal is far away from the initial state, and we might have to take lots of steps in order to get from the initial state to the goal, what ended up happening is that this algorithm, BFS, ended up exploring basically the entire graph, having to go through the entire maze in order to find its way from the initial state to the goal state. What we’d ultimately like is for our algorithm to be a little bit more intelligent. And now what would it mean for our algorithm to be a little bit more intelligent in this case? Well, let’s look back to where breadth-first search might have been able to make a different decision and consider human intuition in this process as well. What might a human do when solving this maze that is different than what BFS ultimately chose to do? Well, the very first decision point that BFS made was right here, when it made five steps and ended up in a position where it had a fork in the row. It could either go left or it could go right. In these initial couple steps, there was no choice. There was only one action that could be taken from each of those states. And so the search algorithm did the only thing that any search algorithm could do, which is keep following that state after the next state. But this decision point is where things get a little bit interesting. Depth-first search, that very first search algorithm we looked at, chose to say, let’s pick one path and exhaust that path. See if anything that way has the goal. And if not, then let’s try the other way. Depth-first search took the alternative approach of saying, you know what, let’s explore things that are shallow, close to us first. Look left and right, then back left and back right, so on and so forth, alternating between our options in the hopes of finding something nearby. But ultimately, what might a human do if confronted with a situation like this of go left or go right? Well, a human might visually see that, all right, I’m trying to get to state b, which is way up there, and going right just feels like it’s closer to the goal. It feels like going right should be better than going left because I’m making progress towards getting to that goal. Now, of course, there are a couple of assumptions that I’m making here. I’m making the assumption that we can represent this grid as like a two-dimensional grid where I know the coordinates of everything. I know that a is in coordinate 0, 0, and b is in some other coordinate pair, and I know what coordinate I’m at now. So I can calculate that, yeah, going this way, that is closer to the goal. And that might be a reasonable assumption for some types of search problems, but maybe not in others. But for now, we’ll go ahead and assume that, that I know what my current coordinate pair is, and I know the coordinate, x, y, of the goal that I’m trying to get to. And in this situation, I’d like an algorithm that is a little bit more intelligent, that somehow knows that I should be making progress towards the goal, and this is probably the way to do that because in a maze, moving in the coordinate direction of the goal is usually, though not always, a good thing. And so here we draw a distinction between two different types of search algorithms, uninformed search and informed search. Uninformed search algorithms are algorithms like DFS and BFS, the two algorithms that we just looked at, which are search strategies that don’t use any problem-specific knowledge to be able to solve the problem. DFS and BFS didn’t really care about the structure of the maze or anything about the way that a maze is in order to solve the problem. They just look at the actions available and choose from those actions, and it doesn’t matter whether it’s a maze or some other problem, the solution or the way that it tries to solve the problem is really fundamentally going to be the same. What we’re going to take a look at now is an improvement upon uninformed search. We’re going to take a look at informed search. Informed search are going to be search strategies that use knowledge specific to the problem to be able to better find a solution. And in the case of a maze, this problem-specific knowledge is something like if I’m in a square that is geographically closer to the goal, that is better than being in a square that is geographically further away. And this is something we can only know by thinking about this problem and reasoning about what knowledge might be helpful for our AI agent to know a little something about. There are a number of different types of informed search. Specifically, first, we’re going to look at a particular type of search algorithm called greedy best-first search. Greedy best-first search, often abbreviated G-BFS, is a search algorithm that instead of expanding the deepest node like DFS or the shallowest node like BFS, this algorithm is always going to expand the node that it thinks is closest to the goal. Now, the search algorithm isn’t going to know for sure whether it is the closest thing to the goal. Because if we knew what was closest to the goal all the time, then we would already have a solution. The knowledge of what is close to the goal, we could just follow those steps in order to get from the initial position to the solution. But if we don’t know the solution, meaning we don’t know exactly what’s closest to the goal, instead we can use an estimate of what’s closest to the goal, otherwise known as a heuristic, just some way of estimating whether or not we’re close to the goal. And we’ll do so using a heuristic function conventionally called h of n that takes a status input and returns our estimate of how close we are to the goal. So what might this heuristic function actually look like in the case of a maze solving algorithm? Where we’re trying to solve a maze, what does the heuristic look like? Well, the heuristic needs to answer a question between these two cells, C and D, which one is better? Which one would I rather be in if I’m trying to find my way to the goal? Well, any human could probably look at this and tell you, you know what, D looks like it’s better. Even if the maze is convoluted and you haven’t thought about all the walls, D is probably better. And why is D better? Well, because if you ignore the wall, so let’s just pretend the walls don’t exist for a moment and relax the problem, so to speak, D, just in terms of coordinate pairs, is closer to this goal. It’s fewer steps that I wouldn’t take to get to the goal as compared to C, even if you ignore the walls. If you just know the xy-coordinate of C and the xy-coordinate of the goal, and likewise you know the xy-coordinate of D, you can calculate the D just geographically. Ignoring the walls looks like it’s better. And so this is the heuristic function that we’re going to use. And it’s something called the Manhattan distance, one specific type of heuristic, where the heuristic is how many squares vertically and horizontally and then left to right, so not allowing myself to go diagonally, just either up or right or left or down. How many steps do I need to take to get from each of these cells to the goal? Well, as it turns out, D is much closer. There are fewer steps. It only needs to take six steps in order to get to that goal. Again, here, ignoring the walls. We’ve relaxed the problem a little bit. We’re just concerned with if you do the math to subtract the x values from each other and the y values from each other, what is our estimate of how far we are away? We can estimate the D is closer to the goal than C is. And so now we have an approach. We have a way of picking which node to remove from the frontier. And at each stage in our algorithm, we’re going to remove a node from the frontier. We’re going to explore the node if it has the smallest value for this heuristic function, if it has the smallest Manhattan distance to the goal. And so what would this actually look like? Well, let me first label this graph, label this maze, with a number representing the value of this heuristic function, the value of the Manhattan distance from any of these cells. So from this cell, for example, we’re one away from the goal. From this cell, we’re two away from the goal, three away, four away. Here, we’re five away because we have to go one to the right and then four up. From somewhere like here, the Manhattan distance is two. We’re only two squares away from the goal geographically, even though in practice, we’re going to have to take a longer path. But we don’t know that yet. The heuristic is just some easy way to estimate how far we are away from the goal. And maybe our heuristic is overly optimistic. It thinks that, yeah, we’re only two steps away. When in practice, when you consider the walls, it might be more steps. So the important thing here is that the heuristic isn’t a guarantee of how many steps it’s going to take. It is estimating. It’s an attempt at trying to approximate. And it does seem generally the case that the squares that look closer to the goal have smaller values for the heuristic function than squares that are further away. So now, using greedy best-first search, what might this algorithm actually do? Well, again, for these first five steps, there’s not much of a choice. We start at this initial state a, and we say, all right, we have to explore these five states. But now we have a decision point. Now we have a choice between going left and going right. And before, when DFS and BFS would just pick arbitrarily, because it just depends on the order you throw these two nodes into the frontier, and we didn’t specify what order you put them into the frontier, only the order you take them out, here we can look at 13 and 11 and say that, all right, this square is a distance of 11 away from the goal according to our heuristic, according to our estimate. And this one, we estimate to be 13 away from the goal. So between those two options, between these two choices, I’d rather have the 11. I’d rather be 11 steps away from the goal, so I’ll go to the right. We’re able to make an informed decision, because we know a little something more about this problem. So then we keep following, 10, 9, 8. Between the two 7s, we don’t really have much of a way to know between those. So then we do just have to make an arbitrary choice. And you know what, maybe we choose wrong. But that’s OK, because now we can still say, all right, let’s try this 7. We say 7, 6, we have to make this choice, even though it increases the value of the heuristic function. But now we have another decision point, between 6 and 8, and between those two. And really, we’re also considering this 13, but that’s much higher. Between 6, 8, and 13, well, the 6 is the smallest value, so we’d rather take the 6. We’re able to make an informed decision that going this way to the right is probably better than going down. So we turn this way, we go to 5. And now we find a decision point where we’ll actually make a decision that we might not want to make, but there’s unfortunately not too much of a way around this. We see 4 and 6. 4 looks closer to the goal, right? It’s going up, and the goal is further up. So we end up taking that route, which ultimately leads us to a dead end. But that’s OK, because we can still say, all right, now let’s try the 6. And now follow this route that will ultimately lead us to the goal. And so this now is how greedy best-for-search might try to approach this problem by saying, whenever we have a decision between multiple nodes that we could explore, let’s explore the node that has the smallest value of h of n, this heuristic function that is estimating how far I have to go. And it just so happens that in this case, we end up doing better in terms of the number of states we needed to explore than BFS needed to. BFS explored all of this section and all of that section, but we were able to eliminate that by taking advantage of this heuristic, this knowledge about how close we are to the goal or some estimate of that idea. So this seems much better. So wouldn’t we always prefer an algorithm like this over an algorithm like breadth-first search? Well, maybe one thing to take into consideration is that we need to come up with a good heuristic, how good the heuristic is, is going to affect how good this algorithm is. And coming up with a good heuristic can oftentimes be challenging. But the other thing to consider is to ask the question, just as we did with the prior two algorithms, is this algorithm optimal? Will it always find the shortest path from the initial state to the goal? And to answer that question, let’s take a look at this example for a moment. Take a look at this example. Again, we’re trying to get from A to B. And again, I’ve labeled each of the cells with their Manhattan distance from the goal. The number of squares up and to the right, you would need to travel in order to get from that square to the goal. And let’s think about, would greedy best-first search that always picks the smallest number end up finding the optimal solution? What is the shortest solution? And would this algorithm find it? And the important thing to realize is that right here is the decision point. We’re estimated to be 12 away from the goal. And we have two choices. We can go to the left, which we estimate to be 13 away from the goal. Or we can go up, where we estimate it to be 11 away from the goal. And between those two, greedy best-first search is going to say the 11 looks better than the 13. And in doing so, greedy best-first search will end up finding this path to the goal. But it turns out this path is not optimal. There is a way to get to the goal using fewer steps. And it’s actually this way, this way that ultimately involved fewer steps, even though it meant at this moment choosing the worst option between the two or what we estimated to be the worst option based on the heuristics. And so this is what we mean by this is a greedy algorithm. It’s making the best decision locally. At this decision point, it looks like it’s better to go here than it is to go to the 13. But in the big picture, it’s not necessarily optimal. That it might find a solution when in actuality, there was a better solution available. So we would like some way to solve this problem. We like the idea of this heuristic, of being able to estimate the path, the distance between us and the goal. And that helps us to be able to make better decisions and to eliminate having to search through entire parts of this state space. But we would like to modify the algorithm so that we can achieve optimality, so that it can be optimal. And what is the way to do this? What is the intuition here? Well, let’s take a look at this problem. In this initial problem, greedy best research found us this solution here, this long path. And the reason why it wasn’t great is because, yes, the heuristic numbers went down pretty low. But later on, they started to build back up. They built back 8, 9, 10, 11, all the way up to 12 in this case. And so how might we go about trying to improve this algorithm? Well, one thing that we might realize is that if we go all the way through this algorithm, through this path, and we end up going to the 12, and we’ve had to take this many steps, who knows how many steps that is, just to get to this 12, we could have also, as an alternative, taken much fewer steps, just six steps, and ended up at this 13 here. And yes, 13 is more than 12, so it looks like it’s not as good. But it required far fewer steps. It only took six steps to get to this 13 versus many more steps to get to this 12. And while greedy best research says, oh, well, 12 is better than 13, so pick the 12, we might more intelligently say, I’d rather be somewhere that heuristically looks like it takes slightly longer if I can get there much more quickly. And we’re going to encode that idea, this general idea, into a more formal algorithm known as A star search. A star search is going to solve this problem by instead of just considering the heuristic, also considering how long it took us to get to any particular state. So the distinction is greedy best for search. If I am in a state right now, the only thing I care about is, what is the estimated distance, the heuristic value, between me and the goal? Whereas A star search will take into consideration two pieces of information. It’ll take into consideration, how far do I estimate I am from the goal? But also, how far did I have to travel in order to get here? Because that is relevant, too. So we’ll search algorithms by expanding the node with the lowest value of g of n plus h of n. h of n is that same heuristic that we were talking about a moment ago that’s going to vary based on the problem. But g of n is going to be the cost to reach the node, how many steps I had to take, in this case, to get to my current position. So what does that search algorithm look like in practice? Well, let’s take a look. Again, we’ve got the same maze. And again, I’ve labeled them with their Manhattan distance. This value is the h of n value, the heuristic estimate of how far each of these squares is away from the goal. But now, as we begin to explore states, we care not just about this heuristic value, but also about g of n, the number of steps I had to take in order to get there. And I care about summing those two numbers together. So what does that look like? On this very first step, I have taken one step. And now I am estimated to be 16 steps away from the goal. So the total value here is 17. Then I take one more step. I’ve now taken two steps. And I estimate myself to be 15 away from the goal, again, a total value of 17. Now I’ve taken three steps. And I’m estimated to be 14 away from the goal, so on and so forth. Four steps, an estimate of 13. Five steps, estimate of 12. And now here’s a decision point. I could either be six steps away from the goal with a heuristic of 13 for a total of 19, or I could be six steps away from the goal with a heuristic of 11 with an estimate of 17 for the total. So between 19 and 17, I’d rather take the 17, the 6 plus 11. So so far, no different than what we saw before. We’re still taking this option because it appears to be better. And I keep taking this option because it appears to be better. But it’s right about here that things get a little bit different. Now I could be 15 steps away from the goal with an estimated distance of 6. So 15 plus 6, total value of 21. Alternatively, I could be six steps away from the goal, because this is five steps away, so this is six steps away, with a total value of 13 as my estimate. So 6 plus 13, that’s 19. So here, we would evaluate g of n plus h of n to be 19, 6 plus 13. Whereas here, we would be 15 plus 6, or 21. And so the intuition is 19 less than 21, pick here. But the idea is ultimately I’d rather be having taken fewer steps, get to a 13, than having taken 15 steps and be at a 6, because it means I’ve had to take more steps in order to get there. Maybe there’s a better path this way. So instead, we’ll explore this route. Now if we go one more, this is seven steps plus 14 is 21. So between those two, it’s sort of a toss-up. We might end up exploring that one anyways. But after that, as these numbers start to get bigger in the heuristic values, and these heuristic values start to get smaller, you’ll find that we’ll actually keep exploring down this path. And you can do the math to see that at every decision point, A star search is going to make a choice based on the sum of how many steps it took me to get to my current position, and then how far I estimate I am from the goal. So while we did have to explore some of these states, the ultimate solution we found was, in fact, an optimal solution. It did find us the quickest possible way to get from the initial state to the goal. And it turns out that A star is an optimal search algorithm under certain conditions. So the conditions are H of n, my heuristic, needs to be admissible. What does it mean for a heuristic to be admissible? Well, a heuristic is admissible if it never overestimates the true cost. H of n always needs to either get it exactly right in terms of how far away I am, or it needs to underestimate. So we saw an example from before where the heuristic value was much smaller than the actual cost it would take. That’s totally fine, but the heuristic value should never overestimate. It should never think that I’m further away from the goal than I actually am. And meanwhile, to make a stronger statement, H of n also needs to be consistent. And what does it mean for it to be consistent? Mathematically, it means that for every node, which we’ll call n, and successor, the node after me, that I’ll call n prime, where it takes a cost of C to make that step, the heuristic value of n needs to be less than or equal to the heuristic value of n prime plus the cost. So it’s a lot of math, but in words what that ultimately means is that if I am here at this state right now, the heuristic value from me to the goal shouldn’t be more than the heuristic value of my successor, the next place I could go to, plus however much it would cost me to just make that step from one step to the next step. And so this is just making sure that my heuristic is consistent between all of these steps that I might take. So as long as this is true, then A star search is going to find me an optimal solution. And this is where much of the challenge of solving these search problems can sometimes come in, that A star search is an algorithm that is known and you could write the code fairly easily, but it’s choosing the heuristic. It can be the interesting challenge. The better the heuristic is, the better I’ll be able to solve the problem in the fewer states that I’ll have to explore. And I need to make sure that the heuristic satisfies these particular constraints. So all in all, these are some of the examples of search algorithms that might work, and certainly there are many more than just this. A star, for example, does have a tendency to use quite a bit of memory. So there are alternative approaches to A star that ultimately use less memory than this version of A star happens to use, and there are other search algorithms that are optimized for other cases as well. But now so far, we’ve only been looking at search algorithms where there is one agent. I am trying to find a solution to a problem. I am trying to navigate my way through a maze. I am trying to solve a 15 puzzle. I am trying to find driving directions from point A to point B. Sometimes in search situations, though, we’ll enter an adversarial situation, where I am an agent trying to make intelligent decisions. And there’s someone else who is fighting against me, so to speak, that has opposite objectives, someone where I am trying to succeed, someone else that wants me to fail. And this is most popular in something like a game, a game like Tic Tac Toe, where we’ve got this 3 by 3 grid, and x and o take turns, either writing an x or an o in any one of these squares. And the goal is to get three x’s in a row if you’re the x player, or three o’s in a row if you’re the o player. And computers have gotten quite good at playing games, Tic Tac Toe very easily, but even more complex games. And so you might imagine, what does an intelligent decision in a game look like? So maybe x makes an initial move in the middle, and o plays up here. What does an intelligent move for x now become? Where should you move if you were x? And it turns out there are a couple of possibilities. But if an AI is playing this game optimally, then the AI might play somewhere like the upper right, where in this situation, o has the opposite objective of x. x is trying to win the game to get three in a row diagonally here. And o is trying to stop that objective, opposite of the objective. And so o is going to place here to try to block. But now, x has a pretty clever move. x can make a move like this, where now x has two possible ways that x can win the game. x could win the game by getting three in a row across here. Or x could win the game by getting three in a row vertically this way. So it doesn’t matter where o makes their next move. o could play here, for example, blocking the three in a row horizontally. But then x is going to win the game by getting a three in a row vertically. And so there’s a fair amount of reasoning that’s going on here in order for the computer to be able to solve a problem. And it’s similar in spirit to the problems we’ve looked at so far. There are actions. There’s some sort of state of the board and some transition from one action to the next. But it’s different in the sense that this is now not just a classical search problem, but an adversarial search problem. That I am at the x player trying to find the best moves to make, but I know that there is some adversary that is trying to stop me. So we need some sort of algorithm to deal with these adversarial type of search situations. And the algorithm we’re going to take a look at is an algorithm called Minimax, which works very well for these deterministic games where there are two players. It can work for other types of games as well. But we’ll look right now at games where I make a move, then my opponent makes a move. And I am trying to win, and my opponent is trying to win also. Or in other words, my opponent is trying to get me to lose. And so what do we need in order to make this algorithm work? Well, any time we try and translate this human concept of playing a game, winning and losing to a computer, we want to translate it in terms that the computer can understand. And ultimately, the computer really just understands the numbers. And so we want some way of translating a game of x’s and o’s on a grid to something numerical, something the computer can understand. The computer doesn’t normally understand notions of win or lose. But it does understand the concept of bigger and smaller. And so what we might do is we might take each of the possible ways that a tic-tac-toe game can unfold and assign a value or a utility to each one of those possible ways. And in a tic-tac-toe game, and in many types of games, there are three possible outcomes. The outcomes are o wins, x wins, or nobody wins. So player one wins, player two wins, or nobody wins. And for now, let’s go ahead and assign each of these possible outcomes a different value. We’ll say o winning, that’ll have a value of negative 1. Nobody winning, that’ll have a value of 0. And x winning, that will have a value of 1. So we’ve just assigned numbers to each of these three possible outcomes. And now we have two players, we have the x player and the o player. And we’re going to go ahead and call the x player the max player. And we’ll call the o player the min player. And the reason why is because in the min and max algorithm, the max player, which in this case is x, is aiming to maximize the score. These are the possible options for the score, negative 1, 0, and 1. x wants to maximize the score, meaning if at all possible, x would like this situation, where x wins the game, and we give it a score of 1. But if this isn’t possible, if x needs to choose between these two options, negative 1, meaning o winning, or 0, meaning nobody winning, x would rather that nobody wins, score of 0, than a score of negative 1, o winning. So this notion of winning and losing and tying has been reduced mathematically to just this idea of try and maximize the score. The x player always wants the score to be bigger. And on the flip side, the min player, in this case o, is aiming to minimize the score. The o player wants the score to be as small as possible. So now we’ve taken this game of x’s and o’s and winning and losing and turned it into something mathematical, something where x is trying to maximize the score, o is trying to minimize the score. Let’s now look at all of the parts of the game that we need in order to encode it in an AI so that an AI can play a game like tic-tac-toe. So the game is going to need a couple of things. We’ll need some sort of initial state that will, in this case, call s0, which is how the game begins, like an empty tic-tac-toe board, for example. We’ll also need a function called player, where the player function is going to take as input a state here represented by s. And the output of the player function is going to be which player’s turn is it. We need to be able to give a tic-tac-toe board to the computer, run it through a function, and that function tells us whose turn it is. We’ll need some notion of actions that we can take. We’ll see examples of that in just a moment. We need some notion of a transition model, same as before. If I have a state and I take an action, I need to know what results as a consequence of it. I need some way of knowing when the game is over. So this is equivalent to kind of like a goal test, but I need some terminal test, some way to check to see if a state is a terminal state, where a terminal state means the game is over. In a classic game of tic-tac-toe, a terminal state means either someone has gotten three in a row or all of the squares of the tic-tac-toe board are filled. Either of those conditions make it a terminal state. In a game of chess, it might be something like when there is checkmate or if checkmate is no longer possible, that that becomes a terminal state. And then finally, we’ll need a utility function, a function that takes a state and gives us a numerical value for that terminal state, some way of saying if x wins the game, that has a value of 1. If o is won the game, that has a value of negative 1. If nobody has won the game, that has a value of 0. So let’s take a look at each of these in turn. The initial state, we can just represent in tic-tac-toe as the empty game board. This is where we begin. It’s the place from which we begin this search. And again, I’ll be representing these things visually, but you can imagine this really just being like an array or a two-dimensional array of all of these possible squares. Then we need the player function that, again, takes a state and tells us whose turn it is. Assuming x makes the first move, if I have an empty game board, then my player function is going to return x. And if I have a game board where x has made a move, then my player function is going to return o. The player function takes a tic-tac-toe game board and tells us whose turn it is. Next up, we’ll consider the actions function. The actions function, much like it did in classical search, takes a state and gives us the set of all of the possible actions we can take in that state. So let’s imagine it’s o is turned to move in a game board that looks like this. What happens when we pass it into the actions function? So the actions function takes this state of the game as input, and the output is a set of possible actions. It’s a set of I could move in the upper left or I could move in the bottom middle. So those are the two possible action choices that I have when I begin in this particular state. Now, just as before, when we had states and actions, we need some sort of transition model to tell us when we take this action in the state, what is the new state that we get. And here, we define that using the result function that takes a state as input as well as an action. And when we apply the result function to this state, saying that let’s let o move in this upper left corner, the new state we get is this resulting state where o is in the upper left corner. And now, this seems obvious to someone who knows how to play tic-tac-toe. Of course, you play in the upper left corner. That’s the board you get. But all of this information needs to be encoded into the AI. The AI doesn’t know how to play tic-tac-toe until you tell the AI how the rules of tic-tac-toe work. And this function, defining this function here, allows us to tell the AI how this game actually works and how actions actually affect the outcome of the game. So the AI needs to know how the game works. The AI also needs to know when the game is over, as by defining a function called terminal that takes as input a state s, such that if we take a game that is not yet over, pass it into the terminal function, the output is false. The game is not over. But if we take a game that is over because x has gotten three in a row along that diagonal, pass that into the terminal function, then the output is going to be true because the game now is, in fact, over. And finally, we’ve told the AI how the game works in terms of what moves can be made and what happens when you make those moves. We’ve told the AI when the game is over. Now we need to tell the AI what the value of each of those states is. And we do that by defining this utility function that takes a state s and tells us the score or the utility of that state. So again, we said that if x wins the game, that utility is a value of 1, whereas if o wins the game, then the utility of that is negative 1. And the AI needs to know, for each of these terminal states where the game is over, what is the utility of that state? So if I give you a game board like this where the game is, in fact, over, and I ask the AI to tell me what the value of that state is, it could do so. The value of the state is 1. Where things get interesting, though, is if the game is not yet over. Let’s imagine a game board like this, where in the middle of the game, it’s o’s turn to make a move. So how do we know it’s o’s turn to make a move? We can calculate that using the player function. We can say player of s, pass in the state, o is the answer. So we know it’s o’s turn to move. And now, what is the value of this board and what action should o take? Well, that’s going to depend. We have to do some calculation here. And this is where the minimax algorithm really comes in. Recall that x is trying to maximize the score, which means that o is trying to minimize the score. So o would like to minimize the total value that we get at the end of the game. And because this game isn’t over yet, we don’t really know just yet what the value of this game board is. We have to do some calculation in order to figure that out. And so how do we do that kind of calculation? Well, in order to do so, we’re going to consider, just as we might in a classical search situation, what actions could happen next and what states will that take us to. And it turns out that in this position, there are only two open squares, which means there are only two open places where o can make a move. o could either make a move in the upper left or o can make a move in the bottom middle. And minimax doesn’t know right out of the box which of those moves is going to be better. So it’s going to consider both. But now, we sort of run into the same situation. Now, I have two more game boards, neither of which is over. What happens next? And now, it’s in this sense that minimax is what we’ll call a recursive algorithm. It’s going to now repeat the exact same process, although now considering it from the opposite perspective. It’s as if I am now going to put myself, if I am the o player, I’m going to put myself in my opponent’s shoes, my opponent as the x player, and consider what would my opponent do if they were in this position? What would my opponent do, the x player, if they were in that position? And what would then happen? Well, the other player, my opponent, the x player, is trying to maximize the score, whereas I am trying to minimize the score as the o player. So x is trying to find the maximum possible value that they can get. And so what’s going to happen? Well, from this board position, x only has one choice. x is going to play here, and they’re going to get three in a row. And we know that that board, x winning, that has a value of 1. If x wins the game, the value of that game board is 1. And so from this position, if this state can only ever lead to this state, it’s the only possible option, and this state has a value of 1, then the maximum possible value that the x player can get from this game board is also 1. From here, the only place we can get is to a game with a value of 1, so this game board also has a value of 1. Now we consider this one over here. What’s going to happen now? Well, x needs to make a move. The only move x can make is in the upper left, so x will go there. And in this game, no one wins the game. Nobody has three in a row. And so the value of that game board is 0. Nobody is 1. And so again, by the same logic, if from this board position the only place we can get to is a board where the value is 0, then this state must also have a value of 0. And now here comes the choice part, the idea of trying to minimize. I, as the o player, now know that if I make this choice moving in the upper left, that is going to result in a game with a value of 1, assuming everyone plays optimally. And if I instead play in the lower middle, choose this fork in the road, that is going to result in a game board with a value of 0. I have two options. I have a 1 and a 0 to choose from, and I need to pick. And as the min player, I would rather choose the option with the minimum value. So whenever a player has multiple choices, the min player will choose the option with the smallest value. The max player will choose the option with the largest value. Between the 1 and the 0, the 0 is smaller, meaning I’d rather tie the game than lose the game. And so this game board will say also has a value of 0, because if I am playing optimally, I will pick this fork in the road. I’ll place my o here to block x’s 3 in a row, x will move in the upper left, and the game will be over, and no one will have won the game. So this is now the logic of minimax, to consider all of the possible options that I can take, all of the actions that I can take, and then to put myself in my opponent’s shoes. I decide what move I’m going to make now by considering what move my opponent will make on the next turn. And to do that, I consider what move I would make on the turn after that, so on and so forth, until I get all the way down to the end of the game, to one of these so-called terminal states. In fact, this very decision point, where I am trying to decide as the o player what to make a decision about, might have just been a part of the logic that the x player, my opponent, was using, the move before me. This might be part of some larger tree, where x is trying to make a move in this situation, and needs to pick between three different options in order to make a decision about what to happen. And the further and further away we are from the end of the game, the deeper this tree has to go. Because every level in this tree is going to correspond to one move, one move or action that I take, one move or action that my opponent takes, in order to decide what happens. And in fact, it turns out that if I am the x player in this position, and I recursively do the logic, and see I have a choice, three choices, in fact, one of which leads to a value of 0. If I play here, and if everyone plays optimally, the game will be a tie. If I play here, then o is going to win, and I’ll lose playing optimally. Or here, where I, the x player, can win, well between a score of 0, and negative 1, and 1, I’d rather pick the board with a value of 1, because that’s the maximum value I can get. And so this board would also have a maximum value of 1. And so this tree can get very, very deep, especially as the game starts to have more and more moves. And this logic works not just for tic-tac-toe, but any of these sorts of games, where I make a move, my opponent makes a move, and ultimately, we have these adversarial objectives. And we can simplify the diagram into a diagram that looks like this. This is a more abstract version of the minimax tree, where these are each states, but I’m no longer representing them as exactly like tic-tac-toe boards. This is just representing some generic game that might be tic-tac-toe, might be some other game altogether. Any of these green arrows that are pointing up, that represents a maximizing state. I would like the score to be as big as possible. And any of these red arrows pointing down, those are minimizing states, where the player is the min player, and they are trying to make the score as small as possible. So if you imagine in this situation, I am the maximizing player, this player here, and I have three choices. One choice gives me a score of 5, one choice gives me a score of 3, and one choice gives me a score of 9. Well, then between those three choices, my best option is to choose this 9 over here, the score that maximizes my options out of all the three options. And so I can give this state a value of 9, because among my three options, that is the best choice that I have available to me. So that’s my decision now. You imagine it’s like one move away from the end of the game. But then you could also ask a reasonable question, what might my opponent do two moves away from the end of the game? My opponent is the minimizing player. They are trying to make the score as small as possible. Imagine what would have happened if they had to pick which choice to make. One choice leads us to this state, where I, the maximizing player, am going to opt for 9, the biggest score that I can get. And 1 leads to this state, where I, the maximizing player, would choose 8, which is then the largest score that I can get. Now the minimizing player, forced to choose between a 9 or an 8, is going to choose the smallest possible score, which in this case is an 8. And that is then how this process would unfold, that the minimizing player in this case considers both of their options, and then all of the options that would happen as a result of that. So this now is a general picture of what the minimax algorithm looks like. Let’s now try to formalize it using a little bit of pseudocode. So what exactly is happening in the minimax algorithm? Well, given a state s, we need to decide what to happen. The max player, if it’s max’s player’s turn, then max is going to pick an action a in actions of s. Recall that actions is a function that takes a state and gives me back all of the possible actions that I can take. It tells me all of the moves that are possible. The max player is going to specifically pick an action a in this set of actions that gives me the highest value of min value of result of s and a. So what does that mean? Well, it means that I want to make the option that gives me the highest score of all of the actions a. But what score is that going to have? To calculate that, I need to know what my opponent, the min player, is going to do if they try to minimize the value of the state that results. So we say, what state results after I take this action? And what happens when the min player tries to minimize the value of that state? I consider that for all of my possible options. And after I’ve considered that for all of my possible options, I pick the action a that has the highest value. Likewise, the min player is going to do the same thing but backwards. They’re also going to consider what are all of the possible actions they can take if it’s their turn. And they’re going to pick the action a that has the smallest possible value of all the options. And the way they know what the smallest possible value of all the options is is by considering what the max player is going to do by saying, what’s the result of applying this action to the current state? And then what would the max player try to do? What value would the max player calculate for that particular state? So everyone makes their decision based on trying to estimate what the other person would do. And now we need to turn our attention to these two functions, max value and min value. How do you actually calculate the value of a state if you’re trying to maximize its value? And how do you calculate the value of a state if you’re trying to minimize the value? If you can do that, then we have an entire implementation of this min and max algorithm. So let’s try it. Let’s try and implement this max value function that takes a state and returns as output the value of that state if I’m trying to maximize the value of the state. Well, the first thing I can check for is to see if the game is over. Because if the game is over, in other words, if the state is a terminal state, then this is easy. I already have this utility function that tells me what the value of the board is. If the game is over, I just check, did x win, did o win, is it a tie? And this utility function just knows what the value of the state is. What’s trickier is if the game isn’t over. Because then I need to do this recursive reasoning about thinking, what is my opponent going to do on the next move? And I want to calculate the value of this state. And I want the value of the state to be as high as possible. And I’ll keep track of that value in a variable called v. And if I want the value to be as high as possible, I need to give v an initial value. And initially, I’ll just go ahead and set it to be as low as possible. Because I don’t know what options are available to me yet. So initially, I’ll set v equal to negative infinity, which seems a little bit strange. But the idea here is I want the value initially to be as low as possible. Because as I consider my actions, I’m always going to try and do better than v. And if I set v to negative infinity, I know I can always do better than that. So now I consider my actions. And this is going to be some kind of loop where for every action in actions of state, recall actions as a function that takes my state and gives me all the possible actions that I can use in that state. So for each one of those actions, I want to compare it to v and say, all right, v is going to be equal to the maximum of v and this expression. So what is this expression? Well, first it is get the result of taking the action in the state and then get the min value of that. In other words, let’s say I want to find out from that state what is the best that the min player can do because they’re going to try and minimize the score. So whatever the resulting score is of the min value of that state, compare it to my current best value and just pick the maximum of those two because I am trying to maximize the value. In short, what these three lines of code are doing are going through all of my possible actions and asking the question, how do I maximize the score given what my opponent is going to try to do? After this entire loop, I can just return v and that is now the value of that particular state. And for the min player, it’s the exact opposite of this, the same logic just backwards. To calculate the minimum value of a state, first we check if it’s a terminal state. If it is, we return its utility. Otherwise, we’re going to now try to minimize the value of the state given all of my possible actions. So I need an initial value for v, the value of the state. And initially, I’ll set it to infinity because I know I can always get something less than infinity. So by starting with v equals infinity, I make sure that the very first action I find, that will be less than this value of v. And then I do the same thing, loop over all of my possible actions. And for each of the results that we could get when the max player makes their decision, let’s take the minimum of that and the current value of v. So after all is said and done, I get the smallest possible value of v that I then return back to the user. So that, in effect, is the pseudocode for Minimax. That is how we take a gain and figure out what the best move to make is by recursively using these max value and min value functions, where max value calls min value, min value calls max value back and forth, all the way until we reach a terminal state, at which point our algorithm can simply return the utility of that particular state. So what you might imagine is that this is going to start to be a long process, especially as games start to get more complex, as we start to add more moves and more possible options and games that might last quite a bit longer. So the next question to ask is, what sort of optimizations can we make here? How can we do better in order to use less space or take less time to be able to solve this kind of problem? And we’ll take a look at a couple of possible optimizations. But for one, we’ll take a look at this example. Again, returning to these up arrows and down arrows, let’s imagine that I now am the max player, this green arrow. I am trying to make this score as high as possible. And this is an easy game where there are just two moves. I make a move, one of these three options. And then my opponent makes a move, one of these three options, based on what move I make. And as a result, we get some value. Let’s look at the order in which I do these calculations and figure out if there are any optimizations I might be able to make to this calculation process. I’m going to have to look at these states one at a time. So let’s say I start here on the left and say, all right, now I’m going to consider, what will the min player, my opponent, try to do here? Well, the min player is going to look at all three of their possible actions and look at their value, because these are terminal states. They’re the end of the game. And so they’ll see, all right, this node is a value of four, value of eight, value of five. And the min player is going to say, well, all right, between these three options, four, eight, and five, I’ll take the smallest one. I’ll take the four. So this state now has a value of four. Then I, as the max player, say, all right, if I take this action, it will have a value of four. That’s the best that I can do, because min player is going to try and minimize my score. So now what if I take this option? We’ll explore this next. And now explore what the min player would do if I choose this action. And the min player is going to say, all right, what are the three options? The min player has options between nine, three, and seven. And so three is the smallest among nine, three, and seven. So we’ll go ahead and say this state has a value of three. So now I, as the max player, I have now explored two of my three options. I know that one of my options will guarantee me a score of four, at least. And one of my options will guarantee me a score of three. And now I consider my third option and say, all right, what happens here? Same exact logic. The min player is going to look at these three states, two, four, and six. I’ll say the minimum possible option is two. So the min player wants the two. Now I, as the max player, have calculated all of the information by looking two layers deep, by looking at all of these nodes. And I can now say, between the four, the three, and the two, you know what? I’d rather take the four. Because if I choose this option, if my opponent plays optimally, they will try and get me to the four. But that’s the best I can do. I can’t guarantee a higher score. Because if I pick either of these two options, I might get a three or I might get a two. And it’s true that down here is a nine. And that’s the highest score out of any of the scores. So I might be tempted to say, you know what? Maybe I should take this option because I might get the nine. But if the min player is playing intelligently, if they’re making the best moves at each possible option they have when they get to make a choice, I’ll be left with a three. Whereas I could better, playing optimally, have guaranteed that I would get the four. So that is, in effect, the logic that I would use as a min and max player trying to maximize my score from that node there. But it turns out they took quite a bit of computation for me to figure that out. I had to reason through all of these nodes in order to draw this conclusion. And this is for a pretty simple game where I have three choices, my opponent has three choices, and then the game’s over. So what I’d like to do is come up with some way to optimize this. Maybe I don’t need to do all of this calculation to still reach the conclusion that, you know what, this action to the left, that’s the best that I could do. Let’s go ahead and try again and try to be a little more intelligent about how I go about doing this. So first, I start the exact same way. I don’t know what to do initially, so I just have to consider one of the options and consider what the min player might do. Min has three options, four, eight, and five. And between those three options, min says four is the best they can do because they want to try to minimize the score. Now I, the max player, will consider my second option, making this move here, and considering what my opponent would do in response. What will the min player do? Well, the min player is going to, from that state, look at their options. And I would say, all right, nine is an option, three is an option. And if I am doing the math from this initial state, doing all this calculation, when I see a three, that should immediately be a red flag for me. Because when I see a three down here at this state, I know that the value of this state is going to be at most three. It’s going to be three or something less than three, even though I haven’t yet looked at this last action or even further actions if there were more actions that could be taken here. How do I know that? Well, I know that the min player is going to try to minimize my score. And if they see a three, the only way this could be something other than a three is if this remaining thing that I haven’t yet looked at is less than three, which means there is no way for this value to be anything more than three because the min player can already guarantee a three and they are trying to minimize my score. So what does that tell me? Well, it tells me that if I choose this action, my score is going to be three or maybe even less than three if I’m unlucky. But I already know that this action will guarantee me a four. And so given that I know that this action guarantees me a score of four and this action means I can’t do better than three, if I’m trying to maximize my options, there is no need for me to consider this triangle here. There is no value, no number that could go here that would change my mind between these two options. I’m always going to opt for this path that gets me a four as opposed to this path where the best I can do is a three if my opponent plays optimally. And this is going to be true for all the future states that I look at too. That if I look over here at what min player might do over here, if I see that this state is a two, I know that this state is at most a two because the only way this value could be something other than two is if one of these remaining states is less than a two and so the min player would opt for that instead. So even without looking at these remaining states, I as the maximizing player can know that choosing this path to the left is going to be better than choosing either of those two paths to the right because this one can’t be better than three. This one can’t be better than two. And so four in this case is the best that I can do. So in order to do this cut, and I can say now that this state has a value of four. So in order to do this type of calculation, I was doing a little bit more bookkeeping, keeping track of things, keeping track all the time of what is the best that I can do, what is the worst that I can do, and for each of these states saying, all right, well, if I already know that I can get a four, then if the best I can do at this state is a three, no reason for me to consider it, I can effectively prune this leaf and anything below it from the tree. And it’s for that reason this approach, this optimization to minimax, is called alpha, beta pruning. Alpha and beta stand for these two values that you’ll have to keep track of of the best you can do so far and the worst you can do so far. And pruning is the idea of if I have a big, long, deep search tree, I might be able to search it more efficiently if I don’t need to search through everything, if I can remove some of the nodes to try and optimize the way that I look through this entire search space. So alpha, beta pruning can definitely save us a lot of time as we go about the search process by making our searches more efficient. But even then, it’s still not great as games get more complex. Tic-tac-toe, fortunately, is a relatively simple game. And we might reasonably ask a question like, how many total possible tic-tac-toe games are there? You can think about it. You can try and estimate how many moves are there at any given point, how many moves long can the game last. It turns out there are about 255,000 possible tic-tac-toe games that can be played. But compare that to a more complex game, something like a game of chess, for example. Far more pieces, far more moves, games that last much longer. How many total possible chess games could there be? It turns out that after just four moves each, four moves by the white player, four moves by the black player, that there are 288 billion possible chess games that can result from that situation, after just four moves each. And going even further, if you look at entire chess games and how many possible chess games there could be as a result there, there are more than 10 to the 29,000 possible chess games, far more chess games than could ever be considered. And this is a pretty big problem for the Minimax algorithm, because the Minimax algorithm starts with an initial state, considers all the possible actions, and all the possible actions after that, all the way until we get to the end of the game. And that’s going to be a problem if the computer is going to need to look through this many states, which is far more than any computer could ever do in any reasonable amount of time. So what do we do in order to solve this problem? Instead of looking through all these states which is totally intractable for a computer, we need some better approach. And it turns out that better approach generally takes the form of something called depth-limited Minimax, where normally Minimax is depth-unlimited. We just keep going layer after layer, move after move, until we get to the end of the game. Depth-limited Minimax is instead going to say, you know what, after a certain number of moves, maybe I’ll look 10 moves ahead, maybe I’ll look 12 moves ahead, but after that point, I’m going to stop and not consider additional moves that might come after that, just because it would be computationally intractable to consider all of those possible options. But what do we do after we get 10 or 12 moves deep when we arrive at a situation where the game’s not over? Minimax still needs a way to assign a score to that game board or game state to figure out what its current value is, which is easy to do if the game is over, but not so easy to do if the game is not yet over. So in order to do that, we need to add one additional feature to depth-limited Minimax called an evaluation function, which is just some function that is going to estimate the expected utility of a game from a given state. So in a game like chess, if you imagine that a game value of 1 means white wins, negative 1 means black wins, 0 means it’s a draw, then you might imagine that a score of 0.8 means white is very likely to win, though certainly not guaranteed. And you would have an evaluation function that estimates how good the game state happens to be. And depending on how good that evaluation function is, that is ultimately what’s going to constrain how good the AI is. The better the AI is at estimating how good or how bad any particular game state is, the better the AI is going to be able to play that game. If the evaluation function is worse and not as good as it estimating what the expected utility is, then it’s going to be a whole lot harder. And you can imagine trying to come up with these evaluation functions. In chess, for example, you might write an evaluation function based on how many pieces you have as compared to how many pieces your opponent has, because each one has a value. And your evaluation function probably needs to be a little bit more complicated than that to consider other possible situations that might arise as well. And there are many other variants on Minimax that add additional features in order to help it perform better under these larger, more computationally untractable situations where we couldn’t possibly explore all of the possible moves. So we need to figure out how to use evaluation functions and other techniques to be able to play these games ultimately better. But this now was a look at this kind of adversarial search, these search problems where we have situations where I am trying to play against some sort of opponent. And these search problems show up all over the place throughout artificial intelligence. We’ve been talking a lot today about more classical search problems, like trying to find directions from one location to another. But any time an AI is faced with trying to make a decision, like what do I do now in order to do something that is rational, or do something that is intelligent, or trying to play a game, like figuring out what move to make, these sort of algorithms can really come in handy. It turns out that for tic-tac-toe, the solution is pretty simple because it’s a small game. XKCD has famously put together a web comic where he will tell you exactly what move to make as the optimal move to make no matter what your opponent happens to do. This type of thing is not quite as possible for a much larger game like Checkers or Chess, for example, where chess is totally computationally untractable for most computers to be able to explore all the possible states. So we really need our AI to be far more intelligent about how they go about trying to deal with these problems and how they go about taking this environment that they find themselves in and ultimately searching for one of these solutions. So this, then, was a look at search in artificial intelligence. Next time, we’ll take a look at knowledge, thinking about how it is that our AIs are able to know information, reason about that information, and draw conclusions, all in our look at AI and the principles behind it. We’ll see you next time. [“AIMS INTRO MUSIC”] All right, welcome back, everyone, to an introduction to artificial intelligence with Python. Last time, we took a look at search problems, in particular, where we have AI agents that are trying to solve some sort of problem by taking actions in some sort of environment, whether that environment is trying to take actions by playing moves in a game or whether those actions are something like trying to figure out where to make turns in order to get driving directions from point A to point B. This time, we’re going to turn our attention more generally to just this idea of knowledge, the idea that a lot of intelligence is based on knowledge, especially if we think about human intelligence. People know information. We know facts about the world. And using that information that we know, we’re able to draw conclusions, reason about the information that we know in order to figure out how to do something or figure out some other piece of information that we conclude based on the information we already have available to us. What we’d like to focus on now is the ability to take this idea of knowledge and being able to reason based on knowledge and apply those ideas to artificial intelligence. In particular, we’re going to be building what are known as knowledge-based agents, agents that are able to reason and act by representing knowledge internally. Somehow inside of our AI, they have some understanding of what it means to know something. And ideally, they have some algorithms or some techniques they can use based on that knowledge that they know in order to figure out the solution to a problem or figure out some additional piece of information that can be helpful in some sense. So what do we mean by reasoning based on knowledge to be able to draw conclusions? Well, let’s look at a simple example drawn from the world of Harry Potter. We take one sentence that we know to be true. Imagine if it didn’t rain, then Harry visited Hagrid today. So one fact that we might know about the world. And then we take another fact. Harry visited Hagrid or Dumbledore today, but not both. So it tells us something about the world, that Harry either visited Hagrid but not Dumbledore, or Harry visited Dumbledore but not Hagrid. And now we have a third piece of information about the world that Harry visited Dumbledore today. So we now have three pieces of information now, three facts. Inside of a knowledge base, so to speak, information that we know. And now we, as humans, can try and reason about this and figure out, based on this information, what additional information can we begin to conclude? And well, looking at these last two statements, Harry either visited Hagrid or Dumbledore but not both, and we know that Harry visited Dumbledore today, well, then it’s pretty reasonable that we could draw the conclusion that, you know what, Harry must not have visited Hagrid today. Because based on a combination of these two statements, we can draw this inference, so to speak, a conclusion that Harry did not visit Hagrid today. But it turns out we can even do a little bit better than that, get some more information by taking a look at this first statement and reasoning about that. This first statement says, if it didn’t rain, then Harry visited Hagrid today. So what does that mean? In all cases where it didn’t rain, then we know that Harry visited Hagrid. But if we also know now that Harry did not visit Hagrid, then that tells us something about our initial premise that we were thinking about. In particular, it tells us that it did rain today, because we can reason, if it didn’t rain, that Harry would have visited Hagrid. But we know for a fact that Harry did not visit Hagrid today. So it’s this kind of reason, this sort of logical reasoning, where we use logic based on the information that we know in order to take information and reach conclusions that is going to be the focus of what we’re going to be talking about today. How can we make our artificial intelligence logical so that they can perform the same kinds of deduction, the same kinds of reasoning that we’ve been doing so far? Of course, humans reason about logic generally in terms of human language. That I just now was speaking in English, talking in English about these sentences and trying to reason through how it is that they relate to one another. We’re going to need to be a little bit more formal when we turn our attention to computers and being able to encode this notion of logic and truthhood and falsehood inside of a machine. So we’re going to need to introduce a few more terms and a few symbols that will help us reason through this idea of logic inside of an artificial intelligence. And we’ll begin with the idea of a sentence. Now, a sentence in a natural language like English is just something that I’m saying, like what I’m saying right now. In the context of AI, though, a sentence is just an assertion about the world in what we’re going to call a knowledge representation language, some way of representing knowledge inside of our computers. And the way that we’re going to spend most of today reasoning about knowledge is through a type of logic known as propositional logic. There are a number of different types of logic, some of which we’ll touch on. But propositional logic is based on a logic of propositions, or just statements about the world. And so we begin in propositional logic with a notion of propositional symbols. We will have certain symbols that are oftentimes just letters, something like P or Q or R, where each of those symbols is going to represent some fact or sentence about the world. So P, for example, might represent the fact that it is raining. And so P is going to be a symbol that represents that idea. And Q, for example, might represent Harry visited Hagrid today. Each of these propositional symbols represents some sentence or some fact about the world. But in addition to just having individual facts about the world, we want some way to connect these propositional symbols together in order to reason more complexly about other facts that might exist inside of the world in which we’re reasoning. So in order to do that, we’ll need to introduce some additional symbols that are known as logical connectives. Now, there are a number of these logical connectives. But five of the most important, and the ones we’re going to focus on today, are these five up here, each represented by a logical symbol. Not is represented by this symbol here, and is represented as sort of an upside down V, or is represented by a V shape. Implication, and we’ll talk about what that means in just a moment, is represented by an arrow. And biconditional, again, we’ll talk about what that means in a moment, is represented by these double arrows. But these five logical connectives are the main ones we’re going to be focusing on in terms of thinking about how it is that a computer can reason about facts and draw conclusions based on the facts that it knows. But in order to get there, we need to take a look at each of these logical connectives and build up an understanding for what it is that they actually mean. So let’s go ahead and begin with the not symbol, so this not symbol here. And what we’re going to show for each of these logical connectives is what we’re going to call a truth table, a table that demonstrates what this word not means when we attach it to a propositional symbol or any sentence inside of our logical language. And so the truth table for not is shown right here. If P, some propositional symbol, or some other sentence even, is false, then not P is true. And if P is true, then not P is false. So you can imagine that placing this not symbol in front of some sentence of propositional logic just says the opposite of that. So if, for example, P represented it is raining, then not P would represent the idea that it is not raining. And as you might expect, if P is false, meaning if the sentence, it is raining, is false, well then the sentence not P must be true. The sentence that it is not raining is therefore true. So not, you can imagine, just takes whatever is in P and it inverts it. It turns false into true and true into false, much analogously to what the English word not means, just taking whatever comes after it and inverting it to mean the opposite. Next up, and also very English-like, is this idea of and represented by this upside-down V shape or this point shape. And as opposed to just taking a single argument the way not does, we have P and we have not P. And is going to combine two different sentences in propositional logic together. So I might have one sentence P and another sentence Q, and I want to combine them together to say P and Q. And the general logic for what P and Q means is it means that both of its operands are true. P is true and also Q is true. And so here’s what that truth table looks like. This time we have two variables, P and Q. And when we have two variables, each of which can be in two possible states, true or false, that leads to two squared or four possible combinations of truth and falsehood. So we have P is false and Q is false. We have P is false and Q is true. P is true and Q is false. And then P and Q both are true. And those are the only four possibilities for what P and Q could mean. And in each of those situations, this third column here, P and Q, is telling us a little bit about what it actually means for P and Q to be true. And we see that the only case where P and Q is true is in this fourth row here, where P happens to be true, Q also happens to be true. And in all other situations, P and Q is going to evaluate to false. So this, again, is much in line with what our intuition of and might mean. If I say P and Q, I probably mean that I expect both P and Q to be true. Next up, also potentially consistent with what we mean, is this word or, represented by this V shape, sort of an upside down and symbol. And or, as the name might suggest, is true if either of its arguments are true, as long as P is true or Q is true, then P or Q is going to be true. Which means the only time that P or Q is false is if both of its operands are false. If P is false and Q is false, then P or Q is going to be false. But in all other cases, at least one of the operands is true. Maybe they’re both true, in which case P or Q is going to evaluate to true. Now, this is mostly consistent with the way that most people might use the word or, in the sense of speaking the word or in normal English, though there is sometimes when we might say or, where we mean P or Q, but not both, where we mean, sort of, it can only be one or the other. It’s important to note that this symbol here, this or, means P or Q or both, that those are totally OK. As long as either or both of them are true, then the or is going to evaluate to be true, as well. It’s only in the case where all of the operands are false that P or Q ultimately evaluates to false, as well. In logic, there’s another symbol known as the exclusive or, which encodes this idea of exclusivity of one or the other, but not both. But we’re not going to be focusing on that today. Whenever we talk about or, we’re always talking about either or both, in this case, as represented by this truth table here. So that now is not an and an or. And next up is what we might call implication, as denoted by this arrow symbol. So we have P and Q. And this sentence here will generally read as P implies Q. And what P implies Q means is that if P is true, then Q is also true. So I might say something like, if it is raining, then I will be indoors. Meaning, it is raining implies I will be indoors, as the logical sentence that I’m saying there. And the truth table for this can sometimes be a little bit tricky. So obviously, if P is true and Q is true, then P implies Q. That’s true. That definitely makes sense. And it should also stand to reason that when P is true and Q is false, then P implies Q is false. Because if I said to you, if it is raining, then I will be out indoors. And it is raining, but I’m not indoors? Well, then it would seem to be that my original statement was not true. P implies Q means that if P is true, then Q also needs to be true. And if it’s not, well, then the statement is false. What’s also worth noting, though, is what happens when P is false. When P is false, the implication makes no claim at all. If I say something like, if it is raining, then I will be indoors. And it turns out it’s not raining. Then in that case, I am not making any statement as to whether or not I will be indoors or not. P implies Q just means that if P is true, Q must be true. But if P is not true, then we make no claim about whether or not Q is true at all. So in either case, if P is false, it doesn’t matter what Q is. Whether it’s false or true, we’re not making any claim about Q whatsoever. We can still evaluate the implication to true. The only way that the implication is ever false is if our premise, P, is true, but the conclusion that we’re drawing Q happens to be false. So in that case, we would say P does not imply Q in that case. Finally, the last connective that we’ll discuss is this bi-conditional. You can think of a bi-conditional as a condition that goes in both directions. So originally, when I said something like, if it is raining, then I will be indoors. I didn’t say what would happen if it wasn’t raining. Maybe I’ll be indoors, maybe I’ll be outdoors. This bi-conditional, you can read as an if and only if. So I can say, I will be indoors if and only if it is raining, meaning if it is raining, then I will be indoors. And if I am indoors, it’s reasonable to conclude that it is also raining. So this bi-conditional is only true when P and Q are the same. So if P is true and Q is true, then this bi-conditional is also true. P implies Q, but also the reverse is true. Q also implies P. So if P and Q both happen to be false, we would still say it’s true. But in any of these other two situations, this P if and only if Q is going to ultimately evaluate to false. So a lot of trues and falses going on there, but these five basic logical connectives are going to form the core of the language of propositional logic, the language that we’re going to use in order to describe ideas, and the language that we’re going to use in order to reason about those ideas in order to draw conclusions. So let’s now take a look at some of the additional terms that we’ll need to know about in order to go about trying to form this language of propositional logic and writing AI that’s actually able to understand this sort of logic. The next thing we’re going to need is the notion of what is actually true about the world. We have a whole bunch of propositional symbols, P and Q and R and maybe others, but we need some way of knowing what actually is true in the world. Is P true or false? Is Q true or false? So on and so forth. And to do that, we’ll introduce the notion of a model. A model just assigns a truth value, where a truth value is either true or false, to every propositional symbol. In other words, it’s creating what we might call a possible world. So let me give an example. If, for example, I have two propositional symbols, P is it is raining and Q is it is a Tuesday, a model just takes each of these two symbols and assigns a truth value to them, either true or false. So here’s a sample model. In this model, in other words, in this possible world, it is possible that P is true, meaning it is raining, and Q is false, meaning it is not a Tuesday. But there are other possible worlds or other models as well. There is some model where both of these variables are true, some model where both of these variables are false. In fact, if there are n variables that are propositional symbols like this that are either true or false, then the number of possible models is 2 to the n, because each of these possible models, possible variables within my model, could be set to either true or false if I don’t know any information about it. So now that I have the symbols and the connectives that I’m going to need in order to construct these parts of knowledge, we need some way to represent that knowledge. And to do so, we’re going to allow our AI access to what we’ll call a knowledge base. And a knowledge base is really just a set of sentences that our AI knows to be true. Some set of sentences in propositional logic that are things that our AI knows about the world. And so we might tell our AI some information, information about a situation that it finds itself in, or a situation about a problem that it happens to be trying to solve. And we would give that information to the AI that the AI would store inside of its knowledge base. And what happens next is the AI would like to use that information in the knowledge base to be able to draw conclusions about the rest of the world. And what do those conclusions look like? Well, to understand those conclusions, we’ll need to introduce one more idea, one more symbol. And that is the notion of entailment. So this sentence here, with this double turnstile in these Greek letters, this is the Greek letter alpha and the Greek letter beta. And we read this as alpha entails beta. And alpha and beta here are just sentences in propositional logic. And what this means is that alpha entails beta means that in every model, in other words, in every possible world in which sentence alpha is true, then sentence beta is also true. So if something entails something else, if alpha entails beta, it means that if I know alpha to be true, then beta must therefore also be true. So if my alpha is something like I know that it is a Tuesday in January, then a reasonable beta might be something like I know that it is January. Because in all worlds where it is a Tuesday in January, I know for sure that it must be January, just by definition. This first statement or sentence about the world entails the second statement. And we can reasonably use deduction based on that first sentence to figure out that the second sentence is, in fact, true as well. And ultimately, it’s this idea of entailment that we’re going to try and encode into our computer. We want our AI agent to be able to figure out what the possible entailments are. We want our AI to be able to take these three sentences, sentences like, if it didn’t rain, Harry visited Hagrid. That Harry visited Hagrid or Dumbledore, but not both. And that Harry visited Dumbledore. And just using that information, we’d like our AI to be able to infer or figure out that using these three sentences inside of a knowledge base, we can draw some conclusions. In particular, we can draw the conclusions here that, one, Harry did not visit Hagrid today. And we can draw the entailment, too, that it did, in fact, rain today. And this process is known as inference. And that’s what we’re going to be focusing on today, this process of deriving new sentences from old ones, that I give you these three sentences, you put them in the knowledge base in, say, the AI. And the AI is able to use some sort of inference algorithm to figure out that these two sentences must also be true. And that is how we define inference. So let’s take a look at an inference example to see how we might actually go about inferring things in a human sense before we take a more algorithmic approach to see how we could encode this idea of inference in AI. And we’ll see there are a number of ways that we can actually achieve this. So again, we’ll deal with a couple of propositional symbols. We’ll deal with P, Q, and R. P is it is a Tuesday. Q is it is raining. And R is Harry will go for a run, three propositional symbols that we are just defining to mean this. We’re not saying anything yet about whether they’re true or false. We’re just defining what they are. Now, we’ll give ourselves or an AI access to a knowledge base, abbreviated to KB, the knowledge that we know about the world. We know this statement. All right. So let’s try to parse it. The parentheses here are just used for precedent, so we can see what associates with what. But you would read this as P and not Q implies R. All right. So what does that mean? Let’s put it piece by piece. P is it is a Tuesday. Q is it is raining, so not Q is it is not raining, and implies R is Harry will go for a run. So the way to read this entire sentence in human natural language at least is if it is a Tuesday and it is not raining, then Harry will go for a run. So if it is a Tuesday and it is not raining, then Harry will go for a run. And that is now inside of our knowledge base. And let’s now imagine that our knowledge base has two other pieces of information as well. It has information that P is true, that it is a Tuesday. And we also have the information not Q, that it is not raining, that this sentence Q, it is raining, happens to be false. And those are the three sentences that we have access to. P and not Q implies R, P and not Q. Using that information, we should be able to draw some inferences. P and not Q is only true if both P and not Q are true. All right, we know that P is true and we know that not Q is true. So we know that this whole expression is true. And the definition of implication is if this whole thing on the left is true, then this thing on the right must also be true. So if we know that P and not Q is true, then R must be true as well. So the inference we should be able to draw from all of this is that R is true and we know that Harry will go for a run by taking this knowledge inside of our knowledge base and being able to reason based on that idea. And so this ultimately is the beginning of what we might consider to be some sort of inference algorithm, some process that we can use to try and figure out whether or not we can draw some conclusion. And ultimately, what these inference algorithms are going to answer is the central question about entailment. Given some query about the world, something we’re wondering about the world, and we’ll call that query alpha, the question we want to ask using these inference algorithms is does KB, our knowledge base, entail alpha? In other words, using only the information we know inside of our knowledge base, the knowledge that we have access to, can we conclude that this sentence alpha is true? And that’s ultimately what we would like to do. So how can we do that? How can we go about writing an algorithm that can look at this knowledge base and figure out whether or not this query alpha is actually true? Well, it turns out there are a couple of different algorithms for doing so. And one of the simplest, perhaps, is known as model checking. Now, remember that a model is just some assignment of all of the propositional symbols inside of our language to a truth value, true or false. And you can think of a model as a possible world, that there are many possible worlds where different things might be true or false, and we can enumerate all of them. And the model checking algorithm does exactly that. So what does our model checking algorithm do? Well, if we wanted to determine if our knowledge base entails some query alpha, then we are going to enumerate all possible models. In other words, consider all possible values of true and false for our variables, all possible states in which our world can be in. And if in every model where our knowledge base is true, alpha is also true, then we know that the knowledge base entails alpha. So let’s take a closer look at that sentence and try and figure out what it actually means. If we know that in every model, in other words, in every possible world, no matter what assignment of true and false to variables you give, if we know that whenever our knowledge is true, what we know to be true is true, that this query alpha is also true, well, then it stands to reason that as long as our knowledge base is true, then alpha must also be true. And so this is going to form the foundation of our model checking algorithm. We’re going to enumerate all of the possible worlds and ask ourselves whenever the knowledge base is true, is alpha true? And if that’s the case, then we know alpha to be true. And otherwise, there is no entailment. Our knowledge base does not entail alpha. All right. So this is a little bit abstract, but let’s take a look at an example to try and put real propositional symbols to this idea. So again, we’ll work with the same example. P is it is a Tuesday, Q is it is raining, R as Harry will go for a run. Our knowledge base contains these pieces of information. P and not Q implies R. We also know P. It is a Tuesday and not Q. It is not raining. And our query, our alpha in this case, the thing we want to ask is R. We want to know, is it guaranteed? Is it entailed that Harry will go for a run? So the first step is to enumerate all of the possible models. We have three propositional symbols here, P, Q, and R, which means we have 2 to the third power, or eight possible models. All false, false, false true, false true, false, false true, true, et cetera. Eight possible ways you could assign true and false to all of these models. And we might ask in each one of them, is the knowledge base true? Here are the set of things that we know. In which of these worlds could this knowledge base possibly apply to? In which world is this knowledge base true? Well, in the knowledge base, for example, we know P. We know it is a Tuesday, which means we know that these four first four rows where P is false, none of those are going to be true or are going to work for this particular knowledge base. Our knowledge base is not true in those worlds. Likewise, we also know not Q. We know that it is not raining. So any of these models where Q is true, like these two and these two here, those aren’t going to work either because we know that Q is not true. And finally, we also know that P and not Q implies R, which means that when P is true or P is true here and Q is false, Q is false in these two, then R must be true. And if ever P is true, Q is false, but R is also false, well, that doesn’t satisfy this implication here. That implication does not hold true under those situations. So we could say that for our knowledge base, we can conclude under which of these possible worlds is our knowledge base true and under which of the possible worlds is our knowledge base false. And it turns out there is only one possible world where our knowledge base is actually true. In some cases, there might be multiple possible worlds where the knowledge base is true. But in this case, it just so happens that there’s only one, one possible world where we can definitively say something about our knowledge base. And in this case, we would look at the query. The query of R is R true, R is true, and so as a result, we can draw that conclusion. And so this is this idea of model check-in. Enumerate all the possible models and look in those possible models to see whether or not, if our knowledge base is true, is the query in question true as well. So let’s now take a look at how we might actually go about writing this in a programming language like Python. Take a look at some actual code that would encode this notion of propositional symbols and logic and these connectives like and and or and not and implication and so forth and see what that code might actually look like. So I’ve written in advance a logic library that’s more detailed than we need to worry about entirely today. But the important thing is that we have one class for every type of logical symbol or connective that we might have. So we just have one class for logical symbols, for example, where every symbol is going to represent and store some name for that particular symbol. And we also have a class for not that takes an operand. So we might say not one symbol to say something is not true or some other sentence is not true. We have one for and, one for or, so on and so forth. And I’ll just demonstrate how this works. And you can take a look at the actual logic.py later on. But I’ll go ahead and call this file harry.py. We’re going to store information about this world of Harry Potter, for example. So I’ll go ahead and import from my logic module. I’ll import everything. And in this library, in order to create a symbol, you use capital S symbol. And I’ll create a symbol for rain, to mean it is raining, for example. And I’ll create a symbol for Hagrid, to mean Harry visited Hagrid, is what this symbol is going to mean. So this symbol means it is raining. This symbol means Harry visited Hagrid. And I’ll add another symbol called Dumbledore for Harry visited Dumbledore. Now, I’d like to save these symbols so that I can use them later as I do some logical analysis. So I’ll go ahead and save each one of them inside of a variable. So like rain, Hagrid, and Dumbledore, so you could call the variables anything. And now that I have these logical symbols, I can use logical connectives to combine them together. So for example, if I have a sentence like and rain and Hagrid, for example, which is not necessarily true, but just for demonstration, I can now try and print out sentence.formula, which is a function I wrote that takes a sentence in propositional logic and just prints it out so that we, the programmers, can now see this in order to get an understanding for how it actually works. So if I run python harry.py, what we’ll see is this sentence in propositional logic, rain and Hagrid. This is the logical representation of what we have here in our Python program of saying and whose arguments are rain and Hagrid. So we’re saying rain and Hagrid by encoding that idea. And this is quite common in Python object-oriented programming, where you have a number of different classes, and you pass arguments into them in order to create a new and object, for example, in order to represent this idea. But now what I’d like to do is somehow encode the knowledge that I have about the world in order to solve that problem from the beginning of class, where we talked about trying to figure out who Harry visited and trying to figure out if it’s raining or if it’s not raining. And so what knowledge do I have? I’ll go ahead and create a new variable called knowledge. And what do I know? Well, I know the very first sentence that we talked about was the idea that if it is not raining, then Harry will visit Hagrid. So all right, how do I encode the idea that it is not raining? Well, I can use not and then the rain symbol. So here’s me saying that it is not raining. And now the implication is that if it is not raining, then Harry visited Hagrid. So I’ll wrap this inside of an implication to say, if it is not raining, this first argument to the implication will then Harry visited Hagrid. So I’m saying implication, the premise is that it’s not raining. And if it is not raining, then Harry visited Hagrid. And I can print out knowledge.formula to see the logical formula equivalent of that same idea. So I run Python of harry.py. And this is the logical formula that we see as a result, which is a text-based version of what we were looking at before, that if it is not raining, then that implies that Harry visited Hagrid. But there was additional information that we had access to as well. In this case, we had access to the fact that Harry visited either Hagrid or Dumbledore. So how do I encode that? Well, this means that in my knowledge, I’ve really got multiple pieces of knowledge going on. I know one thing and another thing and another thing. So I’ll go ahead and wrap all of my knowledge inside of an and. And I’ll move things on to new lines just for good measure. But I know multiple things. So I’m saying knowledge is an and of multiple different sentences. I know multiple different sentences to be true. One such sentence that I know to be true is this implication, that if it is not raining, then Harry visited Hagrid. Another such sentence that I know to be true is or Hagrid Dumbledore. In other words, Hagrid or Dumbledore is true, because I know that Harry visited Hagrid or Dumbledore. But I know more than that, actually. That initial sentence from before said that Harry visited Hagrid or Dumbledore, but not both. So now I want a sentence that will encode the idea that Harry didn’t visit both Hagrid and Dumbledore. Well, the notion of Harry visiting Hagrid and Dumbledore would be represented like this, and of Hagrid and Dumbledore. And if that is not true, if I want to say not that, then I’ll just wrap this whole thing inside of a not. So now these three lines, line 8 says that if it is not raining, then Harry visited Hagrid. Line 9 says Harry visited Hagrid or Dumbledore. And line 10 says Harry didn’t visit both Hagrid and Dumbledore, that it is not true that both the Hagrid symbol and the Dumbledore symbol are true. Only one of them can be true. And finally, the last piece of information that I knew was the fact that Harry visited Dumbledore. So these now are the pieces of knowledge that I know, one sentence and another sentence and another and another. And I can print out what I know just to see it a little bit more visually. And here now is a logical representation of the information that my computer is now internally representing using these various different Python objects. And again, take a look at logic.py if you want to take a look at how exactly it’s implementing this, but no need to worry too much about all of the details there. We’re here saying that if it is not raining, then Harry visited Hagrid. We’re saying that Hagrid or Dumbledore is true. And we’re saying it is not the case that Hagrid and Dumbledore is true, that they’re not both true. And we also know that Dumbledore is true. So this long logical sentence represents our knowledge base. It is the thing that we know. And now what we’d like to do is we’d like to use model checking to ask a query, to ask a question like, based on this information, do I know whether or not it’s raining? And we as humans were able to logic our way through it and figure out that, all right, based on these sentences, we can conclude this and that to figure out that, yes, it must have been raining. But now we’d like for the computer to do that as well. So let’s take a look at the model checking algorithm that is going to follow that same pattern that we drew out in pseudocode a moment ago. So I’ve defined a function here in logic.py that you can take a look at called model check. Model check takes two arguments, the knowledge that I already know, and the query. And the idea is, in order to do model checking, I need to enumerate all of the possible models. And for each of the possible models, I need to ask myself, is the knowledge base true? And is the query true? So the first thing I need to do is somehow enumerate all of the possible models, meaning for all possible symbols that exist, I need to assign true and false to each one of them and see whether or not it’s still true. And so here is the way we’re going to do that. We’re going to start. So I’ve defined another helper function internally that we’ll get to in just a moment. But this function starts by getting all of the symbols in both the knowledge and the query, by figuring out what symbols am I dealing with. In this case, the symbols I’m dealing with are rain and Hagrid and Dumbledore, but there might be other symbols depending on the problem. And we’ll take a look soon at some examples of situations where ultimately we’re going to need some additional symbols in order to represent the problem. And then we’re going to run this check all function, which is a helper function that’s basically going to recursively call itself checking every possible configuration of propositional symbols. So we start out by looking at this check all function. And what do we do? So if not symbols means if we finish assigning all of the symbols. We’ve assigned every symbol a value. So far we haven’t done that, but if we ever do, then we check. In this model, is the knowledge true? That’s what this line is saying. If we evaluate the knowledge propositional logic formula using the model’s assignment of truth values, is the knowledge true? If the knowledge is true, then we should return true only if the query is true. Because if the knowledge is true, we want the query to be true as well in order for there to be entailment. Otherwise, we don’t know that there otherwise there won’t be an entailment if there’s ever a situation where what we know in our knowledge is true, but the query, the thing we’re asking, happens to be false. So this line here is checking that same idea that in all worlds where the knowledge is true, the query must also be true. Otherwise, we can just return true because if the knowledge isn’t true, then we don’t care. This is equivalent to when we were enumerating this table from a moment ago. In all situations where the knowledge base wasn’t true, all of these seven rows here, we didn’t care whether or not our query was true or not. We only care to check whether the query is true when the knowledge base is actually true, which was just this green highlighted row right there. So that logic is encoded using that statement there. And otherwise, if we haven’t assigned symbols yet, which we haven’t seen anything yet, then the first thing we do is pop one of the symbols. I make a copy of the symbols first just to save an existing copy. But I pop one symbol off of the remaining symbols so that I just pick one symbol at random. And I create one copy of the model where that symbol is true. And I create a second copy of the model where that symbol is false. So I now have two copies of the model, one where the symbol is true and one where the symbol is false. And I need to make sure that this entailment holds in both of those models. So I recursively check all on the model where the statement is true and check all on the model where the statement is false. So again, you can take a look at that function to try to get a sense for how exactly this logic is working. But in effect, what it’s doing is recursively calling this check all function again and again and again. And on every level of the recursion, we’re saying let’s pick a new symbol that we haven’t yet assigned, assign it to true and assign it to false, and then check to make sure that the entailment holds in both cases. Because ultimately, I need to check every possible world. I need to take every combination of symbols and try every combination of true and false in order to figure out whether the entailment relation actually holds. So that function we’ve written for you. But in order to use that function inside of harry.py, what I’ll write is something like this. I would like to model check based on the knowledge. And then I provide as a second argument what the query is, what the thing I want to ask is. And what I want to ask in this case is, is it raining? So model check again takes two arguments. The first argument is the information that I know, this knowledge, which in this case is this information that was given to me at the beginning. And the second argument, rain, is encoding the idea of the query. What am I asking? I would like to ask, based on this knowledge, do I know for sure that it is raining? And I can try and print out the result of that. And when I run this program, I see that the answer is true. That based on this information, I can conclusively say that it is raining, because using this model checking algorithm, we were able to check that in every world where this knowledge is true, it is raining. In other words, there is no world where this knowledge is true, and it is not raining. So you can conclude that it is, in fact, raining. And this sort of logic can be applied to a number of different types of problems, that if confronted with a problem where some sort of logical deduction can be used in order to try to solve it, you might try thinking about what propositional symbols you might need in order to represent that information, and what statements and propositional logic you might use in order to encode that information which you know. And this process of trying to take a problem and figure out what propositional symbols to use in order to encode that idea, or how to represent it logically, is known as knowledge engineering. That software engineers and AI engineers will take a problem and try and figure out how to distill it down into knowledge that is representable by a computer. And if we can take any general purpose problem, some problem that we find in the human world, and turn it into a problem that computers know how to solve as by using any number of different variables, well, then we can take a computer that is able to do something like model checking or some other inference algorithm and actually figure out how to solve that problem. So now we’ll take a look at two or three examples of knowledge engineering and practice, of taking some problem and figuring out how we can apply logical symbols and use logical formulas to be able to encode that idea. And we’ll start with a very popular board game in the US and the UK known as Clue. Now, in the game of Clue, there’s a number of different factors that are going on. But the basic premise of the game, if you’ve never played it before, is that there are a number of different people. For now, we’ll just use three, Colonel Mustard, Professor Plumb, and Miss Scarlet. There are a number of different rooms, like a ballroom, a kitchen, and a library. And there are a number of different weapons, a knife, a revolver, and a wrench. And three of these, one person, one room, and one weapon, is the solution to the mystery, the murderer and what room they were in and what weapon they happened to use. And what happens at the beginning of the game is that all these cards are randomly shuffled together. And three of them, one person, one room, and one weapon, are placed into a sealed envelope that we don’t know. And we would like to figure out, using some sort of logical process, what’s inside the envelope, which person, which room, and which weapon. And we do so by looking at some, but not all, of these cards here, by looking at these cards to try and figure out what might be going on. And so this is a very popular game. But let’s now try and formalize it and see if we could train a computer to be able to play this game by reasoning through it logically. So in order to do this, we’ll begin by thinking about what propositional symbols we’re ultimately going to need. Remember, again, that propositional symbols are just some symbol, some variable, that can be either true or false in the world. And so in this case, the propositional symbols are really just going to correspond to each of the possible things that could be inside the envelope. Mustard is a propositional symbol that, in this case, will just be true if Colonel Mustard is inside the envelope, if he is the murderer, and false otherwise. And likewise for Plum, for Professor Plum, and Scarlet, for Miss Scarlet. And likewise for each of the rooms and for each of the weapons. We have one propositional symbol for each of these ideas. Then using those propositional symbols, we can begin to create logical sentences, create knowledge that we know about the world. So for example, we know that someone is the murderer, that one of the three people is, in fact, the murderer. And how would we encode that? Well, we don’t know for sure who the murderer is. But we know it is one person or the second person or the third person. So I could say something like this. Mustard or Plum or Scarlet. And this piece of knowledge encodes that one of these three people is the murderer. We don’t know which, but one of these three things must be true. What other information do we know? Well, we know that, for example, one of the rooms must have been the room in the envelope. The crime was committed either in the ballroom or the kitchen or the library. Again, right now, we don’t know which. But this is knowledge we know at the outset, knowledge that one of these three must be inside the envelope. And likewise, we can say the same thing about the weapon, that it was either the knife or the revolver or the wrench, that one of those weapons must have been the weapon of choice and therefore the weapon in the envelope. And then as the game progresses, the gameplay works by people get various different cards. And using those cards, you can deduce information. That if someone gives you a card, for example, I have the Professor Plum card in my hand, then I know the Professor Plum card can’t be inside the envelope. I know that Professor Plum is not the criminal, so I know a piece of information like not Plum, for example. I know that Professor Plum has to be false. This propositional symbol is not true. And sometimes I might not know for sure that a particular card is not in the middle, but sometimes someone will make a guess and I’ll know that one of three possibilities is not true. Someone will guess Colonel Mustard in the library with the revolver or something to that effect. And in that case, a card might be revealed that I don’t see. But if it is a card and it is either Colonel Mustard or the revolver or the library, then I know that at least one of them can’t be in the middle. So I know something like it is either not Mustard or it is not the library or it is not the revolver. Now maybe multiple of these are not true, but I know that at least one of Mustard, Library, and Revolver must, in fact, be false. And so this now is a propositional logic representation of this game of Clue, a way of encoding the knowledge that we know inside this game using propositional logic that a computer algorithm, something like model checking that we saw a moment ago, can actually look at and understand. So let’s now take a look at some code to see how this algorithm might actually work in practice. All right, so I’m now going to open up a file called Clue.py, which I’ve started already. And what we’ll see here is I’ve defined a couple of things. To find some symbols initially, notice I have a symbol for Colonel Mustard, a symbol for Professor Plum, a symbol for Miss Scarlett, all of which I’ve put inside of this list of characters. I have a symbol for Ballroom and Kitchen and Library inside of a list of rooms. And then I have symbols for Knife and Revolver and Wrench. These are my weapons. And so all of these characters and rooms and weapons altogether, those are my symbols. And now I also have this check knowledge function. And what the check knowledge function does is it takes my knowledge and it’s going to try and draw conclusions about what I know. So for example, we’ll loop over all of the possible symbols and we’ll check, do I know that that symbol is true? And a symbol is going to be something like Professor Plum or the Knife or the Library. And if I know that it is true, in other words, I know that it must be the card in the envelope, then I’m going to print out using a function called cprint, which prints things in color. I’m going to print out the word yes, and I’m going to print that in green, just to make it very clear to us. If we’re not sure that the symbol is true, maybe I can check to see if I’m sure that the symbol is not true. Like if I know for sure that it is not Professor Plum, for example. And I do that by running model check again, this time checking if my knowledge is not the symbol, if I know for sure that the symbol is not true. And if I don’t know for sure that the symbol is not true, because I say if not model check, meaning I’m not sure that the symbol is false, well, then I’ll go ahead and print out maybe next to the symbol. Because maybe the symbol is true, maybe it’s not, I don’t actually know. So what knowledge do I actually have? Well, let’s try and represent my knowledge now. So my knowledge is, I know a couple of things, so I’ll put them in an and. And I know that one of the three people must be the criminal. So I know or mustard, plum, scarlet. This is my way of encoding that it is either Colonel Mustard or Professor Plum or Miss Scarlet. I know that it must have happened in one of the rooms. So I know or ballroom, kitchen, library, for example. And I know that one of the weapons must have been used as well. So I know or knife, revolver, wrench. So that might be my initial knowledge, that I know that it must have been one of the people, I know it must have been in one of the rooms, and I know that it must have been one of the weapons. And I can see what that knowledge looks like as a formula by printing out knowledge.formula. So I’ll run python clue.py. And here now is the information that I know in logical format. I know that it is Colonel Mustard or Professor Plum or Miss Scarlet. And I know that it is the ballroom, the kitchen, or the library. And I know that it is the knife, the revolver, or the wrench. But I don’t know much more than that. I can’t really draw any firm conclusions. And in fact, we can see that if I try and do, let me go ahead and run my knowledge check function on my knowledge. Knowledge check is this function that I, or check knowledge rather, is this function that I just wrote that looks over all of the symbols and tries to see what conclusions I can actually draw about any of the symbols. So I’ll go ahead and run clue.py and see what it is that I know. And it seems that I don’t really know anything for sure. I have all three people are maybes, all three of the rooms are maybes, all three of the weapons are maybes. I don’t really know anything for certain just yet. But now let me try and add some additional information and see if additional information, additional knowledge, can help us to logically reason our way through this process. And we are just going to provide the information. Our AI is going to take care of doing the inference and figuring out what conclusions it’s able to draw. So I start with some cards. And those cards tell me something. So if I have the kernel mustard card, for example, I know that the mustard symbol must be false. In other words, mustard is not the one in the envelope, is not the criminal. So I can say, knowledge supports something called, every and in this library supports dot add, which is a way of adding knowledge or adding an additional logical sentence to an and clause. So I can say, knowledge dot add, not mustard. I happen to know, because I have the mustard card, that kernel mustard is not the suspect. And maybe I have a couple of other cards too. Maybe I also have a card for the kitchen. So I know it’s not the kitchen. And maybe I have another card that says that it is not the revolver. So I have three cards, kernel mustard, the kitchen, and the revolver. And I encode that into my AI this way by saying, it’s not kernel mustard, it’s not the kitchen, and it’s not the revolver. And I know those to be true. So now, when I rerun clue.py, we’ll see that I’ve been able to eliminate some possibilities. Before, I wasn’t sure if it was the knife or the revolver or the wrench. If a knife was maybe, a revolver was maybe, wrench is maybe. Now I’m down to just the knife and the wrench. Between those two, I don’t know which one it is. They’re both maybes. But I’ve been able to eliminate the revolver, which is one that I know to be false, because I have the revolver card. And so additional information might be acquired over the course of this game. And we would represent that just by adding knowledge to our knowledge set or knowledge base that we’ve been building here. So if, for example, we additionally got the information that someone made a guess, someone guessed like Miss Scarlet in the library with the wrench. And we know that a card was revealed, which means that one of those three cards, either Miss Scarlet or the library or the wrench, one of those at minimum must not be inside of the envelope. So I could add some knowledge, say knowledge.add. And I’m going to add an or clause, because I don’t know for sure which one it’s not, but I know one of them is not in the envelope. So it’s either not Scarlet, or it’s not the library, and or supports multiple arguments. I can say it’s also or not the wrench. So at least one of those needs a Scarlet library and wrench. At least one of those needs to be false. I don’t know which, though. Maybe it’s multiple. Maybe it’s just one, but at least one I know needs to hold. And so now if I rerun clue.py, I don’t actually have any additional information just yet. Nothing I can say conclusively. I still know that maybe it’s Professor Plum, maybe it’s Miss Scarlet. I haven’t eliminated any options. But let’s imagine that I get some more information, that someone shows me the Professor Plum card, for example. So I say, all right, let’s go back here, knowledge.add, not Plum. So I have the Professor Plum card. I know the Professor Plum is not in the middle. I rerun clue.py. And right now, I’m able to draw some conclusions. Now I’ve been able to eliminate Professor Plum, and the only person it could left remaining be is Miss Scarlet. So I know, yes, Miss Scarlet, this variable must be true. And I’ve been able to infer that based on the information I already had. Now between the ballroom and the library and the knife and the wrench, for those two, I’m still not sure. So let’s add one more piece of information. Let’s say that I know that it’s not the ballroom. Someone has shown me the ballroom card, so I know it’s not the ballroom. Which means at this point, I should be able to conclude that it’s the library. Let’s see. I’ll say knowledge.add, not the ballroom. And we’ll go ahead and run that. And it turns out that after all of this, not only can I conclude that I know that it’s the library, but I also know that the weapon was the knife. And that might have been an inference that was a little bit trickier, something I wouldn’t have realized immediately, but the AI, via this model checking algorithm, is able to draw that conclusion, that we know for sure that it must be Miss Scarlet in the library with the knife. And how did we know that? Well, we know it from this or clause up here, that we know that it’s either not Scarlet, or it’s not the library, or it’s not the wrench. And given that we know that it is Miss Scarlet, and we know that it is the library, then the only remaining option for the weapon is that it is not the wrench, which means that it must be the knife. So we as humans now can go back and reason through that, even though it might not have been immediately clear. And that’s one of the advantages of using an AI or some sort of algorithm in order to do this, is that the computer can exhaust all of these possibilities and try and figure out what the solution actually should be. And so for that reason, it’s often helpful to be able to represent knowledge in this way. Knowledge engineering, some situation where we can use a computer to be able to represent knowledge and draw conclusions based on that knowledge. And any time we can translate something into propositional logic symbols like this, this type of approach can be useful. So you might be familiar with logic puzzles, where you have to puzzle your way through trying to figure something out. This is what a classic logic puzzle might look like. Something like Gilderoy, Minerva, Pomona, and Horace each belong to a different one of the four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. And then we have some information. The Gilderoy belongs to Gryffindor or Ravenclaw, Pomona does not belong in Slytherin, and Minerva does belong to Gryffindor. So we have a couple pieces of information. And using that information, we need to be able to draw some conclusions about which person should be assigned to which house. And again, we can use the exact same idea to try and implement this notion. So we need some propositional symbols. And in this case, the propositional symbols are going to get a little more complex, although we’ll see ways to make this a little bit cleaner later on. But we’ll need 16 propositional symbols, one for each person and house. So we need to say, remember, every propositional symbol is either true or false. So Gilderoy Gryffindor is either true or false. Either he’s in Gryffindor or he is not. Likewise, Gilderoy Hufflepuff also true or false. Either it is true or it’s false. And that’s true for every combination of person and house that we could come up with. We have some sort of propositional symbol for each one of those. Using this type of knowledge, we can then begin to think about what types of logical sentences we can say about the puzzle. That if we know what will before even think about the information we were given, we can think about the premise of the problem, that every person is assigned to a different house. So what does that tell us? Well, it tells us sentences like this. It tells us like Pomona Slytherin implies not Pomona Hufflepuff. Something like if Pomona is in Slytherin, then we know that Pomona is not in Hufflepuff. And we know this for all four people and for all combinations of houses, that no matter what person you pick, if they’re in one house, then they’re not in some other house. So I’ll probably have a whole bunch of knowledge statements that are of this form, that if we know Pomona is in Slytherin, then we know Pomona is not in Hufflepuff. We were also given the information that each person is in a different house. So I also have pieces of knowledge that look something like this. Minerva Ravenclaw implies not Gilderoy Ravenclaw. If they’re all in different houses, then if Minerva is in Ravenclaw, then we know the Gilderoy is not in Ravenclaw as well. And I have a whole bunch of similar sentences like this that are expressing that idea for other people and other houses as well. And so in addition to sentences of these form, I also have the knowledge that was given to me. Information like Gilderoy was in Gryffindor or in Ravenclaw that would be represented like this, Gilderoy Gryffindor or Gilderoy Ravenclaw. And then using these sorts of sentences, I can begin to draw some conclusions about the world. So let’s see an example of this. We’ll go ahead and actually try and implement this logic puzzle to see if we can figure out what the answer is. I’ll go ahead and open up puzzle.py, where I’ve already started to implement this sort of idea. I’ve defined a list of people and a list of houses. And I’ve so far created one symbol for every person and for every house. That’s what this double four loop is doing, looping over all people, looping over all houses, creating a new symbol for each of them. And then I’ve added some information. I know that every person belongs to a house, so I’ve added the information for every person that person Gryffindor or person Hufflepuff or person Ravenclaw or person Slytherin, that one of those four things must be true. Every person belongs to a house. What other information do I know? I also know that only one house per person, so no person belongs to multiple houses. So how does this work? Well, this is going to be true for all people. So I’ll loop over every person. And then I need to loop over all different pairs of houses. The idea is I want to encode the idea that if Minerva is in Gryffindor, then Minerva can’t be in Ravenclaw. So I’ll loop over all houses, each one. And I’ll loop over all houses again, h2. And as long as they’re different, h1 not equal to h2, then I’ll add to my knowledge base this piece of information. That implication, in other words, an if then, if the person is in h1, then I know that they are not in house h2. So these lines here are encoding the notion that for every person, if they belong to house one, then they are not in house two. And the other piece of logic we need to encode is the idea that every house can only have one person. In other words, if Pomona is in Hufflepuff, then nobody else is allowed to be in Hufflepuff either. And that’s the same logic, but sort of backwards. I loop over all of the houses and loop over all different pairs of people. So I loop over people once, loop over people again, and only do this when the people are different, p1 not equal to p2. And I add the knowledge that if, as given by the implication, if person one belongs to the house, then it is not the case that person two belongs to the same house. So here I’m just encoding the knowledge that represents the problem’s constraints. I know that everyone’s in a different house. I know that any person can only belong to one house. And I can now take my knowledge and try and print out the information that I happen to know. So I’ll go ahead and print out knowledge.formula, just to see this in action, and I’ll go ahead and skip this for now. But we’ll come back to this in a second. Let’s print out the knowledge that I know by running Python puzzle.py. It’s a lot of information, a lot that I have to scroll through, because there are 16 different variables all going on. But the basic idea, if we scroll up to the very top, is I see my initial information. Gilderoy is either in Gryffindor, or Gilderoy is in Hufflepuff, or Gilderoy is in Ravenclaw, or Gilderoy is in Slytherin, and then way more information as well. So this is quite messy, more than we really want to be looking at. And soon, too, we’ll see ways of representing this a little bit more nicely using logic. But for now, we can just say these are the variables that we’re dealing with. And now we’d like to add some information. So the information we’re going to add is Gilderoy is in Gryffindor, or he is in Ravenclaw. So that knowledge was given to us. So I’ll go ahead and say knowledge.add. And I know that either or Gilderoy Gryffindor or Gilderoy Ravenclaw. One of those two things must be true. I also know that Pomona was not in Slytherin, so I can say knowledge.add not this symbol, not the Pomona-Slytherin symbol. And then I can add the knowledge that Minerva is in Gryffindor by adding the symbol Minerva Gryffindor. So those are the pieces of knowledge that I know. And this loop here at the bottom just loops over all of my symbols, checks to see if the knowledge entails that symbol by calling this model check function again. And if it does, if we know the symbol is true, we print out the symbol. So now I can run Python, puzzle.py, and Python is going to solve this puzzle for me. We’re able to conclude that Gilderoy belongs to Ravenclaw, Pomona belongs to Hufflepuff, Minerva to Gryffindor, and Horace to Slytherin just by encoding this knowledge inside the computer, although it was quite tedious to do in this case. And as a result, we were able to get the conclusion from that as well. And you can imagine this being applied to many sorts of different deductive situations. So not only these situations where we’re trying to deal with Harry Potter characters in this puzzle, but if you’ve ever played games like Mastermind, where you’re trying to figure out which order different colors go in and trying to make predictions about it, I could tell you, for example, let’s play a simplified version of Mastermind where there are four colors, red, blue, green, and yellow, and they’re in some order, but I’m not telling you what order. You just have to make a guess, and I’ll tell you of red, blue, green, and yellow how many of the four you got in the right position. So a simplified version of this game, you might make a guess like red, blue, green, yellow, and I would tell you something like two of those four are in the correct position, but the other two are not. And then you could reasonably make a guess and say, all right, look at this, blue, red, green, yellow. Try switching two of them around, and this time maybe I tell you, you know what, none of those are in the correct position. And the question then is, all right, what is the correct order of these four colors? And we as humans could begin to reason this through. All right, well, if none of these were correct, but two of these were correct, well, it must have been because I switched the red and the blue, which means red and blue here must be correct, which means green and yellow are probably not correct. You can begin to do this sort of deductive reasoning. And we can also equivalently try and take this and encode it inside of our computer as well. And it’s going to be very similar to the logic puzzle that we just did a moment ago. So I won’t spend too much time on this code because it is fairly similar. But again, we have a whole bunch of colors and four different positions in which those colors can be. And then we have some additional knowledge. And I encode all of that knowledge. And you can take a look at this code on your own time. But I just want to demonstrate that when we run this code, run python mastermind.py and run and see what we get, we ultimately are able to compute red 0 in the 0 position, blue in the 1 position, yellow in the 2 position, and green in the 3 position as the ordering of those symbols. Now, ultimately, what you might have noticed is this process was taking quite a long time. And in fact, model checking is not a particularly efficient algorithm, right? What I need to do in order to model check is take all of my possible different variables and enumerate all of the possibilities that they could be in. If I have n variables, I have 2 to the n possible worlds that I need to be looking through in order to perform this model checking algorithm. And this is probably not tractable, especially as we start to get to much larger and larger sets of data where you have many, many more variables that are at play. Right here, we only have a relatively small number of variables. So this sort of approach can actually work. But as the number of variables increases, model checking becomes less and less good of a way of trying to solve these sorts of problems. So while it might have been OK for something like Mastermind to conclude that this is indeed the correct sequence where all four are in the correct position, what we’d like to do is come up with some better ways to be able to make inferences rather than just enumerate all of the possibilities. And to do so, what we’ll transition to next is the idea of inference rules, some sort of rules that we can apply to take knowledge that already exists and translate it into new forms of knowledge. And the general way we’ll structure an inference rule is by having a horizontal line here. Anything above the line is going to represent a premise, something that we know to be true. And then anything below the line will be the conclusion that we can arrive at after we apply the logic from the inference rule that we’re going to demonstrate. So we’ll do some of these inference rules by demonstrating them in English first, but then translating them into the world of propositional logic so you can see what those inference rules actually look like. So for example, let’s imagine that I have access to two pieces of information. I know, for example, that if it is raining, then Harry is inside, for example. And let’s say I also know it is raining. Then most of us could reasonably then look at this information and conclude that, all right, Harry must be inside. This inference rule is known as modus ponens, and it’s phrased more formally in logic as this. If we know that alpha implies beta, in other words, if alpha, then beta, and we also know that alpha is true, then we should be able to conclude that beta is also true. We can apply this inference rule to take these two pieces of information and generate this new piece of information. Notice that this is a totally different approach from the model checking approach, where the approach was look at all of the possible worlds and see what’s true in each of these worlds. Here, we’re not dealing with any specific world. We’re just dealing with the knowledge that we know and what conclusions we can arrive at based on that knowledge. That I know that A implies B, and I know A, and the conclusion is B. And this should seem like a relatively obvious rule. But of course, if alpha, then beta, and we know alpha, then we should be able to conclude that beta is also true. And that’s going to be true for many, but maybe even all of the inference rules that we’ll take a look at. You should be able to look at them and say, yeah, of course that’s going to be true. But it’s putting these all together, figuring out the right combination of inference rules that can be applied that ultimately is going to allow us to generate interesting knowledge inside of our AI. So that’s modus ponensis application of implication, that if we know alpha and we know that alpha implies beta, then we can conclude beta. Let’s take a look at another example. Fairly straightforward, something like Harry is friends with Ron and Hermione. Based on that information, we can reasonably conclude Harry is friends with Hermione. That must also be true. And this inference rule is known as and elimination. And what and elimination says is that if we have a situation where alpha and beta are both true, I have information alpha and beta, well then, just alpha is true. Or likewise, just beta is true. That if I know that both parts are true, then one of those parts must also be true. Again, something obvious from the point of view of human intuition, but a computer needs to be told this kind of information. To be able to apply the inference rule, we need to tell the computer that this is an inference rule that you can apply, so the computer has access to it and is able to use it in order to translate information from one form to another. In addition to that, let’s take a look at another example of an inference rule, something like it is not true that Harry did not pass the test. Bit of a tricky sentence to parse. I’ll read it again. It is not true, or it is false, that Harry did not pass the test. Well, if it is false that Harry did not pass the test, then the only reasonable conclusion is that Harry did pass the test. And so this, instead of being and elimination, is what we call double negation elimination. That if we have two negatives inside of our premise, then we can just remove them altogether. They cancel each other out. One turns true to false, and the other one turns false back into true. Phrased a little bit more formally, we say that if the premise is not alpha, then the conclusion we can draw is just alpha. We can say that alpha is true. We’ll take a look at a couple more of these. If I have it is raining, then Harry is inside. How do I reframe this? Well, this one is a little bit trickier. But if I know if it is raining, then Harry is inside, then I conclude one of two things must be true. Either it is not raining, or Harry is inside. Now, this one’s trickier. So let’s think about it a little bit. This first premise here, if it is raining, then Harry is inside, is saying that if I know that it is raining, then Harry must be inside. So what is the other possible case? Well, if Harry is not inside, then I know that it must not be raining. So one of those two situations must be true. Either it’s not raining, or it is raining, in which case Harry is inside. So the conclusion I can draw is either it is not raining, or it is raining, so therefore, Harry is inside. And so this is a way to translate if-then statements into or statements. And this is known as implication elimination. And this is similar to what we actually did in the beginning when we were first looking at those very first sentences about Harry and Hagrid and Dumbledore. And phrased a little bit more formally, this says that if I have the implication, alpha implies beta, that I can draw the conclusion that either not alpha or beta, because there are only two possibilities. Either alpha is true or alpha is not true. So one of those possibilities is alpha is not true. But if alpha is true, well, then we can draw the conclusion that beta must be true. So either alpha is not true or alpha is true, in which case beta is also true. So this is one way to turn an implication into just a statement about or. In addition to eliminating implications, we can also eliminate biconditionals as well. So let’s take an English example, something like, it is raining if and only if Harry is inside. And this if and only if really sounds like that biconditional, that double arrow sign that we saw in propositional logic not too long ago. And what does this actually mean if we were to translate this? Well, this means that if it is raining, then Harry is inside. And if Harry is inside, then it is raining, that this implication goes both ways. And this is what we would call biconditional elimination, that I can take a biconditional, a if and only if b, and translate that into something like this, a implies b, and b implies a. So many of these inference rules are taking logic that uses certain symbols and turning them into different symbols, taking an implication and turning it into an or, or taking a biconditional and turning it into implication. And another example of it would be something like this. It is not true that both Harry and Ron passed the test. Well, all right, how do we translate that? What does that mean? Well, if it is not true that both of them passed the test, well, then the reasonable conclusion we might draw is that at least one of them didn’t pass the test. So the conclusion is either Harry did not pass the test or Ron did not pass the test, or both. This is not an exclusive or. But if it is true that it is not true that both Harry and Ron passed the test, well, then either Harry didn’t pass the test or Ron didn’t pass the test. And this type of law is one of De Morgan’s laws. Quite famous in logic where the idea is that we can turn an and into an or. We can say we can take this and that both Harry and Ron passed the test and turn it into an or by moving the nots around. So if it is not true that Harry and Ron passed the test, well, then either Harry did not pass the test or Ron did not pass the test either. And the way we frame that more formally using logic is to say this. If it is not true that alpha and beta, well, then either not alpha or not beta. The way I like to think about this is that if you have a negation in front of an and expression, you move the negation inwards, so to speak, moving the negation into each of these individual sentences and then flip the and into an or. So the negation moves inwards and the and flips into an or. So I go from not a and b to not a or not b. And there’s actually a reverse of De Morgan’s law that goes in the other direction for something like this. If I say it is not true that Harry or Ron passed the test, meaning neither of them passed the test, well, then the conclusion I can draw is that Harry did not pass the test and Ron did not pass the test. So in this case, instead of turning an and into an or, we’re turning an or into an and. But the idea is the same. And this, again, is another example of De Morgan’s laws. And the way that works is that if I have not a or b this time, the same logic is going to apply. I’m going to move the negation inwards. And I’m going to flip this time, flip the or into an and. So if not a or b, meaning it is not true that a or b or alpha or beta, then I can say not alpha and not beta, moving the negation inwards in order to make that conclusion. So those are De Morgan’s laws and a couple other inference rules that are worth just taking a look at. One is the distributive law that works this way. So if I have alpha and beta or gamma, well, then much in the same way that you can use in math, use distributive laws to distribute operands like addition and multiplication, I can do a similar thing here, where I can say if alpha and beta or gamma, then I can say something like alpha and beta or alpha and gamma, that I’ve been able to distribute this and sign throughout this expression. So this is an example of the distributive property or the distributive law as applied to logic in much the same way that you would distribute a multiplication over the addition of something, for example. This works the other way too. So if, for example, I have alpha or beta and gamma, I can distribute the or throughout the expression. I can say alpha or beta and alpha or gamma. So the distributive law works in that way too. And it’s helpful if I want to take an or and move it into the expression. And we’ll see an example soon of why it is that we might actually care to do something like that. All right, so now we’ve seen a lot of different inference rules. And the question now is, how can we use those inference rules to actually try and draw some conclusions, to actually try and prove something about entailment, proving that given some initial knowledge base, we would like to find some way to prove that a query is true? Well, one way to think about it is actually to think back to what we talked about last time when we talked about search problems. Recall again that search problems have some sort of initial state. They have actions that you can take from one state to another as defined by a transition model that tells you how to get from one state to another. We talked about testing to see if you were at a goal. And then some path cost function to see how many steps did you have to take or how costly was the solution that you found. Now that we have these inference rules that take some set of sentences in propositional logic and get us some new set of sentences in propositional logic, we can actually treat those sentences or those sets of sentences as states inside of a search problem. So if we want to prove that some query is true, prove that some logical theorem is true, we can treat theorem proving as a form of a search problem. I can say that we begin in some initial state, where that initial state is the knowledge base that I begin with, the set of all of the sentences that I know to be true. What actions are available to me? Well, the actions are any of the inference rules that I can apply at any given time. The transition model just tells me after I apply the inference rule, here is the new set of all of the knowledge that I have, which will be the old set of knowledge, plus some additional inference that I’ve been able to draw, much as in the same way we saw what we got when we applied those inference rules and got some sort of conclusion. That conclusion gets added to our knowledge base, and our transition model will encode that. What is the goal test? Well, our goal test is checking to see if we have proved the statement we’re trying to prove, if the thing we’re trying to prove is inside of our knowledge base. And the path cost function, the thing we’re trying to minimize, is maybe the number of inference rules that we needed to use, the number of steps, so to speak, inside of our proof. And so here we’ve been able to apply the same types of ideas that we saw last time with search problems to something like trying to prove something about knowledge by taking our knowledge and framing it in terms that we can understand as a search problem with an initial state, with actions, with a transition model. So this shows a couple of things, one being how versatile search problems are, that they can be the same types of algorithms that we use to solve a maze or figure out how to get from point A to point B inside of driving directions, for example, can also be used as a theorem proving method of taking some sort of starting knowledge base and trying to prove something about that knowledge. So this, yet again, is a second way, in addition to model checking, to try and prove that certain statements are true. But it turns out there’s yet another way that we can try and apply inference. And we’ll talk about this now, which is not the only way, but certainly one of the most common, which is known as resolution. And resolution is based on another inference rule that we’ll take a look at now, quite a powerful inference rule that will let us prove anything that can be proven about a knowledge base. And it’s based on this basic idea. Let’s say I know that either Ron is in the Great Hall or Hermione is in the library. And let’s say I also know that Ron is not in the Great Hall. Based on those two pieces of information, what can I conclude? Well, I could pretty reasonably conclude that Hermione must be in the library. How do I know that? Well, it’s because these two statements, these two what we’ll call complementary literals, literals that complement each other, they’re opposites of each other, seem to conflict with each other. This sentence tells us that either Ron is in the Great Hall or Hermione is in the library. So if we know that Ron is not in the Great Hall, that conflicts with this one, which means Hermione must be in the library. And this we can frame as a more general rule known as the unit resolution rule, a rule that says that if we have p or q and we also know not p, well then from that we can reasonably conclude q. That if p or q are true and we know that p is not true, the only possibility is for q to then be true. And this, it turns out, is quite a powerful inference rule in terms of what it can do, in part because we can quickly start to generalize this rule. This q right here doesn’t need to just be a single propositional symbol. It could be multiple, all chained together in a single clause, as we’ll call it. So if I had something like p or q1 or q2 or q3, so on and so forth, up until qn, so I had n different other variables, and I have not p, well then what happens when these two complement each other is that these two clauses resolve, so to speak, to produce a new clause that is just q1 or q2 all the way up to qn. And in an or, the order of the arguments in the or doesn’t actually matter. The p doesn’t need to be the first thing. It could have been in the middle. But the idea here is that if I have p in one clause and not p in the other clause, well then I know that one of these remaining things must be true. I’ve resolved them in order to produce a new clause. But it turns out we can generalize this idea even further, in fact, and display even more power that we can have with this resolution rule. So let’s take another example. Let’s say, for instance, that I know the same piece of information that either Ron is in the Great Hall or Hermione is in the library. And the second piece of information I know is that Ron is not in the Great Hall or Harry is sleeping. So it’s not just a single piece of information. I have two different clauses. And we’ll define clauses more precisely in just a moment. What do I know here? Well again, for any propositional symbol like Ron is in the Great Hall, there are only two possibilities. Either Ron is in the Great Hall, in which case, based on resolution, we know that Harry must be sleeping, or Ron is not in the Great Hall, in which case we know based on the same rule that Hermione must be in the library. Based on those two things in combination, I can say based on these two premises that I can conclude that either Hermione is in the library or Harry is sleeping. So again, because these two conflict with each other, I know that one of these two must be true. And you can take a closer look and try and reason through that logic. Make sure you convince yourself that you believe this conclusion. Stated more generally, we can name this resolution rule by saying that if we know p or q is true, and we also know that not p or r is true, we resolve these two clauses together to get a new clause, q or r, that either q or r must be true. And again, much as in the last case, q and r don’t need to just be single propositional symbols. It could be multiple symbols. So if I had a rule that had p or q1 or q2 or q3, so on and so forth, up until qn, where n is just some number. And likewise, I had not p or r1 or r2, so on and so forth, up until rm, where m, again, is just some other number. I can resolve these two clauses together to get one of these must be true, q1 or q2 up until qn or r1 or r2 up until rm. And this is just a generalization of that same rule we saw before. Each of these things here are what we’re going to call a clause, where a clause is formally defined as a disjunction of literals, where a disjunction means it’s a bunch of things that are connected with or. Disjunction means things connected with or. Conjunction, meanwhile, is things connected with and. And a literal is either a propositional symbol or the opposite of a propositional symbol. So it’s something like p or q or not p or not q. Those are all propositional symbols or not of the propositional symbols. And we call those literals. And so a clause is just something like this, p or q or r, for example. Meanwhile, what this gives us an ability to do is it gives us an ability to turn logic, any logical sentence, into something called conjunctive normal form. A conjunctive normal form sentence is a logical sentence that is a conjunction of clauses. Recall, again, conjunction means things are connected to one another using and. And so a conjunction of clauses means it is an and of individual clauses, each of which has ors in it. So something like this, a or b or c, and d or not e, and f or g. Everything in parentheses is one clause. All of the clauses are connected to each other using an and. And everything in the clause is separated using an or. And this is just a standard form that we can translate a logical sentence into that just makes it easy to work with and easy to manipulate. And it turns out that we can take any sentence in logic and turn it into conjunctive normal form just by applying some inference rules and transformations to it. So we’ll take a look at how we can actually do that. So what is the process for taking a logical formula and converting it into conjunctive normal form, otherwise known as c and f? Well, the process looks a little something like this. We need to take all of the symbols that are not part of conjunctive normal form. The bi-conditionals and the implications and so forth, and turn them into something that is more closely like conjunctive normal form. So the first step will be to eliminate bi-conditionals, those if and only if double arrows. And we know how to eliminate bi-conditionals because we saw there was an inference rule to do just that. Any time I have an expression like alpha if and only if beta, I can turn that into alpha implies beta and beta implies alpha based on that inference rule we saw before. Likewise, in addition to eliminating bi-conditionals, I can eliminate implications as well, the if then arrows. And I can do that using the same inference rule we saw before too, taking alpha implies beta and turning that into not alpha or beta because that is logically equivalent to this first thing here. Then we can move knots inwards because we don’t want knots on the outsides of our expressions. Conjunctive normal form requires that it’s just claws and claws and claws and claws. Any knots need to be immediately next to propositional symbols. But we can move those knots around using De Morgan’s laws by taking something like not A and B and turn it into not A or not B, for example, using De Morgan’s laws to manipulate that. And after that, all we’ll be left with are ands and ors. And those are easy to deal with. We can use the distributive law to distribute the ors so that the ors end up on the inside of the expression, so to speak, and the ands end up on the outside. So this is the general pattern for how we’ll take a formula and convert it into conjunctive normal form. And let’s now take a look at an example of how we would do this and explore then why it is that we would want to do something like this. Here’s how we can do it. Let’s take this formula, for example. P or Q implies R. And I’d like to convert this into conjunctive normal form, where it’s all ands of clauses, and every clause is a disjunctive clause. It’s ors together. So what’s the first thing I need to do? Well, this is an implication. So let me go ahead and remove that implication. Using the implication inference rule, I can turn P or Q into P or Q implies R into not P or Q or R. So that’s the first step. I’ve gotten rid of the implication. And next, I can get rid of the not on the outside of this expression, too. I can move the nots inwards so they’re closer to the literals themselves by using De Morgan’s laws. And De Morgan’s law says that not P or Q is equivalent to not P and not Q. Again, here, just applying the inference rules that we’ve already seen in order to translate these statements. And now, I have two things that are separated by an or, where this thing on the inside is an and. What I’d really like to move the ors so the ors are on the inside, because conjunctive normal form means I need clause and clause and clause and clause. And so to do that, I can use the distributive law. If I have not P and not Q or R, I can distribute the or R to both of these to get not P or R and not Q or R using the distributive law. And this now here at the bottom is in conjunctive normal form. It is a conjunction and and of disjunctions of clauses that just are separated by ors. So this process can be used by any formula to take a logical sentence and turn it into this conjunctive normal form, where I have clause and clause and clause and clause and clause and so on. So why is this helpful? Why do we even care about taking all these sentences and converting them into this form? It’s because once they’re in this form where we have these clauses, these clauses are the inputs to the resolution inference rule that we saw a moment ago, that if I have two clauses where there’s something that conflicts or something complementary between those two clauses, I can resolve them to get a new clause, to draw a new conclusion. And we call this process inference by resolution, using the resolution rule to draw some sort of inference. And it’s based on the same idea, that if I have P or Q, this clause, and I have not P or R, that I can resolve these two clauses together to get Q or R as the resulting clause, a new piece of information that I didn’t have before. Now, a couple of key points that are worth noting about this before we talk about the actual algorithm. One thing is that, let’s imagine we have P or Q or S, and I also have not P or R or S. The resolution rule says that because this P conflicts with this not P, we would resolve to put everything else together to get Q or S or R or S. But it turns out that this double S is redundant, or S here and or S there. It doesn’t change the meaning of the sentence. So in resolution, when we do this resolution process, we’ll usually also do a process known as factoring, where we take any duplicate variables that show up and just eliminate them. So Q or S or R or S just becomes Q or R or S. The S only needs to appear once, no need to include it multiple times. Now, one final question worth considering is what happens if I try to resolve P and not P together? If I know that P is true and I know that not P is true, well, resolution says I can merge these clauses together and look at everything else. Well, in this case, there is nothing else, so I’m left with what we might call the empty clause. I’m left with nothing. And the empty clause is always false. The empty clause is equivalent to just being false. And that’s pretty reasonable because it’s impossible for both P and not P to both hold at the same time. P is either true or it’s not true, which means that if P is true, then this must be false. And if this is true, then this must be false. There is no way for both of these to hold at the same time. So if ever I try and resolve these two, it’s a contradiction, and I’ll end up getting this empty clause where the empty clause I can call equivalent to false. And this idea that if I resolve these two contradictory terms, I get the empty clause, this is the basis for our inference by resolution algorithm. Here’s how we’re going to perform inference by resolution at a very high level. We want to prove that our knowledge base entails some query alpha, that based on the knowledge we have, we can prove conclusively that alpha is going to be true. How are we going to do that? Well, in order to do that, we’re going to try to prove that if we know the knowledge and not alpha, that that would be a contradiction. And this is a common technique in computer science more generally, this idea of proving something by contradiction. If I want to prove that something is true, I can do so by first assuming that it is false and showing that it would be contradictory, showing that it leads to some contradiction. And if the thing I’m trying to prove, if when I assume it’s false, leads to a contradiction, then it must be true. And that’s the logical approach or the idea behind a proof by contradiction. And that’s what we’re going to do here. We want to prove that this query alpha is true. So we’re going to assume that it’s not true. We’re going to assume not alpha. And we’re going to try and prove that it’s a contradiction. If we do get a contradiction, well, then we know that our knowledge entails the query alpha. If we don’t get a contradiction, there is no entailment. This is this idea of a proof by contradiction of assuming the opposite of what you’re trying to prove. And if you can demonstrate that that’s a contradiction, then what you’re proving must be true. But more formally, how do we actually do this? How do we check that knowledge base and not alpha is going to lead to a contradiction? Well, here is where resolution comes into play. To determine if our knowledge base entails some query alpha, we’re going to convert knowledge base and not alpha to conjunctive normal form, that form where we have a whole bunch of clauses that are all anded together. And when we have these individual clauses, now we can keep checking to see if we can use resolution to produce a new clause. We can take any pair of clauses and check, is there some literal that is the opposite of each other or complementary to each other in both of them? For example, I have a p in one clause and a not p in another clause. Or an r in one clause and a not r in another clause. If ever I have that situation where once I convert to conjunctive normal form and I have a whole bunch of clauses, I see two clauses that I can resolve to produce a new clause, then I’ll do so. This process occurs in a loop. I’m going to keep checking to see if I can use resolution to produce a new clause and keep using those new clauses to try to generate more new clauses after that. Now, it just so may happen that eventually we may produce the empty clause, the clause we were talking about before. If I resolve p and not p together, that produces the empty clause and the empty clause we know to be false. Because we know that there’s no way for both p and not p to both simultaneously be true. So if ever we produce the empty clause, then we have a contradiction. And if we have a contradiction, that’s exactly what we were trying to do in a fruit by contradiction. If we have a contradiction, then we know that our knowledge base must entail this query alpha. And we know that alpha must be true. And it turns out, and we won’t go into the proof here, but you can show that otherwise, if you don’t produce the empty clause, then there is no entailment. If we run into a situation where there are no more new clauses to add, we’ve done all the resolution that we can do, and yet we still haven’t produced the empty clause, then there is no entailment in this case. And this now is the resolution algorithm. And it’s very abstract looking, especially this idea of like, what does it even mean to have the empty clause? So let’s take a look at an example, actually try and prove some entailment by using this inference by resolution process. So here’s our question. We have this knowledge base. Here is the knowledge that we know, A or B, and not B or C, and not C. And we want to know if all of this entails A. So this is our knowledge base here, this whole log thing. And our query alpha is just this propositional symbol, A. So what do we do? Well, first, we want to prove by contradiction. So we want to first assume that A is false, and see if that leads to some sort of contradiction. So here is what we’re going to start with, A or B, and not B or C, and not C. This is our knowledge base. And we’re going to assume not A. We’re going to assume that the thing we’re trying to prove is, in fact, false. And so this is now in conjunctive normal form, and I have four different clauses. I have A or B. I have not B or C. I have not C, and I have not A. And now, I can begin to just pick two clauses that I can resolve, and apply the resolution rule to them. And so looking at these four clauses, I see, all right, these two clauses are ones I can resolve. I can resolve them because there are complementary literals that show up in them. There’s a C here, and a not C here. So just looking at these two clauses, if I know that not B or C is true, and I know that C is not true, well, then I can resolve these two clauses to say, all right, not B, that must be true. I can generate this new clause as a new piece of information that I now know to be true. And all right, now I can repeat this process, do the process again. Can I use resolution again to get some new conclusion? Well, it turns out I can. I can use that new clause I just generated, along with this one here. There are complementary literals. This B is complementary to, or conflicts with, this not B over here. And so if I know that A or B is true, and I know that B is not true, well, then the only remaining possibility is that A must be true. So now we have A. That is a new clause that I’ve been able to generate. And now, I can do this one more time. I’m looking for two clauses that can be resolved, and you might programmatically do this by just looping over all possible pairs of clauses and checking for complementary literals in each. And here, I can say, all right, I found two clauses, not A and A, that conflict with each other. And when I resolve these two together, well, this is the same as when we were resolving P and not P from before. When I resolve these two clauses together, I get rid of the As, and I’m left with the empty clause. And the empty clause we know to be false, which means we have a contradiction, which means we can safely say that this whole knowledge base does entail A. That if this sentence is true, that we know that A for sure is also true. So this now, using inference by resolution, is an entirely different way to take some statement and try and prove that it is, in fact, true. Instead of enumerating all of the possible worlds that we might be in in order to try to figure out in which cases is the knowledge base true and in which cases are query true, instead we use this resolution algorithm to say, let’s keep trying to figure out what conclusions we can draw and see if we reach a contradiction. And if we reach a contradiction, then that tells us something about whether our knowledge actually entails the query or not. And it turns out there are many different algorithms that can be used for inference. What we’ve just looked at here are just a couple of them. And in fact, all of this is just based on one particular type of logic. It’s based on propositional logic, where we have these individual symbols and we connect them using and and or and not and implies and by conditionals. But propositional logic is not the only kind of logic that exists. And in fact, we see that there are limitations that exist in propositional logic, especially as we saw in examples like with the mastermind example or with the example with the logic puzzle where we had different Hogwarts house people that belong to different houses and we were trying to figure out who belonged to which houses. There were a lot of different propositional symbols that we needed in order to represent some fairly basic ideas. So now is the final topic that we’ll take a look at just before we end class today is one final type of logic different from propositional logic known as first order logic, which is a little bit more powerful than propositional logic and is going to make it easier for us to express certain types of ideas. In propositional logic, if we think back to that puzzle with the people in the Hogwarts houses, we had a whole bunch of symbols. And every symbol could only be true or false. We had a symbol for Minerva Gryffindor, which was either true of Minerva within Gryffindor and false otherwise, and likewise for Minerva Hufflepuff and Minerva Ravenclaw and Minerva Slytherin and so forth. But this was starting to get quite redundant. We wanted some way to be able to express that there is a relationship between these propositional symbols, that Minerva shows up in all of them. And also, I would have liked to have not have had so many different symbols to represent what really was a fairly straightforward problem. So first order logic will give us a different way of trying to deal with this idea by giving us two different types of symbols. We’re going to have constant symbols that are going to represent objects like people or houses. And then predicate symbols, which you can think of as relations or functions that take an input and evaluate them to true or false, for example, that tell us whether or not some property of some constant or some pair of constants or multiple constants actually holds. So we’ll see an example of that in just a moment. For now, in this same problem, our constant symbols might be objects, things like people or houses. So Minerva, Pomona, Horace, Gilderoy, those are all constant symbols, as are my four houses, Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. Predicates, meanwhile, these predicate symbols are going to be properties that might hold true or false of these individual constants. So person might hold true of Minerva, but it would be false for Gryffindor because Gryffindor is not a person. And house is going to hold true for Ravenclaw, but it’s not going to hold true for Horace, for example, because Horace is a person. And belongs to, meanwhile, is going to be some relation that is going to relate people to their houses. And it’s going to only tell me when someone belongs to a house or does not. So let’s take a look at some examples of what a sentence in first order logic might actually look like. A sentence might look like something like this. Person Minerva, with Minerva in parentheses, and person being a predicate symbol, Minerva being a constant symbol. This sentence in first order logic effectively means Minerva is a person, or the person property applies to the Minerva object. So if I want to say something like Minerva is a person, here is how I express that idea using first order logic. Meanwhile, I can say something like, house Gryffindor, to likewise express the idea that Gryffindor is a house. I can do that this way. And all of the same logical connectives that we saw in propositional logic, those are going to work here too. And or implication by conditional not. In fact, I can use not to say something like, not house Minerva. And this sentence in first order logic means something like, Minerva is not a house. It is not true that the house property applies to Minerva. Meanwhile, in addition to some of these predicate symbols that just take a single argument, some of our predicate symbols are going to express binary relations, relations between two of its arguments. So I could say something like, belongs to, and then two inputs, Minerva and Gryffindor, to express the idea that Minerva belongs to Gryffindor. And so now here’s the key difference, or one of the key differences, between this and propositional logic. In propositional logic, I needed one symbol for Minerva Gryffindor, and one symbol for Minerva Hufflepuff, and one symbol for all the other people’s Gryffindor and Hufflepuff variables. In this case, I just need one symbol for each of my people, and one symbol for each of my houses. And then I can express as a predicate something like, belongs to, and say, belongs to Minerva Gryffindor, to express the idea that Minerva belongs to Gryffindor House. So already we can see that first order logic is quite expressive in being able to express these sorts of sentences using the existing constant symbols and predicates that already exist, while minimizing the number of new symbols that I need to create. I can just use eight symbols for people for houses, instead of 16 symbols for every possible combination of each. But first order logic gives us a couple of additional features that we can use to express even more complex ideas. And these more additional features are generally known as quantifiers. And there are two main quantifiers in first order logic, the first of which is universal quantification. Universal quantification lets me express an idea like something is going to be true for all values of a variable. Like for all values of x, some statement is going to hold true. So what might a sentence in universal quantification look like? Well, we’re going to use this upside down a to mean for all. So upside down ax means for all values of x, where x is any object, this is going to hold true. Belongs to x Gryffindor implies not belongs to x Hufflepuff. So let’s try and parse this out. This means that for all values of x, if this holds true, if x belongs to Gryffindor, then this does not hold true. x does not belong to Hufflepuff. So translated into English, this sentence is saying something like for all objects x, if x belongs to Gryffindor, then x does not belong to Hufflepuff, for example. Or a phrase even more simply, anyone in Gryffindor is not in Hufflepuff, simplified way of saying the same thing. So this universal quantification lets us express an idea like something is going to hold true for all values of a particular variable. In addition to universal quantification though, we also have existential quantification. Whereas universal quantification said that something is going to be true for all values of a variable, existential quantification says that some expression is going to be true for some value of a variable, at least one value of the variable. So let’s take a look at a sample sentence using existential quantification. One such sentence looks like this. There exists an x. This backwards e stands for exists. And here we’re saying there exists an x such that house x and belongs to Minerva x. In other words, there exists some object x where x is a house and Minerva belongs to x. Or phrased a little more succinctly in English, I’m here just saying Minerva belongs to a house. There’s some object that is a house and Minerva belongs to a house. And combining this universal and existential quantification, we can create far more sophisticated logical statements than we were able to just using propositional logic. I could combine these to say something like this. For all x, person x implies there exists a y such that house y and belongs to xy. All right. So a lot of stuff going on there, a lot of symbols. Let’s try and parse it out and just understand what it’s saying. Here we’re saying that for all values of x, if x is a person, then this is true. So in other words, I’m saying for all people, and we call that person x, this statement is going to be true. What statement is true of all people? Well, there exists a y that is a house, so there exists some house, and x belongs to y. In other words, I’m saying that for all people out there, there exists some house such that x, the person, belongs to y, the house. This is phrased more succinctly. I’m saying that every person belongs to a house, that for all x, if x is a person, then there exists a house that x belongs to. And so we can now express a lot more powerful ideas using this idea now of first order logic. And it turns out there are many other kinds of logic out there. There’s second order logic and other higher order logic, each of which allows us to express more and more complex ideas. But all of it, in this case, is really in pursuit of the same goal, which is the representation of knowledge. We want our AI agents to be able to know information, to represent that information, whether that’s using propositional logic or first order logic or some other logic, and then be able to reason based on that, to be able to draw conclusions, make inferences, figure out whether there’s some sort of entailment relationship, as by using some sort of inference algorithm, something like inference by resolution or model checking or any number of these other algorithms that we can use in order to take information that we know and translate it to additional conclusions. So all of this has helped us to create AI that is able to represent information about what it knows and what it doesn’t know. Next time, though, we’ll take a look at how we can make our AI even more powerful by not just encoding information that we know for sure to be true and not to be true, but also to take a look at uncertainty, to look at what happens if AI thinks that something might be probable or maybe not very probable or somewhere in between those two extremes, all in the pursuit of trying to build our intelligent systems to be even more intelligent. We’ll see you next time. Thank you. All right, welcome back, everyone, to an introduction to artificial intelligence with Python. And last time, we took a look at how it is that AI inside of our computers can represent knowledge. We represented that knowledge in the form of logical sentences in a variety of different logical languages. And the idea was we wanted our AI to be able to represent knowledge or information and somehow use those pieces of information to be able to derive new pieces of information by inference, to be able to take some information and deduce some additional conclusions based on the information that it already knew for sure. But in reality, when we think about computers and we think about AI, very rarely are our machines going to be able to know things for sure. Oftentimes, there’s going to be some amount of uncertainty in the information that our AIs or our computers are dealing with, where it might believe something with some probability, as we’ll soon discuss what probability is all about and what it means, but not entirely for certain. And we want to use the information that it has some knowledge about, even if it doesn’t have perfect knowledge, to still be able to make inferences, still be able to draw conclusions. So you might imagine, for example, in the context of a robot that has some sensors and is exploring some environment, it might not know exactly where it is or exactly what’s around it, but it does have access to some data that can allow it to draw inferences with some probability. There’s some likelihood that one thing is true or another. Or you can imagine in context where there is a little bit more randomness and uncertainty, something like predicting the weather, where you might not be able to know for sure what tomorrow’s weather is with 100% certainty, but you can probably infer with some probability what tomorrow’s weather is going to be based on maybe today’s weather and yesterday’s weather and other data that you might have access to as well. And so oftentimes, we can distill this in terms of just possible events that might happen and what the likelihood of those events are. This comes a lot in games, for example, where there is an element of chance inside of those games. So you imagine rolling a dice. You’re not sure exactly what the die roll is going to be, but you know it’s going to be one of these possibilities from 1 to 6, for example. And so here now, we introduce the idea of probability theory. And what we’ll take a look at today is beginning by looking at the mathematical foundations of probability theory, getting an understanding for some of the key concepts within probability, and then diving into how we can use probability and the ideas that we look at mathematically to represent some ideas in terms of models that we can put into our computers in order to program an AI that is able to use information about probability to draw inferences, to make some judgments about the world with some probability or likelihood of being true. So probability ultimately boils down to this idea that there are possible worlds that we’re here representing using this little Greek letter omega. And the idea of a possible world is that when I roll a die, there are six possible worlds that could result from it. I could roll a 1, or a 2, or a 3, or a 4, or a 5, or a 6. And each of those are a possible world. And each of those possible worlds has some probability of being true, the probability that I do roll a 1, or a 2, or a 3, or something else. And we represent that probability like this, using the capital letter P. And then in parentheses, what it is that we want the probability of. So this right here would be the probability of some possible world as represented by the little letter omega. Now, there are a couple of basic axioms of probability that become relevant as we consider how we deal with probability and how we think about it. First and foremost, every probability value must range between 0 and 1 inclusive. So the smallest value any probability can have is the number 0, which is an impossible event. Something like I roll a die, and the die is a 7 is the roll that I get. If the die only has numbers 1 through 6, the event that I roll a 7 is impossible, so it would have probability 0. And on the other end of the spectrum, probability can range all the way up to the positive number 1, meaning an event is certain to happen, that I roll a die and the number is less than 10, for example. That is an event that is guaranteed to happen if the only sides on my die are 1 through 6, for instance. And then they can range through any real number in between these two values. Where, generally speaking, a higher value for the probability means an event is more likely to take place, and a lower value for the probability means the event is less likely to take place. And the other key rule for probability looks a little bit like this. This sigma notation, if you haven’t seen it before, refers to summation, the idea that we’re going to be adding up a whole sequence of values. And this sigma notation is going to come up a couple of times today, because as we deal with probability, oftentimes we’re adding up a whole bunch of individual values or individual probabilities to get some other value. So we’ll see this come up a couple of times. But what this notation means is that if I sum up all of the possible worlds omega that are in big omega, which represents the set of all the possible worlds, meaning I take for all of the worlds in the set of possible worlds and add up all of their probabilities, what I ultimately get is the number 1. So if I take all the possible worlds, add up what each of their probabilities is, I should get the number 1 at the end, meaning all probabilities just need to sum to 1. So for example, if I take dice, for example, and if you imagine I have a fair die with numbers 1 through 6 and I roll the die, each one of these rolls has an equal probability of taking place. And the probability is 1 over 6, for example. So each of these probabilities is between 0 and 1, 0 meaning impossible and 1 meaning for certain. And if you add up all of these probabilities for all of the possible worlds, you get the number 1. And we can represent any one of those probabilities like this. The probability that we roll the number 2, for example, is just 1 over 6. Every six times we roll the die, we’d expect that one time, for instance, the die might come up as a 2. Its probability is not certain, but it’s a little more than nothing, for instance. And so this is all fairly straightforward for just a single die. But things get more interesting as our models of the world get a little bit more complex. Let’s imagine now that we’re not just dealing with a single die, but we have two dice, for example. I have a red die here and a blue die there, and I care not just about what the individual roll is, but I care about the sum of the two rolls. In this case, the sum of the two rolls is the number 3. How do I begin to now reason about what does the probability look like if instead of having one die, I now have two dice? Well, what we might imagine is that we could first consider what are all of the possible worlds. And in this case, all of the possible worlds are just every combination of the red and blue die that I could come up with. For the red die, it could be a 1 or a 2 or a 3 or a 4 or a 5 or a 6. And for each of those possibilities, the blue die, likewise, could also be either 1 or 2 or 3 or 4 or 5 or 6. And it just so happens that in this particular case, each of these possible combinations is equally likely. Equally likely are all of these various different possible worlds. That’s not always going to be the case. If you imagine more complex models that we could try to build and things that we could try to represent in the real world, it’s probably not going to be the case that every single possible world is always equally likely. But in the case of fair dice, where in any given die roll, any one number has just as good a chance of coming up as any other number, we can consider all of these possible worlds to be equally likely. But even though all of the possible worlds are equally likely, that doesn’t necessarily mean that their sums are equally likely. So if we consider what the sum is of all of these two, so 1 plus 1, that’s a 2. 2 plus 1 is a 3. And consider for each of these possible pairs of numbers what their sum ultimately is, we can notice that there are some patterns here, where it’s not entirely the case that every number comes up equally likely. If you consider 7, for example, what’s the probability that when I roll two dice, their sum is 7? There are several ways this can happen. There are six possible worlds where the sum is 7. It could be a 1 and a 6, or a 2 and a 5, or a 3 and a 4, a 4 and a 3, and so forth. But if you instead consider what’s the probability that I roll two dice, and the sum of those two die rolls is 12, for example, we’re looking at this diagram, there’s only one possible world in which that can happen. And that’s the possible world where both the red die and the blue die both come up as sixes to give us a sum total of 12. So based on just taking a look at this diagram, we see that some of these probabilities are likely different. The probability that the sum is a 7 must be greater than the probability that the sum is a 12. And we can represent that even more formally by saying, OK, the probability that we sum to 12 is 1 out of 36. Out of the 36 equally likely possible worlds, 6 squared because we have six options for the red die and six options for the blue die, out of those 36 options, only one of them sums to 12. Whereas on the other hand, the probability that if we take two dice rolls and they sum up to the number 7, well, out of those 36 possible worlds, there were six worlds where the sum was 7. And so we get 6 over 36, which we can simplify as a fraction to just 1 over 6. So here now, we’re able to represent these different ideas of probability, representing some events that might be more likely and then other events that are less likely as well. And these sorts of judgments, where we’re figuring out just in the abstract what is the probability that this thing takes place, are generally known as unconditional probabilities. Some degree of belief we have in some proposition, some fact about the world, in the absence of any other evidence. Without knowing any additional information, if I roll a die, what’s the chance it comes up as a 2? Or if I roll two dice, what’s the chance that the sum of those two die rolls is a 7? But usually when we’re thinking about probability, especially when we’re thinking about training in AI to intelligently be able to know something about the world and make predictions based on that information, it’s not unconditional probability that our AI is dealing with, but rather conditional probability, probability where rather than having no original knowledge, we have some initial knowledge about the world and how the world actually works. So conditional probability is the degree of belief in a proposition given some evidence that has already been revealed to us. So what does this look like? Well, it looks like this in terms of notation. We’re going to represent conditional probability as probability of A and then this vertical bar and then B. And the way to read this is the thing on the left-hand side of the vertical bar is what we want the probability of. Here now, I want the probability that A is true, that it is the real world, that it is the event that actually does take place. And then on the right side of the vertical bar is our evidence, the information that we already know for certain about the world. For example, that B is true. So the way to read this entire expression is what is the probability of A given B, the probability that A is true, given that we already know that B is true. And this type of judgment, conditional probability, the probability of one thing given some other fact, comes up quite a lot when we think about the types of calculations we might want our AI to be able to do. For example, we might care about the probability of rain today given that we know that it rained yesterday. We could think about the probability of rain today just in the abstract. What is the chance that today it rains? But usually, we have some additional evidence. I know for certain that it rained yesterday. And so I would like to calculate the probability that it rains today given that I know that it rained yesterday. Or you might imagine that I want to know the probability that my optimal route to my destination changes given the current traffic condition. So whether or not traffic conditions change, that might change the probability that this route is actually the optimal route. Or you might imagine in a medical context, I want to know the probability that a patient has a particular disease given some results of some tests that have been performed on that patient. And I have some evidence, the results of that test, and I would like to know the probability that a patient has a particular disease. So this notion of conditional probability comes up everywhere. So we begin to think about what we would like to reason about, but being able to reason a little more intelligently by taking into account evidence that we already have. We’re more able to get an accurate result for what is the likelihood that someone has this disease if we know this evidence, the results of the test, as opposed to if we were just calculating the unconditional probability of saying, what is the probability they have the disease without any evidence to try and back up our result one way or the other. So now that we’ve got this idea of what conditional probability is, the next question we have to ask is, all right, how do we calculate conditional probability? How do we figure out mathematically, if I have an expression like this, how do I get a number from that? What does conditional probability actually mean? Well, the formula for conditional probability looks a little something like this. The probability of a given b, the probability that a is true, given that we know that b is true, is equal to this fraction, the probability that a and b are true, divided by just the probability that b is true. And the way to intuitively try to think about this is that if I want to know the probability that a is true, given that b is true, well, I want to consider all the ways they could both be true out of the only worlds that I care about are the worlds where b is already true. I can sort of ignore all the cases where b isn’t true, because those aren’t relevant to my ultimate computation. They’re not relevant to what it is that I want to get information about. So let’s take a look at an example. Let’s go back to that example of rolling two dice and the idea that those two dice might sum up to the number 12. We discussed earlier that the unconditional probability that if I roll two dice and they sum to 12 is 1 out of 36, because out of the 36 possible worlds that I might care about, in only one of them is the sum of those two dice 12. It’s only when red is 6 and blue is also 6. But let’s say now that I have some additional information. I now want to know what is the probability that the two dice sum to 12, given that I know that the red die was a 6. So I already have some evidence. I already know the red die is a 6. I don’t know what the blue die is. That information isn’t given to me in this expression. But given the fact that I know that the red die rolled a 6, what is the probability that we sum to 12? And so we can begin to do the math using that expression from before. Here, again, are all of the possibilities, all of the possible combinations of red die being 1 through 6 and blue die being 1 through 6. And I might consider first, all right, what is the probability of my evidence, my B variable, where I want to know, what is the probability that the red die is a 6? Well, the probability that the red die is a 6 is just 1 out of 6. So these 1 out of 6 options are really the only worlds that I care about here now. All the rest of them are irrelevant to my calculation, because I already have this evidence that the red die was a 6, so I don’t need to care about all of the other possibilities that could result. So now, in addition to the fact that the red die rolled as a 6 and the probability of that, the other piece of information I need to know in order to calculate this conditional probability is the probability that both of my variables, A and B, are true. The probability that both the red die is a 6, and they all sum to 12. So what is the probability that both of these things happen? Well, it only happens in one possible case in 1 out of these 36 cases, and it’s the case where both the red and the blue die are equal to 6. This is a piece of information that we already knew. And so this probability is equal to 1 over 36. And so to get the conditional probability that the sum is 12, given that I know that the red dice is equal to 6, well, I just divide these two values together, and 1 over 36 divided by 1 over 6 gives us this probability of 1 over 6. Given that I know that the red die rolled a value of 6, the probability that the sum of the two dice is 12 is also 1 over 6. And that probably makes intuitive sense to you, too, because if the red die is a 6, the only way for me to get to a 12 is if the blue die also rolls a 6, and we know that the probability of the blue die rolling a 6 is 1 over 6. So in this case, the conditional probability seems fairly straightforward. But this idea of calculating a conditional probability by looking at the probability that both of these events take place is an idea that’s going to come up again and again. This is the definition now of conditional probability. And we’re going to use that definition as we think about probability more generally to be able to draw conclusions about the world. This, again, is that formula. The probability of A given B is equal to the probability that A and B take place divided by the probability of B. And you’ll see this formula sometimes written in a couple of different ways. You could imagine algebraically multiplying both sides of this equation by probability of B to get rid of the fraction, and you’ll get an expression like this. The probability of A and B, which is this expression over here, is just the probability of B times the probability of A given B. Or you could represent this equivalently since A and B in this expression are interchangeable. A and B is the same thing as B and A. You could imagine also representing the probability of A and B as the probability of A times the probability of B given A, just switching all of the A’s and B’s. These three are all equivalent ways of trying to represent what joint probability means. And so you’ll sometimes see all of these equations, and they might be useful to you as you begin to reason about probability and to think about what values might be taking place in the real world. Now, sometimes when we deal with probability, we don’t just care about a Boolean event like did this happen or did this not happen. Sometimes we might want the ability to represent variable values in a probability space where some variable might take on multiple different possible values. And in probability, we call a variable in probability theory a random variable. A random variable in probability is just some variable in probability theory that has some domain of values that it can take on. So what do I mean by this? Well, what I mean is I might have a random variable that is just called roll, for example, that has six possible values. Roll is my variable, and the possible values, the domain of values that it can take on are 1, 2, 3, 4, 5, and 6. And I might like to know the probability of each. In this case, they happen to all be the same. But in other random variables, that might not be the case. For example, I might have a random variable to represent the weather, for example, where the domain of values it could take on are things like sun or cloudy or rainy or windy or snowy. And each of those might have a different probability. And I care about knowing what is the probability that the weather equals sun or that the weather equals clouds, for instance. And I might like to do some mathematical calculations based on that information. Other random variables might be something like traffic. What are the odds that there is no traffic or light traffic or heavy traffic? Traffic, in this case, is my random variable. And the values that that random variable can take on are here. It’s either none or light or heavy. And I, the person doing these calculations, I, the person encoding these random variables into my computer, need to make the decision as to what these possible values actually are. You might imagine, for example, for a flight. If I care about whether or not I make it or do a flight on time, my flight has a couple of possible values that it could take on. My flight could be on time. My flight could be delayed. My flight could be canceled. So flight, in this case, is my random variable. And these are the values that it can take on. And often, I want to know something about the probability that my random variable takes on each of those possible values. And this is what we then call a probability distribution. A probability distribution takes a random variable and gives me the probability for each of the possible values in its domain. So in the case of this flight, for example, my probability distribution might look something like this. My probability distribution says the probability that the random variable flight is equal to the value on time is 0.6. Or otherwise, put into more English human-friendly terms, the likelihood that my flight is on time is 60%, for example. And in this case, the probability that my flight is delayed is 30%. The probability that my flight is canceled is 10% or 0.1. And if you sum up all of these possible values, the sum is going to be 1, right? If you take all of the possible worlds, here are my three possible worlds for the value of the random variable flight, add them all up together, the result needs to be the number 1 per that axiom of probability theory that we’ve discussed before. So this now is one way of representing this probability distribution for the random variable flight. Sometimes you’ll see it represented a little bit more concisely that this is pretty verbose for really just trying to express three possible values. And so often, you’ll instead see the same notation representing using a vector. And all a vector is is a sequence of values. As opposed to just a single value, I might have multiple values. And so I could extend instead, represent this idea this way. Bold p, so a larger p, generally meaning the probability distribution of this variable flight is equal to this vector represented in angle brackets. The probability distribution is 0.6, 0.3, and 0.1. And I would just have to know that this probability distribution is in order of on time or delayed and canceled to know how to interpret this vector. To mean the first value in the vector is the probability that my flight is on time. The second value in the vector is the probability that my flight is delayed. And the third value in the vector is the probability that my flight is canceled. And so this is just an alternate way of representing this idea, a little more verbosely. But oftentimes, you’ll see us just talk about a probability distribution over a random variable. And whenever we talk about that, what we’re really doing is trying to figure out the probabilities of each of the possible values that that random variable can take on. But this notation is just a little bit more succinct, even though it can sometimes be a little confusing, depending on the context in which you see it. So we’ll start to look at examples where we use this sort of notation to describe probability and to describe events that might take place. A couple of other important ideas to know with regards to probability theory. One is this idea of independence. And independence refers to the idea that the knowledge of one event doesn’t influence the probability of another event. So for example, in the context of my two dice rolls, where I had the red die and the blue die, the probability that I roll the red die and the blue die, those two events, red die and blue die, are independent. Knowing the result of the red die doesn’t change the probabilities for the blue die. It doesn’t give me any additional information about what the value of the blue die is ultimately going to be. But that’s not always going to be the case. You might imagine that in the case of weather, something like clouds and rain, those are probably not independent. But if it is cloudy, that might increase the probability that later in the day it’s going to rain. So some information informs some other event or some other random variable. So independence refers to the idea that one event doesn’t influence the other. And if they’re not independent, then there might be some relationship. So mathematically, formally, what does independence actually mean? Well, recall this formula from before, that the probability of A and B is the probability of A times the probability of B given A. And the more intuitive way to think about this is that to know how likely it is that A and B happen, well, let’s first figure out the likelihood that A happens. And then given that we know that A happens, let’s figure out the likelihood that B happens and multiply those two things together. But if A and B were independent, meaning knowing A doesn’t change anything about the likelihood that B is true, well, then the probability of B given A, meaning the probability that B is true, given that I know A is true, well, that I know A is true shouldn’t really make a difference if these two things are independent, that A shouldn’t influence B at all. So the probability of B given A is really just the probability of B. If it is true that A and B are independent. And so this right here is one example of a definition for what it means for A and B to be independent. The probability of A and B is just the probability of A times the probability of B. Anytime you find two events A and B where this relationship holds, then you can say that A and B are independent. So an example of that might be the dice that we were taking a look at before. Here, if I wanted the probability of red being a 6 and blue being a 6, well, that’s just the probability that red is a 6 multiplied by the probability that blue is a 6. It’s both equal to 1 over 36. So I can say that these two events are independent. What wouldn’t be independent, for example, would be an example. So this, for example, has a probability of 1 over 36, as we talked about before. But what wouldn’t be independent would be a case like this, the probability that the red die rolls a 6 and the red die rolls a 4. If you just naively took, OK, red die 6, red die 4, well, if I’m only rolling the die once, you might imagine the naive approach is to say, well, each of these has a probability of 1 over 6. So multiply them together, and the probability is 1 over 36. But of course, if you’re only rolling the red die once, there’s no way you could get two different values for the red die. It couldn’t both be a 6 and a 4. So the probability should be 0. But if you were to multiply probability of red 6 times probability of red 4, well, that would equal 1 over 36. But of course, that’s not true. Because we know that there is no way, probability 0, that when we roll the red die once, we get both a 6 and a 4, because only one of those possibilities can actually be the result. And so we can say that the event that red roll is 6 and the event that red roll is 4, those two events are not independent. If I know that the red roll is a 6, I know that the red roll cannot possibly be a 4, so these things are not independent. And instead, if I wanted to calculate the probability, I would need to use this conditional probability as the regular definition of the probability of two events taking place. And the probability of this now, well, the probability of the red roll being a 6, that’s 1 over 6. But what’s the probability that the roll is a 4 given that the roll is a 6? Well, this is just 0, because there’s no way for the red roll to be a 4, given that we already know the red roll is a 6. And so the value, if we do add all that multiplication, is we get the number 0. So this idea of conditional probability is going to come up again and again, especially as we begin to reason about multiple different random variables that might be interacting with each other in some way. And this gets us to one of the most important rules in probability theory, which is known as Bayes rule. And it turns out that just using the information we’ve already learned about probability and just applying a little bit of algebra, we can actually derive Bayes rule for ourselves. But it’s a very important rule when it comes to inference and thinking about probability in the context of what it is that a computer can do or what a mathematician could do by having access to information about probability. So let’s go back to these equations to be able to derive Bayes rule ourselves. We know the probability of A and B, the likelihood that A and B take place, is the likelihood of B, and then the likelihood of A, given that we know that B is already true. And likewise, the probability of A given A and B is the probability of A times the probability of B, given that we know that A is already true. This is sort of a symmetric relationship where it doesn’t matter the order of A and B and B and A mean the same thing. And so in these equations, we can just swap out A and B to be able to represent the exact same idea. So we know that these two equations are already true. We’ve seen that already. And now let’s just do a little bit of algebraic manipulation of this stuff. Both of these expressions on the right-hand side are equal to the probability of A and B. So what I can do is take these two expressions on the right-hand side and just set them equal to each other. If they’re both equal to the probability of A and B, then they both must be equal to each other. So probability of A times probability of B given A is equal to the probability of B times the probability of A given B. And now all we’re going to do is do a little bit of division. I’m going to divide both sides by P of A. And now I get what is Bayes’ rule. The probability of B given A is equal to the probability of B times the probability of A given B divided by the probability of A. And sometimes in Bayes’ rule, you’ll see the order of these two arguments switched. So instead of B times A given B, it’ll be A given B times B. That ultimately doesn’t matter because in multiplication, you can switch the order of the two things you’re multiplying, and it doesn’t change the result. But this here right now is the most common formulation of Bayes’ rule. The probability of B given A is equal to the probability of A given B times the probability of B divided by the probability of A. And this rule, it turns out, is really important when it comes to trying to infer things about the world, because it means you can express one conditional probability, the conditional probability of B given A, using knowledge about the probability of A given B, using the reverse of that conditional probability. So let’s first do a little bit of an example with this, just to see how we might use it, and then explore what this means a little bit more generally. So we’re going to construct a situation where I have some information. There are two events that I care about, the idea that it’s cloudy in the morning and the idea that it is rainy in the afternoon. Those are two different possible events that could take place, cloudy in the morning, or the AM, rainy in the PM. And what I care about is, given clouds in the morning, what is the probability of rain in the afternoon? A reasonable question I might ask, in the morning, I look outside, or an AI’s camera looks outside and sees that there are clouds in the morning. And we want to conclude, we want to figure out what is the probability that in the afternoon, there is going to be rain. Of course, in the abstract, we don’t have access to this kind of information, but we can use data to begin to try and figure this out. So let’s imagine now that I have access to some pieces of information. I have access to the idea that 80% of rainy afternoons start out with a cloudy morning. And you might imagine that I could have gathered this data just by looking at data over a sequence of time, that I know that 80% of the time when it’s raining in the afternoon, it was cloudy that morning. I also know that 40% of days have cloudy mornings. And I also know that 10% of days have rainy afternoons. And now using this information, I would like to figure out, given clouds in the morning, what is the probability that it rains in the afternoon? I want to know the probability of afternoon rain given morning clouds. And I can do that, in particular, using this fact, the probability of, so if I know that 80% of rainy afternoons start with cloudy mornings, then I know the probability of cloudy mornings given rainy afternoons. So using sort of the reverse conditional probability, I can figure that out. Expressed in terms of Bayes rule, this is what that would look like. Probability of rain given clouds is the probability of clouds given rain times the probability of rain divided by the probability of clouds. Here I’m just substituting in for the values of a and b from that equation of Bayes rule from before. And then I can just do the math. I have this information. I know that 80% of the time, if it was raining, then there were clouds in the morning. So 0.8 here. Probability of rain is 0.1, because 10% of days were rainy, and 40% of days were cloudy. I do the math, and I can figure out the answer is 0.2. So the probability that it rains in the afternoon, given that it was cloudy in the morning, is 0.2 in this case. And this now is an application of Bayes rule, the idea that using one conditional probability, we can get the reverse conditional probability. And this is often useful when one of the conditional probabilities might be easier for us to know about or easier for us to have data about. And using that information, we can calculate the other conditional probability. So what does this look like? Well, it means that knowing the probability of cloudy mornings given rainy afternoons, we can calculate the probability of rainy afternoons given cloudy mornings. Or, for example, more generally, if we know the probability of some visible effect, some effect that we can see and observe, given some unknown cause that we’re not sure about, well, then we can calculate the probability of that unknown cause given the visible effect. So what might that look like? Well, in the context of medicine, for example, I might know the probability of some medical test result given a disease. Like, I know that if someone has a disease, then x% of the time the medical test result will show up as this, for instance. And using that information, then I can calculate, all right, what is the probability that given I know the medical test result, what is the likelihood that someone has the disease? This is the piece of information that is usually easier to know, easier to immediately have access to data for. And this is the information that I actually want to calculate. Or I might want to know, for example, if I know that some probability of counterfeit bills have blurry text around the edges, because counterfeit printers aren’t nearly as good at printing text precisely. So I have some information about, given that something is a counterfeit bill, like x% of counterfeit bills have blurry text, for example. And using that information, then I can calculate some piece of information that I might want to know, like, given that I know there’s blurry text on a bill, what is the probability that that bill is counterfeit? So given one conditional probability, I can calculate the other conditional probability as well. And so now we’ve taken a look at a couple of different types of probability. And we’ve looked at unconditional probability, where I just look at what is the probability of this event occurring, given no additional evidence that I might have access to. And we’ve also looked at conditional probability, where I have some sort of evidence, and I would like to, using that evidence, be able to calculate some other probability as well. And the other kind of probability that will be important for us to think about is joint probability. And this is when we’re considering the likelihood of multiple different events simultaneously. And so what do we mean by this? For example, I might have probability distributions that look a little something like this. Like, oh, I want to know the probability distribution of clouds in the morning. And that distribution looks like this. 40% of the time, C, which is my random variable here, is equal to it’s cloudy. And 60% of the time, it’s not cloudy. So here is just a simple probability distribution that is effectively telling me that 40% of the time, it’s cloudy. I might also have a probability distribution for rain in the afternoon, where 10% of the time, or with probability 0.1, it is raining in the afternoon. And with probability 0.9, it is not raining in the afternoon. And using just these two pieces of information, I don’t actually have a whole lot of information about how these two variables relate to each other. But I could if I had access to their joint probability, meaning for every combination of these two things, meaning morning cloudy and afternoon rain, morning cloudy and afternoon not rain, morning not cloudy and afternoon rain, and morning not cloudy and afternoon not raining, if I had access to values for each of those four, I’d have more information. So information that’d be organized in a table like this, and this, rather than just a probability distribution, is a joint probability distribution. It tells me the probability distribution of each of the possible combinations of values that these random variables can take on. So if I want to know what is the probability that on any given day it is both cloudy and rainy, well, I would say, all right, we’re looking at cases where it is cloudy and cases where it is raining. And the intersection of those two, that row in that column, is 0.08. So that is the probability that it is both cloudy and rainy using that information. And using this conditional probability table, using this joint probability table, I can begin to draw other pieces of information about things like conditional probability. So I might ask a question like, what is the probability distribution of clouds given that I know that it is raining? Meaning I know for sure that it’s raining. Tell me the probability distribution over whether it’s cloudy or not, given that I know already that it is, in fact, raining. And here I’m using C to stand for that random variable. I’m looking for a distribution, meaning the answer to this is not going to be a single value. It’s going to be two values, a vector of two values, where the first value is probability of clouds, the second value is probability that it is not cloudy, but the sum of those two values is going to be 1. Because when you add up the probabilities of all of the possible worlds, the result that you get must be the number 1. And well, what do we know about how to calculate a conditional probability? Well, we know that the probability of A given B is the probability of A and B divided by the probability of B. So what does this mean? Well, it means that I can calculate the probability of clouds given that it’s raining as the probability of clouds and raining divided by the probability of rain. And this comma here for the probability distribution of clouds and rain, this comma sort of stands in for the word and. You’ll sort of see in the logical operator and and the comma used interchangeably. This means the probability distribution over the clouds and knowing the fact that it is raining divided by the probability of rain. And the interesting thing to note here and what we’ll often do in order to simplify our mathematics is that dividing by the probability of rain, the probability of rain here is just some numerical constant. It is some number. Dividing by probability of rain is just dividing by some constant, or in other words, multiplying by the inverse of that constant. And it turns out that oftentimes we can just not worry about what the exact value of this is and just know that it is, in fact, a constant value. And we’ll see why in a moment. So instead of expressing this as this joint probability divided by the probability of rain, sometimes we’ll just represent it as alpha times the numerator here, the probability distribution of C, this variable, and that we know that it is raining, for instance. So all we’ve done here is said this value of 1 over the probability of rain, that’s really just a constant we’re going to divide by or equivalently multiply by the inverse of at the end. We’ll just call it alpha for now and deal with it a little bit later. But the key idea here now, and this is an idea that’s going to come up again, is that the conditional distribution of C given rain is proportional to, meaning just some factor multiplied by the joint probability of C and rain being true. And so how do we figure this out? Well, this is going to be the probability that it is cloudy given that it’s raining, which is 0.08, and the probability that it’s not cloudy given that it’s raining, which is 0.02. And so we get alpha times here now is that probability distribution. 0.08 is clouds and rain. 0.02 is not cloudy and rain. But of course, 0.08 and 0.02 don’t sum up to the number 1. And we know that in a probability distribution, if you consider all of the possible values, they must sum up to a probability of 1. And so we know that we just need to figure out some constant to normalize, so to speak, these values, something we can multiply or divide by to get it so that all these probabilities sum up to 1, and it turns out that if we multiply both numbers by 10, then we can get that result of 0.8 and 0.2. The proportions are still equivalent, but now 0.8 plus 0.2, those sum up to the number 1. So take a look at this and see if you can understand step by step how it is we’re getting from one point to another. The key idea here is that by using the joint probabilities, these probabilities that it is both cloudy and rainy and that it is not cloudy and rainy, I can take that information and figure out the conditional probability given that it’s raining. What is the chance that it’s cloudy versus not cloudy? Just by multiplying by some normalization constant, so to speak. And this is what a computer can begin to use to be able to interact with these various different types of probabilities. And it turns out there are a number of other probability rules that are going to be useful to us as we begin to explore how we can actually use this information to encode into our computers some more complex analysis that we might want to do about probability and distributions and random variables that we might be interacting with. So here are a couple of those important probability rules. One of the simplest rules is just this negation rule. What is the probability of not event A? So A is an event that has some probability, and I would like to know what is the probability that A does not occur. And it turns out it’s just 1 minus P of A, which makes sense. Because if those are the two possible cases, either A happens or A doesn’t happen, then when you add up those two cases, you must get 1, which means that P of not A must just be 1 minus P of A. Because P of A and P of not A must sum up to the number 1. They must include all of the possible cases. We’ve seen an expression for calculating the probability of A and B. We might also reasonably want to calculate the probability of A or B. What is the probability that one thing happens or another thing happens? So for example, I might want to calculate what is the probability that if I roll two dice, a red die and a blue die, what is the likelihood that A is a 6 or B is a 6, like one or the other? And what you might imagine you could do, and the wrong way to approach it, would be just to say, all right, well, A comes up as a 6 with the red die comes up as a 6 with probability 1 over 6. The same for the blue die, it’s also 1 over 6. Add them together, and you get 2 over 6, otherwise known as 1 third. But this suffers from a problem of over counting, that we’ve double counted the case, where both A and B, both the red die and the blue die, both come up as a 6-roll. And I’ve counted that instance twice. So to resolve this, the actual expression for calculating the probability of A or B uses what we call the inclusion-exclusion formula. So I take the probability of A, add it to the probability of B. That’s all same as before. But then I need to exclude the cases that I’ve double counted. So I subtract from that the probability of A and B. And that gets me the result for A or B. I consider all the cases where A is true and all the cases where B is true. And if you imagine this is like a Venn diagram of cases where A is true, cases where B is true, I just need to subtract out the middle to get rid of the cases that I have overcounted by double counting them inside of both of these individual expressions. One other rule that’s going to be quite helpful is a rule called marginalization. So marginalization is answering the question of how do I figure out the probability of A using some other variable that I might have access to, like B? Even if I don’t know additional information about it, I know that B, some event, can have two possible states, either B happens or B doesn’t happen, assuming it’s a Boolean, true or false. And well, what that means is that for me to be able to calculate the probability of A, there are only two cases. Either A happens and B happens, or A happens and B doesn’t happen. And those are two disjoint, meaning they can’t both happen together. Either B happens or B doesn’t happen. They’re disjoint or separate cases. And so I can figure out the probability of A just by adding up those two cases. The probability that A is true is the probability that A and B is true, plus the probability that A is true and B isn’t true. So by marginalizing, I’ve looked at the two possible cases that might take place, either B happens or B doesn’t happen. And in either of those cases, I look at what’s the probability that A happens. And if I add those together, well, then I get the probability that A happens as a whole. So take a look at that rule. It doesn’t matter what B is or how it’s related to A. So long as I know these joint distributions, I can figure out the overall probability of A. And this can be a useful way if I have a joint distribution, like the joint distribution of A and B, to just figure out some unconditional probability, like the probability of A. And we’ll see examples of this soon as well. Now, sometimes these might not just be random, might not just be variables that are events that are like they happened or they didn’t happen, like B is here. They might be some broader probability distribution where there are multiple possible values. And so here, in order to use this marginalization rule, I need to sum up not just over B and not B, but for all of the possible values that the other random variable could take on. And so here, we’ll see a version of this rule for random variables. And it’s going to include that summation notation to indicate that I’m summing up, adding up a whole bunch of individual values. So here’s the rule. Looks a lot more complicated, but it’s actually the equivalent exactly the same rule. What I’m saying here is that if I have two random variables, one called x and one called y, well, the probability that x is equal to some value x sub i, this is just some value that this variable takes on. How do I figure it out? Well, I’m going to sum up over j, where j is going to range over all of the possible values that y can take on. Well, let’s look at the probability that x equals xi and y equals yj. So the exact same rule, the only difference here is now I’m summing up over all of the possible values that y can take on, saying let’s add up all of those possible cases and look at this joint distribution, this joint probability, that x takes on the value I care about, given all of the possible values for y. And if I add all those up, then I can get this unconditional probability of what x is equal to, whether or not x is equal to some value x sub i. So let’s take a look at this rule, because it does look a little bit complicated. Let’s try and put a concrete example to it. Here again is that same joint distribution from before. I have cloud, not cloudy, rainy, not rainy. And maybe I want to access some variable. I want to know what is the probability that it is cloudy. Well, marginalization says that if I have this joint distribution and I want to know what is the probability that it is cloudy, well, I need to consider the other variable, the variable that’s not here, the idea that it’s rainy. And I consider the two cases, either it’s raining or it’s not raining. And I just sum up the values for each of those possibilities. In other words, the probability that it is cloudy is equal to the sum of the probability that it’s cloudy and it’s rainy and the probability that it’s cloudy and it is not raining. And so these now are values that I have access to. These are values that are just inside of this joint probability table. What is the probability that it is both cloudy and rainy? Well, it’s just the intersection of these two here, which is 0.08. And the probability that it’s cloudy and not raining is, all right, here’s cloudy, here’s not raining. It’s 0.32. So it’s 0.08 plus 0.32, which just gives us equal to 0.4. That is the unconditional probability that it is, in fact, cloudy. And so marginalization gives us a way to go from these joint distributions to just some individual probability that I might care about. And you’ll see a little bit later why it is that we care about that and why that’s actually useful to us as we begin doing some of these calculations. Last rule we’ll take a look at before transitioning to something a little bit different is this rule of conditioning, very similar to the marginalization rule. But it says that, again, if I have two events, a and b, but instead of having access to their joint probabilities, I have access to their conditional probabilities, how they relate to each other. Well, again, if I want to know the probability that a happens, and I know that there’s some other variable b, either b happens or b doesn’t happen, and so I can say that the probability of a is the probability of a given b times the probability of b, meaning b happened. And given that I know b happened, what’s the likelihood that a happened? And then I consider the other case, that b didn’t happen. So here’s the probability that b didn’t happen. And here’s the probability that a happens, given that I know that b didn’t happen. And this is really the equivalent rule just using conditional probability instead of joint probability, where I’m saying let’s look at both of these two cases and condition on b. Look at the case where b happens, and look at the case where b doesn’t happen, and look at what probabilities I get as a result. And just as in the case of marginalization, where there was an equivalent rule for random variables that could take on multiple possible values in a domain of possible values, here, too, conditioning has the same equivalent rule. Again, there’s a summation to mean I’m summing over all of the possible values that some random variable y could take on. But if I want to know what is the probability that x takes on this value, then I’m going to sum up over all the values j that y could take on, and say, all right, what’s the chance that y takes on that value yj? And multiply it by the conditional probability that x takes on this value, given that y took on that value yj. So equivalent rule just using conditional probabilities instead of joint probabilities. And using the equation we know about joint probabilities, we can translate between these two. So all right, we’ve seen a whole lot of mathematics, and we’ve just laid the foundation for mathematics. And no need to worry if you haven’t seen probability in too much detail up until this point. These are the foundations of the ideas that are going to come up as we begin to explore how we can now take these ideas from probability and begin to apply them to represent something inside of our computer, something inside of the AI agent we’re trying to design that is able to represent information and probabilities and the likelihoods between various different events. So there are a number of different probabilistic models that we can generate, but the first of the models we’re going to talk about are what are known as Bayesian networks. And a Bayesian network is just going to be some network of random variables, connected random variables that are going to represent the dependence between these random variables. The odds are most random variables in this world are not independent from each other, but there’s some relationship between things that are happening that we care about. If it is rainy today, that might increase the likelihood that my flight or my train gets delayed, for example. There are some dependence between these random variables, and a Bayesian network is going to be able to capture those dependencies. So what is a Bayesian network? What is its actual structure, and how does it work? Well, a Bayesian network is going to be a directed graph. And again, we’ve seen directed graphs before. They are individual nodes with arrows or edges that connect one node to another node pointing in a particular direction. And so this directed graph is going to have nodes as well, where each node in this directed graph is going to represent a random variable, something like the weather, or something like whether my train was on time or delayed. And we’re going to have an arrow from a node x to a node y to mean that x is a parent of y. So that’ll be our notation. If there’s an arrow from x to y, x is going to be considered a parent of y. And the reason that’s important is because each of these nodes is going to have a probability distribution that we’re going to store along with it, which is the distribution of x given some evidence, given the parents of x. So the way to more intuitively think about this is the parents seem to be thought of as sort of causes for some effect that we’re going to observe. And so let’s take a look at an actual example of a Bayesian network and think about the types of logic that might be involved in reasoning about that network. Let’s imagine for a moment that I have an appointment out of town, and I need to take a train in order to get to that appointment. So what are the things I might care about? Well, I care about getting to my appointment on time. Whether I make it to my appointment and I’m able to attend it or I miss the appointment. And you might imagine that that’s influenced by the train, that the train is either on time or it’s delayed, for example. But that train itself is also influenced. Whether the train is on time or not depends maybe on the rain. Is there no rain? Is it light rain? Is there heavy rain? And it might also be influenced by other variables too. It might be influenced as well by whether or not there’s maintenance on the train track, for example. If there is maintenance on the train track, that probably increases the likelihood that my train is delayed. And so we can represent all of these ideas using a Bayesian network that looks a little something like this. Here I have four nodes representing four random variables that I would like to keep track of. I have one random variable called rain that can take on three possible values in its domain, either none or light or heavy, for no rain, light rain, or heavy rain. I have a variable called maintenance for whether or not there is maintenance on the train track, which it has two possible values, just either yes or no. Either there is maintenance or there’s no maintenance happening on the track. Then I have a random variable for the train indicating whether or not the train was on time or not. That random variable has two possible values in its domain. The train is either on time or the train is delayed. And then finally, I have a random variable for whether I make it to my appointment. For my appointment down here, I have a random variable called appointment that itself has two possible values, attend and miss. And so here are the possible values. Here are my four nodes, each of which represents a random variable, each of which has a domain of possible values that it can take on. And the arrows, the edges pointing from one node to another, encode some notion of dependence inside of this graph, that whether I make it to my appointment or not is dependent upon whether the train is on time or delayed. And whether the train is on time or delayed is dependent on two things given by the two arrows pointing at this node. It is dependent on whether or not there was maintenance on the train track. And it is also dependent upon whether or not it was raining or whether it is raining. And just to make things a little complicated, let’s say as well that whether or not there is maintenance on the track, this too might be influenced by the rain. That if there’s heavier rain, well, maybe it’s less likely that it’s going to be maintenance on the train track that day because they’re more likely to want to do maintenance on the track on days when it’s not raining, for example. And so these nodes might have different relationships between them. But the idea is that we can come up with a probability distribution for any of these nodes based only upon its parents. And so let’s look node by node at what this probability distribution might actually look like. And we’ll go ahead and begin with this root node, this rain node here, which is at the top, and has no arrows pointing into it, which means its probability distribution is not going to be a conditional distribution. It’s not based on anything. I just have some probability distribution over the possible values for the rain random variable. And that distribution might look a little something like this. None, light and heavy, each have a possible value. Here I’m saying the likelihood of no rain is 0.7, of light rain is 0.2, of heavy rain is 0.1, for example. So here is a probability distribution for this root node in this Bayesian network. And let’s now consider the next node in the network, maintenance. Track maintenance is yes or no. And the general idea of what this distribution is going to encode, at least in this story, is the idea that the heavier the rain is, the less likely it is that there’s going to be maintenance on the track. Because the people that are doing maintenance on the track probably want to wait until a day when it’s not as rainy in order to do the track maintenance, for example. And so what might that probability distribution look like? Well, this now is going to be a conditional probability distribution, that here are the three possible values for the rain random variable, which I’m here just going to abbreviate to R, either no rain, light rain, or heavy rain. And for each of those possible values, either there is yes track maintenance or no track maintenance. And those have probabilities associated with them. That I see here that if it is not raining, then there is a probability of 0.4 that there’s track maintenance and a probability of 0.6 that there isn’t. But if there’s heavy rain, then here the chance that there is track maintenance is 0.1 and the chance that there is not track maintenance is 0.9. Each of these rows is going to sum up to 1. Because each of these represent different values of whether or not it’s raining, the three possible values that that random variable can take on. And each is associated with its own probability distribution that is ultimately all going to add up to the number 1. So that there is our distribution for this random variable called maintenance, about whether or not there is maintenance on the train track. And now let’s consider the next variable. Here we have a node inside of our Bayesian network called train that has two possible values, on time and delayed. And this node is going to be dependent upon the two nodes that are pointing towards it, that whether or not the train is on time or delayed depends on whether or not there is track maintenance. And it depends on whether or not there is rain, that heavier rain probably means more likely that my train is delayed. And if there is track maintenance, that also probably means it’s more likely that my train is delayed as well. And so you could construct a larger probability distribution, a conditional probability distribution, that instead of conditioning on just one variable, as was the case here, is now conditioning on two variables, conditioning both on rain represented by r and on maintenance represented by yes. Again, each of these rows has two values that sum up to the number 1, one for whether the train is on time, one for whether the train is delayed. And here I can say something like, all right, if I know there was light rain and track maintenance, well, OK, that would be r is light and m is yes. Well, then there is a probability of 0.6 that my train is on time, and a probability of 0.4 the train is delayed. And you can imagine gathering this data just by looking at real world data, looking at data about, all right, if I knew that it was light rain and there was track maintenance, how often was a train delayed or not delayed? And you could begin to construct this thing. The interesting thing is intelligently, being able to try to figure out how might you go about ordering these things, what things might influence other nodes inside of this Bayesian network. And the last thing I care about is whether or not I make it to my appointment. So did I attend or miss the appointment? And ultimately, whether I attend or miss the appointment, it is influenced by track maintenance, because it’s indirectly this idea that, all right, if there is track maintenance, well, then my train might more likely be delayed. And if my train is more likely to be delayed, then I’m more likely to miss my appointment. But what we encode in this Bayesian network are just what we might consider to be more direct relationships. So the train has a direct influence on the appointment. And given that I know whether the train is on time or delayed, knowing whether there’s track maintenance isn’t going to give me any additional information that I didn’t already have. That if I know train, these other nodes that are up above isn’t really going to influence the result. And so here we might represent it using another conditional probability distribution that looks a little something like this. The train can take on two possible values. Either my train is on time or my train is delayed. And for each of those two possible values, I have a distribution for what are the odds that I’m able to attend the meeting and what are the odds that I missed the meeting. And obviously, if my train is on time, I’m much more likely to be able to attend the meeting than if my train is delayed, in which case I’m more likely to miss that meeting. So all of these nodes put all together here represent this Bayesian network, this network of random variables whose values I ultimately care about, and that have some sort of relationship between them, some sort of dependence where these arrows from one node to another indicate some dependence, that I can calculate the probability of some node given the parents that happen to exist there. So now that we’ve been able to describe the structure of this Bayesian network and the relationships between each of these nodes by associating each of the nodes in the network with a probability distribution, whether that’s an unconditional probability distribution in the case of this root node here, like rain, and a conditional probability distribution in the case of all of the other nodes whose probabilities are dependent upon the values of their parents, we can begin to do some computation and calculation using the information inside of that table. So let’s imagine, for example, that I just wanted to compute something simple like the probability of light rain. How would I get the probability of light rain? Well, light rain, rain here is a root node. And so if I wanted to calculate that probability, I could just look at the probability distribution for rain and extract from it the probability of light rains, just a single value that I already have access to. But we could also imagine wanting to compute more complex joint probabilities, like the probability that there is light rain and also no track maintenance. This is a joint probability of two values, light rain and no track maintenance. And the way I might do that is first by starting by saying, all right, well, let me get the probability of light rain. But now I also want the probability of no track maintenance. But of course, this node is dependent upon the value of rain. So what I really want is the probability of no track maintenance, given that I know that there was light rain. And so the expression for calculating this idea that the probability of light rain and no track maintenance is really just the probability of light rain and the probability that there is no track maintenance, given that I know that there already is light rain. So I take the unconditional probability of light rain, multiply it by the conditional probability of no track maintenance, given that I know there is light rain. And you can continue to do this again and again for every variable that you want to add into this joint probability that I might want to calculate. If I wanted to know the probability of light rain and no track maintenance and a delayed train, well, that’s going to be the probability of light rain, multiplied by the probability of no track maintenance, given light rain, multiplied by the probability of a delayed train, given light rain and no track maintenance. Because whether the train is on time or delayed is dependent upon both of these other two variables. And so I have two pieces of evidence that go into the calculation of that conditional probability. And each of these three values is just a value that I can look up by looking at one of these individual probability distributions that is encoded into my Bayesian network. And if I wanted a joint probability over all four of the variables, something like the probability of light rain and no track maintenance and a delayed train and I miss my appointment, well, that’s going to be multiplying four different values, one from each of these individual nodes. It’s going to be the probability of light rain, then of no track maintenance given light rain, then of a delayed train, given light rain and no track maintenance. And then finally, for this node here, for whether I make it to my appointment or not, it’s not dependent upon these two variables, given that I know whether or not the train is on time. I only need to care about the conditional probability that I miss my train, or that I miss my appointment, given that the train happens to be delayed. And so that’s represented here by four probabilities, each of which is located inside of one of these probability distributions for each of the nodes, all multiplied together. And so I can take a variable like that and figure out what the joint probability is by multiplying a whole bunch of these individual probabilities from the Bayesian network. But of course, just as with last time, where what I really wanted to do was to be able to get new pieces of information, here, too, this is what we’re going to want to do with our Bayesian network. In the context of knowledge, we talked about the problem of inference. Given things that I know to be true, can I draw conclusions, make deductions about other facts about the world that I also know to be true? And what we’re going to do now is apply the same sort of idea to probability. Using information about which I have some knowledge, whether some evidence or some probabilities, can I figure out not other variables for certain, but can I figure out the probabilities of other variables taking on particular values? And so here, we introduce the problem of inference in a probabilistic setting, in a case where variables might not necessarily be true for sure, but they might be random variables that take on different values with some probability. So how do we formally define what exactly this inference problem actually is? Well, the inference problem has a couple of parts to it. We have some query, some variable x that we want to compute the distribution for. Maybe I want the probability that I miss my train, or I want the probability that there is track maintenance, something that I want information about. And then I have some evidence variables. Maybe it’s just one piece of evidence. Maybe it’s multiple pieces of evidence. But I’ve observed certain variables for some sort of event. So for example, I might have observed that it is raining. This is evidence that I have. I know that there is light rain, or I know that there is heavy rain. And that is evidence I have. And using that evidence, I want to know what is the probability that my train is delayed, for example. And that is a query that I might want to ask based on this evidence. So I have a query, some variable. Evidence, which are some other variables that I have observed inside of my Bayesian network. And of course, that does leave some hidden variables. Why? These are variables that are not evidence variables and not query variables. So you might imagine in the case where I know whether or not it’s raining, and I want to know whether my train is going to be delayed or not, the hidden variable, the thing I don’t have access to, is something like, is there maintenance on the track? Or am I going to make or not make my appointment, for example? These are variables that I don’t have access to. They’re hidden because they’re not things I observed, and they’re also not the query, the thing that I’m asking. And so ultimately, what we want to calculate is I want to know the probability distribution of x given e, the event that I observed. So given that I observed some event, I observed that it is raining, I would like to know what is the distribution over the possible values of the train random variable. Is it on time? Is it delayed? What’s the likelihood it’s going to be there? And it turns out we can do this calculation just using a lot of the probability rules that we’ve already seen in action. And ultimately, we’re going to take a look at the math at a little bit of a high level, at an abstract level. But ultimately, we can allow computers and programming libraries that already exist to begin to do some of this math for us. But it’s good to get a general sense for what’s actually happening when this inference process takes place. Let’s imagine, for example, that I want to compute the probability distribution of the appointment random variable given some evidence, given that I know that there was light rain and no track maintenance. So there’s my evidence, these two variables that I observe the values of. I observe the value of rain. I know there’s light rain. And I know that there is no track maintenance going on today. And what I care about knowing, my query, is this random variable appointment. I want to know the distribution of this random variable appointment, like what is the chance that I’m able to attend my appointment? What is the chance that I miss my appointment given this evidence? And the hidden variable, the information that I don’t have access to, is this variable train. This is information that is not part of the evidence that I see, not something that I observe. But it is also not the query that I’m asking for. And so what might this inference procedure look like? Well, if you recall back from when we were defining conditional probability and doing math with conditional probabilities, we know that a conditional probability is proportional to the joint probability. And we remembered this by recalling that the probability of A given B is just some constant factor alpha multiplied by the probability of A and B. That constant factor alpha turns out to be like dividing over the probability of B. But the important thing is that it’s just some constant multiplied by the joint distribution, the probability that all of these individual things happen. So in this case, I can take the probability of the appointment random variable given light rain and no track maintenance and say that is just going to be proportional, some constant alpha, multiplied by the joint probability, the probability of a particular value for the appointment random variable and light rain and no track maintenance. Well, all right, how do I calculate this, probability of appointment and light rain and no track maintenance, when what I really care about is knowing I need all four of these values to be able to calculate a joint distribution across everything because in a particular appointment depends upon the value of train? Well, in order to do that, here I can begin to use that marginalization trick, that there are only two ways I can get any configuration of an appointment, light rain, and no track maintenance. Either this particular setting of variables happens and the train is on time, or this particular setting of variables happens and the train is delayed. Those are two possible cases that I would want to consider. And if I add those two cases up, well, then I get the result just by adding up all of the possibilities for the hidden variable or variables that there are multiple. But since there’s only one hidden variable here, train, all I need to do is iterate over all the possible values for that hidden variable train and add up their probabilities. So this probability expression here becomes probability distribution over appointment, light, no rain, and train is on time, and the probability distribution over the appointment, light rain, no track maintenance, and that the train is delayed, for example. So I take both of the possible values for train, go ahead and add them up. These are just joint probabilities that we saw earlier, how to calculate just by going parent, parent, parent, parent, and calculating those probabilities and multiplying them together. And then you’ll need to normalize them at the end, speaking at a high level, to make sure that everything adds up to the number 1. So the formula for how you do this in a process known as inference by enumeration looks a little bit complicated, but ultimately it looks like this. And let’s now try to distill what it is that all of these symbols actually mean. Let’s start here. What I care about knowing is the probability of x, my query variable, given some sort of evidence. What do I know about conditional probabilities? Well, a conditional probability is proportional to the joint probability. So it is some alpha, some normalizing constant, multiplied by this joint probability of x and evidence. And how do I calculate that? Well, to do that, I’m going to marginalize over all of the hidden variables, all the variables that I don’t directly observe the values for. I’m basically going to iterate over all of the possibilities that it could happen and just sum them all up. And so I can translate this into a sum over all y, which ranges over all the possible hidden variables and the values that they could take on, and adds up all of those possible individual probabilities. And that is going to allow me to do this process of inference by enumeration. Now, ultimately, it’s pretty annoying if we as humans have to do all this math for ourselves. But turns out this is where computers and AI can be particularly helpful, that we can program a computer to understand a Bayesian network, to be able to understand these inference procedures, and to be able to do these calculations. And using the information you’ve seen here, you could implement a Bayesian network from scratch yourself. But turns out there are a lot of libraries, especially written in Python, that allow us to make it easier to do this sort of probabilistic inference, to be able to take a Bayesian network and do these sorts of calculations, so that you don’t need to know and understand all of the underlying math, though it’s helpful to have a general sense for how it works. But you just need to be able to describe the structure of the network and make queries in order to be able to produce the result. And so let’s take a look at an example of that right now. It turns out that there are a lot of possible libraries that exist in Python for doing this sort of inference. It doesn’t matter too much which specific library you use. They all behave in fairly similar ways. But the library I’m going to use here is one known as pomegranate. And here inside of model.py, I have defined a Bayesian network, just using the structure and the syntax that the pomegranate library expects. And what I’m effectively doing is just, in Python, creating nodes to represent each of the nodes of the Bayesian network that you saw me describe a moment ago. So here on line four, after I’ve imported pomegranate, I’m defining a variable called rain that is going to represent a node inside of my Bayesian network. It’s going to be a node that follows this distribution, where there are three possible values, none for no rain, light for light rain, heavy for heavy rain. And these are the probabilities of each of those taking place. 0.7 is the likelihood of no rain, 0.2 for light rain, 0.1 for heavy rain. Then after that, we go to the next variable, the variable for track maintenance, for example, which is dependent upon that rain variable. And this, instead of being an unconditional distribution, is a conditional distribution, as indicated by a conditional probability table here. And the idea is that I’m following this is conditional on the distribution of rain. So if there is no rain, then the chance that there is, yes, track maintenance is 0.4. If there’s no rain, the chance that there is no track maintenance is 0.6. Likewise, for light rain, I have a distribution. For heavy rain, I have a distribution as well. But I’m effectively encoding the same information you saw represented graphically a moment ago. But I’m telling this Python program that the maintenance node obeys this particular conditional probability distribution. And we do the same thing for the other random variables as well. Train was a node inside my distribution that was a conditional probability table with two parents. It was dependent not only on rain, but also on track maintenance. And so here I’m saying something like, given that there is no rain and, yes, track maintenance, the probability that my train is on time is 0.8. And the probability that it’s delayed is 0.2. And likewise, I can do the same thing for all of the other possible values of the parents of the train node inside of my Bayesian network by saying, for all of those possible values, here is the distribution that the train node should follow. Then I do the same thing for an appointment based on the distribution of the variable train. Then at the end, what I do is actually construct this network by describing what the states of the network are and by adding edges between the dependent nodes. So I create a new Bayesian network, add states to it, one for rain, one for maintenance, one for the train, one for the appointment. And then I add edges connecting the related pieces. Rain has an arrow to maintenance because rain influences track maintenance. Rain also influences the train. Maintenance also influences the train. And train influences whether I make it to my appointment and bake just finalizes the model and does some additional computation. So the specific syntax of this is not really the important part. Pomegranate just happens to be one of several different libraries that can all be used for similar purposes. And you could describe and define a library for yourself that implemented similar things. But the key idea here is that someone can design a library for a general Bayesian network that has nodes that are based upon its parents. And then all a programmer needs to do using one of those libraries is to define what those nodes and what those probability distributions are. And we can begin to do some interesting logic based on it. So let’s try doing that conditional or joint probability calculation that we saw us do by hand before by going into likelihood.py, where here I’m importing the model that I just defined a moment ago. And here I’d just like to calculate model.probability, which calculates the probability for a given observation. And I’d like to calculate the probability of no rain, no track maintenance, my train is on time, and I’m able to attend the meeting. So sort of the optimal scenario that there is no rain and no maintenance on the track, my train is on time, and I’m able to attend the meeting. What is the probability that all of that actually happens? And I can calculate that using the library and just print out its probability. And so I’ll go ahead and run python of likelihood.py. And I see that, OK, the probability is about 0.34. So about a third of the time, everything goes right for me in this case. No rain, no track maintenance, train is on time, and I’m able to attend the meeting. But I could experiment with this, try and calculate other probabilities as well. What’s the probability that everything goes right up until the train, but I still miss my meeting? So no rain, no track maintenance, train is on time, but I miss the appointment. Let’s calculate that probability. And all right, that has a probability of about 0.04. So about 4% of the time, the train will be on time, there won’t be any rain, no track maintenance, and yet I’ll still miss the meeting. And so this is really just an implementation of the calculation of the joint probabilities that we did before. What this library is likely doing is first figuring out the probability of no rain, then figuring out the probability of no track maintenance given no rain, then the probability that my train is on time given both of these values, and then the probability that I miss my appointment given that I know that the train was on time. So this, again, is the calculation of that joint probability. And turns out we can also begin to have our computer solve inference problems as well, to begin to infer, based on information, evidence that we see, what is the likelihood of other variables also being true. So let’s go into inference.py, for example. We’re here, I’m again importing that exact same model from before, importing all the nodes and all the edges and the probability distribution that is encoded there as well. And now there’s a function for doing some sort of prediction. And here, into this model, I pass in the evidence that I observe. So here, I’ve encoded into this Python program the evidence that I have observed. I have observed the fact that the train is delayed. And that is the value for one of the four random variables inside of this Bayesian network. And using that information, I would like to be able to draw inspiration and figure out inferences about the values of the other random variables that are inside of my Bayesian network. I would like to make predictions about everything else. So all of the actual computational logic is happening in just these three lines, where I’m making this call to this prediction. Down below, I’m just iterating over all of the states and all the predictions and just printing them out so that we can visually see what the results are. But let’s find out, given the train is delayed, what can I predict about the values of the other random variables? Let’s go ahead and run python inference.py. I run that, and all right, here is the result that I get. Given the fact that I know that the train is delayed, this is evidence that I have observed. Well, given that there is a 45% chance or a 46% chance that there was no rain, a 31% chance there was light rain, a 23% chance there was heavy rain, I can see a probability distribution of a track maintenance and a probability distribution over whether I’m able to attend or miss my appointment. Now, we know that whether I attend or miss the appointment, that is only dependent upon the train being delayed or not delayed. It shouldn’t depend on anything else. So let’s imagine, for example, that I knew that there was heavy rain. That shouldn’t affect the distribution for making the appointment. And indeed, if I go up here and add some evidence, say that I know that the value of rain is heavy. That is evidence that I now have access to. I now have two pieces of evidence. I know that the rain is heavy, and I know that my train is delayed. I can calculate the probability by running this inference procedure again and seeing the result. I know that the rain is heavy. I know my train is delayed. The probability distribution for track maintenance changed. Given that I know that there’s heavy rain, now it’s more likely that there is no track maintenance, 88%, as opposed to 64% from here before. And now, what is the probability that I make the appointment? Well, that’s the same as before. It’s still going to be attend the appointment with probability 0.6, missed the appointment with probability 0.4, because it was only dependent upon whether or not my train was on time or delayed. And so this here is implementing that idea of that inference algorithm to be able to figure out, based on the evidence that I have, what can we infer about the values of the other variables that exist as well. So inference by enumeration is one way of doing this inference procedure, just looping over all of the values the hidden variables could take on and figuring out what the probability is. Now, it turns out this is not particularly efficient. And there are definitely optimizations you can make by avoiding repeated work. If you’re calculating the same sort of probability multiple times, there are ways of optimizing the program to avoid having to recalculate the same probabilities again and again. But even then, as the number of variables get large, as the number of possible values of variables could take on, get large, we’re going to start to have to do a lot of computation, a lot of calculation, to be able to do this inference. And at that point, it might start to get unreasonable, in terms of the amount of time that it would take to be able to do this sort of exact inference. And it’s for that reason that oftentimes, when it comes towards probability and things we’re not entirely sure about, we don’t always care about doing exact inference and knowing exactly what the probability is. But if we can approximate the inference procedure, do some sort of approximate inference, that that can be pretty good as well. That if I don’t know the exact probability, but I have a general sense for the probability that I can get increasingly accurate with more time, that that’s probably pretty good, especially if I can get that to happen even faster. So how could I do approximate inference inside of a Bayesian network? Well, one method is through a procedure known as sampling. In the process of sampling, I’m going to take a sample of all of the variables inside of this Bayesian network here. And how am I going to sample? Well, I’m going to sample one of the values from each of these nodes according to their probability distribution. So how might I take a sample of all these nodes? Well, I’ll start at the root. I’ll start with rain. Here’s the distribution for rain. And I’ll go ahead and, using a random number generator or something like it, randomly pick one of these three values. I’ll pick none with probability 0.7, light with probability 0.2, and heavy with probability 0.1. So I’ll randomly just pick one of them according to that distribution. And maybe in this case, I pick none, for example. Then I do the same thing for the other variable. Maintenance also has a probability distribution. And I’m going to sample. Now, there are three probability distributions here. But I’m only going to sample from this first row here, because I’ve observed already in my sample that the value of rain is none. So given that rain is none, I’m going to sample from this distribution to say, all right, what should the value of maintenance be? And in this case, maintenance is going to be, let’s just say yes, which happens 40% of the time in the event that there is no rain, for example. And we’ll sample all of the rest of the nodes in this way as well, that I want to sample from the train distribution. And I’ll sample from this first row here, where there is no rain, but there is track maintenance. And I’ll sample 80% of the time. I’ll say the train is on time. 20% of the time, I’ll say the train is delayed. And finally, we’ll do the same thing for whether I make it to my appointment or not. Did I attend or miss the appointment? We’ll sample based on this distribution and maybe say that in this case, I attend the appointment, which happens 90% of the time when the train is actually on time. So by going through these nodes, I can very quickly just do some sampling and get a sample of the possible values that could come up from going through this entire Bayesian network according to those probability distributions. And where this becomes powerful is if I do this not once, but I do this thousands or tens of thousands of times and generate a whole bunch of samples all using this distribution. I get different samples. Maybe some of them are the same. But I get a value for each of the possible variables that could come up. And so then if I’m ever faced with a question, a question like, what is the probability that the train is on time, you could do an exact inference procedure. This is no different than the inference problem we had before where I could just marginalize, look at all the possible other values of the variables, and do the computation of inference by enumeration to find out this probability exactly. But I could also, if I don’t care about the exact probability, just sample it, approximate it to get close. And this is a powerful tool in AI where we don’t need to be right 100% of the time or we don’t need to be exactly right. If we just need to be right with some probability, we can often do so more effectively, more efficiently. And so if here now are all of those possible samples, I’ll highlight the ones where the train is on time. I’m ignoring the ones where the train is delayed. And in this case, there’s like six out of eight of the samples have the train is arriving on time. And so maybe in this case, I can say that in six out of eight cases, that’s the likelihood that the train is on time. And with eight samples, that might not be a great prediction. But if I had thousands upon thousands of samples, then this could be a much better inference procedure to be able to do these sorts of calculations. So this is a direct sampling method to just do a bunch of samples and then figure out what the probability of some event is. Now, this from before was an unconditional probability. What is the probability that the train is on time? And I did that by looking at all the samples and figuring out, right, here are the ones where the train is on time. But sometimes what I want to calculate is not an unconditional probability, but rather a conditional probability, something like what is the probability that there is light rain, given that the train is on time, something to that effect. And to do that kind of calculation, well, what I might do is here are all the samples that I have. And I want to calculate a probability distribution, given that I know that the train is on time. So to be able to do that, I can kind of look at the two cases where the train was delayed and ignore or reject them, sort of exclude them from the possible samples that I’m considering. And now I want to look at these remaining cases where the train is on time. Here are the cases where there is light rain. And I say, OK, these are two out of the six possible cases. That can give me an approximation for the probability of light rain, given the fact that I know the train was on time. And I did that in almost exactly the same way, just by adding an additional step, by saying that, all right, when I take each sample, let me reject all of the samples that don’t match my evidence and only consider the samples that do match what it is that I have in my evidence that I want to make some sort of calculation about. And it turns out, using the libraries that we’ve had for Bayesian networks, we can begin to implement this same sort of idea, like implement rejection sampling, which is what this method is called, to be able to figure out some probability, not via direct inference, but instead by sampling. So what I have here is a program called sample.py. Imports the exact same model. And what I define first is a program to generate a sample. And the way I generate a sample is just by looping over all of the states. The states need to be in some sort of order to make sure I’m looping in the correct order. But effectively, if it is a conditional distribution, I’m going to sample based on the parents. And otherwise, I’m just going to directly sample the variable, like rain, which has no parents. It’s just an unconditional distribution and keep track of all those parent samples and return the final sample. The exact syntax of this, again, not particularly important. It just happens to be part of the implementation details of this particular library. The interesting logic is down below. Now that I have the ability to generate a sample, if I want to know the distribution of the appointment random variable, given that the train is delayed, well, then I can begin to do calculations like this. Let me take 10,000 samples and assemble all my results in this list called data. I’ll go ahead and loop n times, in this case, 10,000 times. I’ll generate a sample. And I want to know the distribution of appointment, given that the train is delayed. So according to rejection sampling, I’m only going to consider samples where the train is delayed. If the train is not delayed, I’m not going to consider those values at all. So I’m going to say, all right, if I take the sample, look at the value of the train random variable, if the train is delayed, well, let me go ahead and add to my data that I’m collecting the value of the appointment random variable that it took on in this particular sample. So I’m only considering the samples where the train is delayed. And for each of those samples, considering what the value of appointment is, and then at the end, I’m using a Python class called counter, which quickly counts up all the values inside of a data set. So I can take this list of data and figure out how many times was my appointment made and how many times was my appointment missed. And so this here, with just a couple lines of code, is an implementation of rejection sampling. And I can run it by going ahead and running Python sample.py. And when I do that, here is the result I get. This is the result of the counter. 1,251 times, I was able to attend the meeting. And 856 times, I was able to miss the meeting. And you can imagine, by doing more and more samples, I’ll be able to get a better and better, more accurate result. And this is a randomized process. It’s going to be an approximation of the probability. If I run it a different time, you’ll notice the numbers are similar, 12, 72, and 905. But they’re not identical because there’s some randomization, some likelihood that things might be higher or lower. And so this is why we generally want to try and use more samples so that we can have a greater amount of confidence in our result, be more sure about the result that we’re getting of whether or not it accurately reflects or represents the actual underlying probabilities that are inherent inside of this distribution. And so this, then, was an instance of rejection sampling. And it turns out there are a number of other sampling methods that you could use to begin to try to sample. One problem that rejection sampling has is that if the evidence you’re looking for is a fairly unlikely event, well, you’re going to be rejecting a lot of samples. Like if I’m looking for the probability of x given some evidence e, if e is very unlikely to occur, like occurs maybe one every 1,000 times, then I’m only going to be considering 1 out of every 1,000 samples that I do, which is a pretty inefficient method for trying to do this sort of calculation. I’m throwing away a lot of samples. And it takes computational effort to be able to generate those samples. So I’d like to not have to do something like that. So there are other sampling methods that can try and address this. One such sampling method is called likelihood weighting. In likelihood weighting, we follow a slightly different procedure. And the goal is to avoid needing to throw out samples that didn’t match the evidence. And so what we’ll do is we’ll start by fixing the values for the evidence variables. Rather than sample everything, we’re going to fix the values of the evidence variables and not sample those. Then we’re going to sample all the other non-evidence variables in the same way, just using the Bayesian network looking at the probability distributions, sampling all the non-evidence variables. But then what we need to do is weight each sample by its likelihood. If our evidence is really unlikely, we want to make sure that we’ve taken into account how likely was the evidence to actually show up in the sample. If I have a sample where the evidence was much more likely to show up than another sample, then I want to weight the more likely one higher. So we’re going to weight each sample by its likelihood, where likelihood is just defined as the probability of all the evidence. Given all the evidence we have, what is the probability that it would happen in that particular sample? So before, all of our samples were weighted equally. They all had a weight of 1 when we were calculating the overall average. In this case, we’re going to weight each sample, multiply each sample by its likelihood in order to get the more accurate distribution. So what would this look like? Well, if I ask the same question, what is the probability of light rain, given that the train is on time, when I do the sampling procedure and start by trying to sample, I’m going to start by fixing the evidence variable. I’m already going to have in my sample the train is on time. That way, I don’t have to throw out anything. I’m only sampling things where I know the value of the variables that are my evidence are what I expect them to be. So I’ll go ahead and sample from rain. And maybe this time, I sample light rain instead of no rain. Then I’ll sample from track maintenance and say, maybe, yes, there’s track maintenance. Then for train, well, I’ve already fixed it in place. Train was an evidence variable. So I’m not going to bother sampling again. I’ll just go ahead and move on. I’ll move on to appointment and go ahead and sample from appointment as well. So now I’ve generated a sample. I’ve generated a sample by fixing this evidence variable and sampling the other three. And the last step is now weighting the sample. How much weight should it have? And the weight is based on how probable is it that the train was actually on time, this evidence actually happened, given the values of these other variables, light rain and the fact that, yes, there was track maintenance. Well, to do that, I can just go back to the train variable and say, all right, if there was light rain and track maintenance, the likelihood of my evidence, the likelihood that my train was on time, is 0.6. And so this particular sample would have a weight of 0.6. And I could repeat the sampling procedure again and again. Each time every sample would be given a weight according to the probability of the evidence that I see associated with it. And there are other sampling methods that exist as well, but all of them are designed to try and get it the same idea, to approximate the inference procedure of figuring out the value of a variable. So we’ve now dealt with probability as it pertains to particular variables that have these discrete values. But what we haven’t really considered is how values might change over time. That we’ve considered something like a variable for rain, where rain can take on values of none or light rain or heavy rain. But in practice, usually when we consider values for variables like rain, we like to consider it for over time, how do the values of these variables change? What do we do with when we’re dealing with uncertainty over a period of time, which can come up in the context of weather, for example, if I have sunny days and I have rainy days. And I’d like to know not just what is the probability that it’s raining now, but what is the probability that it rains tomorrow, or the day after that, or the day after that. And so to do this, we’re going to introduce a slightly different kind of model. But here, we’re going to have a random variable, not just one for the weather, but for every possible time step. And you can define time step however you like. A simple way is just to use days as your time step. And so we can define a variable called x sub t, which is going to be the weather at time t. So x sub 0 might be the weather on day 0. x sub 1 might be the weather on day 1, so on and so forth. x sub 2 is the weather on day 2. But as you can imagine, if we start to do this over longer and longer periods of time, there’s an incredible amount of data that might go into this. If you’re keeping track of data about the weather for a year, now suddenly you might be trying to predict the weather tomorrow, given 365 days of previous pieces of evidence. And that’s a lot of evidence to have to deal with and manipulate and calculate. Probably nobody knows what the exact conditional probability distribution is for all of those combinations of variables. And so when we’re trying to do this inference inside of a computer, when we’re trying to reasonably do this sort of analysis, it’s helpful to make some simplifying assumptions, some assumptions about the problem that we can just assume are true, to make our lives a little bit easier. Even if they’re not totally accurate assumptions, if they’re close to accurate or approximate, they’re usually pretty good. And the assumption we’re going to make is called the Markov assumption, which is the assumption that the current state depends only on a finite fixed number of previous states. So the current day’s weather depends not on all the previous day’s weather for the rest of all of history, but the current day’s weather I can predict just based on yesterday’s weather, or just based on the last two days weather, or the last three days weather. But oftentimes, we’re going to deal with just the one previous state that helps to predict this current state. And by putting a whole bunch of these random variables together, using this Markov assumption, we can create what’s called a Markov chain, where a Markov chain is just some sequence of random variables where each of the variables distribution follows that Markov assumption. And so we’ll do an example of this where the Markov assumption is, I can predict the weather. Is it sunny or rainy? And we’ll just consider those two possibilities for now, even though there are other types of weather. But I can predict each day’s weather just on the prior day’s weather, using today’s weather, I can come up with a probability distribution for tomorrow’s weather. And here’s what this weather might look like. It’s formatted in terms of a matrix, as you might describe it, as rows and columns of values, where on the left-hand side, I have today’s weather, represented by the variable x sub t. And over here in the columns, I have tomorrow’s weather, represented by the variable x sub t plus 1, t plus 1 day’s weather instead. And what this matrix is saying is, if today is sunny, well, then it’s more likely than not that tomorrow is also sunny. Oftentimes, the weather stays consistent for multiple days in a row. And for example, let’s say that if today is sunny, our model says that tomorrow, with probability 0.8, it will also be sunny. And with probability 0.2, it will be raining. And likewise, if today is raining, then it’s more likely than not that tomorrow is also raining. With probability 0.7, it’ll be raining. With probability 0.3, it will be sunny. So this matrix, this description of how it is we transition from one state to the next state is what we’re going to call the transition model. And using the transition model, you can begin to construct this Markov chain by just predicting, given today’s weather, what’s the likelihood of tomorrow’s weather happening. And you can imagine doing a similar sampling procedure, where you take this information, you sample what tomorrow’s weather is going to be. Using that, you sample the next day’s weather. And the result of that is you can form this Markov chain of like x0, time and time, day zero is sunny, the next day is sunny, maybe the next day it changes to raining, then raining, then raining. And the pattern that this Markov chain follows, given the distribution that we had access to, this transition model here, is that when it’s sunny, it tends to stay sunny for a little while. The next couple of days tend to be sunny too. And when it’s raining, it tends to be raining as well. And so you get a Markov chain that looks like this, and you can do analysis on this. You can say, given that today is raining, what is the probability that tomorrow is raining? Or you can begin to ask probability questions like, what is the probability of this sequence of five values, sun, sun, rain, rain, rain, and answer those sorts of questions too. And it turns out there are, again, many Python libraries for interacting with models like this of probabilities that have distributions and random variables that are based on previous variables according to this Markov assumption. And pomegranate2 has ways of dealing with these sorts of variables. So I’ll go ahead and go into the chain directory, where I have some information about Markov chains. And here, I’ve defined a file called model.py, where I’ve defined in a very similar syntax. And again, the exact syntax doesn’t matter so much as the idea that I’m encoding this information into a Python program so that the program has access to these distributions. I’ve here defined some starting distribution. So every Markov model begins at some point in time, and I need to give it some starting distribution. And so we’ll just say, you know at the start, you can pick 50-50 between sunny and rainy. We’ll say it’s sunny 50% of the time, rainy 50% of the time. And then down below, I’ve here defined the transition model, how it is that I transition from one day to the next. And here, I’ve encoded that exact same matrix from before, that if it was sunny today, then with probability 0.8, it will be sunny tomorrow. And it’ll be rainy tomorrow with probability 0.2. And I likewise have another distribution for if it was raining today instead. And so that alone defines the Markov model. You can begin to answer questions using that model. But one thing I’ll just do is sample from the Markov chain. It turns out there is a method built into this Markov chain library that allows me to sample 50 states from the chain, basically just simulating like 50 instances of weather. And so let me go ahead and run this. Python model.py. And when I run it, what I get is that it’s going to sample from this Markov chain 50 states, 50 days worth of weather that it’s just going to randomly sample. And you can imagine sampling many times to be able to get more data, to be able to do more analysis. But here, for example, it’s sunny two days in a row, rainy a whole bunch of days in a row before it changes back to sun. And so you get this model that follows the distribution that we originally described, that follows the distribution of sunny days tend to lead to more sunny days. Rainy days tend to lead to more rainy days. And that then is a Markov model. And Markov models rely on us knowing the values of these individual states. I know that today is sunny or that today is raining. And using that information, I can draw some sort of inference about what tomorrow is going to be like. But in practice, this often isn’t the case. It often isn’t the case that I know for certain what the exact state of the world is. Oftentimes, the state of the world is exactly unknown. But I’m able to somehow sense some information about that state, that a robot or an AI doesn’t have exact knowledge about the world around it. But it has some sort of sensor, whether that sensor is a camera or sensors that detect distance or just a microphone that is sensing audio, for example. It is sensing data. And using that data, that data is somehow related to the state of the world, even if it doesn’t actually know, our AI doesn’t know, what the underlying true state of the world actually is. And for that, we need to get into the world of sensor models, the way of describing how it is that we translate what the hidden state, the underlying true state of the world, is with what the observation, what it is that the AI knows or the AI has access to, actually is. And so for example, a hidden state might be a robot’s position. If a robot is exploring new uncharted territory, the robot likely doesn’t know exactly where it is. But it does have an observation. It has robot sensor data, where it can sense how far away are possible obstacles around it. And using that information, using the observed information that it has, it can infer something about the hidden state. Because what the true hidden state is influences those observations. Whatever the robot’s true position is affects or has some effect upon what the sensor data of the robot is able to collect is, even if the robot doesn’t actually know for certain what its true position is. Likewise, if you think about a voice recognition or a speech recognition program that listens to you and is able to respond to you, something like Alexa or what Apple and Google are doing with their voice recognition as well, that you might imagine that the hidden state, the underlying state, is what words are actually spoken. The true nature of the world contains you saying a particular sequence of words, but your phone or your smart home device doesn’t know for sure exactly what words you said. The only observation that the AI has access to is some audio waveforms. And those audio waveforms are, of course, dependent upon this hidden state. And you can infer, based on those audio waveforms, what the words spoken likely were. But you might not know with 100% certainty what that hidden state actually is. And it might be a task to try and predict, given this observation, given these audio waveforms, can you figure out what the actual words spoken are. And likewise, you might imagine on a website, true user engagement. Might be information you don’t directly have access to. But you can observe data, like website or app analytics, about how often was this button clicked or how often are people interacting with a page in a particular way. And you can use that to infer things about your users as well. So this type of problem comes up all the time when we’re dealing with AI and trying to infer things about the world. That often AI doesn’t really know the hidden true state of the world. All the AI has access to is some observation that is related to the hidden true state. But it’s not direct. There might be some noise there. The audio waveform might have some additional noise that might be difficult to parse. The sensor data might not be exactly correct. There’s some noise that might not allow you to conclude with certainty what the hidden state is, but can allow you to infer what it might be. And so the simple example we’ll take a look at here is imagining the hidden state as the weather, whether it’s sunny or rainy or not. And imagine you are programming an AI inside of a building that maybe has access to just a camera to inside the building. And all you have access to is an observation as to whether or not employees are bringing an umbrella into the building or not. You can detect whether it’s an umbrella or not. And so you might have an observation as to whether or not an umbrella is brought into the building or not. And using that information, you want to predict whether it’s sunny or rainy, even if you don’t know what the underlying weather is. So the underlying weather might be sunny or rainy. And if it’s raining, obviously people are more likely to bring an umbrella. And so whether or not people bring an umbrella, your observation, tells you something about the hidden state. And of course, this is a bit of a contrived example, but the idea here is to think about this more broadly in terms of more generally, any time you observe something, it having to do with some underlying hidden state. And so to try and model this type of idea where we have these hidden states and observations, rather than just use a Markov model, which has state, state, state, state, each of which is connected by that transition matrix that we described before, we’re going to use what we call a hidden Markov model. Very similar to a Markov model, but this is going to allow us to model a system that has hidden states that we don’t directly observe, along with some observed event that we do actually see. And so in addition to that transition model that we still need of saying, given the underlying state of the world, if it’s sunny or rainy, what’s the probability of tomorrow’s weather? We also need another model that, given some state, is going to give us an observation of green, yes, someone brings an umbrella into the office, or red, no, nobody brings umbrellas into the office. And so the observation might be that if it’s sunny, then odds are nobody is going to bring an umbrella to the office. But maybe some people are just being cautious, and they do bring an umbrella to the office anyways. And if it’s raining, then with much higher probability, then people are going to bring umbrellas into the office. But maybe if the rain was unexpected, people didn’t bring an umbrella. And so it might have some other probability as well. And so using the observations, you can begin to predict with reasonable likelihood what the underlying state is, even if you don’t actually get to observe the underlying state, if you don’t get to see what the hidden state is actually equal to. This here we’ll often call the sensor model. It’s also often called the emission probabilities, because the state, the underlying state, emits some sort of emission that you then observe. And so that can be another way of describing that same idea. And the sensor Markov assumption that we’re going to use is this assumption that the evidence variable, the thing we observe, the emission that gets produced, depends only on the corresponding state, meaning it can predict whether or not people will bring umbrellas or not entirely dependent just on whether it is sunny or rainy today. Of course, again, this assumption might not hold in practice, that in practice, it might depend whether or not people bring umbrellas, might depend not just on today’s weather, but also on yesterday’s weather and the day before. But for simplification purposes, it can be helpful to apply this sort of assumption just to allow us to be able to reason about these probabilities a little more easily. And if we’re able to approximate it, we can still often get a very good answer. And so what these hidden Markov models end up looking like is a little something like this, where now, rather than just have one chain of states, like sun, sun, rain, rain, rain, we instead have this upper level, which is the underlying state of the world. Is it sunny or is it rainy? And those are connected by that transition matrix we described before. But each of these states produces an emission, produces an observation that I see, that on this day, it was sunny and people didn’t bring umbrellas. And on this day, it was sunny, but people did bring umbrellas. And on this day, it was raining and people did bring umbrellas, and so on and so forth. And so each of these underlying states represented by x sub t for x sub 1, 0, 1, 2, so on and so forth, produces some sort of observation or emission, which is what the e stands for, e sub 0, e sub 1, e sub 2, so on and so forth. And so this, too, is a way of trying to represent this idea. And what you want to think about is that these underlying states are the true nature of the world, the robot’s position as it moves over time, and that produces some sort of sensor data that might be observed, or what people are actually saying and using the emission data of what audio waveforms do you detect in order to process that data and try and figure it out. And there are a number of possible tasks that you might want to do given this kind of information. And one of the simplest is trying to infer something about the future or the past or about these sort of hidden states that might exist. And so the tasks that you’ll often see, and we’re not going to go into the mathematics of these tasks, but they’re all based on the same idea of conditional probabilities and using the probability distributions we have to draw these sorts of conclusions. One task is called filtering, which is given observations from the start until now, calculate the distribution for the current state, meaning given information about from the beginning of time until now, on which days do people bring an umbrella or not bring an umbrella, can I calculate the probability of the current state that today, is it sunny or is it raining? Another task that might be possible is prediction, which is looking towards the future. Given observations about people bringing umbrellas from the beginning of when we started counting time until now, can I figure out the distribution that tomorrow is it sunny or is it raining? And you can also go backwards as well by a smoothing, where I can say given observations from start until now, calculate the distributions for some past state. Like I know that today people brought umbrellas and tomorrow people brought umbrellas. And so given two days worth of data of people bringing umbrellas, what’s the probability that yesterday it was raining? And that I know that people brought umbrellas today, that might inform that decision as well. It might influence those probabilities. And there’s also a most likely explanation task, in addition to other tasks that might exist as well, which is combining some of these given observations from the start up until now, figuring out the most likely sequence of states. And this is what we’re going to take a look at now, this idea that if I have all these observations, umbrella, no umbrella, umbrella, no umbrella, can I calculate the most likely states of sun, rain, sun, rain, and whatnot that actually represented the true weather that would produce these observations? And this is quite common when you’re trying to do something like voice recognition, for example, that you have these emissions of the audio waveforms, and you would like to calculate based on all of the observations that you have, what is the most likely sequence of actual words, or syllables, or sounds that the user actually made when they were speaking to this particular device, or other tasks that might come up in that context as well. And so we can try this out by going ahead and going into the HMM directory, HMM for Hidden Markov Model. And here, what I’ve done is I’ve defined a model where this model first defines my possible state, sun, and rain, along with their emission probabilities, the observation model, or the emission model, where here, given that I know that it’s sunny, the probability that I see people bring an umbrella is 0.2, the probability of no umbrella is 0.8. And likewise, if it’s raining, then people are more likely to bring an umbrella. Umbrella has probability 0.9, no umbrella has probability 0.1. So the actual underlying hidden states, those states are sun and rain, but the things that I observe, the observations that I can see, are either umbrella or no umbrella as the things that I observe as a result. So this then, I also need to add to it a transition matrix, same as before, saying that if today is sunny, then tomorrow is more likely to be sunny. And if today is rainy, then tomorrow is more likely to be raining. As of before, I give it some starting probabilities, saying at first, 50-50 chance for whether it’s sunny or rainy. And then I can create the model based on that information. Again, the exact syntax of this is not so important, so much as it is the data that I am now encoding into a program, such that now I can begin to do some inference. So I can give my program, for example, a list of observations, umbrella, umbrella, no umbrella, umbrella, umbrella, so on and so forth, no umbrella, no umbrella. And I would like to calculate, I would like to figure out the most likely explanation for these observations. What is likely is whether rain, rain, is this rain, or is it more likely that this was actually sunny, and then it switched back to it being rainy? And that’s an interesting question. We might not be sure, because it might just be that it just so happened on this rainy day, people decided not to bring an umbrella. Or it could be that it switched from rainy to sunny back to rainy, which doesn’t seem too likely, but it certainly could happen. And using the data we give to the hidden Markov model, our model can begin to predict these answers, can begin to figure it out. So we’re going to go ahead and just predict these observations. And then for each of those predictions, go ahead and print out what the prediction is. And this library just so happens to have a function called predict that does this prediction process for me. So I’ll run python sequence.py. And the result I get is this. This is the prediction based on the observations of what all of those states are likely to be. And it’s likely to be rain and rain. In this case, it thinks that what most likely happened is that it was sunny for a day and then went back to being rainy. But in different situations, if it was rainy for longer maybe, or if the probabilities were slightly different, you might imagine that it’s more likely that it was rainy all the way through. And it just so happened on one rainy day, people decided not to bring umbrellas. And so here, too, Python libraries can begin to allow for the sort of inference procedure. And by taking what we know and by putting it in terms of these tasks that already exist, these general tasks that work with hidden Markov models, then any time we can take an idea and formulate it as a hidden Markov model, formulate it as something that has hidden states and observed emissions that result from those states, then we can take advantage of these algorithms that are known to exist for trying to do this sort of inference. So now we’ve seen a couple of ways that AI can begin to deal with uncertainty. We’ve taken a look at probability and how we can use probability to describe numerically things that are likely or more likely or less likely to happen than other events or other variables. And using that information, we can begin to construct these standard types of models, things like Bayesian networks and Markov chains and hidden Markov models that all allow us to be able to describe how particular events relate to other events or how the values of particular variables relate to other variables, not for certain, but with some sort of probability distribution. And by formulating things in terms of these models that already exist, we can take advantage of Python libraries that implement these sort of models already and allow us just to be able to use them to produce some sort of resulting effect. So all of this then allows our AI to begin to deal with these sort of uncertain problems so that our AI doesn’t need to know things for certain but can infer based on information it doesn’t know. Next time, we’ll take a look at additional types of problems that we can solve by taking advantage of AI-related algorithms, even beyond the world of the types of problems we’ve already explored. We’ll see you next time. OK. Welcome back, everyone, to an introduction to artificial intelligence with Python. And now, so far, we’ve taken a look at a couple of different types of problems. We’ve seen classical search problems where we’re trying to get from an initial state to a goal by figuring out some optimal path. We’ve taken a look at adversarial search where we have a game-playing agent that is trying to make the best move. We’ve seen knowledge-based problems where we’re trying to use logic and inference to be able to figure out and draw some additional conclusions. And we’ve seen some probabilistic models as well where we might not have certain information about the world, but we want to use the knowledge about probabilities that we do have to be able to draw some conclusions. Today, we’re going to turn our attention to another category of problems generally known as optimization problems, where optimization is really all about choosing the best option from a set of possible options. And we’ve already seen optimization in some contexts, like game-playing, where we’re trying to create an AI that chooses the best move out of a set of possible moves. But what we’ll take a look at today is a category of types of problems and algorithms to solve them that can be used in order to deal with a broader range of potential optimization problems. And the first of the algorithms that we’ll take a look at is known as a local search. And local search differs from search algorithms we’ve seen before in the sense that the search algorithms we’ve looked at so far, which are things like breadth-first search or A-star search, for example, generally maintain a whole bunch of different paths that we’re simultaneously exploring, and we’re looking at a bunch of different paths at once trying to find our way to the solution. On the other hand, in local search, this is going to be a search algorithm that’s really just going to maintain a single node, looking at a single state. And we’ll generally run this algorithm by maintaining that single node and then moving ourselves to one of the neighboring nodes throughout this search process. And this is generally useful in context not like these problems, which we’ve seen before, like a maze-solving situation where we’re trying to find our way from the initial state to the goal by following some path. But local search is most applicable when we really don’t care about the path at all, and all we care about is what the solution is. And in the case of solving a maze, the solution was always obvious. You could point to the solution. You know exactly what the goal is, and the real question is, what is the path to get there? But local search is going to come up in cases where figuring out exactly what the solution is, exactly what the goal looks like, is actually the heart of the challenge. And to give an example of one of these kinds of problems, we’ll consider a scenario where we have two types of buildings, for example. We have houses and hospitals. And our goal might be in a world that’s formatted as this grid, where we have a whole bunch of houses, a house here, house here, two houses over there, maybe we want to try and find a way to place two hospitals on this map. So maybe a hospital here and a hospital there. And the problem now is we want to place two hospitals on the map, but we want to do so with some sort of objective. And our objective in this case is to try and minimize the distance of any of the houses from a hospital. So you might imagine, all right, what’s the distance from each of the houses to their nearest hospital? There are a number of ways we could calculate that distance. But one way is using a heuristic we’ve looked at before, which is the Manhattan distance, this idea of how many rows and columns would you have to move inside of this grid layout in order to get to a hospital, for example. And it turns out, if you take each of these four houses and figure out, all right, how close are they to their nearest hospital, you get something like this, where this house is three away from a hospital, this house is six away, and these two houses are each four away. And if you add all those numbers up together, you get a total cost of 17, for example. So for this particular configuration of hospitals, a hospital here and a hospital there, that state, we might say, has a cost of 17. And the goal of this problem now that we would like to apply a search algorithm to figure out is, can you solve this problem to find a way to minimize that cost? Minimize the total amount if you sum up all of the distances from all the houses to the nearest hospital. How can we minimize that final value? And if we think about this problem a little bit more abstractly, abstracting away from this specific problem and thinking more generally about problems like it, you can often formulate these problems by thinking about them as a state-space landscape, as we’ll soon call it. Here in this diagram of a state-space landscape, each of these vertical bars represents a particular state that our world could be in. So for example, each of these vertical bars represents a particular configuration of two hospitals. And the height of this vertical bar is generally going to represent some function of that state, some value of that state. So maybe in this case, the height of the vertical bar represents what is the cost of this particular configuration of hospitals in terms of what is the sum total of all the distances from all of the houses to their nearest hospital. And generally speaking, when we have a state-space landscape, we want to do one of two things. We might be trying to maximize the value of this function, trying to find a global maximum, so to speak, of this state-space landscape, a single state whose value is higher than all of the other states that we could possibly choose from. And generally in this case, when we’re trying to find a global maximum, we’ll call the function that we’re trying to optimize some objective function, some function that measures for any given state how good is that state, such that we can take any state, pass it into the objective function, and get a value for how good that state is. And ultimately, what our goal is is to find one of these states that has the highest possible value for that objective function. An equivalent but reversed problem is the problem of finding a global minimum, some state that has a value after you pass it into this function that is lower than all of the other possible values that we might choose from. And generally speaking, when we’re trying to find a global minimum, we call the function that we’re calculating a cost function. Generally, each state has some sort of cost, whether that cost is a monetary cost, or a time cost, or in the case of the houses and hospitals, we’ve been looking at just now, a distance cost in terms of how far away each of the houses is from a hospital. And we’re trying to minimize the cost, find the state that has the lowest possible value of that cost. So these are the general types of ideas we might be trying to go for within a state space landscape, trying to find a global maximum, or trying to find a global minimum. And how exactly do we do that? We’ll recall that in local search, we generally operate this algorithm by maintaining just a single state, just some current state represented inside of some node, maybe inside of a data structure, where we’re keeping track of where we are currently. And then ultimately, what we’re going to do is from that state, move to one of its neighbor states. So in this case, represented in this one-dimensional space by just the state immediately to the left or to the right of it. But for any different problem, you might define what it means for there to be a neighbor of a particular state. In the case of a hospital, for example, that we were just looking at, a neighbor might be moving one hospital one space to the left or to the right or up or down. Some state that is close to our current state, but slightly different, and as a result, might have a slightly different value in terms of its objective function or in terms of its cost function. So this is going to be our general strategy in local search, to be able to take a state, maintaining some current node, and move where we’re looking at in the state space landscape in order to try to find a global maximum or a global minimum somehow. And perhaps the simplest of algorithms that we could use to implement this idea of local search is an algorithm known as hill climbing. And the basic idea of hill climbing is, let’s say I’m trying to maximize the value of my state. I’m trying to figure out where the global maximum is. I’m going to start at a state. And generally, what hill climbing is going to do is it’s going to consider the neighbors of that state, that from this state, all right, I could go left or I could go right, and this neighbor happens to be higher and this neighbor happens to be lower. And in hill climbing, if I’m trying to maximize the value, I’ll generally pick the highest one I can between the state to the left and right of me. This one is higher. So I’ll go ahead and move myself to consider that state instead. And then I’ll repeat this process, continually looking at all of my neighbors and picking the highest neighbor, doing the same thing, looking at my neighbors, picking the highest of my neighbors, until I get to a point like right here, where I consider both of my neighbors and both of my neighbors have a lower value than I do. This current state has a value that is higher than any of its neighbors. And at that point, the algorithm terminates. And I can say, all right, here I have now found the solution. And the same thing works in exactly the opposite way for trying to find a global minimum. But the algorithm is fundamentally the same. If I’m trying to find a global minimum and say my current state starts here, I’ll continually look at my neighbors, pick the lowest value that I possibly can, until I eventually, hopefully, find that global minimum, a point at which when I look at both of my neighbors, they each have a higher value. And I’m trying to minimize the total score or cost or value that I get as a result of calculating some sort of cost function. So we can formulate this graphical idea in terms of pseudocode. And the pseudocode for hill climbing might look like this. We define some function called hill climb that takes as input the problem that we’re trying to solve. And generally, we’re going to start in some sort of initial state. So I’ll start with a variable called current that is keeping track of my initial state, like an initial configuration of hospitals. And maybe some problems lend themselves to an initial state, some place where you begin. In other cases, maybe not, in which case we might just randomly generate some initial state, just by choosing two locations for hospitals at random, for example, and figuring out from there how we might be able to improve. But that initial state, we’re going to store inside of current. And now, here comes our loop, some repetitive process we’re going to do again and again until the algorithm terminates. And what we’re going to do is first say, let’s figure out all of the neighbors of the current state. From my state, what are all of the neighboring states for some definition of what it means to be a neighbor? And I’ll go ahead and choose the highest value of all of those neighbors and save it inside of this variable called neighbor. So keep track of the highest-valued neighbor. This is in the case where I’m trying to maximize the value. In the case where I’m trying to minimize the value, you might imagine here, you’ll pick the neighbor with the lowest possible value. But these ideas are really fundamentally interchangeable. And it’s possible, in some cases, there might be multiple neighbors that each have an equally high value or an equally low value in the minimizing case. And in that case, we can just choose randomly from among them. Choose one of them and save it inside of this variable neighbor. And then the key question to ask is, is this neighbor better than my current state? And if the neighbor, the best neighbor that I was able to find, is not better than my current state, well, then the algorithm is over. And I’ll just go ahead and return the current state. If none of my neighbors are better, then I may as well stay where I am, is the general logic of the hill climbing algorithm. But otherwise, if the neighbor is better, then I may as well move to that neighbor. So you might imagine setting current equal to neighbor, where the general idea is if I’m at a current state and I see a neighbor that is better than me, then I’ll go ahead and move there. And then I’ll repeat the process, continually moving to a better neighbor until I reach a point at which none of my neighbors are better than I am. And at that point, we’d say the algorithm can just terminate there. So let’s take a look at a real example of this with these houses and hospitals. So we’ve seen now that if we put the hospitals in these two locations, that has a total cost of 17. And now we need to define, if we’re going to implement this hill climbing algorithm, what it means to take this particular configuration of hospitals, this particular state, and get a neighbor of that state. And a simple definition of neighbor might be just, let’s pick one of the hospitals and move it by one square, the left or right or up or down, for example. And that would mean we have six possible neighbors from this particular configuration. We could take this hospital and move it to any of these three possible squares, or we take this hospital and move it to any of those three possible squares. And each of those would generate a neighbor. And what I might do is say, all right, here’s the locations and the distances between each of the houses and their nearest hospital. Let me consider all of the neighbors and see if any of them can do better than a cost of 17. And it turns out there are a couple of ways that we could do that. And it doesn’t matter if we randomly choose among all the ways that are the best. But one such possible way is by taking a look at this hospital here and considering the directions in which it might move. If we hold this hospital constant, if we take this hospital and move it one square up, for example, that doesn’t really help us. It gets closer to the house up here, but it gets further away from the house down here. And it doesn’t really change anything for the two houses along the left-hand side. But if we take this hospital on the right and move it one square down, it’s the opposite problem. It gets further away from the house up above, and it gets closer to the house down below. The real idea, the goal should be to be able to take this hospital and move it one square to the left. By moving it one square to the left, we move it closer to both of these houses on the right without changing anything about the houses on the left. For them, this hospital is still the closer one, so they aren’t affected. So we’re able to improve the situation by picking a neighbor that results in a decrease in our total cost. And so we might do that. Move ourselves from this current state to a neighbor by just taking that hospital and moving it. And at this point, there’s not a whole lot that can be done with this hospital. But there’s still other optimizations we can make, other neighbors we can move to that are going to have a better value. If we consider this hospital, for example, we might imagine that right now it’s a bit far up, that both of these houses are a little bit lower. So we might be able to do better by taking this hospital and moving it one square down, moving it down so that now instead of a cost of 15, we’re down to a cost of 13 for this particular configuration. And we can do even better by taking the hospital and moving it one square to the left. Now instead of a cost of 13, we have a cost of 11, because this house is one away from the hospital. This one is four away. This one is three away. And this one is also three away. So we’ve been able to do much better than that initial cost that we had using the initial configuration. Just by taking every state and asking ourselves the question, can we do better by just making small incremental changes, moving to a neighbor, moving to a neighbor, and moving to a neighbor after that? And now at this point, we can potentially see that at this point, the algorithm is going to terminate. There’s actually no neighbor we can move to that is going to improve the situation, get us a cost that is less than 11. Because if we take this hospital and move it upper to the right, well, that’s going to make it further away. If we take it and move it down, that doesn’t really change the situation. It gets further away from this house but closer to that house. And likewise, the same story was true for this hospital. Any neighbor we move it to, up, left, down, or right, is either going to make it further away from the houses and increase the cost, or it’s going to have no effect on the cost whatsoever. And so the question we might now ask is, is this the best we could do? Is this the best placement of the hospitals we could possibly have? And it turns out the answer is no, because there’s a better way that we could place these hospitals. And in particular, there are a number of ways you could do this. But one of the ways is by taking this hospital here and moving it to this square, for example, moving it diagonally by one square, which was not part of our definition of neighbor. We could only move left, right, up, or down. But this is, in fact, better. It has a total cost of 9. It is now closer to both of these houses. And as a result, the total cost is less. But we weren’t able to find it, because in order to get there, we had to go through a state that actually wasn’t any better than the current state that we had been on previously. And so this appears to be a limitation, or a concern you might have as you go about trying to implement a hill climbing algorithm, is that it might not always give you the optimal solution. If we’re trying to maximize the value of any particular state, we’re trying to find the global maximum, a concern might be that we could get stuck at one of the local maxima, highlighted here in blue, where a local maxima is any state whose value is higher than any of its neighbors. If we ever find ourselves at one of these two states when we’re trying to maximize the value of the state, we’re not going to make any changes. We’re not going to move left or right. We’re not going to move left here, because those states are worse. But yet, we haven’t found the global optimum. We haven’t done as best as we could do. And likewise, in the case of the hospitals, what we’re ultimately trying to do is find a global minimum, find a value that is lower than all of the others. But we have the potential to get stuck at one of the local minima, any of these states whose value is lower than all of its neighbors, but still not as low as the local minima. And so the takeaway here is that it’s not always going to be the case that when we run this naive hill climbing algorithm, that we’re always going to find the optimal solution. There are things that could go wrong. If we started here, for example, and tried to maximize our value as much as possible, we might move to the highest possible neighbor, move to the highest possible neighbor, move to the highest possible neighbor, and stop, and never realize that there’s actually a better state way over there that we could have gone to instead. And other problems you might imagine just by taking a look at this state space landscape are these various different types of plateaus, something like this flat local maximum here, where all six of these states each have the exact same value. And so in the case of the algorithm we showed before, none of the neighbors are better, so we might just get stuck at this flat local maximum. And even if you allowed yourself to move to one of the neighbors, it wouldn’t be clear which neighbor you would ultimately move to, and you could get stuck here as well. And there’s another one over here. This one is called a shoulder. It’s not really a local maximum, because there’s still places where we can go higher, not a local minimum, because we can go lower. So we can still make progress, but it’s still this flat area, where if you have a local search algorithm, there’s potential to get lost here, unable to make some upward or downward progress, depending on whether we’re trying to maximize or minimize it, and therefore another potential for us to be able to find a solution that might not actually be the optimal solution. And so because of this potential, the potential that hill climbing has to not always find us the optimal result, it turns out there are a number of different varieties and variations on the hill climbing algorithm that help to solve the problem better depending on the context, and depending on the specific type of problem, some of these variants might be more applicable than others. What we’ve taken a look at so far is a version of hill climbing generally called steepest ascent hill climbing, where the idea of steepest ascent hill climbing is we are going to choose the highest valued neighbor, in the case where we’re trying to maximize or the lowest valued neighbor in cases where we’re trying to minimize. But generally speaking, if I have five neighbors and they’re all better than my current state, I will pick the best one of those five. Now, sometimes that might work pretty well. It’s sort of a greedy approach of trying to take the best operation at any particular time step, but it might not always work. There might be cases where actually I want to choose an option that is slightly better than me, but maybe not the best one because that later on might lead to a better outcome ultimately. So there are other variants that we might consider of this basic hill climbing algorithm. One is known as stochastic hill climbing. And in this case, we choose randomly from all of our higher value neighbors. So if I’m at my current state and there are five neighbors that are all better than I am, rather than choosing the best one, as steep as the set would do, stochastic will just choose randomly from one of them, thinking that if it’s better, then it’s better. And maybe there’s a potential to make forward progress, even if it is not locally the best option I could possibly choose. First choice hill climbing ends up just choosing the very first highest valued neighbor that it follows, behaving on a similar idea, rather than consider all of the neighbors. As soon as we find a neighbor that is better than our current state, we’ll go ahead and move there. There may be some efficiency improvements there and maybe has the potential to find a solution that the other strategies weren’t able to find. And with all of these variants, we still suffer from the same potential risk, this risk that we might end up at a local minimum or a local maximum. And we can reduce that risk by repeating the process multiple times. So one variant of hill climbing is random restart hill climbing, where the general idea is we’ll conduct hill climbing multiple times. If we apply steepest descent hill climbing, for example, we’ll start at some random state, try and figure out how to solve the problem and figure out what is the local maximum or local minimum we get to. And then we’ll just randomly restart and try again, choose a new starting configuration, try and figure out what the local maximum or minimum is, and do this some number of times. And then after we’ve done it some number of times, we can pick the best one out of all of the ones that we’ve taken a look at. So there’s another option we have access to as well. And then, although I said that generally local search will usually just keep track of a single node and then move to one of its neighbors, there are variants of hill climbing that are known as local beam searches, where rather than keep track of just one current best state, we’re keeping track of k highest valued neighbors, such that rather than starting at one random initial configuration, I might start with 3 or 4 or 5, randomly generate all the neighbors, and then pick the 3 or 4 or 5 best of all of the neighbors that I find, and continually repeat this process, with the idea being that now I have more options that I’m considering, more ways that I could potentially navigate myself to the optimal solution that might exist for a particular problem. So let’s now take a look at some actual code that can implement some of these kinds of ideas, something like steepest ascent hill climbing, for example, for trying to solve this hospital problem. So I’m going to go ahead and go into my hospitals directory, where I’ve actually set up the basic framework for solving this type of problem. I’ll go ahead and go into hospitals.py, and we’ll take a look at the code we’ve created here. I’ve defined a class that is going to represent the state space. So the space has a height, and a width, and also some number of hospitals. So you can configure how big is your map, how many hospitals should go here. We have a function for adding a new house to the state space, and then some functions that are going to get me all of the available spaces for if I want to randomly place hospitals in particular locations. And here now is the hill climbing algorithm. So what are we going to do in the hill climbing algorithm? Well, we’re going to start by randomly initializing where the hospitals are going to go. We don’t know where the hospitals should actually be, so let’s just randomly place them. So here I’m running a loop for each of the hospitals that I have. I’m going to go ahead and add a new hospital at some random location. So I basically get all of the available spaces, and I randomly choose one of them as where I would like to add this particular hospital. I have some logging output and generating some images, which we’ll take a look at a little bit later. But here is the key idea. So I’m going to just keep repeating this algorithm. I could specify a maximum of how many times I want it to run, or I could just run it up until it hits a local maximum or local minimum. And now we’ll basically consider all of the hospitals that could potentially move. So consider each of the two hospitals or more hospitals if they’re more than that. And consider all of the places where that hospital could move to, some neighbor of that hospital that we can move the neighbor to. And then see, is this going to be better than where we were currently? So if it is going to be better, then we’ll go ahead and update our best neighbor and keep track of this new best neighbor that we found. And then afterwards, we can ask ourselves the question, if best neighbor cost is greater than or equal to the cost of the current set of hospitals, meaning if the cost of our best neighbor is greater than the current cost, meaning our best neighbor is worse than our current state, well, then we shouldn’t make any changes at all. And we should just go ahead and return the current set of hospitals. But otherwise, we can update our hospitals in order to change them to one of the best neighbors. And if there are multiple that are all equivalent, I’m here using random.choice to say go ahead and choose one randomly. So this is really just a Python implementation of that same idea that we were just talking about, this idea of taking a current state, some current set of hospitals, generating all of the neighbors, looking at all of the ways we could take one hospital and move it one square to the left or right or up or down, and then figuring out, based on all of that information, which is the best neighbor or the set of all the best neighbors, and then choosing from one of those. And each time, we go ahead and generate an image in order to do that. And so now what we’re doing is if we look down at the bottom, I’m going to randomly generate a space with height 10 and width 20. And I’ll say go ahead and put three hospitals somewhere in the space. I’ll randomly generate 15 houses that I just go ahead and add in random locations. And now I’m going to run this hill climbing algorithm in order to try and figure out where we should place those hospitals. So we’ll go ahead and run this program by running Python hospitals. And we see that we started. Our initial state had a cost of 72, but we were able to continually find neighbors that were able to decrease that cost, decrease to 69, 66, 63, so on and so forth, all the way down to 53, as the best neighbor we were able to ultimately find. And we can take a look at what that looked like by just opening up these files. So here, for example, was the initial configuration. We randomly selected a location for each of these 15 different houses and then randomly selected locations for one, two, three hospitals that were just located somewhere inside of the state space. And if you add up all the distances from each of the houses to their nearest hospital, you get a total cost of about 72. And so now the question is, what neighbors can we move to that improve the situation? And it looks like the first one the algorithm found was by taking this house that was over there on the right and just moving it to the left. And that probably makes sense because if you look at the houses in that general area, really these five houses look like they’re probably the ones that are going to be closest to this hospital over here. Moving it to the left decreases the total distance, at least to most of these houses, though it does increase that distance for one of them. And so we’re able to make these improvements to the situation by continually finding ways that we can move these hospitals around until we eventually settle at this particular state that has a cost of 53, where we figured out a position for each of the hospitals. And now none of the neighbors that we could move to are actually going to improve the situation. We can take this hospital and this hospital and that hospital and look at each of the neighbors. And none of those are going to be better than this particular configuration. And again, that’s not to say that this is the best we could do. There might be some other configuration of hospitals that is a global minimum. And this might just be a local minimum that is the best of all of its neighbors, but maybe not the best in the entire possible state space. And you could search through the entire state space by considering all of the possible configurations for hospitals. But ultimately, that’s going to be very time intensive, especially as our state space gets bigger and there might be more and more possible states. It’s going to take quite a long time to look through all of them. And so being able to use these sort of local search algorithms can often be quite good for trying to find the best solution we can do. And especially if we don’t care about doing the best possible and we just care about doing pretty good and finding a pretty good placement of those hospitals, then these methods can be particularly powerful. But of course, we can try and mitigate some of this concern by instead of using hill climbing to use random restart, this idea of rather than just hill climb one time, we can hill climb multiple times and say, try hill climbing a whole bunch of times on the exact same map and figure out what is the best one that we’ve been able to find. And so I’ve here implemented a function for random restart that restarts some maximum number of times. And what we’re going to do is repeat that number of times this process of just go ahead and run the hill climbing algorithm, figure out what the cost is of getting from all the houses to the hospitals, and then figure out is this better than we’ve done so far. So I can try this exact same idea where instead of running hill climbing, I’ll go ahead and run random restart. And I’ll randomly restart maybe 20 times, for example. And we’ll go ahead and now I’ll remove all the images and then rerun the program. And now we started by finding a original state. When we initially ran hill climbing, the best cost we were able to find was 56. Each of these iterations is a different iteration of the hill climbing algorithm. We’re running hill climbing not one time, but 20 times here, each time going until we find a local minimum in this case. And we look and see each time did we do better than we did the best time we’ve done so far. So we went from 56 to 46. This one was greater, so we ignored it. This one was 41, which was less, so we went ahead and kept that one. And for all of the remaining 16 times that we tried to implement hill climbing and we tried to run the hill climbing algorithm, we couldn’t do any better than that 41. Again, maybe there is a way to do better that we just didn’t find, but it looks like that way ended up being a pretty good solution to the problem. That was attempt number three, starting from counting at zero. So we can take a look at that, open up number three. And this was the state that happened to have a cost of 41, that after running the hill climbing algorithm on some particular random initial configuration of hospitals, this is what we found was the local minimum in terms of trying to minimize the cost. And it looks like we did pretty well. This hospital is pretty close to this region. This one is pretty close to these houses here. This hospital looks about as good as we can do for trying to capture those houses over on that side. And so these sorts of algorithms can be quite useful for trying to solve these problems. But the real problem with many of these different types of hill climbing, steepest of sense, stochastic, first choice, and so forth, is that they never make a move that makes our situation worse. They’re always going to take ourselves in our current state, look at the neighbors, and consider can we do better than our current state and move to one of those neighbors. Which of those neighbors we choose might vary among these various different types of algorithms, but we never go from a current position to a position that is worse than our current position. And ultimately, that’s what we’re going to need to do if we want to be able to find a global maximum or a global minimum. Because sometimes if we get stuck, we want to find some way of dislodging ourselves from our local maximum or local minimum in order to find the global maximum or the global minimum or increase the probability that we do find it. And so the most popular technique for trying to approach the problem from that angle is a technique known as simulated annealing, simulated because it’s modeling after a real physical process of annealing, where you can think about this in terms of physics, a physical situation where you have some system of particles. And you might imagine that when you heat up a particular physical system, there’s a lot of energy there. Things are moving around quite randomly. But over time, as the system cools down, it eventually settles into some final position. And that’s going to be the general idea of simulated annealing. We’re going to simulate that process of some high temperature system where things are moving around randomly quite frequently, but over time decreasing that temperature until we eventually settle at our ultimate solution. And the idea is going to be if we have some state space landscape that looks like this and we begin at its initial state here, if we’re looking for a global maximum and we’re trying to maximize the value of the state, our traditional hill climbing algorithms would just take the state and look at the two neighbor ones and always pick the one that is going to increase the value of the state. But if we want some chance of being able to find the global maximum, we can’t always make good moves. We have to sometimes make bad moves and allow ourselves to make a move in a direction that actually seems for now to make our situation worse such that later we can find our way up to that global maximum in terms of trying to solve that problem. Of course, once we get up to this global maximum, once we’ve done a whole lot of the searching, then we probably don’t want to be moving to states that are worse than our current state. And so this is where this metaphor for annealing starts to come in, where we want to start making more random moves and over time start to make fewer of those random moves based on a particular temperature schedule. So the basic outline looks something like this. Early on in simulated annealing, we have a higher temperature state. And what we mean by a higher temperature state is that we are more likely to accept neighbors that are worse than our current state. We might look at our neighbors. And if one of our neighbors is worse than the current state, especially if it’s not all that much worse, if it’s pretty close but just slightly worse, then we might be more likely to accept that and go ahead and move to that neighbor anyways. But later on as we run simulated annealing, we’re going to decrease that temperature. And at a lower temperature, we’re going to be less likely to accept neighbors that are worse than our current state. Now to formalize this and put a little bit of pseudocode to it, here is what that algorithm might look like. We have a function called simulated annealing that takes as input the problem we’re trying to solve and also potentially some maximum number of times we might want to run the simulated annealing process, how many different neighbors we’re going to try and look for. And that value is going to vary based on the problem you’re trying to solve. We’ll, again, start with some current state that will be equal to the initial state of the problem. But now we need to repeat this process over and over for max number of times. Repeat some process some number of times where we’re first going to calculate a temperature. And this temperature function takes the current time t starting at 1 going all the way up to max and then gives us some temperature that we can use in our computation, where the idea is that this temperature is going to be higher early on and it’s going to be lower later on. So there are a number of ways this temperature function could often work. One of the simplest ways is just to say it is like the proportion of time that we still have remaining. Out of max units of time, how much time do we have remaining? You start off with a lot of that time remaining. And as time goes on, the temperature is going to decrease because you have less and less of that remaining time still available to you. So we calculate a temperature for the current time. And then we pick a random neighbor of the current state. No longer are we going to be picking the best neighbor that we possibly can or just one of the better neighbors that we can. We’re going to pick a random neighbor. It might be better. It might be worse. But we’re going to calculate that. We’re going to calculate delta E, E for energy in this case, which is just how much better is the neighbor than the current state. So if delta E is positive, that means the neighbor is better than our current state. If delta E is negative, that means the neighbor is worse than our current state. And so we can then have a condition that looks like this. If delta E is greater than 0, that means the neighbor state is better than our current state. And if ever that situation arises, we’ll just go ahead and update current to be that neighbor. Same as before, move where we are currently to be the neighbor because the neighbor is better than our current state. We’ll go ahead and accept that. But now the difference is that whereas before, we never, ever wanted to take a move that made our situation worse, now we sometimes want to make a move that is actually going to make our situation worse because sometimes we’re going to need to dislodge ourselves from a local minimum or local maximum to increase the probability that we’re able to find the global minimum or the global maximum a little bit later. And so how do we do that? How do we decide to sometimes accept some state that might actually be worse? Well, we’re going to accept a worse state with some probability. And that probability needs to be based on a couple of factors. It needs to be based in part on the temperature, where if the temperature is higher, we’re more likely to move to a worse neighbor. And if the temperature is lower, we’re less likely to move to a worse neighbor. But it also, to some degree, should be based on delta E. If the neighbor is much worse than the current state, we probably want to be less likely to choose that than if the neighbor is just a little bit worse than the current state. So again, there are a couple of ways you could calculate this. But it turns out one of the most popular is just to calculate E to the power of delta E over T, where E is just a constant. Delta E over T are based on delta E and T here. We calculate that value. And that’ll be some value between 0 and 1. And that is the probability with which we should just say, all right, let’s go ahead and move to that neighbor. And it turns out that if you do the math for this value, when delta E is such that the neighbor is not that much worse than the current state, that’s going to be more likely that we’re going to go ahead and move to that state. And likewise, when the temperature is lower, we’re going to be less likely to move to that neighboring state as well. So now this is the big picture for simulated annealing, this process of taking the problem and going ahead and generating random neighbors will always move to a neighbor if it’s better than our current state. But even if the neighbor is worse than our current state, we’ll sometimes move there depending on how much worse it is and also based on the temperature. And as a result, the hope, the goal of this whole process is that as we begin to try and find our way to the global maximum or the global minimum, we can dislodge ourselves if we ever get stuck at a local maximum or local minimum in order to eventually make our way to exploring the part of the state space that is going to be the best. And then as the temperature decreases, eventually we settle there without moving around too much from what we’ve found to be the globally best thing that we can do thus far. So at the very end, we just return whatever the current state happens to be. And that is the conclusion of this algorithm. We’ve been able to figure out what the solution is. And these types of algorithms have a lot of different applications. Any time you can take a problem and formulate it as something where you can explore a particular configuration and then ask, are any of the neighbors better than this current configuration and have some way of measuring that, then there is an applicable case for these hill climbing, simulated annealing types of algorithms. So sometimes it can be for facility location type problems, like for when you’re trying to plan a city and figure out where the hospitals should be. But there are definitely other applications as well. And one of the most famous problems in computer science is the traveling salesman problem. Traveling salesman problem generally is formulated like this. I have a whole bunch of cities here indicated by these dots. And what I’d like to do is find some route that takes me through all of the cities and ends up back where I started. So some route that starts here, goes through all these cities, and ends up back where I originally started. And what I might like to do is minimize the total distance that I have to travel or the total cost of taking this entire path. And you can imagine this is a problem that’s very applicable in situations like when delivery companies are trying to deliver things to a whole bunch of different houses, they want to figure out, how do I get from the warehouse to all these various different houses and get back again, all using as minimal time and distance and energy as possible. So you might want to try to solve these sorts of problems. But it turns out that solving this particular kind of problem is very computationally difficult. It is a very computationally expensive task to be able to figure it out. This falls under the category of what are known as NP-complete problems, problems that there is no known efficient way to try and solve these sorts of problems. And so what we ultimately have to do is come up with some approximation, some ways of trying to find a good solution, even if we’re not going to find the globally best solution that we possibly can, at least not in a feasible or tractable amount of time. And so what we could do is take the traveling salesman problem and try to formulate it using local search and ask a question like, all right, I can pick some state, some configuration, some route between all of these nodes. And I can measure the cost of that state, figure out what the distance is. And I might now want to try to minimize that cost as much as possible. And then the only question now is, what does it mean to have a neighbor of this state? What does it mean to take this particular route and have some neighboring route that is close to it but slightly different and such that it might have a different total distance? And there are a number of different definitions for what a neighbor of a traveling salesman configuration might look like. But one way is just to say, a neighbor is what happens if we pick two of these edges between nodes and switch them effectively. So for example, I might pick these two edges here, these two that just happened across this node goes here, this node goes there, and go ahead and switch them. And what that process will generally look like is removing both of these edges from the graph, taking this node, and connecting it to the node it wasn’t connected to. So connecting it up here instead. We’ll need to take these arrows that were originally going this way and reverse them, so move them going the other way, and then just fill in that last remaining blank, add an arrow that goes in that direction instead. So by taking two edges and just switching them, I have been able to consider one possible neighbor of this particular configuration. And it looks like this neighbor is actually better. It looks like this probably travels a shorter distance in order to get through all the cities through this route than the current state did. And so you could imagine implementing this idea inside of a hill climbing or simulated annealing algorithm, where we repeat this process to try and take a state of this traveling salesman problem, look at all the neighbors, and then move to the neighbors if they’re better, or maybe even move to the neighbors if they’re worse, until we eventually settle upon some best solution that we’ve been able to find. And it turns out that these types of approximation algorithms, even if they don’t always find the very best solution, can often do pretty well at trying to find solutions that are helpful too. So that then was a look at local search, a particular category of algorithms that can be used for solving a particular type of problem, where we don’t really care about the path to the solution. I didn’t care about the steps I took to decide where the hospitals should go. I just cared about the solution itself. I just care about where the hospitals should be, or what the route through the traveling salesman journey really ought to be. Another type of algorithm that might come up are known as these categories of linear programming types of problems. And linear programming often comes up in the context where we’re trying to optimize for some mathematical function. But oftentimes, linear programming will come up when we might have real numbered values. So it’s not just discrete fixed values that we might have, but any decimal values that we might want to be able to calculate. And so linear programming is a family of types of problems where we might have a situation that looks like this, where the goal of linear programming is to minimize a cost function. And you can invert the numbers and say try and maximize it, but often we’ll frame it as trying to minimize a cost function that has some number of variables, x1, x2, x3, all the way up to xn, just some number of variables that are involved, things that I want to know the values to. And this cost function might have coefficients in front of those variables. And this is what we would call a linear equation, where we just have all of these variables that might be multiplied by a coefficient and then add it together. We’re not going to square anything or cube anything, because that’ll give us different types of equations. With linear programming, we’re just dealing with linear equations in addition to linear constraints, where a constraint is going to look something like if we sum up this particular equation that is just some linear combination of all of these variables, it is less than or equal to some bound b. And we might have a whole number of these various different constraints that we might place onto our linear programming exercise. And likewise, just as we can have constraints that are saying this linear equation is less than or equal to some bound b, it might also be equal to something. That if you want some sum of some combination of variables to be equal to a value, you can specify that. And we can also maybe specify that each variable has lower and upper bounds, that it needs to be a positive number, for example, or it needs to be a number that is less than 50, for example. And there are a number of other choices that we can make there for defining what the bounds of a variable are. But it turns out that if you can take a problem and formulate it in these terms, formulate the problem as your goal is to minimize a cost function, and you’re minimizing that cost function subject to particular constraints, subjects to equations that are of the form like this of some sequence of variables is less than a bound or is equal to some particular value, then there are a number of algorithms that already exist for solving these sorts of problems. So let’s go ahead and take a look at an example. Here’s an example of a problem that might come up in the world of linear programming. Often, this is going to come up when we’re trying to optimize for something. And we want to be able to do some calculations, and we have constraints on what we’re trying to optimize. And so it might be something like this. In the context of a factory, we have two machines, x1 and x2. x1 costs $50 an hour to run. x2 costs $80 an hour to run. And our goal, what we’re trying to do, our objective, is to minimize the total cost. So that’s what we’d like to do. But we need to do so subject to certain constraints. So there might be a labor constraint that x1 requires five units of labor per hour, x2 requires two units of labor per hour, and we have a total of 20 units of labor that we have to spend. So this is a constraint. We have no more than 20 units of labor that we can spend, and we have to spend it across x1 and x2, each of which requires a different amount of labor. And we might also have a constraint like this that tells us x1 is going to produce 10 units of output per hour, x2 is going to produce 12 units of output per hour, and the company needs 90 units of output. So we have some goal, something we need to achieve. We need to achieve 90 units of output, but there are some constraints that x1 can only produce 10 units of output per hour, x2 produces 12 units of output per hour. These types of problems come up quite frequently, and you can start to notice patterns in these types of problems, problems where I am trying to optimize for some goal, minimizing cost, maximizing output, maximizing profits, or something like that. And there are constraints that are placed on that process. And so now we just need to formulate this problem in terms of linear equations. So let’s start with this first point. Two machines, x1 and x2, x costs $50 an hour, x2 costs $80 an hour. Here we can come up with an objective function that might look like this. This is our cost function, rather. 50 times x1 plus 80 times x2, where x1 is going to be a variable representing how many hours do we run machine x1 for, x2 is going to be a variable representing how many hours are we running machine x2 for. And what we’re trying to minimize is this cost function, which is just how much it costs to run each of these machines per hour summed up. This is an example of a linear equation, just some combination of these variables plus coefficients that are placed in front of them. And I would like to minimize that total value. But I need to do so subject to these constraints. x1 requires 50 units of labor per hour, x2 requires 2, and we have a total of 20 units of labor to spend. And so that gives us a constraint of this form. 5 times x1 plus 2 times x2 is less than or equal to 20. 20 is the total number of units of labor we have to spend. And that’s spent across x1 and x2, each of which requires a different number of units of labor per hour, for example. And finally, we have this constraint here. x1 produces 10 units of output per hour, x2 produces 12, and we need 90 units of output. And so this might look something like this. That 10×1 plus 12×2, this is amount of output per hour, it needs to be at least 90. We can do better or great, but it needs to be at least 90. And if you recall from my formulation before, I said that generally speaking in linear programming, we deal with equals constraints or less than or equal to constraints. So we have a greater than or equal to sign here. That’s not a problem. Whenever we have a greater than or equal to sign, we can just multiply the equation by negative 1, and that’ll flip it around to a less than or equals negative 90, for example, instead of a greater than or equal to 90. And that’s going to be an equivalent expression that we can use to represent this problem. So now that we have this cost function and these constraints that it’s subject to, it turns out there are a number of algorithms that can be used in order to solve these types of problems. And these problems go a little bit more into geometry and linear algebra than we’re really going to get into. But the most popular of these types of algorithms are simplex, which was one of the first algorithms discovered for trying to solve linear programs. And later on, a class of interior point algorithms can be used to solve this type of problem as well. The key is not to understand exactly how these algorithms work, but to realize that these algorithms exist for efficiently finding solutions any time we have a problem of this particular form. And so we can take a look, for example, at the production directory here, where here I have a file called production.py, where here I’m using scipy, which was the library for a lot of science-related functions within Python. And I can go ahead and just run this optimization function in order to run a linear program. .linprog here is going to try and solve this linear program for me, where I provide to this expression, to this function call, all of the data about my linear program. So it needs to be in a particular format, which might be a little confusing at first. But this first argument to scipy.optimize.linprogramming is the cost function, which is in this case just an array or a list that has 50 and 80, because my original cost function was 50 times x1 plus 80 times x2. So I just tell Python, 50 and 80, those are the coefficients that I am now trying to optimize for. And then I provide all of the constraints. So the constraints, and I wrote them up above in comments, is the constraint 1 is 5×1 plus 2×2 is less than or equal to 20. And constraint 2 is negative 10×1 plus negative 12×2 is less than or equal to negative 90. And so scipy expects these constraints to be in a particular format. It first expects me to provide all of the coefficients for the upper bound equations, ub just for upper bound, where the coefficients of the first equation are 5 and 2, because we have 5×1 and 2×2. And the coefficients for the second equation are negative 10 and negative 12, because I have negative 10×1 plus negative 12×2. And then here, we provide it as a separate argument, just to keep things separate, what the actual bound is. What is the upper bound for each of these constraints? Well, for the first constraint, the upper bound is 20. That was constraint number 1. And then for constraint number 2, the upper bound is 90. So a bit of a cryptic way of representing it. It’s not quite as simple as just writing the mathematical equations. What really is being expected here are all of the coefficients and all of the numbers that are in these equations by first providing the coefficients for the cost function, then providing all the coefficients for the inequality constraints, and then providing all of the upper bounds for those inequality constraints. And once all of that information is there, then we can run any of these interior point algorithms or the simplex algorithm. Even if you don’t understand how it works, you can just run the function and figure out what the result should be. And here, I said if the result is a success, we were able to solve this problem. Go ahead and print out what the value of x1 and x2 should be. Otherwise, go ahead and print out no solution. And so if I run this program by running python production.py, it takes a second to calculate. But then we see here is what the optimal solution should be. x1 should run for 1.5 hours. x2 should run for 6.25 hours. And we were able to do this by just formulating the problem as a linear equation that we were trying to optimize, some cost that we were trying to minimize, and then some constraints that were placed on that. And many, many problems fall into this category of problems that you can solve if you can just figure out how to use equations and use these constraints to represent that general idea. And that’s a theme that’s going to come up a couple of times today, where we want to be able to take some problem and reduce it down to some problem we know how to solve in order to begin to find a solution and to use existing methods that we can use in order to find a solution more effectively or more efficiently. And it turns out that these types of problems, where we have constraints, show up in other ways too. And there’s an entire class of problems that’s more generally just known as constraint satisfaction problems. And we’re going to now take a look at how you might formulate a constraint satisfaction problem and how you might go about solving a constraint satisfaction problem. But the basic idea of a constraint satisfaction problem is we have some number of variables that need to take on some values. And we need to figure out what values each of those variables should take on. But those variables are subject to particular constraints that are going to limit what values those variables can actually take on. So let’s take a look at a real world example, for example. Let’s look at exam scheduling, that I have four students here, students 1, 2, 3, and 4. Each of them is taking some number of different classes. Classes here are going to be represented by letters. So student 1 is enrolled in courses A, B, and C. Student 2 is enrolled in courses B, D, and E, so on and so forth. And now, say university, for example, is trying to schedule exams for all of these courses. But there are only three exam slots on Monday, Tuesday, and Wednesday. And we have to schedule an exam for each of these courses. But the constraint now, the constraint we have to deal with with the scheduling, is that we don’t want anyone to have to take two exams on the same day. We would like to try and minimize that or eliminate it if at all possible. So how do we begin to represent this idea? How do we structure this in a way that a computer with an AI algorithm can begin to try and solve the problem? Well, let’s in particular just look at these classes that we might take and represent each of the courses as some node inside of a graph. And what we’ll do is we’ll create an edge between two nodes in this graph if there is a constraint between those two nodes. So what does this mean? Well, we can start with student 1, who’s enrolled in courses A, B, and C. What that means is that A and B can’t have an exam at the same time. A and C can’t have an exam at the same time. And B and C also can’t have an exam at the same time. And I can represent that in this graph by just drawing edges. One edge between A and B, one between B and C, and then one between C and A. And that encodes now the idea that between those nodes, there is a constraint. And in particular, the constraint happens to be that these two can’t be equal to each other, though there are other types of constraints that are possible, depending on the type of problem that you’re trying to solve. And then we can do the same thing for each of the other students. So for student 2, who’s enrolled in courses B, D, and E, well, that means B, D, and E, those all need to have edges that connect each other as well. Student 3 is enrolled in courses C, E, and F. So we’ll go ahead and take C, E, and F and connect those by drawing edges between them too. And then finally, student 4 is enrolled in courses E, F, and G. And we can represent that by drawing edges between E, F, and G, although E and F already had an edge between them. We don’t need another one, because this constraint is just encoding the idea that course E and course F cannot have an exam on the same day. So this then is what we might call the constraint graph. There’s some graphical representation of all of my variables, so to speak, and the constraints between those possible variables. Where in this particular case, each of the constraints represents an inequality constraint, that an edge between B and D means whatever value the variable B takes on cannot be the value that the variable D takes on as well. So what then actually is a constraint satisfaction problem? Well, a constraint satisfaction problem is just some set of variables, x1 all the way through xn, some set of domains for each of those variables. So every variable needs to take on some values. Maybe every variable has the same domain, but maybe each variable has a slightly different domain. And then there’s a set of constraints, and we’ll just call a set C, that is some constraints that are placed upon these variables, like x1 is not equal to x2. But there could be other forms too, like maybe x1 equals x2 plus 1 if these variables are taking on numerical values in their domain, for example. The types of constraints are going to vary based on the types of problems. And constraint satisfaction shows up all over the place as well, in any situation where we have variables that are subject to particular constraints. So one popular game is Sudoku, for example, this 9 by 9 grid where you need to fill in numbers in each of these cells, but you want to make sure there’s never a duplicate number in any row, or in any column, or in any grid of 3 by 3 cells, for example. So what might this look like as a constraint satisfaction problem? Well, my variables are all of the empty squares in the puzzle. So represented here is just like an x comma y coordinate, for example, as all of the squares where I need to plug in a value, where I don’t know what value it should take on. The domain is just going to be all of the numbers from 1 through 9, any value that I could fill in to one of these cells. So that is going to be the domain for each of these variables. And then the constraints are going to be of the form, like this cell can’t be equal to this cell, can’t be equal to this cell, can’t be, and all of these need to be different, for example, and same for all of the rows, and the columns, and the 3 by 3 squares as well. So those constraints are going to enforce what values are actually allowed. And we can formulate the same idea in the case of this exam scheduling problem, where the variables we have are the different courses, a up through g. The domain for each of these variables is going to be Monday, Tuesday, and Wednesday. Those are the possible values each of the variables can take on, that in this case just represent when is the exam for that class. And then the constraints are of this form, a is not equal to b, a is not equal to c, meaning a and b can’t have an exam on the same day, a and c can’t have an exam on the same day. Or more formally, these two variables cannot take on the same value within their domain. So that then is this formulation of a constraint satisfaction problem that we can begin to use to try and solve this problem. And constraints can come in a number of different forms. There are hard constraints, which are constraints that must be satisfied for a correct solution. So something like in the Sudoku puzzle, you cannot have this cell and this cell that are in the same row take on the same value. That is a hard constraint. But problems can also have soft constraints, where these are constraints that express some notion of preference, that maybe a and b can’t have an exam on the same day, but maybe someone has a preference that a’s exam is earlier than b’s exam. It doesn’t need to be the case with some expression that some solution is better than another solution. And in that case, you might formulate the problem as trying to optimize for maximizing people’s preferences. You want people’s preferences to be satisfied as much as possible. In this case, though, we’ll mostly just deal with hard constraints, constraints that must be met in order to have a correct solution to the problem. So we want to figure out some assignment of these variables to their particular values that is ultimately going to give us a solution to the problem by allowing us to assign some day to each of the classes such that we don’t have any conflicts between classes. So it turns out that we can classify the constraints in a constraint satisfaction problem into a number of different categories. The first of those categories are perhaps the simplest of the types of constraints, which are known as unary constraints, where unary constraint is a constraint that just involves a single variable. For example, a unary constraint might be something like, a does not equal Monday, meaning Course A cannot have its exam on Monday. If for some reason the instructor for the course isn’t available on Monday, you might have a constraint in your problem that looks like this, something that just has a single variable a in it, and maybe says a is not equal to Monday, or a is equal to something, or in the case of numbers greater than or less than something, a constraint that just has one variable, we consider to be a unary constraint. And this is in contrast to something like a binary constraint, which is a constraint that involves two variables, for example. So this would be a constraint like the ones we were looking at before. Something like a does not equal b is an example of a binary constraint, because it is a constraint that has two variables involved in it, a and b. And we represented that using some arc or some edge that connects variable a to variable b. And using this knowledge of, OK, what is a unary constraint? What is a binary constraint? There are different types of things we can say about a particular constraint satisfaction problem. And one thing we can say is we can try and make the problem node consistent. So what does node consistency mean? Node consistency means that we have all of the values in a variable’s domain satisfying that variable’s unary constraints. So for each of the variables inside of our constraint satisfaction problem, if all of the values satisfy the unary constraints for that particular variable, we can say that the entire problem is node consistent, or we can even say that a particular variable is node consistent if we just want to make one node consistent within itself. So what does that actually look like? Let’s look at now a simplified example, where instead of having a whole bunch of different classes, we just have two classes, a and b, each of which has an exam on either Monday or Tuesday or Wednesday. So this is the domain for the variable a, and this is the domain for the variable b. And now let’s imagine we have these constraints, a not equal to Monday, b not equal to Tuesday, b not equal to Monday, a not equal to b. So those are the constraints that we have on this particular problem. And what we can now try to do is enforce node consistency. And node consistency just means we make sure that all of the values for any variable’s domain satisfy its unary constraints. And so we could start by trying to make node a node consistent. Is it consistent? Does every value inside of a’s domain satisfy its unary constraints? Well, initially, we’ll see that Monday does not satisfy a’s unary constraints, because we have a constraint, a unary constraint here, that a is not equal to Monday. But Monday is still in a’s domain. And so this is something that is not node consistent, because we have Monday in the domain. But this is not a valid value for this particular node. And so how do we make this node consistent? Well, to make the node consistent, what we’ll do is we’ll just go ahead and remove Monday from a’s domain. Now a can only be on Tuesday or Wednesday, because we had this constraint that said a is not equal to Monday. And at this point now, a is node consistent. For each of the values that a can take on, Tuesday and Wednesday, there is no constraint that is a unary constraint that conflicts with that idea. There is no constraint that says that a can’t be Tuesday. There is no unary constraint that says that a cannot be on Wednesday. And so now we can turn our attention to b. b also has a domain, Monday, Tuesday, and Wednesday. And we can begin to see whether those variables satisfy the unary constraints as well. Well, here is a unary constraint, b is not equal to Tuesday. And that does not appear to be satisfied by this domain of Monday, Tuesday, and Wednesday, because Tuesday, this possible value that the variable b could take on is not consistent with this unary constraint, that b is not equal to Tuesday. So to solve that problem, we’ll go ahead and remove Tuesday from b’s domain. Now b’s domain only contains Monday and Wednesday. But as it turns out, there’s yet another unary constraint that we placed on the variable b, which is here. b is not equal to Monday. And that means that this value, Monday, inside of b’s domain, is not consistent with b’s unary constraints, because we have a constraint that says the b cannot be Monday. And so we can remove Monday from b’s domain. And now we’ve made it through all of the unary constraints. We’ve not yet considered this constraint, which is a binary constraint. But we’ve considered all of the unary constraints, all of the constraints that involve just a single variable. And we’ve made sure that every node is consistent with those unary constraints. So we can say that now we have enforced node consistency, that for each of these possible nodes, we can pick any of these values in the domain. And there won’t be a unary constraint that is violated as a result of it. So node consistency is fairly easy to enforce. We just take each node, make sure the values in the domain satisfy the unary constraints. Where things get a little bit more interesting is when we consider different types of consistency, something like arc consistency, for example. And arc consistency refers to when all of the values in a variable’s domain satisfy the variable’s binary constraints. So when we’re looking at trying to make a arc consistent, we’re no longer just considering the unary constraints that involve a. We’re trying to consider all of the binary constraints that involve a as well. So any edge that connects a to another variable inside of that constraint graph that we were taking a look at before. Put a little bit more formally, arc consistency. And arc really is just another word for an edge that connects two of these nodes inside of our constraint graph. We can define arc consistency a little more precisely like this. In order to make some variable x arc consistent with respect to some other variable y, we need to remove any element from x’s domain to make sure that every choice for x, every choice in x’s domain, has a possible choice for y. So put another way, if I have a variable x and I want to make x an arc consistent, then I’m going to look at all of the possible values that x can take on and make sure that for all of those possible values, there is still some choice that I can make for y, if there’s some arc between x and y, to make sure that y has a possible option that I can choose as well. So let’s look at an example of that going back to this example from before. We enforced node consistency already by saying that a can only be on Tuesday or Wednesday because we knew that a could not be on Monday. And we also said that b’s only domain only consists of Wednesday because we know that b does not equal Tuesday and also b does not equal Monday. So now let’s begin to consider arc consistency. Let’s try and make a arc consistent with b. And what that means is to make a arc consistent with respect to b means that for any choice we make in a’s domain, there is some choice we can make in b’s domain that is going to be consistent. And we can try that. For a, we can choose Tuesday as a possible value for a. If I choose Tuesday for a, is there a value for b that satisfies the binary constraint? Well, yes, b Wednesday would satisfy this constraint that a does not equal b because Tuesday does not equal Wednesday. However, if we chose Wednesday for a, well, then there is no choice in b’s domain that satisfies this binary constraint. There is no way I can choose something for b that satisfies a does not equal b because I know b must be Wednesday. And so if ever I run into a situation like this where I see that here is a possible value for a such that there is no choice of value for b that satisfies the binary constraint, well, then this is not arc consistent. And to make it arc consistent, I would need to take Wednesday and remove it from a’s domain. Because Wednesday was not going to be a possible choice I can make for a because it wasn’t consistent with this binary constraint for b. There was no way I could choose Wednesday for a and still have an available solution by choosing something for b as well. So here now, I’ve been able to enforce arc consistency. And in doing so, I’ve actually solved this entire problem, that given these constraints where a and b can have exams on either Monday or Tuesday or Wednesday, the only solution, as it would appear, is that a’s exam must be on Tuesday and b’s exam must be on Wednesday. And that is the only option available to me. So if we want to apply our consistency to a larger graph, not just looking at one particular pair of our consistency, there are ways we can do that too. And we can begin to formalize what the pseudocode would look like for trying to write an algorithm that enforces arc consistency. And we’ll start by defining a function called revise. Revise is going to take as input a CSP, otherwise known as a constraint satisfaction problem, and also two variables, x and y. And what revise is going to do is it is going to make x arc consistent with respect to y, meaning remove anything from x’s domain that doesn’t allow for a possible option for y. How does this work? Well, we’ll go ahead and first keep track of whether or not we’ve made a revision. Revise is ultimately going to return true or false. It’ll return true in the event that we did make a revision to x’s domain. It’ll return false if we didn’t make any change to x’s domain. And we’ll see in a moment why that’s going to be helpful. But we start by saying revised equals false. We haven’t made any changes. Then we’ll say, all right, let’s go ahead and loop over all of the possible values in x’s domain. So loop over x’s domain for each little x in x’s domain. I want to make sure that for each of those choices, I have some available choice in y that satisfies the binary constraints that are defined inside of my CSP, inside of my constraint satisfaction problem. So if ever it’s the case that there is no value y in y’s domain that satisfies the constraint for x and y, well, if that’s the case, that means that this value x shouldn’t be in x’s domain. So we’ll go ahead and delete x from x’s domain. And I’ll set revised equal to true because I did change x’s domain. I changed x’s domain by removing little x. And I removed little x because it wasn’t art consistent. There was no way I could choose a value for y that would satisfy this xy constraint. So in this case, we’ll go ahead and set revised equal true. And we’ll do this again and again for every value in x’s domain. Sometimes it might be fine. In other cases, it might not allow for a possible choice for y, in which case we need to remove this value from x’s domain. And at the end, we just return revised to indicate whether or not we actually made a change. So this function, then, this revised function is effectively an implementation of what you saw me do graphically a moment ago. And it makes one variable, x, arc consistent with another variable, in this case, y. But generally speaking, when we want to enforce our consistency, we’ll often want to enforce our consistency not just for a single arc, but for the entire constraint satisfaction problem. And it turns out there’s an algorithm to do that as well. And that algorithm is known as AC3. AC3 takes a constraint satisfaction problem. And it enforces our consistency across the entire problem. How does it do that? Well, it’s going to basically maintain a queue or basically just a line of all of the arcs that it needs to make consistent. And over time, we might remove things from that queue as we begin dealing with our consistency. And we might need to add things to that queue as well if there are more things we need to make arc consistent. So we’ll go ahead and start with a queue that contains all of the arcs in the constraint satisfaction problem, all of the edges that connect two nodes that have some sort of binary constraint between them. And now, as long as the queue is non-empty, there is work to be done. The queue is all of the things that we need to make arc consistent. So as long as the queue is non-empty, there’s still things we have to do. What do we have to do? Well, we’ll start by de-queuing from the queue, remove something from the queue. And strictly speaking, it doesn’t need to be a queue, but a queue is a traditional way of doing this. We’ll de-queue from the queue, and that’ll give us an arc, x and y, these two variables where I would like to make x arc consistent with y. So how do we make x arc consistent with y? Well, we can go ahead and just use that revise function that we talked about a moment ago. We called the revise function, passing as input the constraint satisfaction problem, and also these variables x and y, because I want to make x arc consistent with y. In other words, remove any values from x’s domain that don’t leave an available option for y. And recall, what does revised return? Well, it returns true if we actually made a change, if we removed something from x’s domain, because there wasn’t an available option for y, for example. And it returns false if we didn’t make any change to x’s domain at all. And it turns out if revised returns false, if we didn’t make any changes, well, then there’s not a whole lot more work to be done here for this arc. We can just move ahead to the next arc that’s in the queue. But if we did make a change, if we did reduce x’s domain by removing values from x’s domain, well, then what we might realize is that this creates potential problems later on, that it might mean that some arc that was arc consistent with x, that node might no longer be arc consistent with x, because while there used to be an option that we could choose for x, now there might not be, because now we might have removed something from x that was necessary for some other arc to be arc consistent. And so if ever we did revise x’s domain, we’re going to need to add some things to the queue, some additional arcs that we might want to check. How do we do that? Well, first thing we want to check is to make sure that x’s domain is not 0. If x’s domain is 0, that means there are no available options for x at all. And that means that there’s no way you can solve the constraint satisfaction problem. If we’ve removed everything from x’s domain, we’ll go ahead and just return false here to indicate there’s no way to solve the problem, because there’s nothing left in x’s domain. But otherwise, if there are things left in x’s domain, but fewer things than before, well, then what we’ll do is we’ll loop over each variable z that is in all of x’s neighbors, except for y, y we already handled. But we’ll consider all of x’s other’s neighbors and ask ourselves, all right, will that arc from each of those z’s to x, that arc might no longer be arc consistent, because while for each z, there might have been a possible option we could choose for x to correspond with each of z’s possible values, now there might not be, because we removed some elements from x’s domain. And so what we’ll do here is we’ll go ahead and enqueue, adding something to the queue, this arc zx for all of those neighbors z. So we need to add back some arcs to the queue in order to continue to enforce arc consistency. At the very end, if we make it through all this process, then we can return true. But this now is AC3, this algorithm for enforcing arc consistency on a constraint satisfaction problem. And the big idea is really just keep track of all of the arcs that we might need to make arc consistent, make it arc consistent by calling the revise function. And if we did revise it, then there are some new arcs that might need to be added to the queue in order to make sure that everything is still arc consistent, even after we’ve removed some of the elements from a particular variable’s domain. So what then would happen if we tried to enforce arc consistency on a graph like this, on a graph where each of these variables has a domain of Monday, Tuesday, and Wednesday? Well, it turns out that by enforcing arc consistency on this graph, well, it can solve some types of problems. Nothing actually changes here. For any particular arc, just considering two variables, there’s always a way for me to just, for any of the choices I make for one of them, make a choice for the other one, because there are three options, and I just need the two to be different from each other. So this is actually quite easy to just take an arc and just declare that it is arc consistent, because if I pick Monday for D, then I just pick something that isn’t Monday for B. In arc consistency, we only consider consistency between a binary constraint between two nodes, and we’re not really considering all of the rest of the nodes yet. So just using AC3, the enforcement of arc consistency, that can sometimes have the effect of reducing domains to make it easier to find solutions, but it will not always actually solve the problem. We might still need to somehow search to try and find a solution. And we can use classical traditional search algorithms to try to do so. You’ll recall that a search problem generally consists of these parts. We have some initial state, some actions, a transition model that takes me from one state to another state, a goal test to tell me have I satisfied my objective correctly, and then some path cost function, because in the case of like maze solving, I was trying to get to my goal as quickly as possible. So you could formulate a CSP, or a constraint satisfaction problem, as one of these types of search problems. The initial state will just be an empty assignment, where an assignment is just a way for me to assign any particular variable to any particular value. So if an empty assignment is no variables that are assigned to any values yet, then the action I can take is adding some new variable equals value pair to that assignment, saying for this assignment, let me add a new value for this variable. And the transition model just defines what happens when you take that action. You get a new assignment that has that variable equal to that value inside of it. The goal test is just checking to make sure all the variables have been assigned and making sure all the constraints have been satisfied. And the path cost function is sort of irrelevant. I don’t really care about what the path really is. I just care about finding some assignment that actually satisfies all of the constraints. So really, all the paths have the same cost. I don’t really care about the path to the goal. I just care about the solution itself, much as we’ve talked about now before. The problem here, though, is that if we just implement this naive search algorithm just by implementing like breadth-first search or depth-first search, this is going to be very, very inefficient. And there are ways we can take advantage of efficiencies in the structure of a constraint satisfaction problem itself. And one of the key ideas is that we can really just order these variables. And it doesn’t matter what order we assign variables in. The assignment a equals 2 and then b equals 8 is identical to the assignment of b equals 8 and then a equals 2. Switching the order doesn’t really change anything about the fundamental nature of that assignment. And so there are some ways that we can try and revise this idea of a search algorithm to apply it specifically for a problem like a constraint satisfaction problem. And it turns out the search algorithm we’ll generally use when talking about constraint satisfaction problems is something known as backtracking search. And the big idea of backtracking search is we’ll go ahead and make assignments from variables to values. And if ever we get stuck, we arrive at a place where there is no way we can make any forward progress while still preserving the constraints that we need to enforce, we’ll go ahead and backtrack and try something else instead. So the very basic sketch of what backtracking search looks like is it looks like this. Function called backtrack that takes as input an assignment and a constraint satisfaction problem. So initially, we don’t have any assigned variables. So when we begin backtracking search, this assignment is just going to be the empty assignment with no variables inside of it. But we’ll see later this is going to be a recursive function. So backtrack takes as input the assignment and the problem. If the assignment is complete, meaning all of the variables have been assigned, we just return that assignment. That, of course, won’t be true initially, because we start with an empty assignment. But over time, we might add things to that assignment. So if ever the assignment actually is complete, then we’re done. Then just go ahead and return that assignment. But otherwise, there is some work to be done. So what we’ll need to do is select an unassigned variable for this particular problem. So we need to take the problem, look at the variables that have already been assigned, and pick a variable that has not yet been assigned. And I’ll go ahead and take that variable. And then I need to consider all of the values in that variable’s domain. So we’ll go ahead and call this domain values function. We’ll talk a little more about that later, that takes a variable and just gives me back an ordered list of all of the values in its domain. So I’ve taken a random unselected variable. I’m going to loop over all of the possible values. And the idea is, let me just try all of these values as possible values for the variable. So if the value is consistent with the assignment so far, it doesn’t violate any of the constraints, well then let’s go ahead and add variable equals value to the assignment because it’s so far consistent. And now let’s recursively call backtrack to try and make the rest of the assignments also consistent. So I’ll go ahead and call backtrack on this new assignment that I’ve added the variable equals value to. And now I recursively call backtrack and see what the result is. And if the result isn’t a failure, well then let me just return that result. And otherwise, what else could happen? Well, if it turns out the result was a failure, well then that means this value was probably a bad choice for this particular variable because when I assigned this variable equal to that value, eventually down the road I ran into a situation where I violated constraints. There was nothing more I could do. So now I’ll remove variable equals value from the assignment, effectively backtracking to say, all right, that value didn’t work. Let’s try another value instead. And then at the very end, if we were never able to return a complete assignment, we’ll just go ahead and return failure because that means that none of the values worked for this particular variable. This now is the idea for backtracking search, to take each of the variables, try values for them, and recursively try backtracking search, see if we can make progress. And if ever we run into a dead end, we run into a situation where there is no possible value we can choose that satisfies the constraints, we return failure. And that propagates up, and eventually we make a different choice by going back and trying something else instead. So let’s put this algorithm into practice. Let’s actually try and use backtracking search to solve this problem now, where I need to figure out how to assign each of these courses to an exam slot on Monday or Tuesday or Wednesday in such a way that it satisfies these constraints, that each of these edges mean those two classes cannot have an exam on the same day. So I can start by just starting at a node. It doesn’t really matter which I start with, but in this case, I’ll just start with A. And I’ll ask the question, all right, let me loop over the values in the domain. And maybe in this case, I’ll just start with Monday and say, all right, let’s go ahead and assign A to Monday. We’ll just go and order Monday, Tuesday, Wednesday. And now let’s consider node B. So I’ve made an assignment to A, so I recursively call backtrack with this new part of the assignment. And now I’m looking to pick another unassigned variable like B. And I’ll say, all right, maybe I’ll start with Monday, because that’s the very first value in B’s domain. And I ask, all right, does Monday violate any constraints? And it turns out, yes, it does. It violates this constraint here between A and B, because A and B are now both on Monday, and that doesn’t work, because B can’t be on the same day as A. So that doesn’t work. So we might instead try Tuesday, try the next value in B’s domain. And is that consistent with the assignment so far? Well, yeah, B, Tuesday, A, Monday, that is consistent so far, because they’re not on the same day. So that’s good. Now we can recursively call backtrack. Try again. Pick another unassigned variable, something like D, and say, all right, let’s go through its possible values. Is Monday consistent with this assignment? Well, yes, it is. B and D are on different days, Monday versus Tuesday. And A and B are also on different days, Monday versus Tuesday. So that’s fine so far, too. We’ll go ahead and try again. Maybe we’ll go to this variable here, E. Say, can we make that consistent? Let’s go through the possible values. We’ve recursively called backtrack. We might start with Monday and say, all right, that’s not consistent, because D and E now have exams on the same day. So we might try Tuesday instead, going to the next one. Ask, is that consistent? Well, no, it’s not, because B and E, those have exams on the same day. And so we try, all right, is Wednesday consistent? And in turn, it’s like, all right, yes, it is. Wednesday is consistent, because D and E now have exams on different days. B and E now have exams on different days. All seems to be well so far. I recursively call backtrack, select another unassigned variable, we’ll say maybe choose C this time, and say, all right, let’s try the values that C could take on. Let’s start with Monday. And it turns out that’s not consistent, because now A and C both have exams on the same day. So I try Tuesday and say, that’s not consistent either, because B and C now have exams on the same day. And then I say, all right, let’s go ahead and try Wednesday. But that’s not consistent either, because C and E each have exams on the same day too. So now we’ve gone through all the possible values for C, Monday, Tuesday, and Wednesday. And none of them are consistent. There is no way we can have a consistent assignment. Backtrack, in this case, will return a failure. And so then we’d say, all right, we have to backtrack back to here. Well, now for E, we’ve tried all of Monday, Tuesday, and Wednesday. And none of those work, because Wednesday, which seemed to work, turned out to be a failure. So that means there’s no possible way we can assign E. So that’s a failure too. We have to go back up to D, which means that Monday assignment to D, that must be wrong. We must try something else. So we can try, all right, what if instead of Monday, we try Tuesday? Tuesday, it turns out, is not consistent, because B and D now have an exam on the same day. But Wednesday, as it turns out, works. And now we can begin to mix and forward progress again. We go back to E and say, all right, which of these values works? Monday turns out to work by not violating any constraints. Then we go up to C now. Monday doesn’t work, because it violates a constraint. Violates two, actually. Tuesday doesn’t work, because it violates a constraint as well. But Wednesday does work. Then we can go to the next variable, F, and say, all right, does Monday work? We’ll know. It violates a constraint. But Tuesday does work. And then finally, we can look at the last variable, G, recursively calling backtrack one more time. Monday is inconsistent. That violates a constraint. Tuesday also violates a constraint. But Wednesday, that doesn’t violate a constraint. And so now at this point, we recursively call backtrack one last time. We now have a satisfactory assignment of all of the variables. And at this point, we can say that we are now done. We have now been able to successfully assign a variable or a value to each one of these variables in such a way that we’re not violating any constraints. We’re going to go ahead and have classes A and E have their exams on Monday. Classes B and F can have their exams on Tuesday. And classes C, D, and G can have their exams on Wednesday. And there’s no violated constraints that might come up there. So that then was a graphical look at how this might work. Let’s now take a look at some code we could use to actually try and solve this problem as well. So here I’ll go ahead and go into the scheduling directory. We’re here now. We’ll start by looking at schedule0.py. We’re here. I define a list of variables, A, B, C, D, E, F, G. Those are all different classes. Then underneath that, I define my list of constraints. So constraint A and B. That is a constraint because they can’t be on the same day. Likewise, A and C, B and C, so on and so forth, enforcing those exact same constraints. And here then is what the backtracking function might look like. First, if the assignment is complete, if I’ve made an assignment of every variable to a value, go ahead and just return that assignment. Then we’ll select an unassigned variable from that assignment. Then for each of the possible values in the domain, Monday, Tuesday, Wednesday, let’s go ahead and create a new assignment that assigns the variable to that value. I’ll call this consistent function, which I’ll show you in a moment, that just checks to make sure this new assignment is consistent. But if it is consistent, we’ll go ahead and call backtrack to go ahead and continue trying to run backtracking search. And as long as the result is not none, meaning it wasn’t a failure, we can go ahead and return that result. But if we make it through all the values and nothing works, then it is a failure. There’s no solution. We go ahead and return none here. What do these functions do? Select unassigned variable is just going to choose a variable not yet assigned. So it’s going to loop over all the variables. And if it’s not already assigned, we’ll go ahead and just return that variable. And what does the consistent function do? Well, the consistent function goes through all the constraints. And if we have a situation where we’ve assigned both of those values to variables, but they are the same, well, then that is a violation of the constraint, in which case we’ll return false. But if nothing is inconsistent, then the assignment is consistent and will return true. And then all the program does is it calls backtrack on an empty assignment, an empty dictionary that has no variable assigned and no values yet, save that inside a solution, and then print out that solution. So by running this now, I can run Python schedule0.py. And what I get as a result of that is an assignment of all these variables to values. And it turns out we assign a to Monday as we would expect, b to Tuesday, c to Wednesday, exactly the same type of thing we were talking about before, an assignment of each of these variables to values that doesn’t violate any constraints. And I had to do a fair amount of work in order to implement this idea myself. I had to write the backtrack function that went ahead and went through this process of recursively trying to do this backtracking search. But it turns out the constraint satisfaction problems are so popular that there exist many libraries that already implement this type of idea. Again, as with before, the specific library is not as important as the fact that libraries do exist. This is just one example of a Python constraint library, where now, rather than having to do all the work from scratch inside of schedule1.py, I’m just taking advantage of a library that implements a lot of these ideas already. So here, I create a new problem, add variables to it with particular domains. I add a whole bunch of these individual constraints, where I call addConstraint and pass in a function describing what the constraint is. And the constraint basically says the function that takes two variables, x and y, and makes sure that x is not equal to y, enforcing the idea that these two classes cannot have exams on the same day. And then, for any constraint satisfaction problem, I can call getSolutions to get all the solutions to that problem. And then, for each of those solutions, print out what that solution happens to be. And if I run python schedule1.py, and now see, there are actually a number of different solutions that can be used to solve the problem. There are, in fact, six different solutions, assignments of variables to values that will give me a satisfactory answer to this constraint satisfaction problem. So this then was an implementation of a very basic backtracking search method, where really we just went through each of the variables, picked one that wasn’t assigned, tried the possible values the variable could take on. And then, if it worked, if it didn’t violate any constraints, then we kept trying other variables. And if ever we hit a dead end, we had to backtrack. But ultimately, we might be able to be a little bit more intelligent about how we do this in order to improve the efficiency of how we solve these sorts of problems. And one thing we might imagine trying to do is going back to this idea of inference, using the knowledge we know to be able to draw conclusions in order to make the rest of the problem solving process a little bit easier. And let’s now go back to where we got stuck in this problem the first time. When we were solving this constraint satisfaction problem, we dealt with B. And then we went on to D. And we went ahead and just assigned D to Monday, because that seemed to work with the assignment so far. It didn’t violate any constraints. But it turned out that later on that choice turned out to be a bad one, that that choice wasn’t consistent with the rest of the values that we could take on here. And the question is, is there anything we could do to avoid getting into a situation like this, avoid trying to go down a path that’s ultimately not going to lead anywhere by taking advantage of knowledge that we have initially? And it turns out we do have that kind of knowledge. We can look at just the structure of this graph so far. And we can say that right now C’s domain, for example, contains values Monday, Tuesday, and Wednesday. And based on those values, we can say that this graph is not arc consistent. Recall that arc consistency is all about making sure that for every possible value for a particular node, that there is some other value that we are able to choose. And as we can see here, Monday and Tuesday are not going to be possible values that we can choose for C. They’re not going to be consistent with a node like B, for example, because B is equal to Tuesday, which means that C cannot be Tuesday. And because A is equal to Monday, C also cannot be Monday. So using that information, by making C arc consistent with A and B, we could remove Monday and Tuesday from C’s domain and just leave C with Wednesday, for example. And if we continued to try and enforce arc consistency, we’d see there are some other conclusions we can draw as well. We see that B’s only option is Tuesday and C’s only option is Wednesday. And so if we want to make E arc consistent, well, E can’t be Tuesday, because that wouldn’t be arc consistent with B. And E can’t be Wednesday, because that wouldn’t be arc consistent with C. So we can go ahead and say E and just set that equal to Monday, for example. And then we can begin to do this process again and again, that in order to make D arc consistent with B and E, then D would have to be Wednesday. That’s the only possible option. And likewise, we can make the same judgments for F and G as well. And it turns out that without having to do any additional search, just by enforcing arc consistency, we were able to actually figure out what the assignment of all the variables should be without needing to backtrack at all. And the way we did that is by interleaving this search process and the inference step, by this step of trying to enforce arc consistency. And the algorithm to do this is often called just the maintaining arc consistency algorithm, which just enforces arc consistency every time we make a new assignment of a value to an existing variable. So sometimes we can enforce our consistency using that AC3 algorithm at the very beginning of the problem before we even begin searching in order to limit the domain of the variables in order to make it easier to search. But we can also take advantage of the interleaving of enforcing our consistency with search such that every time in the search process we make a new assignment, we go ahead and enforce arc consistency as well to make sure that we’re just eliminating possible values from domains whenever possible. And how do we do this? Well, this is really equivalent to just every time we make a new assignment to a variable x. We’ll go ahead and call our AC3 algorithm, this algorithm that enforces arc consistency on a constraint satisfaction problem. And we go ahead and call that, starting it with a Q, not of all of the arcs, which we did originally, but just of all of the arcs that we want to make arc consistent with x, this thing that we have just made an assignment to. So all arcs yx, where y is a neighbor of x, something that shares a constraint with x, for example. And by maintaining arc consistency in the backtracking search process, we can ultimately make our search process a little bit more efficient. And so this is the revised version of this backtrack function. Same as before, the changes here are highlighted in yellow. Every time we add a new variable equals value to our assignment, we’ll go ahead and run this inference procedure, which might do a number of different things. But one thing it could do is call the maintaining arc consistency algorithm to make sure we’re able to enforce arc consistency on the problem. And we might be able to draw new inferences as a result of that process. Get new guarantees of this variable needs to be equal to that value, for example. That might happen one time. It might happen many times. And so long as those inferences are not a failure, as long as they don’t lead to a situation where there is no possible way to make forward progress, well, then we can go ahead and add those inferences, those new knowledge, that new pieces of knowledge I know about what variables should be assigned to what values, I can add those to the assignment in order to more quickly make forward progress by taking advantage of information that I can just deduce, information I know based on the rest of the structure of the constraint satisfaction problem. And the only other change I’ll need to make now is if it turns out this value doesn’t work, well, then down here, I’ll go ahead and need to remove not only variable equals value, but also any of those inferences that I made, remove that from the assignment as well. So here, then, we’re often able to solve the problem by backtracking less than we might originally have needed to, just by taking advantage of the fact that every time we make a new assignment of one variable to one value, that might reduce the domains of other variables as well. And we can use that information to begin to more quickly draw conclusions in order to try and solve the problem more efficiently as well. And it turns out there are other heuristics we can use to try and improve the efficiency of our search process as well. And it really boils down to a couple of these functions that I’ve talked about, but we haven’t really talked about how they’re working. And one of them is this function here, select unassigned variable, where we’re selecting some variable in the constraint satisfaction problem that has not yet been assigned. So far, I’ve sort of just been selecting variables randomly, just like picking one variable and one unassigned variable in order to decide, all right, this is the variable that we’re going to assign next, and then going from there. But it turns out that by being a little bit intelligent, by following certain heuristics, we might be able to make the search process much more efficient just by choosing very carefully which variable we should explore next. So some of those heuristics include the minimum remaining values, or MRV heuristic, which generally says that if I have a choice between which variable I should select, I should select the variable with the smallest domain, the variable that has the fewest number of remaining values left. With the idea being, if there are only two remaining values left, well, I may as well prune one of them very quickly in order to get to the other, because one of those two has got to be the solution, if a solution does exist. Sometimes minimum remaining values might not give a conclusive result if all the nodes have the same number of remaining values, for example. And in that case, another heuristic that can be helpful to look at is the degree heuristic. The degree of a node is the number of nodes that are attached to that node, the number of nodes that are constrained by that particular node. And if you imagine which variable should I choose, should I choose a variable that has a high degree that is connected to a lot of different things, or a variable with a low degree that is not connected to a lot of different things, well, it can often make sense to choose the variable that has the highest degree that is connected to the most other nodes as the thing you would search first. Why is that the case? Well, it’s because by choosing a variable with a high degree, that is immediately going to constrain the rest of the variables more, and it’s more likely to be able to eliminate large sections of the state space that you don’t need to search through at all. So what could this actually look like? Let’s go back to this search problem here. In this particular case, I’ve made an assignment here. I’ve made an assignment here. And the question is, what should I look at next? And according to the minimum remaining values heuristic, what I should choose is the variable that has the fewest remaining possible values. And in this case, that’s this node here, node C, that only has one variable left in this domain, which in this case is Wednesday, which is a very reasonable choice of a next assignment to make, because I know it’s the only option, for example. I know that the only possible option for C is Wednesday, so I may as well make that assignment and then potentially explore the rest of the space after that. But meanwhile, at the very start of the problem, when I didn’t have any knowledge of what nodes should have what values yet, I still had to pick what node should be the first one that I try and assign a value to. And I arbitrarily just chose the one at the top, node A originally. But we can be more intelligent about that. We can look at this particular graph. All of them have domains of the same size, domain of size 3. So minimum remaining values doesn’t really help us there. But we might notice that node E has the highest degree. It is connected to the most things. And so perhaps it makes sense to begin our search, rather than starting at node A at the very top, start with the node with the highest degree. Start by searching from node E, because from there, that’s going to much more easily allow us to enforce the constraints that are nearby, eliminating large portions of the search space that I might not need to search through. And in fact, by starting with E, we can immediately then assign other variables. And following that, we can actually assign the rest of the variables without needing to do any backtracking at all, even if I’m not using this inference procedure. Just by starting with a node that has a high degree, that is going to very quickly restrict the possible values that other nodes can take on. So that then is how we can go about selecting an unassigned variable in a particular order. Rather than randomly picking a variable, if we’re a little bit intelligent about how we choose it, we can make our search process much, much more efficient by making sure we don’t have to search through portions of the search space that ultimately aren’t going to matter. The other variable we haven’t really talked about, the other function here, is this domain values function. This domain values function that takes a variable and gives me back a sequence of all of the values inside of that variable’s domain. The naive way to approach it is what we did before, which is just go in order, go Monday, then Tuesday, then Wednesday. But the problem is that going in that order might not be the most efficient order to search in, that sometimes it might be more efficient to choose values that are likely to be solutions first and then go to other values. Now, how do you assess whether a value is likelier to lead to a solution or less likely to lead to a solution? Well, one thing you can take a look at is how many constraints get added, how many things get removed from domains as you make this new assignment of a variable to this particular value. And the heuristic we can use here is the least constraining value heuristic, which is the idea that we should return variables in order based on the number of choices that are ruled out for neighboring values. And I want to start with the least constraining value, the value that rules out the fewest possible options. And the idea there is that if all I care about doing is finding a solution, if I start with a value that rules out a lot of other choices, I’m ruling out a lot of possibilities that maybe is going to make it less likely that this particular choice leads to a solution. Whereas on the other hand, if I have a variable and I start by choosing a value that doesn’t rule out very much, well, then I still have a lot of space where there might be a solution that I could ultimately find. And this might seem a little bit counterintuitive and a little bit at odds with what we were talking about before, where I said, when you’re picking a variable, you should pick the variable that is going to have the fewest possible values remaining. But here, I want to pick the value for the variable that is the least constraining. But the general idea is that when I am picking a variable, I would like to prune large portions of the search space by just choosing a variable that is going to allow me to quickly eliminate possible options. Whereas here, within a particular variable, as I’m considering values that that variable could take on, I would like to just find a solution. And so what I want to do is ultimately choose a value that still leaves open the possibility of me finding a solution to be as likely as possible. By not ruling out many options, I leave open the possibility that I can still find a solution without needing to go back later and backtrack. So an example of that might be in this particular situation here, if I’m trying to choose a variable for a value for node C here, that C is equal to either Tuesday or Wednesday. We know it can’t be Monday because it conflicts with this domain here, where we already know that A is Monday, so C must be Tuesday or Wednesday. And the question is, should I try Tuesday first, or should I try Wednesday first? And if I try Tuesday, what gets ruled out? Well, one option gets ruled out here, a second option gets ruled out here, and a third option gets ruled out here. So choosing Tuesday would rule out three possible options. And what about choosing Wednesday? Well, choosing Wednesday would rule out one option here, and it would rule out one option there. And so I have two choices. I can choose Tuesday that rules out three options, or Wednesday that rules out two options. And according to the least constraining value heuristic, what I should probably do is go ahead and choose Wednesday, the one that rules out the fewest number of possible options, leaving open as many chances as possible for me to eventually find the solution inside of the state space. And ultimately, if you continue this process, we will find the solution, an assignment of variables, two values, that allows us to give each of these exams, each of these classes, an exam date that doesn’t conflict with anyone that happens to be enrolled in two classes at the same time. So the big takeaway now with all of this is that there are a number of different ways we can formulate a problem. The ways we’ve looked at today are we can formulate a problem as a local search problem, a problem where we’re looking at a current node and moving to a neighbor based on whether that neighbor is better or worse than the current node that we are looking at. We looked at formulating problems as linear programs, where just by putting things in terms of equations and constraints, we’re able to solve problems a little bit more efficiently. And we saw formulating a problem as a constraint satisfaction problem, creating this graph of all of the constraints that connect two variables that have some constraint between them, and using that information to be able to figure out what the solution should be. And so the takeaway of all of this now is that if we have some problem in artificial intelligence that we would like to use AI to be able to solve them, whether that’s trying to figure out where hospitals should be or trying to solve the traveling salesman problem, trying to optimize productions and costs and whatnot, or trying to figure out how to satisfy certain constraints, whether that’s in a Sudoku puzzle, or whether that’s in trying to figure out how to schedule exams for a university, or any number of a wide variety of types of problems, if we can formulate that problem as one of these sorts of problems, then we can use these known algorithms, these algorithms for enforcing art consistency and backtracking search, these hill climbing and simulated annealing algorithms, these simplex algorithms and interior point algorithms that can be used to solve linear programs, that we can use those techniques to begin to solve a whole wide variety of problems all in this world of optimization inside of artificial intelligence. This was an introduction to artificial intelligence with Python for today. We will see you next time. [” All right. Welcome back, everyone, to an introduction to artificial intelligence with Python. Now, so far in this class, we’ve used AI to solve a number of different problems, giving AI instructions for how to search for a solution, or how to satisfy certain constraints in order to find its way from some input point to some output point in order to solve some sort of problem. Today, we’re going to turn to the world of learning, in particular the idea of machine learning, which generally refers to the idea where we are not going to give the computer explicit instructions for how to perform a task, but rather we are going to give the computer access to information in the form of data, or patterns that it can learn from, and let the computer try and figure out what those patterns are, try and understand that data to be able to perform a task on its own. Now, machine learning comes in a number of different forms, and it’s a very wide field. So today, we’ll explore some of the foundational algorithms and ideas that are behind a lot of the different areas within machine learning. And one of the most popular is the idea of supervised machine learning, or just supervised learning. And supervised learning is a particular type of task. It refers to the task where we give the computer access to a data set, where that data set consists of input-output pairs. And what we would like the computer to do is we would like our AI to be able to figure out some function that maps inputs to outputs. So we have a whole bunch of data that generally consists of some kind of input, some evidence, some information that the computer will have access to. And we would like the computer, based on that input information, to predict what some output is going to be. And we’ll give it some data so that the computer can train its model on and begin to understand how it is that this information works and how it is that the inputs and outputs relate to each other. But ultimately, we hope that our computer will be able to figure out some function that, given those inputs, is able to get those outputs. There are a couple of different tasks within supervised learning. The one we’ll focus on and start with is known as classification. And classification is the problem where, if I give you a whole bunch of inputs, you need to figure out some way to map those inputs into discrete categories, where you can decide what those categories are, and it’s the job of the computer to predict what those categories are going to be. So that might be, for example, I give you information about a bank note, like a US dollar, and I’m asking you to predict for me, does it belong to the category of authentic bank notes, or does it belong to the category of counterfeit bank notes? You need to categorize the input, and we want to train the computer to figure out some function to be able to do that calculation. Another example might be the case of weather, someone we’ve talked about a little bit so far in this class, where we would like to predict on a given day, is it going to rain on that day? Is it going to be cloudy on that day? And before we’ve seen how we could do this, if we really give the computer all the exact probabilities for if these are the conditions, what’s the probability of rain? Oftentimes, we don’t have access to that information, though. But what we do have access to is a whole bunch of data. So if we wanted to be able to predict something like, is it going to rain or is it not going to rain, we would give the computer historical information about days when it was raining and days when it was not raining and ask the computer to look for patterns in that data. So what might that data look like? Well, we could structure that data in a table like this. This might be what our table looks like, where for any particular day, going back, we have information about that day’s humidity, that day’s air pressure, and then importantly, we have a label, something where the human has said that on this particular day, it was raining or it was not raining. So you could fill in this table with a whole bunch of data. And what makes this what we would call a supervised learning exercise is that a human has gone in and labeled each of these data points, said that on this day, when these were the values for the humidity and pressure, that day was a rainy day and this day was a not rainy day. And what we would like the computer to be able to do then is to be able to figure out, given these inputs, given the humidity and the pressure, can the computer predict what label should be associated with that day? Does that day look more like it’s going to be a day that rains or does it look more like a day when it’s not going to rain? Put a little bit more mathematically, you can think of this as a function that takes two inputs, the inputs being the data points that our computer will have access to, things like humidity and pressure. So we could write a function f that takes as input both humidity and pressure. And then the output is going to be what category we would ascribe to these particular input points, what label we would associate with that input. So we’ve seen a couple of example data points here, where given this value for humidity and this value for pressure, we predict, is it going to rain or is it not going to rain? And that’s information that we just gathered from the world. We measured on various different days what the humidity and pressure were. We observed whether or not we saw rain or no rain on that particular day. And this function f is what we would like to approximate. Now, the computer and we humans don’t really know exactly how this function f works. It’s probably quite a complex function. So what we’re going to do instead is attempt to estimate it. We would like to come up with a hypothesis function. h, which is going to try to approximate what f does. We want to come up with some function h that will also take the same inputs and will also produce an output, rain or no rain. And ideally, we’d like these two functions to agree as much as possible. So the goal then of the supervised learning classification tasks is going to be to figure out, what does that function h look like? How can we begin to estimate, given all of this information, all of this data, what category or what label should be assigned to a particular data point? So where could you begin doing this? Well, a reasonable thing to do, especially in this situation, I have two numerical values, is I could try to plot this on a graph that has two axes, an x-axis and a y-axis. And in this case, we’re just going to be using two numerical values as input. But these same types of ideas scale as you add more and more inputs as well. We’ll be plotting things in two dimensions. But as we soon see, you could add more inputs and just imagine things in multiple dimensions. And while we humans have trouble conceptualizing anything really beyond three dimensions, at least visually, a computer has no problem with trying to imagine things in many, many more dimensions, that for a computer, each dimension is just some separate number that it is keeping track of. So it wouldn’t be unreasonable for a computer to think in 10 dimensions or 100 dimensions to be able to try to solve a problem. But for now, we’ve got two inputs. So we’ll graph things along two axes, an x-axis, which will here represent humidity, and a y-axis, which here represents pressure. And what we might do is say, let’s take all of the days that were raining and just try to plot them on this graph and see where they fall on this graph. And here might be all of the rainy days, where each rainy day is one of these blue dots here that corresponds to a particular value for humidity and a particular value for pressure. And then I might do the same thing with the days that were not rainy. So take all the not rainy days, figure out what their values were for each of these two inputs, and go ahead and plot them on this graph as well. And I’ve here plotted them in red. So blue here stands for a rainy day. Red here stands for a not rainy day. And this then is the input that my computer has access to all of this input. And what I would like the computer to be able to do is to train a model such that if I’m ever presented with a new input that doesn’t have a label associated with it, something like this white dot here, I would like to predict, given those values for each of the two inputs, should we classify it as a blue dot, a rainy day, or should we classify it as a red dot, a not rainy day? And if you’re just looking at this picture graphically, trying to say, all right, this white dot, does it look like it belongs to the blue category, or does it look like it belongs to the red category, I think most people would agree that it probably belongs to the blue category. And why is that? Well, it looks like it’s close to other blue dots. And that’s not a very formal notion, but it’s a notion that we’ll formalize in just a moment. That because it seems to be close to this blue dot here, nothing else is closer to it, then we might say that it should be categorized as blue. It should fall into that category of, I think that day is going to be a rainy day based on that input. Might not be totally accurate, but it’s a pretty good guess. And this type of algorithm is actually a very popular and common machine learning algorithm known as nearest neighbor classification. It’s an algorithm for solving these classification-type problems. And in nearest neighbor classification, it’s going to perform this algorithm. What it will do is, given an input, it will choose the class of the nearest data point to that input. By class, we just here mean category, like rain or no rain, counterfeit or not counterfeit. And we choose the category or the class based on the nearest data point. So given all that data, we just looked at, is the nearest data point a blue point or is it a red point? And depending on the answer to that question, we were able to make some sort of judgment. We were able to say something like, we think it’s going to be blue or we think it’s going to be red. So likewise, we could apply this to other data points that we encounter as well. If suddenly this data point comes about, well, its nearest data is red. So we would go ahead and classify this as a red point, not raining. Things get a little bit trickier, though, when you look at a point like this white point over here and you ask the same sort of question. Should it belong to the category of blue points, the rainy days? Or should it belong to the category of red points, the not rainy days? Now, nearest neighbor classification would say the way you solve this problem is look at which point is nearest to that point. You look at this nearest point and say it’s red. It’s a not rainy day. And therefore, according to nearest neighbor classification, I would say that this unlabeled point, well, that should also be red. It should also be classified as a not rainy day. But your intuition might think that that’s a reasonable judgment to make, that it’s the closest thing is a not rainy day. So may as well guess that it’s a not rainy day. But it’s probably also reasonable to look at the bigger picture of things to say, yes, it is true that the nearest point to it was a red point. But it’s surrounded by a whole bunch of other blue points. So looking at the bigger picture, there’s potentially an argument to be made that this point should actually be blue. And with only this data, we actually don’t know for sure. We are given some input, something we’re trying to predict. And we don’t necessarily know what the output is going to be. So in this case, which one is correct is difficult to say. But oftentimes, considering more than just a single neighbor, considering multiple neighbors can sometimes give us a better result. And so there’s a variant on the nearest neighbor classification algorithm that is known as the K nearest neighbor classification algorithm, where K is some parameter, some number that we choose, for how many neighbors are we going to look at. So one nearest neighbor classification is what we saw before. Just pick the one nearest neighbor and use that category. But with K nearest neighbor classification, where K might be 3, or 5, or 7, to say look at the 3, or 5, or 7 closest neighbors, closest data points to that point, works a little bit differently. This algorithm, we’ll give it an input. Choose the most common class out of the K nearest data points to that input. So if we look at the five nearest points, and three of them say it’s raining, and two of them say it’s not raining, we’ll go with the three instead of the two, because each one effectively gets one vote towards what they believe the category ought to be. And ultimately, you choose the category that has the most votes as a consequence of that. So K nearest neighbor classification, fairly straightforward one to understand intuitively. You just look at the neighbors and figure out what the answer might be. And it turns out this can work very, very well for solving a whole variety of different types of classification problems. But not every model is going to work under every situation. And so one of the things we’ll take a look at today, especially in the context of supervised machine learning, is that there are a number of different approaches to machine learning, a number of different algorithms that we can apply, all solving the same type of problem, all solving some kind of classification problem where we want to take inputs and organize it into different categories. And no one algorithm is necessarily always going to be better than some other algorithm. They each have their trade-offs. And maybe depending on the data, one type of algorithm is going to be better suited to trying to model that information than some other algorithm. And so this is what a lot of machine learning research ends up being about, that when you’re trying to apply machine learning techniques, you’re often looking not just at one particular algorithm, but trying multiple different algorithms, trying to see what is going to give you the best results for trying to predict some function that maps inputs to outputs. So what then are the drawbacks of K nearest neighbor classification? Well, there are a couple. One might be that in a naive approach, at least, it could be fairly slow to have to go through and measure the distance between a point and every single one of these points that exist here. Now, there are ways of trying to get around that. There are data structures that can help to make it more quickly to be able to find these neighbors. There are also techniques you can use to try and prune some of this data, remove some of the data points so that you’re only left with the relevant data points just to make it a little bit easier. But ultimately, what we might like to do is come up with another way of trying to do this classification. And one way of trying to do the classification was looking at what are the neighboring points. But another way might be to try to look at all of the data and see if we can come up with some decision boundary, some boundary that will separate the rainy days from the not rainy days. And in the case of two dimensions, we can do that by drawing a line, for example. So what we might want to try to do is just find some line, find some separator that divides the rainy days, the blue points over here, from the not rainy days, the red points over there. We’re now trying a different approach in contrast with the nearest neighbor approach, which just looked at local data around the input data point that we cared about. Now what we’re doing is trying to use a technique known as linear regression to find some sort of line that will separate the two halves from each other. Now sometimes it’ll actually be possible to come up with some line that perfectly separates all the rainy days from the not rainy days. Realistically, though, this is probably cleaner than many data sets will actually be. Oftentimes, data is messier. There are outliers. There’s random noise that happens inside of a particular system. And what we’d like to do is still be able to figure out what a line might look like. So in practice, the data will not always be linearly separable. Or linearly separable refers to some data set where I could draw a line just to separate the two halves of it perfectly. Instead, you might have a situation like this, where there are some rainy points that are on this side of the line and some not rainy points that are on that side of the line. And there may not be a line that perfectly separates what path of the inputs from the other half, that perfectly separates all the rainy days from the not rainy days. But we can still say that this line does a pretty good job. And we’ll try to formalize a little bit later what we mean when we say something like this line does a pretty good job of trying to make that prediction. But for now, let’s just say we’re looking for a line that does as good of a job as we can at trying to separate one category of things from another category of things. So let’s now try to formalize this a little bit more mathematically. We want to come up with some sort of function, some way we can define this line. And our inputs are things like humidity and pressure in this case. So our inputs we might call x1 is going to represent humidity, and x2 is going to represent pressure. These are inputs that we are going to provide to our machine learning algorithm. And given those inputs, we would like for our model to be able to predict some sort of output. And we are going to predict that using our hypothesis function, which we called h. Our hypothesis function is going to take as input x1 and x2, humidity and pressure in this case. And you can imagine if we didn’t just have two inputs, we had three or four or five inputs or more, we could have this hypothesis function take all of those as input. And we’ll see examples of that a little bit later as well. And now the question is, what does this hypothesis function do? Well, it really just needs to measure, is this data point on one side of the boundary, or is it on the other side of the boundary? And how do we formalize that boundary? Well, the boundary is generally going to be a linear combination of these input variables, at least in this particular case. So what we’re trying to do when we say linear combination is take each of these inputs and multiply them by some number that we’re going to have to figure out. We’ll generally call that number a weight for how important should these variables be in trying to determine the answer. So we’ll weight each of these variables with some weight, and we might add a constant to it just to try and make the function a little bit different. And the result, we just need to compare. Is it greater than 0, or is it less than 0 to say, does it belong on one side of the line or the other side of the line? So what that mathematical expression might look like is this. We would take each of my variables, x1 and x2, multiply them by some weight. I don’t yet know what that weight is, but it’s going to be some number, weight 1 and weight 2. And maybe we just want to add some other weight 0 to it, because the function might require us to shift the entire value up or down by a certain amount. And then we just compare. If we do all this math, is it greater than or equal to 0? If so, we might categorize that data point as a rainy day. And otherwise, we might say, no rain. So the key here, then, is that this expression is how we are going to calculate whether it’s a rainy day or not. We’re going to do a bunch of math where we take each of the variables, multiply them by a weight, maybe add an extra weight to it, see if the result is greater than or equal to 0. And using that result of that expression, we’re able to determine whether it’s raining or not raining. This expression here is in this case going to refer to just some line. If you were to plot that graphically, it would just be some line. And what the line actually looks like depends upon these weights. x1 and x2 are the inputs, but these weights are really what determine the shape of that line, the slope of that line, and what that line actually looks like. So we then would like to figure out what these weights should be. We can choose whatever weights we want, but we want to choose weights in such a way that if you pass in a rainy day’s humidity and pressure, then you end up with a result that is greater than or equal to 0. And we would like it such that if we passed into our hypothesis function a not rainy day’s inputs, then the output that we get should be not raining. So before we get there, let’s try and formalize this a little bit more mathematically just to get a sense for how it is that you’ll often see this if you ever go further into supervised machine learning and explore this idea. One thing is that generally for these categories, we’ll sometimes just use the names of the categories like rain and not rain. Often mathematically, if we’re trying to do comparisons between these things, it’s easier just to deal in the world of numbers. So we could just say 1 and 0, 1 for raining, 0 for not raining. So we do all this math. And if the result is greater than or equal to 0, we’ll go ahead and say our hypothesis function outputs 1, meaning raining. And otherwise, it outputs 0, meaning not raining. And oftentimes, this type of expression will instead express using vector mathematics. And all a vector is, if you’re not familiar with the term, is it refers to a sequence of numerical values. You could represent that in Python using a list of numerical values or a tuple with numerical values. And here, we have a couple of sequences of numerical values. One of our vectors, one of our sequences of numerical values, are all of these individual weights, w0, w1, and w2. So we could construct what we’ll call a weight vector, and we’ll see why this is useful in a moment, called w, generally represented using a boldface w, that is just a sequence of these three weights, weight 0, weight 1, and weight 2. And to be able to calculate, based on those weights, whether we think a day is raining or not raining, we’re going to multiply each of those weights by one of our input variables. That w2, this weight, is going to be multiplied by input variable x2. w1 is going to be multiplied by input variable x1. And w0, well, it’s not being multiplied by anything. But to make sure the vectors are the same length, and we’ll see why that’s useful in just a second, we’ll just go ahead and say w0 is being multiplied by 1. Because you can multiply by something by 1, and you end up getting the exact same number. So in addition to the weight vector w, we’ll also have an input vector that we’ll call x that has three values, 1, again, because we’re just multiplying w0 by 1 eventually, and then x1 and x2. So here, then, we’ve represented two distinct vectors, a vector of weights that we need to somehow learn. The goal of our machine learning algorithm is to learn what this weight vector is supposed to be. We could choose any arbitrary set of numbers, and it would produce a function that tries to predict rain or not rain, but it probably wouldn’t be very good. What we want to do is come up with a good choice of these weights so that we’re able to do the accurate predictions. And then this input vector represents a particular input to the function, a data point for which we would like to estimate, is that day a rainy day, or is that day a not rainy day? And so that’s going to vary just depending on what input is provided to our function, what it is that we are trying to estimate. And then to do the calculation, we want to calculate this expression here, and it turns out that expression is what we would call the dot product of these two vectors. The dot product of two vectors just means taking each of the terms in the vectors and multiplying them together, w0 multiply it by 1, w1 multiply it by x1, w2 multiply it by x2, and that’s why these vectors need to be the same length. And then we just add all of the results together. So the dot product of w and x, our weight vector and our input vector, that’s just going to be w0 times 1, or just w0, plus w1 times x1, multiplying these two terms together, plus w2 times x2, multiplying those terms together. So we have our weight vector, which we need to figure out. We need our machine learning algorithm to figure out what the weights should be. We have the input vector representing the data point that we’re trying to predict a category for, predict a label for. And we’re able to do that calculation by taking this dot product, which you’ll often see represented in vector form. But if you haven’t seen vectors before, you can think of it as identical to just this mathematical expression, just doing the multiplication, adding the results together, and then seeing whether the result is greater than or equal to 0 or not. This expression here is identical to the expression that we’re calculating to see whether or not that answer is greater than or equal to 0 in this case. And so for that reason, you’ll often see the hypothesis function written as something like this, a simpler representation where the hypothesis takes as input some input vector x, some humidity and pressure for some day. And we want to predict an output like rain or no rain or 1 or 0 if we choose to represent things numerically. And the way we do that is by taking the dot product of the weights and our input. If it’s greater than or equal to 0, we’ll go ahead and say the output is 1. Otherwise, the output is going to be 0. And this hypothesis, we say, is parameterized by the weights. Depending on what weights we choose, we’ll end up getting a different hypothesis. If we choose the weights randomly, we’re probably not going to get a very good hypothesis function. We’ll get a 1 or a 0. But it’s probably not accurately going to reflect whether we think a day is going to be rainy or not rainy. But if we choose the weights right, we can often do a pretty good job of trying to estimate whether we think the output of the function should be a 1 or a 0. And so the question, then, is how to figure out what these weights should be, how to be able to tune those parameters. And there are a number of ways you can do that. One of the most common is known as the perceptron learning rule. And we’ll see more of this later. But the idea of the perceptron learning rule, and we’re not going to get too deep into the mathematics, we’ll mostly just introduce it more conceptually, is to say that given some data point that we would like to learn from, some data point that has an input x and an output y, where y is like 1 for rain or 0 for not rain, then we’re going to update the weights. And we’ll look at the formula in just a moment. But the big picture idea is that we can start with random weights, but then learn from the data. Take the data points one at a time. And for each one of the data points, figure out, all right, what parameters do we need to change inside of the weights in order to better match that input point. And so that is the value of having access to a lot of data in the supervised machine learning algorithm, is that you take each of the data points and maybe look at them multiple times and constantly try and figure out whether you need to shift your weights in order to better create some weight vector that is able to correctly or more accurately try to estimate what the output should be, whether we think it’s going to be raining or whether we think it’s not going to be raining. So what does that weight update look like? Without going into too much of the mathematics, we’re going to update each of the weights to be the result of the original weight plus some additional expression. And to understand this expression, y, well, y is what the actual output is. And hypothesis of x, the input, that’s going to be what we thought the input was. And so I can replace this by saying what the actual value was minus what our estimate was. And based on the difference between the actual value and what our estimate was, we might want to change our hypothesis, change the way that we do that estimation. If the actual value and the estimate were the same thing, meaning we were correctly able to predict what category this data point belonged to, well, then actual value minus estimate, that’s just going to be 0, which means this whole term on the right-hand side goes to be 0, and the weight doesn’t change. Weight i, where i is like weight 1 or weight 2 or weight 0, weight i just stays at weight i. And none of the weights change if we were able to correctly predict what category the input belonged to. But if our hypothesis didn’t correctly predict what category the input belonged to, well, then maybe then we need to make some changes, adjust the weights so that we’re better able to predict this kind of data point in the future. And what is the way we might do that? Well, if the actual value was bigger than the estimate, then, and for now we’ll go ahead and assume that these x’s are positive values, then if the actual value was bigger than the estimate, well, that means we need to increase the weight in order to make it such that the output is bigger, and therefore we’re more likely to get to the right actual value. And so if the actual value is bigger than the estimate, then actual value minus estimate, that’ll be a positive number. And so you imagine we’re just adding some positive number to the weight just to increase it ever so slightly. And likewise, the inverse case is true, that if the actual value was less than the estimate, the actual value was 0, but we estimated 1, meaning it actually was not raining, but we predicted it was going to be raining. Well, then we want to decrease the value of the weight, because then in that case, we want to try and lower the total value of computing that dot product in order to make it less likely that we would predict that it would actually be raining. So no need to get too deep into the mathematics of that, but the general idea is that every time we encounter some data point, we can adjust these weights accordingly to try and make the weights better line up with the actual data that we have access to. And you can repeat this process with data point after data point until eventually, hopefully, your algorithm converges to some set of weights that do a pretty good job of trying to figure out whether a day is going to be rainy or not raining. And just as a final point about this particular equation, this value alpha here is generally what we’ll call the learning rate. It’s just some parameter, some number we choose for how quickly we’re actually going to be updating these weight values. So that if alpha is bigger, then we’re going to update these weight values by a lot. And if alpha is smaller, then we’ll update the weight values by less. And you can choose a value of alpha. Depending on the problem, different values might suit the situation better or worse than others. So after all of that, after we’ve done this training process of take all this data and using this learning rule, look at all the pieces of data and use each piece of data as an indication to us of do the weights stay the same, do we increase the weights, do we decrease the weights, and if so, by how much? What you end up with is effectively a threshold function. And we can look at what the threshold function looks like like this. On the x-axis here, we have the output of that function, taking the weights, taking the dot product of it with the input. And on the y-axis, we have what the output is going to be, 0, which in this case represented not raining, and 1, which in this case represented raining. And the way that our hypothesis function works is it calculates this value. And if it’s greater than 0 or greater than some threshold value, then we declare that it’s a rainy day. And otherwise, we declare that it’s a not rainy day. And this then graphically is what that function looks like, that initially when the value of this dot product is small, it’s not raining, it’s not raining, it’s not raining. But as soon as it crosses that threshold, we suddenly say, OK, now it’s raining, now it’s raining, now it’s raining. And the way to interpret this kind of representation is that anything on this side of the line, that would be the category of data points where we say, yes, it’s raining. Anything that falls on this side of the line are the data points where we would say, it’s not raining. And again, we want to choose some value for the weights that results in a function that does a pretty good job of trying to do this estimation. But one tricky thing with this type of hard threshold is that it only leaves two possible outcomes. We plug in some data as input. And the output we get is raining or not raining. And there’s no room for anywhere in between. And maybe that’s what you want. Maybe all you want is given some data point, you would like to be able to classify it into one or two or more of these various different categories. But it might also be the case that you care about knowing how strong that prediction is, for example. So if we go back to this instance here, where we have rainy days on this side of the line, not rainy days on that side of the line, you might imagine that let’s look now at these two white data points. This data point here that we would like to predict a label or a category for. And this data point over here that we would also like to predict a label or a category for. It seems likely that you could pretty confidently say that this data point, that should be a rainy day. Seems close to the other rainy days if we’re going by the nearest neighbor strategy. It’s on this side of the line if we’re going by the strategy of just saying, which side of the line does it fall on by figuring out what those weights should be. And if we’re using the line strategy of just which side of the line does it fall on, which side of this decision boundary, well, we’d also say that this point here is also a rainy day because it falls on the side of the line that corresponds to rainy days. But it’s likely that even in this case, we would know that we don’t feel nearly as confident about this data point on the left as compared to this data point on the right. That for this one on the right, we can feel very confident that yes, it’s a rainy day. This one, it’s pretty close to the line if we’re judging just by distance. And so you might be less sure. But our threshold function doesn’t allow for a notion of less sure or more sure about something. It’s what we would call a hard threshold. It’s once you’ve crossed this line, then immediately we say, yes, this is going to be a rainy day. Anywhere before it, we’re going to say it’s not a rainy day. And that may not be helpful in a number of cases. One, this is not a particularly easy function to deal with. As you get deeper into the world of machine learning and are trying to do things like taking derivatives of these curves with this type of function makes things challenging. But the other challenge is that we don’t really have any notion of gradation between things. We don’t have a notion of yes, this is a very strong belief that it’s going to be raining as opposed to it’s probably more likely than not that it’s going to be raining, but maybe not totally sure about that either. So what we can do by taking advantage of a technique known as logistic regression is instead of using this hard threshold type of function, we can use instead a logistic function, something we might call a soft threshold. And that’s going to transform this into looking something a little more like this, something that more nicely curves. And as a result, the possible output values are no longer just 0 and 1, 0 for not raining, 1 for raining. But you can actually get any real numbered value between 0 and 1. But if you’re way over on this side, then you get a value of 0. OK, it’s not going to be raining, and we’re pretty sure about that. And if you’re over on this side, you get a value of 1. And yes, we’re very sure that it’s going to be raining. But in between, you could get some real numbered value, where a value like 0.7 might mean we think it’s going to rain. It’s more probable that it’s going to rain than not based on the data. But we’re not as confident as some of the other data points might be. So one of the advantages of the soft threshold is that it allows us to have an output that could be some real number that potentially reflects some sort of probability, the likelihood that we think that this particular data point belongs to that particular category. And there are some other nice mathematical properties of that as well. So that then is two different approaches to trying to solve this type of classification problem. One is this nearest neighbor type of approach, where you just take a data point and look at the data points that are nearby to try and estimate what category we think it belongs to. And the other approach is the approach of saying, all right, let’s just try and use linear regression, figure out what these weights should be, adjust the weights in order to figure out what line or what decision boundary is going to best separate these two categories. It turns out that another popular approach, a very popular approach if you just have a data set and you want to start trying to do some learning on it, is what we call the support vector machine. And we’re not going to go too much into the mathematics of the support vector machine, but we’ll at least explore it graphically to see what it is that it looks like. And the idea or the motivation behind the support vector machine is the idea that there are actually a lot of different lines that we could draw, a lot of different decision boundaries that we could draw to separate two groups. So for example, I had the red data points over here and the blue data points over here. One possible line I could draw is a line like this, that this line here would separate the red points from the blue points. And it does so perfectly. All the red points are on one side of the line. All the blue points are on the other side of the line. But this should probably make you a little bit nervous. If you come up with a model and the model comes up with a line that looks like this. And the reason why is that you worry about how well it’s going to generalize to other data points that are not necessarily in the data set that we have access to. For example, if there was a point that fell like right here, for example, on the right side of the line, well, then based on that, we might want to guess that it is, in fact, a red point, but it falls on the side of the line where instead we would estimate that it’s a blue point instead. And so based on that, this line is probably not a great choice just because it is so close to these various data points. We might instead prefer like a diagonal line that just goes diagonally through the data set like we’ve seen before. But there too, there’s a lot of diagonal lines that we could draw as well. For example, I could draw this diagonal line here, which also successfully separates all the red points from all of the blue points. From the perspective of something like just trying to figure out some setting of weights that allows us to predict the correct output, this line will predict the correct output for this particular set of data every single time because the red points are on one side, the blue points are on the other. But yet again, you should probably be a little nervous because this line is so close to these red points, even though we’re able to correctly predict on the input data, if there was a point that fell somewhere in this general area, our algorithm, this model, would say that, yeah, we think it’s a blue point, when in actuality, it might belong to the red category instead just because it looks like it’s close to the other red points. What we really want to be able to say, given this data, how can you generalize this as best as possible, is to come up with a line like this that seems like the intuitive line to draw. And the reason why it’s intuitive is because it seems to be as far apart as possible from the red data and the blue data. So that if we generalize a little bit and assume that maybe we have some points that are different from the input but still slightly further away, we can still say that something on this side probably red, something on that side probably blue, and we can make those judgments that way. And that is what support vector machines are designed to do. They’re designed to try and find what we call the maximum margin separator, where the maximum margin separator is just some boundary that maximizes the distance between the groups of points rather than come up with some boundary that’s very close to one set or the other, where in the case before, we wouldn’t have cared. As long as we’re categorizing the input well, that seems all we need to do. The support vector machine will try and find this maximum margin separator, some way of trying to maximize that particular distance. And it does so by finding what we call the support vectors, which are the vectors that are closest to the line, and trying to maximize the distance between the line and those particular points. And it works that way in two dimensions. It also works in higher dimensions, where we’re not looking for some line that separates the two data points, but instead looking for what we generally call a hyperplane, some decision boundary, effectively, that separates one set of data from the other set of data. And this ability of support vector machines to work in higher dimensions actually has a number of other applications as well. But one is that it helpfully deals with cases where data may not be linearly separable. So we talked about linear separability before, this idea that you can take data and just draw a line or some linear combination of the inputs that allows us to perfectly separate the two sets from each other. There are some data sets that are not linearly separable. And some were even two. You would not be able to find a good line at all that would try to do that kind of separation. Something like this, for example. Or if you imagine here are the red points and the blue points around it. If you try to find a line that divides the red points from the blue points, it’s actually going to be difficult, if not impossible, to do that any line you choose, well, if you draw a line here, then you ignore all of these blue points that should actually be blue and not red. Anywhere else you draw a line, there’s going to be a lot of error, a lot of mistakes, a lot of what we’ll soon call loss to that line that you draw, a lot of points that you’re going to categorize incorrectly. What we really want is to be able to find a better decision boundary that may not be just a straight line through this two dimensional space. And what support vector machines can do is they can begin to operate in higher dimensions and be able to find some other decision boundary, like the circle in this case, that actually is able to separate one of these sets of data from the other set of data a lot better. So oftentimes in data sets where the data is not linearly separable, support vector machines by working in higher dimensions can actually figure out a way to solve that kind of problem effectively. So that then, three different approaches to trying to solve these sorts of problems. We’ve seen support vector machines. We’ve seen trying to use linear regression and the perceptron learning rule to be able to figure out how to categorize inputs and outputs. We’ve seen the nearest neighbor approach. No one necessarily better than any other again. It’s going to depend on the data set, the information you have access to. It’s going to depend on what the function looks like that you’re ultimately trying to predict. And this is where a lot of research and experimentation can be involved in trying to figure out how it is to best perform that kind of estimation. But classification is only one of the tasks that you might encounter in supervised machine learning. Because in classification, what we’re trying to predict is some discrete category. We’re trying to predict red or blue, rain or not rain, authentic or counterfeit. But sometimes what we want to predict is a real numbered value. And for that, we have a related problem, not classification, but instead known as regression. And regression is the supervised learning problem where we try and learn a function mapping inputs to outputs same as before. But instead of the outputs being discrete categories, things like rain or not rain, in a regression problem, the output values are generally continuous values, some real number that we would like to predict. This happens all the time as well. You might imagine that a company might take this approach if it’s trying to figure out, for instance, what the effect of its advertising is. How do advertising dollars spent translate into sales for the company’s product, for example? And so they might like to try to predict some function that takes as input the amount of money spent on advertising. And here, we’re just going to use one input. But again, you could scale this up to many more inputs as well if you have a lot of different kinds of data you have access to. And the goal is to learn a function that given this amount of spending on advertising, we’re going to get this amount in sales. And you might judge, based on having access to a whole bunch of data, like for every past month, here is how much we spent on advertising, and here is what sales were. And we would like to predict some sort of hypothesis function that, again, given the amount spent on advertising, we can predict, in this case, some real number, some number estimate of how much sales we expect that company to do in this month or in this quarter or whatever unit of time we’re choosing to measure things in. And so again, the approach to solving this type of problem, we could try using a linear regression type approach where we take this data and we just plot it. On the x-axis, we have advertising dollars spent. On the y-axis, we have sales. And we might just want to try and draw a line that does a pretty good job of trying to estimate this relationship between advertising and sales. And in this case, unlike before, we’re not trying to separate the data points into discrete categories. But instead, in this case, we’re just trying to find a line that approximates this relationship between advertising and sales so that if we want to figure out what the estimated sales are for a particular advertising budget, you just look it up in this line, figure out for this amount of advertising, we would have this amount of sales and just try and make the estimate that way. And so you can try and come up with a line, again, figuring out how to modify the weights using various different techniques to try and make it so that this line fits as well as possible. So with all of these approaches, then, to trying to solve machine learning style problems, the question becomes, how do we evaluate these approaches? How do we evaluate the various different hypotheses that we could come up with? Because each of these algorithms will give us some sort of hypothesis, some function that maps inputs to outputs, and we want to know, how well does that function work? And you can think of evaluating these hypotheses and trying to get a better hypothesis as kind of like an optimization problem. In an optimization problem, as you recall from before, we were either trying to maximize some objective function by trying to find a global maximum, or we were trying to minimize some cost function by trying to find some global minimum. And in the case of evaluating these hypotheses, one thing we might say is that this cost function, the thing we’re trying to minimize, we might be trying to minimize what we would call a loss function. And what a loss function is, is it is a function that is going to estimate for us how poorly our function performs. More formally, it’s like a loss of utility by whenever we predict something that is wrong, that is a loss of utility. That’s going to add to the output of our loss function. And you could come up with any loss function that you want, just some mathematical way of estimating, given each of these data points, given what the actual output is, and given what our projected output is, our estimate, you could calculate some sort of numerical loss for it. But there are a couple of popular loss functions that are worth discussing, just so that you’ve seen them before. When it comes to discrete categories, things like rain or not rain, counterfeit or not counterfeit, one approaches the 0, 1 loss function. And the way that works is for each of the data points, our loss function takes as input what the actual output is, like whether it was actually raining or not raining, and takes our prediction into account. Did we predict, given this data point, that it was raining or not raining? And if the actual value equals the prediction, well, then the 0, 1 loss function will just say the loss is 0. There was no loss of utility, because we were able to predict correctly. And otherwise, if the actual value was not the same thing as what we predicted, well, then in that case, our loss is 1. We lost something, lost some utility, because what we predicted was the output of the function, was not what it actually was. And the goal, then, in a situation like this would be to come up with some hypothesis that minimizes the total empirical loss, the total amount that we’ve lost, if you add up for all these data points what the actual output is and what your hypothesis would have predicted. So in this case, for example, if we go back to classifying days as raining or not raining, and we came up with this decision boundary, how would we evaluate this decision boundary? How much better is it than drawing the line here or drawing the line there? Well, we could take each of the input data points, and each input data point has a label, whether it was raining or whether it was not raining. And we could compare it to the prediction, whether we predicted it would be raining or not raining, and assign it a numerical value as a result. So for example, these points over here, they were all rainy days, and we predicted they would be raining, because they fall on the bottom side of the line. So they have a loss of 0, nothing lost from those situations. And likewise, same is true for some of these points over here, where it was not raining and we predicted it would not be raining either. Where we do have loss are points like this point here and that point there, where we predicted that it would not be raining, but in actuality, it’s a blue point. It was raining. Or likewise here, we predicted that it would be raining, but in actuality, it’s a red point. It was not raining. And so as a result, we miscategorized these data points that we were trying to train on. And as a result, there is some loss here. One loss here, there, here, and there, for a total loss of 4, for example, in this case. And that might be how we would estimate or how we would say that this line is better than a line that goes somewhere else or a line that’s further down, because this line might minimize the loss. So there is no way to do better than just these four points of loss if you’re just drawing a straight line through our space. So the 0, 1 loss function checks. Did we get it right? Did we get it wrong? If we got it right, the loss is 0, nothing lost. If we got it wrong, then our loss function for that data point says 1. And we add up all of those losses across all of our data points to get some sort of empirical loss, how much we have lost across all of these original data points that our algorithm had access to. There are other forms of loss as well that work especially well when we deal with more real valued cases, cases like the mapping between advertising budget and amount that we do in sales, for example. Because in that case, you care not just that you get the number exactly right, but you care how close you were to the actual value. If the actual value is you did like $2,800 in sales and you predicted that you would do $2,900 in sales, maybe that’s pretty good. That’s much better than if you had predicted you’d do $1,000 in sales, for example. And so we would like our loss function to be able to take that into account as well, take into account not just whether the actual value and the expected value are exactly the same, but also take into account how far apart they were. And so for that one approach is what we call L1 loss. L1 loss doesn’t just look at whether actual and predicted are equal to each other, but we take the absolute value of the actual value minus the predicted value. In other words, we just ask how far apart were the actual and predicted values, and we sum that up across all of the data points to be able to get what our answer ultimately is. So what might this actually look like for our data set? Well, if we go back to this representation where we had advertising along the x-axis, sales along the y-axis, our line was our prediction, our estimate for any given amount of advertising, what we predicted sales was going to be. And our L1 loss is just how far apart vertically along the sales axis our prediction was from each of the data points. So we could figure out exactly how far apart our prediction was from each of the data points and figure out as a result of that what our loss is overall for this particular hypothesis just by adding up all of these various different individual losses for each of these data points. And our goal then is to try and minimize that loss, to try and come up with some line that minimizes what the utility loss is by judging how far away our estimate amount of sales is from the actual amount of sales. And turns out there are other loss functions as well. One that’s quite popular is the L2 loss. The L2 loss, instead of just using the absolute value, like how far away the actual value is from the predicted value, it uses the square of actual minus predicted. So how far apart are the actual and predicted value? And it squares that value, effectively penalizing much more harshly anything that is a worse prediction. So you imagine if you have two data points that you predict as being one value away from their actual value, as opposed to one data point that you predict as being two away from its actual value, the L2 loss function will more harshly penalize that one that is two away, because it’s going to square, however, much the differences between the actual value and the predicted value. And depending on the situation, you might want to choose a loss function depending on what you care about minimizing. If you really care about minimizing the error on more outlier cases, then you might want to consider something like this. But if you’ve got a lot of outliers, and you don’t necessarily care about modeling them, then maybe an L1 loss function is preferable. But there are trade-offs here that you need to decide, based on a particular set of data. But what you do run the risk of with any of these loss functions, with anything that we’re trying to do, is a problem known as overfitting. And overfitting is a big problem that you can encounter in machine learning, which happens anytime a model fits too closely with a data set, and as a result, fails to generalize. We would like our model to be able to accurately predict data and inputs and output pairs for the data that we have access to. But the reason we wanted to do so is because we want our model to generalize well to data that we haven’t seen before. I would like to take data from the past year of whether it was raining or not raining, and use that data to generalize it towards the future. Say, in the future, is it going to be raining or not raining? Or if I have a whole bunch of data on what counterfeit and not counterfeit US dollar bills look like in the past when people have encountered them, I’d like to train a computer to be able to, in the future, generalize to other dollar bills that I might see as well. And the problem with overfitting is that if you try and tie yourself too closely to the data set that you’re training your model on, you can end up not generalizing very well. So what does this look like? Well, we might imagine the rainy day and not rainy day example again from here, where the blue points indicate rainy days and the red points indicate not rainy days. And we decided that we felt pretty comfortable with drawing a line like this as the decision boundary between rainy days and not rainy days. So we can pretty comfortably say that points on this side more likely to be rainy days, points on that side more likely to be not rainy days. But the loss, the empirical loss, isn’t zero in this particular case because we didn’t categorize everything perfectly. There was this one outlier, this one day that it wasn’t raining, but yet our model still predicts that it is raining. But that doesn’t necessarily mean our model is bad. It just means the model isn’t 100% accurate. If you really wanted to try and find a hypothesis that resulted in minimizing the loss, you could come up with a different decision boundary. It wouldn’t be a line, but it would look something like this. This decision boundary does separate all of the red points from all of the blue points because the red points fall on this side of this decision boundary, the blue points fall on the other side of the decision boundary. But this, we would probably argue, is not as good of a prediction. Even though it seems to be more accurate based on all of the available training data that we have for training this machine learning model, we might say that it’s probably not going to generalize well. That if there were other data points like here and there, we might still want to consider those to be rainy days because we think this was probably just an outlier. So if the only thing you care about is minimizing the loss on the data you have available to you, you run the risk of overfitting. And this can happen in the classification case. It can also happen in the regression case, that here we predicted what we thought was a pretty good line relating advertising to sales, trying to predict what sales were going to be for a given amount of advertising. But I could come up with a line that does a better job of predicting the training data, and it would be something that looks like this, just connecting all of the various different data points. And now there is no loss at all. Now I’ve perfectly predicted, given any advertising, what sales are. And for all the data available to me, it’s going to be accurate. But it’s probably not going to generalize very well. I have overfit my model on the training data that is available to me. And so in general, we want to avoid overfitting. We’d like strategies to make sure that we haven’t overfit our model to a particular data set. And there are a number of ways that you could try to do this. One way is by examining what it is that we’re optimizing for. In an optimization problem, all we do is we say, there is some cost, and I want to minimize that cost. And so far, we’ve defined that cost function, the cost of a hypothesis, just as being equal to the empirical loss of that hypothesis, like how far away are the actual data points, the outputs, away from what I predicted them to be based on that particular hypothesis. And if all you’re trying to do is minimize cost, meaning minimizing the loss in this case, then the result is going to be that you might overfit, that to minimize cost, you’re going to try and find a way to perfectly match all the input data. And that might happen as a result of overfitting on that particular input data. So in order to address this, you could add something to the cost function. What counts as cost will not just loss, but also some measure of the complexity of the hypothesis. The word the complexity of the hypothesis is something that you would need to define for how complicated does our line look. This is sort of an Occam’s razor-style approach where we want to give preference to a simpler decision boundary, like a straight line, for example, some simpler curve, as opposed to something far more complex that might represent the training data better but might not generalize as well. We’ll generally say that a simpler solution is probably the better solution and probably the one that is more likely to generalize well to other inputs. So we measure what the loss is, but we also measure the complexity. And now that all gets taken into account when we consider the overall cost, that yes, something might have less loss if it better predicts the training data, but if it’s much more complex, it still might not be the best option that we have. And we need to come up with some balance between loss and complexity. And for that reason, you’ll often see this represented as multiplying the complexity by some parameter that we have to choose, parameter lambda in this case, where we’re saying if lambda is a greater value, then we really want to penalize more complex hypotheses. Whereas if lambda is smaller, we’re going to penalize more complex hypotheses a little bit, and it’s up to the machine learning programmer to decide where they want to set that value of lambda for how much do I want to penalize a more complex hypothesis that might fit the data a little better. And again, there’s no one right answer to a lot of these things, but depending on the data set, depending on the data you have available to you and the problem you’re trying to solve, your choice of these parameters may vary, and you may need to experiment a little bit to figure out what the right choice of that is ultimately going to be. This process, then, of considering not only loss, but also some measure of the complexity is known as regularization. Regularization is the process of penalizing a hypothesis that is more complex in order to favor a simpler hypothesis that is more likely to generalize well, more likely to be able to apply to other situations that are dealing with other input points unlike the ones that we’ve necessarily seen before. So oftentimes, you’ll see us add some regularizing term to what we’re trying to minimize in order to avoid this problem of overfitting. Now, another way of making sure we don’t overfit is to run some experiments and to see whether or not we are able to generalize our model that we’ve created to other data sets as well. And it’s for that reason that oftentimes when you’re doing a machine learning experiment, when you’ve got some data and you want to try and come up with some function that predicts, given some input, what the output is going to be, you don’t necessarily want to do your training on all of the data you have available to you that you could employ a method known as holdout cross-validation, where in holdout cross-validation, we split up our data. We split up our data into a training set and a testing set. The training set is the set of data that we’re going to use to train our machine learning model. And the testing set is the set of data that we’re going to use in order to test to see how well our machine learning model actually performed. So the learning happens on the training set. We figure out what the parameters should be. We figure out what the right model is. And then we see, all right, now that we’ve trained the model, we’ll see how well it does at predicting things inside of the testing set, some set of data that we haven’t seen before. And the hope then is that we’re going to be able to predict the testing set pretty well if we’re able to generalize based on the training data that’s available to us. If we’ve overfit the training data, though, and we’re not able to generalize, well, then when we look at the testing set, it’s likely going to be the case that we’re not going to predict things in the testing set nearly as effectively. So this is one method of cross-validation, validating to make sure that the work we have done is actually going to generalize to other data sets as well. And there are other statistical techniques we can use as well. One of the downsides of this just hold out cross-validation is if you say I just split it 50-50, I train using 50% of the data and test using the other 50%, or you could choose other percentages as well, is that there is a fair amount of data that I am now not using to train, that I might be able to get a better model as a result, for example. So one approach is known as k-fold cross-validation. In k-fold cross-validation, rather than just divide things into two sets and run one experiment, we divide things into k different sets. So maybe I divide things up into 10 different sets and then run 10 different experiments. So if I split up my data into 10 different sets of data, then what I’ll do is each time for each of my 10 experiments, I will hold out one of those sets of data, where I’ll say, let me train my model on these nine sets, and then test to see how well it predicts on set number 10. And then pick another set of nine sets to train on, and then test it on the other one that I held out, where each time I train the model on everything minus the one set that I’m holding out, and then test to see how well our model performs on the test that I did hold out. And what you end up getting is 10 different results, 10 different answers for how accurately our model worked. And oftentimes, you could just take the average of those 10 to get an approximation for how well we think our model performs overall. But the key idea is separating the training data from the testing data, because you want to test your model on data that is different from what you trained the model on. Because the training, you want to avoid overfitting. You want to be able to generalize. And the way you test whether you’re able to generalize is by looking at some data that you haven’t seen before and seeing how well we’re actually able to perform. And so if we want to actually implement any of these techniques inside of a programming language like Python, number of ways we could do that. We could write this from scratch on our own, but there are libraries out there that allow us to take advantage of existing implementations of these algorithms, that we can use the same types of algorithms in a lot of different situations. And so there’s a library, very popular one, known as Scikit-learn, which allows us in Python to be able to very quickly get set up with a lot of these different machine learning models. This library has already written an algorithm for nearest neighbor classification, for doing perceptron learning, for doing a bunch of other types of inference and supervised learning that we haven’t yet talked about. But using it, we can begin to try actually testing how these methods work and how accurately they perform. So let’s go ahead and take a look at one approach to trying to solve this type of problem. All right, so I’m first going to pull up banknotes.csv, which is a whole bunch of data provided by UC Irvine, which is information about various different banknotes that people took pictures of various different banknotes and measured various different properties of those banknotes. And in particular, some human categorized each of those banknotes as either a counterfeit banknote or as not counterfeit. And so what you’re looking at here is each row represents one banknote. This is formatted as a CSV spreadsheet, where just comma separated values separating each of these various different fields. We have four different input values for each of these data points, just information, some measurement that was made on the banknote. And what those measurements exactly are aren’t as important as the fact that we do have access to this data. But more importantly, we have access for each of these data points to a label, where 0 indicates something like this was not a counterfeit bill, meaning it was an authentic bill. And a data point labeled 1 means that it is a counterfeit bill, at least according to the human researcher who labeled this particular data. So we have a whole bunch of data representing a whole bunch of different data points, each of which has these various different measurements that were made on that particular bill, and each of which has an output value, 0 or 1, 0 meaning it was a genuine bill, 1 meaning it was a counterfeit bill. And what we would like to do is use supervised learning to begin to predict or model some sort of function that can take these four values as input and predict what the output would be. We want our learning algorithm to find some sort of pattern that is able to predict based on these measurements, something that you could measure just by taking a photo of a bill, predict whether that bill is authentic or whether that bill is counterfeit. And so how can we do that? Well, I’m first going to open up banknote0.py and see how it is that we do this. I’m first importing a lot of things from Scikit-learn, but importantly, I’m going to set my model equal to the perceptron model, which is one of those models that we talked about before. We’re just going to try and figure out some setting of weights that is able to divide our data into two different groups. Then I’m going to go ahead and read data in for my file from banknotes.csv. And basically, for every row, I’m going to separate that row into the first four values of that row, which is the evidence for that row. And then the label, where if the final column in that row is a 0, the label is authentic. And otherwise, it’s going to be counterfeit. So I’m effectively reading data in from the CSV file, dividing into a whole bunch of rows where each row has some evidence, those four input values that are going to be inputs to my hypothesis function. And then the label, the output, whether it is authentic or counterfeit, that is the thing that I am then trying to predict. So the next step is that I would like to split up my data set into a training set and a testing set, some set of data that I would like to train my machine learning model on, and some set of data that I would like to use to test that model, see how well it performed. So what I’ll do is I’ll go ahead and figure out length of the data, how many data points do I have. I’ll go ahead and take half of them, save that number as a number called holdout. That is how many items I’m going to hold out for my data set to save for the testing phase. I’ll randomly shuffle the data so it’s in some random order. And then I’ll say my testing set will be all of the data up to the holdout. So I’ll take holdout many data items, and that will be my testing set. My training data will be everything else, the information that I’m going to train my model on. And then I’ll say I need to divide my training data into two different sets. I need to divide it into my x values, where x here represents the inputs. So the x values, the x values that I’m going to train on, are basically for every row in my training set, I’m going to get the evidence for that row, those four values, where it’s basically a vector of four numbers, where that is going to be all of the input. And then I need the y values. What are the outputs that I want to learn from, the labels that belong to each of these various different input points? Well, that’s going to be the same thing for each row in the training data. But this time, I take that row and get what its label is, whether it is authentic or counterfeit. So I end up with one list of all of these vectors of my input data, and one list, which follows the same order, but is all of the labels that correspond with each of those vectors. And then to train my model, which in this case is just this perceptron model, I just call model.fit, pass in the training data, and what the labels for those training data are. And scikit-learn will take care of fitting the model, will do the entire algorithm for me. And then when it’s done, I can then test to see how well that model performed. So I can say, let me get all of these input vectors for what I want to test on. So for each row in my testing data set, go ahead and get the evidence. And the y values, those are what the actual values were for each of the rows in the testing data set, what the actual label is. But then I’m going to generate some predictions. I’m going to use this model and try and predict, based on the testing vectors, I want to predict what the output is. And my goal then is to now compare y testing with predictions. I want to see how well my predictions, based on the model, actually reflect what the y values were, what the output is, that were actually labeled. Because I now have this label data, I can assess how well the algorithm worked. And so now I can just compute how well we did. I’m going to, this zip function basically just lets me look through two different lists, one by one at the same time. So for each actual value and for each predicted value, if the actual is the same thing as what I predicted, I’ll go ahead and increment the counter by one. Otherwise, I’ll increment my incorrect counter by one. And so at the end, I can print out, here are the results, here’s how many I got right, here’s how many I got wrong, and here was my overall accuracy, for example. So I can go ahead and run this. I can run python banknote0.py. And it’s going to train on half the data set and then test on half the data set. And here are the results for my perceptron model. In this case, it correctly was able to classify 679 bills as correctly either authentic or counterfeit and incorrectly classified seven of them for an overall accuracy of close to 99% accurate. So on this particular data set, using this perceptron model, we were able to predict very well what the output was going to be. And we can try different models, too, that scikit-learn makes it very easy just to swap out one model for another model. So instead of the perceptron model, I can use the support vector machine using the SVC, otherwise known as a support vector classifier, using a support vector machine to classify things into two different groups. And now see, all right, how well does this perform? And all right, this time, we were able to correctly predict 682 and incorrectly predicted four for accuracy of 99.4%. And we could even try the k-neighbors classifier as the model instead. And this takes a parameter, n neighbors, for how many neighbors do you want to look at? Let’s just look at one neighbor, the one nearest neighbor, and use that to predict. Go ahead and run this as well. And it looks like, based on the k-neighbors classifier, looking at just one neighbor, we were able to correctly classify 685 data points, incorrectly classified one. Maybe let’s try three neighbors instead, instead of just using one neighbor. Do more of a k-nearest neighbors approach, where I look at the three nearest neighbors and see how that performs. And that one, in this case, seems to have gotten 100% of all of the predictions correctly described as either authentic banknotes or as counterfeit banknotes. And we could run these experiments multiple times, because I’m randomly reorganizing the data every time. We’re technically training these on slightly different data sets. And so you might want to run multiple experiments to really see how well they’re actually going to perform. But in short, they all perform very well. And while some of them perform slightly better than others here, that might not always be the case for every data set. But you can begin to test now by very quickly putting together these machine learning models using Scikit-learn to be able to train on some training set and then test on some testing set as well. And this splitting up into training groups and testing groups and testing happens so often that Scikit-learn has functions built in for trying to do it. I did it all by hand just now. But if we take a look at banknotes one, we take advantage of some other features that exist in Scikit-learn, where we can really simplify a lot of our logic, that there is a function built into Scikit-learn called train test split, which will automatically split data into a training group and a testing group. I just have to say what proportion should be in the testing group, something like 0.5, half the data inside the testing group. Then I can fit the model on the training data, make the predictions on the testing data, and then just count up. And Scikit-learn has some nice methods for just counting up how many times our testing data match the predictions, how many times our testing data didn’t match the predictions. So very quickly, you can write programs with not all that many lines of code. It’s maybe like 40 lines of code to get through all of these predictions. And then as a result, see how well we’re able to do. So these types of libraries can allow us, without really knowing the implementation details of these algorithms, to be able to use the algorithms in a very practical way to be able to solve these types of problems. So that then was supervised learning, this task of given a whole set of data, some input output pairs, we would like to learn some function that maps those inputs to those outputs. But turns out there are other forms of learning as well. And another popular type of machine learning, especially nowadays, is known as reinforcement learning. And the idea of reinforcement learning is rather than just being given a whole data set at the beginning of input output pairs, reinforcement learning is all about learning from experience. In reinforcement learning, our agent, whether it’s like a physical robot that’s trying to make actions in the world or just some virtual agent that is a program running somewhere, our agent is going to be given a set of rewards or punishments in the form of numerical values. But you can think of them as reward or punishment. And based on that, it learns what actions to take in the future, that our agent, our AI, will be put in some sort of environment. It will make some actions. And based on the actions that it makes, it learns something. It either gets a reward when it does something well, it gets a punishment when it does something poorly, and it learns what to do or what not to do in the future based on those individual experiences. And so what this will often look like is it will often start with some agent, some AI, which might, again, be a physical robot, if you’re imagining a physical robot moving around, but it can also just be a program. And our agent is situated in their environment, where the environment is where they’re going to make their actions, and it’s what’s going to give them rewards or punishments for various actions that they’re in. So for example, the environment is going to start off by putting our agent inside of a state. Our agent has some state that, in a game, might be the state of the game that the agent is playing. In a world that the agent is exploring might be some position inside of a grid representing the world that they’re exploring. But the agent is in some sort of state. And in that state, the agent needs to choose to take an action. The agent likely has multiple actions they can choose from, but they pick an action. So they take an action in a particular state. And as a result of that, the agent will generally get two things in response as we model them. The agent gets a new state that they find themselves in. After being in this state, taking one action, they end up in some other state. And they’re also given some sort of numerical reward, positive meaning reward, meaning it was a good thing, negative generally meaning they did something bad, they received some sort of punishment. And that is all the information the agent has. It’s told what state it’s in. It makes some sort of action. And based on that, it ends up in another state. And it ends up getting some particular reward. And it needs to learn, based on that information, what actions to begin to take in the future. And so you could imagine generalizing this to a lot of different situations. This is oftentimes how you train if you’ve ever seen those robots that are now able to walk around the way humans do. It would be quite difficult to program the robot in exactly the right way to get it to walk the way humans do. You could instead train it through reinforcement learning, give it some sort of numerical reward every time it does something good, like take steps forward, and punish it every time it does something bad, like fall over, and then let the AI just learn based on that sequence of rewards, based on trying to take various different actions. You can begin to have the agent learn what to do in the future and what not to do. So in order to begin to formalize this, the first thing we need to do is formalize this notion of what we mean about states and actions and rewards, like what does this world look like? And oftentimes, we’ll formulate this world as what’s known as a Markov decision process, similar in spirit to Markov chains, which you might recall from before. But a Markov decision process is a model that we can use for decision making, for an agent trying to make decisions in its environment. And it’s a model that allows us to represent the various different states that an agent can be in, the various different actions that they can take, and also what the reward is for taking one action as opposed to another action. So what then does it actually look like? Well, if you recall a Markov chain from before, a Markov chain looked a little something like this, where we had a whole bunch of these individual states, and each state immediately transitioned to another state based on some probability distribution. We saw this in the context of the weather before, where if it was sunny, we said with some probability, it’ll be sunny the next day. With some other probability, it’ll be rainy, for example. But we could also imagine generalizing this. It’s not just sun and rain anymore. We just have these states, where one state leads to another state according to some probability distribution. But in this original model, there was no agent that had any control over this process. It was just entirely probability based, where with some probability, we moved to this next state. But maybe it’s going to be some other state with some other probability. What we’ll now have is the ability for the agent in this state to choose from a set of actions, where maybe instead of just one path forward, they have three different choices of actions that each lead up down different paths. And even this is a bit of an oversimplification, because in each of these states, you might imagine more branching points where there are more decisions that can be taken as well. So we’ve extended the Markov chain to say that from a state, you now have available action choices. And each of those actions might be associated with its own probability distribution of going to various different states. Then in addition, we’ll add another extension, where any time you move from a state, taking an action, going into this other state, we can associate a reward with that outcome, saying either r is positive, meaning some positive reward, or r is negative, meaning there was some sort of punishment. And this then is what we’ll consider to be a Markov decision process. That a Markov decision process has some initial set of states, of states in the world that we can be in. We have some set of actions that, given a state, I can say, what are the actions that are available to me in that state, an action that I can choose from? Then we have some transition model. The transition model before just said that, given my current state, what is the probability that I end up in that next state or this other state? The transition model now has effectively two things we’re conditioning on. We’re saying, given that I’m in this state and that I take this action, what’s the probability that I end up in this next state? Now maybe we live in a very deterministic world in this Markov decision process. We’re given a state and given an action. We know for sure what next state we’ll end up in. But maybe there’s some randomness in the world that when you take in a state and you take an action, you might not always end up in the exact same state. There might be some probabilities involved there as well. The Markov decision process can handle both of those possible cases. And then finally, we have a reward function, generally called r, that in this case says, what is the reward for being in this state, taking this action, and then getting to s prime this next state? So I’m in this original state. I take this action. I get to this next state. What is the reward for doing that process? And you can add up these rewards every time you take an action to get the total amount of rewards that an agent might get from interacting in a particular environment modeled using this Markov decision process. So what might this actually look like in practice? Well, let’s just create a little simulated world here where I have this agent that is just trying to navigate its way. This agent is this yellow dot here, like a robot in the world, trying to navigate its way through this grid. And ultimately, it’s trying to find its way to the goal. And if it gets to the green goal, then it’s going to get some sort of reward. But then we might also have some red squares that are places where you get some sort of punishment, some bad place where we don’t want the agent to go. And if it ends up in the red square, then our agent is going to get some sort of punishment as a result of that. But the agent originally doesn’t know all of these details. It doesn’t know that these states are associated with punishments. But maybe it does know that this state is associated with a reward. Maybe it doesn’t. But it just needs to sort of interact with the environment to try and figure out what to do and what not to do. So the first thing the agent might do is, given no additional information, if it doesn’t know what the punishments are, it doesn’t know where the rewards are, it just might try and take an action. And it takes an action and ends up realizing that it got some sort of punishment. And so what does it learn from that experience? Well, it might learn that when you’re in this state in the future, don’t take the action move to the right, that that is a bad action to take. That in the future, if you ever find yourself back in the state, don’t take this action of going to the right when you’re in this particular state, because that leads to punishment. That might be the intuition at least. And so you could try doing other actions. You move up, all right, that didn’t lead to any immediate rewards. Maybe try something else. Then maybe try something else. And all right, now you found that you got another punishment. And so you learn something from that experience. So the next time you do this whole process, you know that if you ever end up in this square, you shouldn’t take the down action, because being in this state and taking that action ultimately leads to some sort of punishment, a negative reward, in other words. And this process repeats. You might imagine just letting our agent explore the world, learning over time what states tend to correspond with poor actions, learning over time what states correspond with poor actions, until eventually, if it tries enough things randomly, it might find that eventually when you get to this state, if you take the up action in this state, it might find that you actually get a reward from that. And what it can learn from that is that if you’re in this state, you should take the up action, because that leads to a reward. And over time, you can also learn that if you’re in this state, you should take the left action, because that leads to this state that also lets you eventually get to the reward. So you begin to learn over time not only which actions are good in particular states, but also which actions are bad, such that once you know some sequence of good actions that leads you to some sort of reward, our agent can just follow those instructions, follow the experience that it has learned. We didn’t tell the agent what the goal was. We didn’t tell the agent where the punishments were. But the agent can begin to learn from this experience and learn to begin to perform these sorts of tasks better in the future. And so let’s now try to formalize this idea, formalize the idea that we would like to be able to learn in this state taking this action, is that a good thing or a bad thing? There are lots of different models for reinforcement learning. We’re just going to look at one of them today. And the one that we’re going to look at is a method known as Q-learning. And what Q-learning is all about is about learning a function, a function Q, that takes inputs S and A, where S is a state and A is an action that you take in that state. And what this Q function is going to do is it is going to estimate the value. How much reward will I get from taking this action in this state? Originally, we don’t know what this Q function should be. But over time, based on experience, based on trying things out and seeing what the result is, I would like to try and learn what Q of SA is for any particular state and any particular action that I might take in that state. So what is the approach? Well, the approach originally is we’ll start with Q SA equal to 0 for all states S and for all actions A. That initially, before I’ve ever started anything, before I’ve had any experiences, I don’t know the value of taking any action in any given state. So I’m going to assume that the value is just 0 all across the board. But then as I interact with the world, as I experience rewards or punishments, or maybe I go to a cell where I don’t get either reward or a punishment, I want to somehow update my estimate of Q SA. I want to continually update my estimate of Q SA based on the experiences and rewards and punishments that I’ve received, such that in the future, my knowledge of what actions are good and what states will be better. So when we take an action and receive some sort of reward, I want to estimate the new value of Q SA. And I estimate that based on a couple of different things. I estimate it based on the reward that I’m getting from taking this action and getting into the next state. But assuming the situation isn’t over, assuming there are still future actions that I might take as well, I also need to take into account the expected future rewards. That if you imagine an agent interacting with the environment, then sometimes you’ll take an action and get a reward, but then you can keep taking more actions and get more rewards, that these both are relevant, both the current reward I’m getting from this current step and also my future reward. And it might be the case that I’ll want to take a step that doesn’t immediately lead to a reward, because later on down the line, I know it will lead to more rewards as well. So there’s a balancing act between current rewards that the agent experiences and future rewards that the agent experiences as well. And then we need to update QSA. So we estimate the value of QSA based on the current reward and the expected future rewards. And then we need to update this Q function to take into account this new estimate. Now, we already, as we go through this process, we’ll already have an estimate for what we think the value is. Now we have a new estimate, and then somehow we need to combine these two estimates together, and we’ll look at more formal ways that we can actually begin to do that. So to actually show you what this formula looks like, here is the approach we’ll take with Q learning. We’re going to, again, start with Q of S and A being equal to 0 for all states. And then every time we take an action A in state S and observer reward R, we’re going to update our value, our estimate, for Q of SA. And the idea is that we’re going to figure out what the new value estimate is minus what our existing value estimate is. And so we have some preconceived notion for what the value is for taking this action in this state. Maybe our expectation is we currently think the value is 10. But then we’re going to estimate what we now think it’s going to be. Maybe the new value estimate is something like 20. So there’s a delta of 10 that our new value estimate is 10 points higher than what our current value estimate happens to be. And so we have a couple of options here. We need to decide how much we want to adjust our current expectation of what the value is of taking this action in this particular state. And what that difference is, how much we add or subtract from our existing notion of how much do we expect the value to be, is dependent on this parameter alpha, also called a learning rate. And alpha represents, in effect, how much we value new information compared to how much we value old information. An alpha value of 1 means we really value new information. But if we have a new estimate, then it doesn’t matter what our old estimate is. We’re only going to consider our new estimate because we always just want to take into consideration our new information. So the way that works is that if you imagine alpha being 1, well, then we’re taking the old value of QSA and then adding 1 times the new value minus the old value. And that just leaves us with the new value. So when alpha is 1, all we take into consideration is what our new estimate happens to be. But over time, as we go through a lot of experiences, we already have some existing information. We might have tried taking this action nine times already. And now we just tried it a 10th time. And we don’t only want to consider this 10th experience. I also want to consider the fact that my prior nine experiences, those were meaningful, too. And that’s data I don’t necessarily want to lose. And so this alpha controls that decision, controls how important is the new information. 0 would mean ignore all the new information. Just keep this Q value the same. 1 means replace the old information entirely with the new information. And somewhere in between, keep some sort of balance between these two values. We can put this equation a little bit more formally as well. The old value estimate is our old estimate for what the value is of taking this action in a particular state. That’s just Q of SNA. So we have it once here, and we’re going to add something to it. We’re going to add alpha times the new value estimate minus the old value estimate. But the old value estimate, we just look up by calling this Q function. And what then is the new value estimate? Based on this experience we have just taken, what is our new estimate for the value of taking this action in this particular state? Well, it’s going to be composed of two parts. It’s going to be composed of what reward did I just get from taking this action in this state. And then it’s going to be, what can I expect my future rewards to be from this point forward? So it’s going to be R, some reward I’m getting right now, plus whatever I estimate I’m going to get in the future. And how do I estimate what I’m going to get in the future? Well, it’s a bit of another call to this Q function. It’s going to be take the maximum across all possible actions I could take next and say, all right, of all of these possible actions I could take, which one is going to have the highest reward? And so this then looks a little bit complicated. This is going to be our notion for how we’re going to perform this kind of update. I have some estimate, some old estimate, for what the value is of taking this action in this state. And I’m going to update it based on new information that I experience some reward. I predict what my future reward is going to be. And using that I update what I estimate the reward will be for taking this action in this particular state. And there are other additions you might make to this algorithm as well. Sometimes it might not be the case that future rewards you want to wait equally to current rewards. Maybe you want an agent that values reward now over reward later. And so sometimes you can even add another term in here, some other parameter, where you discount future rewards and say future rewards are not as valuable as rewards immediately. That getting reward in the current time step is better than waiting a year and getting rewards later. But that’s something up to the programmer to decide what that parameter ought to be. But the big picture idea of this entire formula is to say that every time we experience some new reward, we take that into account. We update our estimate of how good is this action. And then in the future, we can make decisions based on that algorithm. Once we have some good estimate for every state and for every action, what the value is of taking that action, then we can do something like implement a greedy decision making policy. That if I am in a state and I want to know what action should I take in that state, well, then I consider for all of my possible actions, what is the value of QSA? What is my estimated value of taking that action in that state? And I will just pick the action that has the highest value after I evaluate that expression. So I pick the action that has the highest value. And based on that, that tells me what action I should take. At any given state that I’m in, I can just greedily say across all my actions, this action gives me the highest expected value. And so I’ll go ahead and choose that action as the action that I take as well. But there is a downside to this kind of approach. And then downside comes up in a situation like this, where we know that there is some solution that gets me to the reward. And our agent has been able to figure that out. But it might not necessarily be the best way or the fastest way. If the agent is allowed to explore a little bit more, it might find that it can get the reward faster by taking some other route instead, by going through this particular path that is a faster way to get to that ultimate goal. And maybe we would like for the agent to be able to figure that out as well. But if the agent always takes the actions that it knows to be best, well, when it gets to this particular square, it doesn’t know that this is a good action because it’s never really tried it. But it knows that going down eventually leads its way to this reward. So it might learn in the future that it should just always take this route and it’s never going to explore and go along that route instead. So in reinforcement learning, there is this tension between exploration and exploitation. And exploitation generally refers to using knowledge that the AI already has. The AI already knows that this is a move that leads to reward. So we’ll go ahead and use that move. And exploration is all about exploring other actions that we may not have explored as thoroughly before because maybe one of these actions, even if I don’t know anything about it, might lead to better rewards faster or to more rewards in the future. And so an agent that only ever exploits information and never explores might be able to get reward, but it might not maximize its rewards because it doesn’t know what other possibilities are out there, possibilities that we only know about by taking advantage of exploration. And so how can we try and address this? Well, one possible solution is known as the Epsilon greedy algorithm, where we set Epsilon equal to how often we want to just make a random move, where occasionally we will just make a random move in order to say, let’s try to explore and see what happens. And then the logic of the algorithm will be with probability 1 minus Epsilon, choose the estimated best move. In a greedy case, we’d always choose the best move. But in Epsilon greedy, we’re most of the time going to choose the best move or sometimes going to choose the best move. But sometimes with probability Epsilon, we’re going to choose a random move instead. So every time we’re faced with the ability to take an action, sometimes we’re going to choose the best move. Sometimes we’re just going to choose a random move. So this type of algorithm can be quite powerful in a reinforcement learning context by not always just choosing the best possible move right now, but sometimes, especially early on, allowing yourself to make random moves that allow you to explore various different possible states and actions more, and maybe over time, you might decrease your value of Epsilon. More and more often, choosing the best move after you’re more confident that you’ve explored what all of the possibilities actually are. So we can put this into practice. And one very common application of reinforcement learning is in game playing, that if you want to teach an agent how to play a game, you just let the agent play the game a whole bunch. And then the reward signal happens at the end of the game. When the game is over, if our AI won the game, it gets a reward of like 1, for example. And if it lost the game, it gets a reward of negative 1. And from that, it begins to learn what actions are good and what actions are bad. You don’t have to tell the AI what’s good and what’s bad, but the AI figures it out based on that reward. Winning the game is some signal, losing the game is some signal, and based on all of that, it begins to figure out what decisions it should actually make. So one very simple game, which you may have played before, is a game called Nim. And in the game of Nim, you’ve got a whole bunch of objects in a whole bunch of different piles, where here I’ve represented each pile as an individual row. So you’ve got one object in the first pile, three in the second pile, five in the third pile, seven in the fourth pile. And the game of Nim is a two player game where players take turns removing objects from piles. And the rule is that on any given turn, you were allowed to remove as many objects as you want from any one of these piles, any one of these rows. You have to remove at least one object, but you remove as many as you want from exactly one of the piles. And whoever takes the last object loses. So player one might remove four from this pile here. Player two might remove four from this pile here. So now we’ve got four piles left, one, three, one, and three. Player one might remove the entirety of the second pile. Player two, if they’re being strategic, might remove two from the third pile. Now we’ve got three piles left, each with one object left. Player one might remove one from one pile. Player two removes one from the other pile. And now player one is left with choosing this one object from the last pile, at which point player one loses the game. So fairly simple game. Piles of objects, any turn you choose how many objects to remove from a pile, whoever removes the last object loses. And this is the type of game you could encode into an AI fairly easily, because the states are really just four numbers. Every state is just how many objects in each of the four piles. And the actions are things like, how many am I going to remove from each one of these individual piles? And the reward happens at the end, that if you were the player that had to remove the last object, then you get some sort of punishment. But if you were not, and the other player had to remove the last object, well, then you get some sort of reward. So we could actually try and show a demonstration of this, that I’ve implemented an AI to play the game of Nim. All right, so here, what we’re going to do is create an AI as a result of training the AI on some number of games, that the AI is going to play against itself, where the idea is the AI will play games against itself, learn from each of those experiences, and learn what to do in the future. And then I, the human, will play against the AI. So initially, we’ll say train zero times, meaning we’re not going to let the AI play any practice games against itself in order to learn from its experiences. We’re just going to see how well it plays. And it looks like there are four piles. I can choose how many I remove from any one of the piles. So maybe from pile three, I will remove five objects, for example. So now, AI chose to take one item from pile zero. So I’m left with these piles now, for example. And so here, I could choose maybe to say, I would like to remove from pile two, I’ll remove all five of them, for example. And so AI chose to take two away from pile one. Now I’m left with one pile that has one object, one pile that has two objects. So from pile three, I will remove two objects. And now I’ve left the AI with no choice but to take that last one. And so the game is over, and I was able to win. But I did so because the AI was really just playing randomly. It didn’t have any prior experience that it was using in order to make these sorts of judgments. Now let me let the AI train itself on 10,000 games. I’m going to let the AI play 10,000 games of nim against itself. Every time it wins or loses, it’s going to learn from that experience and learn in the future what to do and what not to do. So here then, I’ll go ahead and run this again. And now you see the AI running through a whole bunch of training games, 10,000 training games against itself. And now it’s going to let me make these sorts of decisions. So now I’m going to play against the AI. Maybe I’ll remove one from pile three. And the AI took everything from pile three, so I’m left with three piles. I’ll go ahead and from pile two maybe remove three items. And the AI removes one item from pile zero. I’m left with two piles, each of which has two items in it. I’ll remove one from pile one, I guess. And the AI took two from pile two, leaving me with no choice but to take one away from pile one. So it seems like after playing 10,000 games of nim against itself, the AI has learned something about what states and what actions tend to be good and has begun to learn some sort of pattern for how to predict what actions are going to be good and what actions are going to be bad in any given state. So reinforcement learning can be a very powerful technique for achieving these sorts of game-playing agents, agents that are able to play a game well just by learning from experience, whether that’s playing against other people or by playing against itself and learning from those experiences as well. Now, nim is a bit of an easy game to use reinforcement learning for because there are so few states. There are only states that are as many as how many different objects are in each of these various different piles. You might imagine that it’s going to be harder if you think of a game like chess or games where there are many, many more states and many, many more actions that you can imagine taking, where it’s not going to be as easy to learn for every state and for every action what the value is going to be. So oftentimes in that case, we can’t necessarily learn exactly what the value is for every state and for every action, but we can approximate it. So much as we saw with minimax, so we could use a depth-limiting approach to stop calculating at a certain point in time, we can do a similar type of approximation known as function approximation in a reinforcement learning context where instead of learning a value of q for every state and every action, we just have some function that estimates what the value is for taking this action in this particular state that might be based on various different features of the state that the agent happens to be in, where you might have to choose what those features actually are. But you can begin to learn some patterns that generalize beyond one specific state and one specific action that you can begin to learn if certain features tend to be good things or bad things. Reinforcement learning can allow you, using a very similar mechanism, to generalize beyond one particular state and say, if this other state looks kind of like this state, then maybe the similar types of actions that worked in one state will also work in another state as well. And so this type of approach can be quite helpful as you begin to deal with reinforcement learning that exist in larger and larger state spaces where it’s just not feasible to explore all of the possible states that could actually exist. So there, then, are two of the main categories of reinforcement learning. Supervised learning, where you have labeled input and output pairs, and reinforcement learning, where an agent learns from rewards or punishments that it receives. The third major category of machine learning that we’ll just touch on briefly is known as unsupervised learning. And unsupervised learning happens when we have data without any additional feedback, without labels, that in the supervised learning case, all of our data had labels. We labeled the data point with whether that was a rainy day or not rainy day. And using those labels, we were able to infer what the pattern was. Or we labeled data as a counterfeit banknote or not a counterfeit. And using those labels, we were able to draw inferences and patterns to figure out what does a banknote look like versus not. In unsupervised learning, we don’t have any access to any of those labels. But we still would like to learn some of those patterns. And one of the tasks that you might want to perform in unsupervised learning is something like clustering, where clustering is just the task of, given some set of objects, organize it into distinct clusters, groups of objects that are similar to one another. And there’s lots of applications for clustering. It comes up in genetic research, where you might have a whole bunch of different genes and you want to cluster them into similar genes if you’re trying to analyze them across a population or across species. It comes up in an image if you want to take all the pixels of an image, cluster them into different parts of the image. Comes a lot up in market research if you want to divide your consumers into different groups so you know which groups to target with certain types of product advertisements, for example, and a number of other contexts as well in which clustering can be very applicable. One technique for clustering is an algorithm known as k-means clustering. And what k-means clustering is going to do is it is going to divide all of our data points into k different clusters. And it’s going to do so by repeating this process of assigning points to clusters and then moving around those clusters at centers. We’re going to define a cluster by its center, the middle of the cluster, and then assign points to that cluster based on which center is closest to that point. And I’ll show you an example of that now. Here, for example, I have a whole bunch of unlabeled data, just various data points that are in some sort of graphical space. And I would like to group them into various different clusters. But I don’t know how to do that originally. And let’s say I want to assign like three clusters to this group. And you have to choose how many clusters you want in k-means clustering that you could try multiple and see how well those values perform. But I’ll start just by randomly picking some places to put the centers of those clusters. Maybe I have a blue cluster, a red cluster, and a green cluster. And I’m going to start with the centers of those clusters just being in these three locations here. And what k-means clustering tells us to do is once I have the centers of the clusters, assign every point to a cluster based on which cluster center it is closest to. So we end up with something like this, where all of these points are closer to the blue cluster center than any other cluster center. All of these points here are closer to the green cluster center than any other cluster center. And then these two points plus these points over here, those are all closest to the red cluster center instead. So here then is one possible assignment of all these points to three different clusters. But it’s not great that it seems like in this red cluster, these points are kind of far apart. In this green cluster, these points are kind of far apart. It might not be my ideal choice of how I would cluster these various different data points. But k-means clustering is an iterative process that after I do this, there is a next step, which is that after I’ve assigned all of the points to the cluster center that it is nearest to, we are going to re-center the clusters, meaning take the cluster centers, these diamond shapes here, and move them to the middle, or the average, effectively, of all of the points that are in that cluster. So we’ll take this blue point, this blue center, and go ahead and move it to the middle or to the center of all of the points that were assigned to the blue cluster, moving it slightly to the right in this case. And we’ll do the same thing for red. We’ll move the cluster center to the middle of all of these points, weighted by how many points there are. There are more points over here, so the red center ends up moving a little bit further that way. And likewise, for the green center, there are many more points on this side of the green center. So the green center ends up being pulled a little bit further in this direction. So we re-center all of the clusters, and then we repeat the process. We go ahead and now reassign all of the points to the cluster center that they are now closest to. And now that we’ve moved around the cluster centers, these cluster assignments might change. That this point originally was closer to the red cluster center, but now it’s actually closer to the blue cluster center. Same goes for this point as well. And these three points that were originally closer to the green cluster center are now closer to the red cluster center instead. So we can reassign what colors or which clusters each of these data points belongs to, and then repeat the process again, moving each of these cluster means and the middles of the clusterism to the mean, the average, of all of the other points that happen to be there, and repeat the process again. Go ahead and assign each of the points to the cluster that they are closest to. So once we reach a point where we’ve assigned all the points to clusters to the cluster that they are nearest to, and nothing changed, we’ve reached a sort of equilibrium in this situation, where no points are changing their allegiance. And as a result, we can declare this algorithm is now over. And we now have some assignment of each of these points into three different clusters. And it looks like we did a pretty good job of trying to identify which points are more similar to one another than they are to points in other groups. So we have the green cluster down here, this blue cluster here, and then this red cluster over there as well. And we did so without any access to some labels to tell us what these various different clusters were. We just used an algorithm in an unsupervised sense without any of those labels to figure out which points belonged to which categories. And again, lots of applications for this type of clustering technique. And there are many more algorithms in each of these various different fields within machine learning, supervised and reinforcement and unsupervised. But those are many of the big picture foundational ideas that underlie a lot of these techniques, where these are the problems that we’re trying to solve. And we try and solve those problems using a number of different methods of trying to take data and learn patterns in that data, whether that’s trying to find neighboring data points that are similar or trying to minimize some sort of loss function or any number of other techniques that allow us to begin to try to solve these sorts of problems. That then was a look at some of the principles that are at the foundation of modern machine learning, this ability to take data and learn from that data so that the computer can perform a task even if they haven’t explicitly been given instructions in order to do so. Next time, we’ll continue this conversation about machine learning, looking at other techniques we can use for solving these sorts of problems. We’ll see you then. All right, welcome back, everyone, to an introduction to artificial intelligence with Python. Now, last time, we took a look at machine learning, a set of techniques that computers can use in order to take a set of data and learn some patterns inside of that data, learn how to perform a task even if we the programmers didn’t give the computer explicit instructions for how to perform that task. Today, we transition to one of the most popular techniques and tools within machine learning, that of neural networks. And neural networks were inspired as early as the 1940s by researchers who were thinking about how it is that humans learn, studying neuroscience in the human brain and trying to see whether or not we could apply those same ideas to computers as well and model computer learning off of human learning. So how is the brain structured? Well, very simply put, the brain consists of a whole bunch of neurons. And those neurons are connected to one another and communicate with one another in some way. In particular, if you think about the structure of a biological neural network, something like this, there are a couple of key properties that scientists observed. One was that these neurons are connected to each other and receive electrical signals from one another, that one neuron can propagate electrical signals to another neuron. And another point is that neurons process those input signals and then can be activated, that a neuron becomes activated at a certain point and then can propagate further signals onto neurons in the future. And so the question then became, could we take this biological idea of how it is that humans learn with brains and with neurons and apply that to a machine as well, in effect designing an artificial neural network, or an ANN, which will be a mathematical model for learning that is inspired by these biological neural networks? And what artificial neural networks will allow us to do is they will first be able to model some sort of mathematical function. Every time you look at a neural network, which we’ll see more of later today, each one of them is really just some mathematical function that is mapping certain inputs to particular outputs based on the structure of the network, that depending on where we place particular units inside of this neural network, that’s going to determine how it is that the network is going to function. And in particular, artificial neural networks are going to lend themselves to a way that we can learn what the network’s parameters should be. We’ll see more on that in just a moment. But in effect, we want a model such that it is easy for us to be able to write some code that allows for the network to be able to figure out how to model the right mathematical function given a particular set of input data. So in order to create our artificial neural network, instead of using biological neurons, we’re just going to use what we’re going to call units, units inside of a neural network, which we can represent kind of like a node in a graph, which will here be represented just by a blue circle like this. And these artificial units, these artificial neurons, can be connected to one another. So here, for instance, we have two units that are connected by this edge inside of this graph, effectively. And so what we’re going to do now is think of this idea as some sort of mapping from inputs to outputs. So we have one unit that is connected to another unit that we might think of this side of the input and that side of the output. And what we’re trying to do then is to figure out how to solve a problem, how to model some sort of mathematical function. And this might take the form of something we saw last time, which was something like we have certain inputs, like variables x1 and x2. And given those inputs, we want to perform some sort of task, a task like predicting whether or not it’s going to rain. And ideally, we’d like some way, given these inputs, x1 and x2, which stand for some sort of variables to do with the weather, we would like to be able to predict, in this case, a Boolean classification. Is it going to rain, or is it not going to rain? And we did this last time by way of a mathematical function. We defined some function, h, for our hypothesis function, that took as input x1 and x2, the two inputs that we cared about processing, in order to determine whether we thought it was going to rain or whether we thought it was not going to rain. The question then becomes, what does this hypothesis function do in order to make that determination? And we decided last time to use a linear combination of these input variables to determine what the output should be. So our hypothesis function was equal to something like this. Weight 0 plus weight 1 times x1 plus weight 2 times x2. So what’s going on here is that x1 and x2, those are input variables, the inputs to this hypothesis function. And each of those input variables is being multiplied by some weight, which is just some number. So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2. And we have this additional weight, weight 0, that doesn’t get multiplied by an input variable at all, that just serves to either move the function up or move the function’s value down. You can think of this as either a weight that’s just multiplied by some dummy value, like the number 1. It’s multiplied by 1, and so it’s not multiplied by anything. Or sometimes, you’ll see in the literature, people call this variable weight 0 a bias, so that you can think of these variables as slightly different. We have weights that are multiplied by the input, and we separately add some bias to the result as well. You’ll hear both of those terminologies used when people talk about neural networks and machine learning. So in effect, what we’ve done here is that in order to define a hypothesis function, we just need to decide and figure out what these weights should be to determine what values to multiply by our inputs to get some sort of result. Of course, at the end of this, what we need to do is make some sort of classification, like rainy or not rainy. And to do that, we use some sort of function that defines some sort of threshold. And so we saw, for instance, the step function, which is defined as 1 if the result of multiplying the weights by the inputs is at least 0, otherwise it’s 0. And you can think of this line down the middle as kind of like a dotted line. Effectively, it stays at 0 all the way up to one point, and then the function steps or jumps up to 1. So it’s 0 before it reaches some threshold, and then it’s 1 after it reaches a particular threshold. And so this was one way we could define what will come to call an activation function, a function that determines when it is that this output becomes active, changes to 1 instead of being a 0. But we also saw that if we didn’t just want a purely binary classification, we didn’t want purely 1 or 0, but we wanted to allow for some in-between real numbered values, we could use a different function. And there are a number of choices, but the one that we looked at was the logistic sigmoid function that has sort of an s-shaped curve, where we could represent this as a probability that may be somewhere in between the probability of rain or something like 0.5. Maybe a little bit later, the probability of rain is 0.8. And so rather than just have a binary classification of 0 or 1, we could allow for numbers that are in between as well. And it turns out there are many other different types of activation functions, where an activation function just takes the output of multiplying the weights together and adding that bias, and then figuring out what the actual output should be. Another popular one is the rectified linear unit, otherwise known as ReLU. And the way that works is that it just takes its input and takes the maximum of that input and 0. So if it’s positive, it remains unchanged. But if it’s 0, if it’s negative, it goes ahead and levels out at 0. And there are other activation functions that we could choose as well. But in short, each of these activation functions, you can just think of as a function that gets applied to the result of all of this computation. We take some function g and apply it to the result of all of that calculation. And this then is what we saw last time, the way of defining some hypothesis function that takes in inputs, calculate some linear combination of those inputs, and then passes it through some sort of activation function to get our output. And this actually turns out to be the model for the simplest of neural networks, that we’re going to instead represent this mathematical idea graphically by using a structure like this. Here then is a neural network that has two inputs. We can think of this as x1 and this as x2. And then one output, which you can think of as classifying whether or not we think it’s going to rain or not rain, for example, in this particular instance. And so how exactly does this model work? Well, each of these two inputs represents one of our input variables, x1 and x2. And notice that these inputs are connected to this output via these edges, which are going to be defined by their weights. So these edges each have a weight associated with them, weight 1 and weight 2. And then this output unit, what it’s going to do is it is going to calculate an output based on those inputs and based on those weights. This output unit is going to multiply all the inputs by their weights, add in this bias term, which you can think of as an extra w0 term that gets added into it, and then we pass it through an activation function. So this then is just a graphical way of representing the same idea we saw last time just mathematically. And we’re going to call this a very simple neural network. And we’d like for this neural network to be able to learn how to calculate some function, that we want some function for the neural network to learn. And the neural network is going to learn what should the values of w0, w1, and w2 be? What should the activation function be in order to get the result that we would expect? So we can actually take a look at an example of this. What then is a very simple function that we might calculate? Well, if we recall back from when we were looking at propositional logic, one of the simplest functions we looked at was something like the or function that takes two inputs, x and y, and outputs 1, otherwise known as true, if either one of the inputs or both of them are 1, and outputs of 0 if both of the inputs are 0 or false. So this then is the or function. And this was the truth table for the or function, that as long as either of the inputs are 1, the output of the function is 1, and the only case where the output is 0 is where both of the inputs are 0. So the question is, how could we take this and train a neural network to be able to learn this particular function? What would those weights look like? Well, we could do something like this. Here’s our neural network. And I’ll propose that in order to calculate the or function, we’re going to use a value of 1 for each of the weights. And we’ll use a bias of negative 1. And then we’ll just use this step function as our activation function. How then does this work? Well, if I wanted to calculate something like 0 or 0, which we know to be 0 because false or false is false, then what are we going to do? Well, our output unit is going to calculate this input multiplied by the weight, 0 times 1, that’s 0. Same thing here, 0 times 1, that’s 0. And we’ll add to that the bias minus 1. So that’ll give us a result of negative 1. If we plot that on our activation function, negative 1 is here. It’s before the threshold, which means either 0 or 1. It’s only 1 after the threshold. Since negative 1 is before the threshold, the output that this unit provides is going to be 0. And that’s what we would expect it to be, that 0 or 0 should be 0. What if instead we had had 1 or 0, where this is the number 1? Well, in this case, in order to calculate what the output is going to be, we again have to do this weighted sum, 1 times 1, that’s 1. 0 times 1, that’s 0. Sum of that so far is 1. Add negative 1 to that. Well, then the output is 0. And if we plot 0 on the step function, 0 ends up being here. It’s just at the threshold. And so the output here is going to be 1, because the output of 1 or 0, that’s 1. So that’s what we would expect as well. And just for one more example, if I had 1 or 1, what would the result be? Well, 1 times 1 is 1. 1 times 1 is 1. The sum of those is 2. I add the bias term to that. I get the number 1. 1 plotted on this graph is way over there. That’s well beyond the threshold. And so this output is going to be 1 as well. The output is always 0 or 1, depending on whether or not we’re past the threshold. And this neural network then models the OR function, a very simple function, definitely. But it still is able to model it correctly. If I give it the inputs, it will tell me what x1 or x2 happens to be. And you could imagine trying to do this for other functions as well. A function like the AND function, for instance, that takes two inputs and calculates whether both x and y are true. So if x is 1 and y is 1, then the output of x and y is 1. But in all the other cases, the output is 0. How could we model that inside of a neural network as well? Well, it turns out we could do it in the same way, except instead of negative 1 as the bias, we can use negative 2 as the bias instead. What does that end up looking like? Well, if I had 1 and 1, that should be 1, because 1 true and true is equal to true. Well, I take 1 times 1, that’s 1. 1 times 1 is 1. I get a total sum of 2 so far. Now I add the bias of negative 2, and I get the value 0. And 0, when I plot it on the activation function, is just past that threshold, and so the output is going to be 1. But if I had any other input, for example, like 1 and 0, well, the weighted sum of these is 1 plus 0 is going to be 1. Minus 2 is going to give us negative 1, and negative 1 is not past that threshold, and so the output is going to be 0. So those then are some very simple functions that we can model using a neural network that has two inputs and one output, where our goal is to be able to figure out what those weights should be in order to determine what the output should be. And you could imagine generalizing this to calculate more complex functions as well, that maybe, given the humidity and the pressure, we want to calculate what’s the probability that it’s going to rain, for example. Or we might want to do a regression-style problem. We’re given some amount of advertising, and given what month it is maybe, we want to predict what our expected sales are going to be for that particular month. So you could imagine these inputs and outputs being different as well. And it turns out that in some problems, we’re not just going to have two inputs, and the nice thing about these neural networks is that we can compose multiple units together, make our networks more complex just by adding more units into this particular neural network. So the network we’ve been looking at has two inputs and one output. But we could just as easily say, let’s go ahead and have three inputs in there, or have even more inputs, where we could arbitrarily decide however many inputs there are to our problem, all going to be calculating some sort of output that we care about figuring out the value of. How then does the math work for figuring out that output? Well, it’s going to work in a very similar way. In the case of two inputs, we had two weights indicated by these edges, and we multiplied the weights by the numbers, adding this bias term. And we’ll do the same thing in the other cases as well. If I have three inputs, you’ll imagine multiplying each of these three inputs by each of these weights. If I had five inputs instead, we’re going to do the same thing. Here I’m saying sum up from 1 to 5, xi multiplied by weight i. So take each of the five input variables, multiply them by their corresponding weight, and then add the bias to that. So this would be a case where there are five inputs into this neural network, for example. But there could be more, arbitrarily many nodes that we want inside of this neural network, where each time we’re just going to sum up all of those input variables multiplied by their weight and then add the bias term at the very end. And so this allows us to be able to represent problems that have even more inputs just by growing the size of our neural network. Now, the next question we might ask is a question about how it is that we train these neural networks. In the case of the or function and the and function, they were simple enough functions that I could just tell you, like here, what the weights should be. And you could probably reason through it yourself what the weights should be in order to calculate the output that you want. But in general, with functions like predicting sales or predicting whether or not it’s going to rain, these are much trickier functions to be able to figure out. We would like the computer to have some mechanism of calculating what it is that the weights should be, how it is to set the weights so that our neural network is able to accurately model the function that we care about trying to estimate. And it turns out that the strategy for doing this, inspired by the domain of calculus, is a technique called gradient descent. And what gradient descent is, it is an algorithm for minimizing loss when you’re training a neural network. And recall that loss refers to how bad our hypothesis function happens to be, that we can define certain loss functions. And we saw some examples of loss functions last time that just give us a number for any particular hypothesis, saying, how poorly does it model the data? How many examples does it get wrong? How are they worse or less bad as compared to other hypothesis functions that we might define? And this loss function is just a mathematical function. And when you have a mathematical function, in calculus what you could do is calculate something known as the gradient, which you can think of as like a slope. It’s the direction the loss function is moving at any particular point. And what it’s going to tell us is, in which direction should we be moving these weights in order to minimize the amount of loss? And so generally speaking, we won’t get into the calculus of it. But the high level idea for gradient descent is going to look something like this. If we want to train a neural network, we’ll go ahead and start just by choosing the weights randomly. Just pick random weights for all of the weights in the neural network. And then we’ll use the input data that we have access to in order to train the network, in order to figure out what the weights should actually be. So we’ll repeat this process again and again. The first step is we’re going to calculate the gradient based on all of the data points. So we’ll look at all the data and figure out what the gradient is at the place where we currently are for the current setting of the weights, which means in which direction should we move the weights in order to minimize the total amount of loss, in order to make our solution better. And once we’ve calculated that gradient, which direction we should move in the loss function, well, then we can just update those weights according to the gradient. Take a small step in the direction of those weights in order to try to make our solution a little bit better. And the size of the step that we take, that’s going to vary. And you can choose that when you’re training a particular neural network. But in short, the idea is going to be take all the data points, figure out based on those data points in what direction the weights should move, and then move the weights one small step in that direction. And if you repeat that process over and over again, adjusting the weights a little bit at a time based on all the data points, eventually you should end up with a pretty good solution to trying to solve this sort of problem. At least that’s what we would hope to happen. Now, if you look at this algorithm, a good question to ask anytime you’re analyzing an algorithm is what is going to be the expensive part of doing the calculation? What’s going to take a lot of work to try to figure out? What is going to be expensive to calculate? And in particular, in the case of gradient descent, the really expensive part is this all data points part right here, having to take all of the data points and using all of those data points figure out what the gradient is at this particular setting of all of the weights. Because odds are in a big machine learning problem where you’re trying to solve a big problem with a lot of data, you have a lot of data points in order to calculate. And figuring out the gradient based on all of those data points is going to be expensive. And you’ll have to do it many times. You’ll likely repeat this process again and again and again, going through all the data points, taking one small step over and over as you try and figure out what the optimal setting of those weights happens to be. It turns out that we would ideally like to be able to train our neural networks faster, to be able to more quickly converge to some sort of solution that is going to be a good solution to the problem. So in that case, there are alternatives to just standard gradient descent, which looks at all of the data points at once. We can employ a method like stochastic gradient descent, which will randomly just choose one data point at a time to calculate the gradient based on, instead of calculating it based on all of the data points. So the idea there is that we have some setting of the weights. We pick a data point. And based on that one data point, we figure out in which direction should we move all of the weights and move the weights in that small direction, then take another data point and do that again and repeat this process again and again, maybe looking at each of the data points multiple times, but each time only using one data point to calculate the gradient, to calculate which direction we should move in. Now, just using one data point instead of all of the data points probably gives us a less accurate estimate of what the gradient actually is. But on the plus side, it’s going to be much faster to be able to calculate, that we can much more quickly calculate what the gradient is based on one data point, instead of calculating based on all of the data points and having to do all of that computational work again and again. So there are trade-offs here between looking at all of the data points and just looking at one data point. And it turns out that a middle ground that is also quite popular is a technique called mini-batch gradient descent, where the idea there is instead of looking at all of the data versus just a single point, we instead divide our data set up into small batches, groups of data points, where you can decide how big a particular batch is. But in short, you’re just going to look at a small number of points at any given time, hopefully getting a more accurate estimate of the gradient, but also not requiring all of the computational effort needed to look at every single one of these data points. So gradient descent, then, is this technique that we can use in order to train these neural networks, in order to figure out what the setting of all of these weights should be if we want some way to try and get an accurate notion of how it is that this function should work, some way of modeling how to transform the inputs into particular outputs. Now, so far, the networks that we’ve taken a look at have all been structured similar to this. We have some number of inputs, maybe two or three or five or more. And then we have one output that is just predicting like rain or no rain or just predicting one particular value. But often in machine learning problems, we don’t just care about one output. We might care about an output that has multiple different values associated with it. So in the same way that we could take a neural network and add units to the input layer, we can likewise add inputs or add outputs to the output layer as well. Instead of just one output, you could imagine we have two outputs, or we could have four outputs, for example, where in each case, as we add more inputs or add more outputs, if we want to keep this network fully connected between these two layers, we just need to add more weights, that now each of these input nodes has four weights associated with each of the four outputs. And that’s true for each of these various different input nodes. So as we add nodes, we add more weights in order to make sure that each of the inputs can somehow be connected to each of the outputs so that each output value can be calculated based on what the value of the input happens to be. So what might a case be where we want multiple different output values? Well, you might consider that in the case of weather predicting, for example, we might not just care whether it’s raining or not raining. There might be multiple different categories of weather that we would like to categorize the weather into. With just a single output variable, we can do a binary classification, like rain or no rain, for instance, 1 or 0. But it doesn’t allow us to do much more than that. With multiple output variables, I might be able to use each one to predict something a little different. Maybe I want to categorize the weather into one of four different categories, something like is it going to be raining or sunny or cloudy or snowy. And I now have four output variables that can be used to represent maybe the probability that it is rainy as opposed to sunny as opposed to cloudy or as opposed to snowy. How then would this neural network work? Well, we have some input variables that represent some data that we have collected about the weather. Each of those inputs gets multiplied by each of these various different weights. We have more multiplications to do, but these are fairly quick mathematical operations to perform. And then what we get is after passing them through some sort of activation function in the outputs, we end up getting some sort of number, where that number, you might imagine, you could interpret as a probability, like a probability that it is one category as opposed to another category. So here we’re saying that based on the inputs, we think there is a 10% chance that it’s raining, a 60% chance that it’s sunny, a 20% chance of cloudy, a 10% chance that it’s snowy. And given that output, if these represent a probability distribution, well, then you could just pick whichever one has the highest value, in this case, sunny, and say that, well, most likely, we think that this categorization of inputs means that the output should be snowy or should be sunny. And that is what we would expect the weather to be in this particular instance. And so this allows us to do these sort of multi-class classifications, where instead of just having a binary classification, 1 or 0, we can have as many different categories as we want. And we can have our neural network output these probabilities over which categories are more likely than other categories. And using that data, we’re able to draw some sort of inference on what it is that we should do. So this was sort of the idea of supervised machine learning. I can give this neural network a whole bunch of data, a whole bunch of input data corresponding to some label, some output data, like we know that it was raining on this day, we know that it was sunny on that day. And using all of that data, the algorithm can use gradient descent to figure out what all of the weights should be in order to create some sort of model that hopefully allows us a way to predict what we think the weather is going to be. But neural networks have a lot of other applications as well. You could imagine applying the same sort of idea to a reinforcement learning sort of example as well, where you remember that in reinforcement learning, what we wanted to do is train some sort of agent to learn what action to take, depending on what state they currently happen to be in. So depending on the current state of the world, we wanted the agent to pick from one of the available actions that is available to them. And you might model that by having each of these input variables represent some information about the state, some data about what state our agent is currently in. And then the output, for example, could be each of the various different actions that our agent could take, action 1, 2, 3, and 4. And you might imagine that this network would work in the same way, but based on these particular inputs, we go ahead and calculate values for each of these outputs. And those outputs could model which action is better than other actions. And we could just choose, based on looking at those outputs, which action we should take. And so these neural networks are very broadly applicable, that all they’re really doing is modeling some mathematical function. So anything that we can frame as a mathematical function, something like classifying inputs into various different categories or figuring out based on some input state what action we should take, these are all mathematical functions that we could attempt to model by taking advantage of this neural network structure, and in particular, taking advantage of this technique, gradient descent, that we can use in order to figure out what the weights should be in order to do this sort of calculation. Now, how is it that you would go about training a neural network that has multiple outputs instead of just one? Well, with just a single output, we could see what the output for that value should be, and then you update all of the weights that corresponded to it. And when we have multiple outputs, at least in this particular case, we can really think of this as four separate neural networks, that really we just have one network here that has these three inputs corresponding with these three weights corresponding to this one output value. And the same thing is true for this output value. This output value effectively defines yet another neural network that has these same three inputs, but a different set of weights that correspond to this output. And likewise, this output has its own set of weights as well, and same thing for the fourth output too. And so if you wanted to train a neural network that had four outputs instead of just one, in this case where the inputs are directly connected to the outputs, you could really think of this as just training four independent neural networks. We know what the outputs for each of these four should be based on our input data, and using that data, we can begin to figure out what all of these individual weights should be. And maybe there’s an additional step at the end to make sure that we turn these values into a probability distribution such that we can interpret which one is better than another or more likely than another as a category or something like that. So this then seems like it does a pretty good job of taking inputs and trying to predict what outputs should be. And we’ll see some real examples of this in just a moment as well. But it’s important then to think about what the limitations of this sort of approach is, of just taking some linear combination of inputs and passing it into some sort of activation function. And it turns out that when we do this in the case of binary classification, trying to predict does it belong to one category or another, we can only predict things that are linearly separable. Because we’re taking a linear combination of inputs and using that to define some decision boundary or threshold, then what we get is a situation where if we have this set of data, we can predict a line that separates linearly the red points from the blue points, but a single unit that is making a binary classification, otherwise known as a perceptron, can’t deal with a situation like this, where we’ve seen this type of situation before, where there is no straight line that just goes straight through the data that will divide the red points away from the blue points. It’s a more complex decision boundary. The decision boundary somehow needs to capture the things inside of this circle. And there isn’t really a line that will allow us to deal with that. So this is the limitation of the perceptron, these units that just make these binary decisions based on their inputs, that a single perceptron is only capable of learning a linearly separable decision boundary. All it can do is define a line. And sure, it can give us probabilities based on how close to that decision boundary we are, but it can only really decide based on a linear decision boundary. And so this doesn’t seem like it’s going to generalize well to situations where real world data is involved, because real world data often isn’t linearly separable. It often isn’t the case that we can just draw a line through the data and be able to divide it up into multiple groups. So what then is the solution to this? Well, what was proposed was the idea of a multilayer neural network, that so far all of the neural networks we’ve seen have had a set of inputs and a set of outputs, and the inputs are connected to those outputs. But in a multilayer neural network, this is going to be an artificial neural network that has an input layer still. It has an output layer, but also has one or more hidden layers in between. Other layers of artificial neurons or units that are going to calculate their own values as well. So instead of a neural network that looks like this with three inputs and one output, you might imagine in the middle here injecting a hidden layer, something like this. This is a hidden layer that has four nodes. You could choose how many nodes or units end up going into the hidden layer. You can have multiple hidden layers as well. And so now each of these inputs isn’t directly connected to the output. Each of the inputs is connected to this hidden layer. And then all of the nodes in the hidden layer, those are connected to the one output. And so this is just another step that we can take towards calculating more complex functions. Each of these hidden units will calculate its output value, otherwise known as its activation, based on a linear combination of all the inputs. And once we have values for all of these nodes, as opposed to this just being the output, we do the same thing again. Calculate the output for this node based on multiplying each of the values for these units by their weights as well. So in effect, the way this works is that we start with inputs. They get multiplied by weights in order to calculate values for the hidden nodes. Those get multiplied by weights in order to figure out what the ultimate output is going to be. And the advantage of layering things like this is it gives us an ability to model more complex functions, that instead of just having a single decision boundary, a single line dividing the red points from the blue points, each of these hidden nodes can learn a different decision boundary. And we can combine those decision boundaries to figure out what the ultimate output is going to be. And as we begin to imagine more complex situations, you could imagine each of these nodes learning some useful property or learning some useful feature of all of the inputs and us somehow learning how to combine those features together in order to get the output that we actually want. Now, the natural question when we begin to look at this now is to ask the question of, how do we train a neural network that has hidden layers inside of it? And this turns out to initially be a bit of a tricky question, because the input data that we are given is we are given values for all of the inputs, and we’re given what the value of the output should be, what the category is, for example. But the input data doesn’t tell us what the values for all of these nodes should be. So we don’t know how far off each of these nodes actually is because we’re only given data for the inputs and the outputs. The reason this is called the hidden layer is because the data that is made available to us doesn’t tell us what the values for all of these intermediate nodes should actually be. And so the strategy people came up with was to say that if you know what the error or the losses on the output node, well, then based on what these weights are, if one of these weights is higher than another, you can calculate an estimate for how much the error from this node was due to this part of the hidden node, or this part of the hidden layer, or this part of the hidden layer, based on the values of these weights, in effect saying that based on the error from the output, I can back propagate the error and figure out an estimate for what the error is for each of these nodes in the hidden layer as well. And there’s some more calculus here that we won’t get into the details of, but the idea of this algorithm is known as back propagation. It’s an algorithm for training a neural network with multiple different hidden layers. And the idea for this, the pseudocode for it, will again be if we want to run gradient descent with back propagation. We’ll start with a random choice of weights, as we did before. And now we’ll go ahead and repeat the training process again and again. But what we’re going to do each time is now we’re going to calculate the error for the output layer first. We know the output and what it should be, and we know what we calculated so we can figure out what the error there is. But then we’re going to repeat for every layer, starting with the output layer, moving back into the hidden layer, then the hidden layer before that if there are multiple hidden layers, going back all the way to the very first hidden layer, assuming there are multiple, we’re going to propagate the error back one layer. Whatever the error was from the output, figure out what the error should be a layer before that based on what the values of those weights are. And then we can update those weights. So graphically, the way you might think about this is that we first start with the output. We know what the output should be. We know what output we calculated. And based on that, we can figure out, all right, how do we need to update those weights? Backpropagating the error to these nodes. And using that, we can figure out how we should update these weights. And you might imagine if there are multiple layers, we could repeat this process again and again to begin to figure out how all of these weights should be updated. And this backpropagation algorithm is really the key algorithm that makes neural networks possible. It makes it possible to take these multi-level structures and be able to train those structures depending on what the values of these weights are in order to figure out how it is that we should go about updating those weights in order to create some function that is able to minimize the total amount of loss, to figure out some good setting of the weights that will take the inputs and translate it into the output that we expect. And this works, as we said, not just for a single hidden layer. But you can imagine multiple hidden layers, where each hidden layer we just define however many nodes we want, where each of the nodes in one layer, we can connect to the nodes in the next layer, defining more and more complex networks that are able to model more and more complex types of functions. And so this type of network is what we might call a deep neural network, part of a larger family of deep learning algorithms, if you’ve ever heard that term. And all deep learning is about is it’s using multiple layers to be able to predict and be able to model higher level features inside of the input, to be able to figure out what the output should be. And so a deep neural network is just a neural network that has multiple of these hidden layers, where we start at the input, calculate values for this layer, then this layer, then this layer, and then ultimately get an output. And this allows us to be able to model more and more sophisticated types of functions, that each of these layers can calculate something a little bit different, and we can combine that information to figure out what the output should be. Of course, as with any situation of machine learning, as we begin to make our models more and more complex, to model more and more complex functions, the risk we run is something like overfitting. And we talked about overfitting last time in the context of overfitting based on when we were training our models to be able to learn some sort of decision boundary, where overfitting happens when we fit too closely to the training data. And as a result, we don’t generalize well to other situations as well. And one of the risks we run with a far more complex neural network that has many, many different nodes is that we might overfit based on the input data. We might grow over reliant on certain nodes to calculate things just purely based on the input data that doesn’t allow us to generalize very well to the output. And there are a number of strategies for dealing with overfitting. But one of the most popular in the context of neural networks is a technique known as dropout. And what dropout does is it, when we’re training the neural network, what we’ll do in dropout is temporarily remove units, temporarily remove these artificial neurons from our network chosen at random. And the goal here is to prevent over-reliance on certain units. What generally happens in overfitting is that we begin to over-rely on certain units inside the neural network to be able to tell us how to interpret the input data. What dropout will do is randomly remove some of these units in order to reduce the chance that we over-rely on certain units to make our neural network more robust, to be able to handle the situations even when we just drop out particular neurons entirely. So the way that might work is we have a network like this. And as we’re training it, when we go about trying to update the weights the first time, we’ll just randomly pick some percentage of the nodes to drop out of the network. It’s as if those nodes aren’t there at all. It’s as if the weights associated with those nodes aren’t there at all. And we’ll train it this way. Then the next time we update the weights, we’ll pick a different set and just go ahead and train that way. And then again, randomly choose and train with other nodes that have been dropped out as well. And the goal of that is that after the training process, if you train by dropping out random nodes inside of this neural network, you hopefully end up with a network that’s a little bit more robust, that doesn’t rely too heavily on any one particular node, but more generally learns how to approximate a function in general. So that then is a look at some of these techniques that we can use in order to implement a neural network, to get at the idea of taking this input, passing it through these various different layers in order to produce some sort of output. And what we’d like to do now is take those ideas and put them into code. And to do that, there are a number of different machine learning libraries, neural network libraries that we can use that allow us to get access to someone’s implementation of back propagation and all of these hidden layers. And one of the most popular, developed by Google, is known as TensorFlow, a library that we can use for quickly creating neural networks and modeling them and running them on some sample data to see what the output is going to be. And before we actually start writing code, we’ll go ahead and take a look at TensorFlow’s playground, which will be an opportunity for us just to play around with this idea of neural networks in different layers, just to get a sense for what it is that we can do by taking advantage of neural networks. So let’s go ahead and go into TensorFlow’s playground, which you can go to by visiting that URL from before. And what we’re going to do now is we’re going to try and learn the decision boundary for this particular output. I want to learn to separate the orange points from the blue points. And I’d like to learn some sort of setting of weights inside of a neural network that will be able to separate those from each other. The features we have access to, our input data, are the x value and the y value, so the two values along each of the two axes. And what I’ll do now is I can set particular parameters, like what activation function I would like to use. And I’ll just go ahead and press play and see what happens. And what happens here is that you’ll see that just by using these two input features, the x value and the y value, with no hidden layers, just take the input, x and y values, and figure out what the decision boundary is. Our neural network learns pretty quickly that in order to divide these two points, we should just use this line. This line acts as a decision boundary that separates this group of points from that group of points, and it does it very well. You can see up here what the loss is. The training loss is 0, meaning we were able to perfectly model separating these two points from each other inside of our training data. So this was a fairly simple case of trying to apply a neural network because the data is very clean. It’s very nicely linearly separable. We could just draw a line that separates all of those points from each other. Let’s now consider a more complex case. So I’ll go ahead and pause the simulation, and we’ll go ahead and look at this data set here. This data set is a little bit more complex now. In this data set, we still have blue and orange points that we’d like to separate from each other. But there’s no single line that we can draw that is going to be able to figure out how to separate the blue from the orange, because the blue is located in these two quadrants, and the orange is located here and here. It’s a more complex function to be able to learn. So let’s see what happens. If we just try and predict based on those inputs, the x and y coordinates, what the output should be, I’ll press Play. And what you’ll notice is that we’re not really able to draw much of a conclusion, that we’re not able to very cleanly see how we should divide the orange points from the blue points, and you don’t see a very clean separation there. So it seems like we don’t have enough sophistication inside of our network to be able to model something that is that complex. We need a better model for this neural network. And I’ll do that by adding a hidden layer. So now I have a hidden layer that has two neurons inside of it. So I have two inputs that then go to two neurons inside of a hidden layer that then go to our output. And now I’ll press Play. And what you’ll notice here is that we’re able to do slightly better. We’re able to now say, all right, these points are definitely blue. These points are definitely orange. We’re still struggling a little bit with these points up here, though. And what we can do is we can see for each of these hidden neurons, what is it exactly that these hidden neurons are doing? Each hidden neuron is learning its own decision boundary. And we can see what that boundary is. This first neuron is learning, all right, this line that seems to separate some of the blue points from the rest of the points. This other hidden neuron is learning another line that seems to be separating the orange points in the lower right from the rest of the points. So that’s why we’re able to figure out these two areas in the bottom region. But we’re still not able to perfectly classify all of the points. So let’s go ahead and add another neuron. Now we’ve got three neurons inside of our hidden layer and see what we’re able to learn now. All right, well, now we seem to be doing a better job. By learning three different decision boundaries, which each of the three neurons inside of our hidden layer, we’re able to much better figure out how to separate these blue points from the orange points. And we can see what each of these hidden neurons is learning. Each one is learning a slightly different decision boundary. And then we’re combining those decision boundaries together to figure out what the overall output should be. And then we can try it one more time by adding a fourth neuron there and try learning that. And it seems like now we can do even better at trying to separate the blue points from the orange points. But we were only able to do this by adding a hidden layer, by adding some layer that is learning some other boundaries and combining those boundaries to determine the output. And the strength, the size and thickness of these lines indicate how high these weights are, how important each of these inputs is for making this sort of calculation. And we can do maybe one more simulation. Let’s go ahead and try this on a data set that looks like this. Go ahead and get rid of the hidden layer. Here now we’re trying to separate the blue points from the orange points where all the blue points are located, again, inside of a circle effectively. So we’re not going to be able to learn a line. Notice I press Play. And we’re really not able to draw any sort of classification at all because there is no line that cleanly separates the blue points from the orange points. So let’s try to solve this by introducing a hidden layer. I’ll go ahead and press Play. And all right, with two neurons in a hidden layer, we’re able to do a little better because we effectively learned two different decision boundaries. We learned this line here. And we learned this line on the right-hand side. And right now we’re just saying, all right, well, if it’s in between, we’ll call it blue. And if it’s outside, we’ll call it orange. So not great, but certainly better than before, that we’re learning one decision boundary and another. And based on those, we can figure out what the output should be. But let’s now go ahead and add a third neuron and see what happens now. I go ahead and train it. And now, using three different decision boundaries that are learned by each of these hidden neurons, we’re able to much more accurately model this distinction between blue points and orange points. We’re able to figure out maybe with these three decision boundaries, combining them together, you can imagine figuring out what the output should be and how to make that sort of classification. And so the goal here is just to get a sense for having more neurons in these hidden layers allows us to learn more structure in the data, allows us to figure out what the relevant and important decision boundaries are. And then using this backpropagation algorithm, we’re able to figure out what the values of these weights should be in order to train this network to be able to classify one category of points away from another category of points instead. And this is ultimately what we’re going to be trying to do whenever we’re training a neural network. So let’s go ahead and actually see an example of this. You’ll recall from last time that we had this banknotes file that included information about counterfeit banknotes as opposed to authentic banknotes, where I had four different values for each banknote and then a categorization of whether that banknote is considered to be authentic or a counterfeit note. And what I wanted to do was, based on that input information, figure out some function that could calculate based on the input information what category it belonged to. And what I’ve written here in banknotes.py is a neural network that will learn just that, a network that learns based on all of the input whether or not we should categorize a banknote as authentic or as counterfeit. The first step is the same as what we saw from last time. I’m really just reading the data in and getting it into an appropriate format. And so this is where more of the writing Python code on your own comes in, in terms of manipulating this data, massaging the data into a format that will be understood by a machine learning library like scikit-learn or like TensorFlow. And so here I separate it into a training and a testing set. And now what I’m doing down below is I’m creating a neural network. Here I’m using TF, which stands for TensorFlow. Up above, I said import TensorFlow as TF, TF just an abbreviation that we’ll often use so we don’t need to write out TensorFlow every time we want to use anything inside of the library. I’m using TF.keras. Keras is an API, a set of functions that we can use in order to manipulate neural networks inside of TensorFlow. And it turns out there are other machine learning libraries that also use the Keras API. But here I’m saying, all right, go ahead and give me a model that is a sequential model, a sequential neural network, meaning one layer after another. And now I’m going to add to that model what layers I want inside of my neural network. So here I’m saying model.add. Go ahead and add a dense layer. And when we say a dense layer, we mean a layer that is just each of the nodes inside of the layer is going to be connected to each of the nodes from the previous layer. So we have a densely connected layer. This layer is going to have eight units inside of it. So it’s going to be a hidden layer inside of a neural network with eight different units, eight artificial neurons, each of which might learn something different. And I just sort of chose eight arbitrarily. You could choose a different number of hidden nodes inside of the layer. And as we saw before, depending on the number of units there are inside of your hidden layer, more units means you can learn more complex functions. So maybe you can more accurately model the training data. But it comes at the cost. More units means more weights that you need to figure out how to update. So it might be more expensive to do that calculation. And you also run the risk of overfitting on the data. If you have too many units and you learn to just overfit on the training data, that’s not good either. So there is a balance. And there’s often a testing process where you’ll train on some data and maybe validate how well you’re doing on a separate set of data, often called a validation set, to see, all right, which setting of parameters. How many layers should I have? How many units should be in each layer? Which one of those performs the best on the validation set? So you can do some testing to figure out what these hyper parameters, so called, should be equal to. Next, I specify what the input shape is. Meaning, all right, what does my input look like? My input has four values. And so the input shape is just four, because we have four inputs. And then I specify what the activation function is. And the activation function, again, we can choose. There are a number of different activation functions. Here I’m using relu, which you might recall from earlier. And then I’ll add an output layer. So I have my hidden layer. Now I’m adding one more layer that will just have one unit, because all I want to do is predict something like counterfeit build or authentic build. So I just need a single unit. And the activation function I’m going to use here is that sigmoid activation function, which, again, was that S-shaped curve that just gave us a probability of what is the probability that this is a counterfeit build, as opposed to an authentic build. So that, then, is the structure of my neural network, a sequential neural network that has one hidden layer with eight units inside of it, and then one output layer that just has a single unit inside of it. And I can choose how many units there are. I can choose the activation function. Then I’m going to compile this model. TensorFlow gives you a choice of how you would like to optimize the weights. There are various different algorithms for doing that. What type of loss function you want to use. Again, many different options for doing that. And then how I want to evaluate my model, well, I care about accuracy. I care about how many of my points am I able to classify correctly versus not correctly as counterfeit or not counterfeit. And I would like it to report to me how accurate my model is performing. Then, now that I’ve defined that model, I call model.fit to say go ahead and train the model. Train it on all the training data plus all of the training labels. So labels for each of those pieces of training data. And I’m saying run it for 20 epics, meaning go ahead and go through each of these training points 20 times, effectively. Go through the data 20 times and keep trying to update the weights. If I did it for more, I could train for even longer and maybe get a more accurate result. But then after I fit it on all the data, I’ll go ahead and just test it. I’ll evaluate my model using model.evaluate built into TensorFlow that is just going to tell me how well do I perform on the testing data. So ultimately, this is just going to give me some numbers that tell me how well we did in this particular case. So now what I’m going to do is go into banknotes and go ahead and run banknotes.py. And what’s going to happen now is it’s going to read in all of that training data. It’s going to generate a neural network with all my inputs, my eight hidden units inside my layer, and then an output unit. And now what it’s doing is it’s training. It’s training 20 times. And each time you can see how my accuracy is increasing on my training data. It starts off the very first time not very accurate, though better than random, something like 79% of the time. It’s able to accurately classify one bill from another. But as I keep training, notice this accuracy value improves and improves and improves until after I’ve trained through all the data points 20 times, it looks like my accuracy is above 99% on the training data. And here’s where I tested it on a whole bunch of testing data. And it looks like in this case, I was also like 99.8% accurate. So just using that, I was able to generate a neural network that can detect counterfeit bills from authentic bills based on this input data 99.8% of the time, at least based on this particular testing data. And I might want to test it with more data as well, just to be confident about that. But this is really the value of using a machine learning library like TensorFlow. And there are others available for Python and other languages as well. But all I have to do is define the structure of the network and define the data that I’m going to pass into the network. And then TensorFlow runs the backpropagation algorithm for learning what all of those weights should be, for figuring out how to train this neural network to be able to accurately, as accurately as possible, figure out what the output values should be there as well. And so this then was a look at what it is that neural networks can do just using these sequences of layer after layer after layer. And you can begin to imagine applying these to much more general problems. And one big problem in computing and artificial intelligence more generally is the problem of computer vision. Computer vision is all about computational methods for analyzing and understanding images. You might have pictures that you want the computer to figure out how to deal with, how to process those images and figure out how to produce some sort of useful result out of this. You’ve seen this in the context of social media websites that are able to look at a photo that contains a whole bunch of faces. And it’s able to figure out what’s a picture of whom and label those and tag them with appropriate people. This is becoming increasingly relevant as we begin to discuss self-driving cars, that these cars now have cameras. And we would like for the computer to have some sort of algorithm that looks at the image and figures out what color is the light, what cars are around us and in what direction, for example. And so computer vision is all about taking an image and figuring out what sort of computation, what sort of calculation we can do with that image. It’s also relevant in the context of something like handwriting recognition. This, what you’re looking at, is an example of the MNIST data set. It’s a big data set just of handwritten digits that we could use to ideally try and figure out how to predict, given someone’s handwriting, given a photo of a digit that they have drawn, can you predict whether it’s a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, for example. So this sort of handwriting recognition is yet another task that we might want to use computer vision tasks and tools to be able to apply it towards. This might be a task that we might care about. So how, then, can we use neural networks to be able to solve a problem like this? Well, neural networks rely upon some sort of input where that input is just numerical data. We have a whole bunch of units where each one of them just represents some sort of number. And so in the context of something like handwriting recognition or in the context of just an image, you might imagine that an image is really just a grid of pixels, grid of dots where each dot has some sort of color. And in the context of something like handwriting recognition, you might imagine that if you just fill in each of these dots in a particular way, you can generate a 2 or an 8, for example, based on which dots happen to be shaded in and which dots are not. And we can represent each of these pixel values just using numbers. So for a particular pixel, for example, 0 might represent entirely black. Depending on how you’re representing color, it’s often common to represent color values on a 0 to 255 range so that you can represent a color using 8 bits for a particular value, like how much white is in the image. So 0 might represent all black. 255 might represent entirely white as a pixel. And somewhere in between might represent some shade of gray, for example. But you might imagine not just having a single slider that determines how much white is in the image, but if you had a color image, you might imagine three different numerical values, a red, green, and blue value, where the red value controls how much red is in the image. We have one value for controlling how much green is in the pixel and one value for how much blue is in the pixel as well. And depending on how it is that you set these values of red, green, and blue, you can get a different color. And so any pixel can really be represented, in this case, by three numerical values, a red value, a green value, and a blue value. And if you take a whole bunch of these pixels, assemble them together inside of a grid of pixels, then you really just have a whole bunch of numerical values that you can use in order to perform some sort of prediction task. And so what you might imagine doing is using the same techniques we talked about before, just design a neural network with a lot of inputs, that for each of the pixels, we might have one or three different inputs in the case of a color image, a different input that is just connected to a deep neural network, for example. And this deep neural network might take all of the pixels inside of the image of what digit a person drew. And the output might be like 10 neurons that classify it as a 0, or a 1, or a 2, or a 3, or just tells us in some way what that digit happens to be. Now, there are a couple of drawbacks to this approach. The first drawback to the approach is just the size of this input array, that we have a whole bunch of inputs. If we have a big image that has a lot of different channels, we’re looking at a lot of inputs, and therefore a lot of weights that we have to calculate. And a second problem is the fact that by flattening everything into just this structure of all the pixels, we’ve lost access to a lot of the information about the structure of the image that’s relevant, that really, when a person looks at an image, they’re looking at particular features of the image. They’re looking at curves. They’re looking at shapes. They’re looking at what things can you identify in different regions of the image, and maybe put those things together in order to get a better picture of what the overall image is about. And by just turning it into pixel values for each of the pixels, sure, you might be able to learn that structure, but it might be challenging in order to do so. It might be helpful to take advantage of the fact that you can use properties of the image itself, the fact that it’s structured in a particular way, to be able to improve the way that we learn based on that image too. So in order to figure out how we can train our neural networks to better be able to deal with images, we’ll introduce a couple of ideas, a couple of algorithms that we can apply that allow us to take the image and extract some useful information out of that image. And the first idea we’ll introduce is the notion of image convolution. And what image convolution is all about is it’s about filtering an image, sort of extracting useful or relevant features out of the image. And the way we do that is by applying a particular filter that basically adds the value for every pixel with the values for all of the neighboring pixels to it, according to some sort of kernel matrix, which we’ll see in a moment, is going to allow us to weight these pixels in various different ways. And the goal of image convolution, then, is to extract some sort of interesting or useful features out of an image, to be able to take a pixel and, based on its neighboring pixels, maybe predict some sort of valuable information. Something like taking a pixel and looking at its neighboring pixels, you might be able to predict whether or not there’s some sort of curve inside the image, or whether it’s forming the outline of a particular line or a shape, for example. And that might be useful if you’re trying to use all of these various different features to combine them to say something meaningful about an image as a whole. So how, then, does image convolution work? Well, we start with a kernel matrix. And the kernel matrix looks something like this. And the idea of this is that, given a pixel that will be the middle pixel, we’re going to multiply each of the neighboring pixels by these values in order to get some sort of result by summing up all the numbers together. So if I take this kernel, which you can think of as a filter that I’m going to apply to the image, and let’s say that I take this image. This is a 4 by 4 image. We’ll think of it as just a black and white image, where each one is just a single pixel value. So somewhere between 0 and 255, for example. So we have a whole bunch of individual pixel values like this. And what I’d like to do is apply this kernel, this filter, so to speak, to this image. And the way I’ll do that is, all right, the kernel is 3 by 3. You can imagine a 5 by 5 kernel or a larger kernel, too. And I’ll take it and just first apply it to the first 3 by 3 section of the image. And what I’ll do is I’ll take each of these pixel values, multiply it by its corresponding value in the filter matrix, and add all of the results together. So here, for example, I’ll say 10 times 0, plus 20 times negative 1, plus 30 times 0, so on and so forth, doing all of this calculation. And at the end, if I take all these values, multiply them by their corresponding value in the kernel, add the results together, for this particular set of 9 pixels, I get the value of 10, for example. And then what I’ll do is I’ll slide this 3 by 3 grid, effectively, over. I’ll slide the kernel by 1 to look at the next 3 by 3 section. Here, I’m just sliding it over by 1 pixel. But you might imagine a different stride length, or maybe I jump by multiple pixels at a time if you really wanted to. You have different options here. But here, I’m just sliding over, looking at the next 3 by 3 section. And I’ll do the same math, 20 times 0, plus 30 times negative 1, plus 40 times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5. And what I end up getting is the number 20. Then you can imagine shifting over to this one, doing the same thing, calculating the number 40, for example, and then doing the same thing here, and calculating a value there as well. And so what we have now is what we’ll call a feature map. We have taken this kernel, applied it to each of these various different regions, and what we get is some representation of a filtered version of that image. And so to give a more concrete example of why it is that this kind of thing could be useful, let’s take this kernel matrix, for example, which is quite a famous one, that has an 8 in the middle, and then all of the neighboring pixels get a negative 1. And let’s imagine we wanted to apply that to a 3 by 3 part of an image that looks like this, where all the values are the same. They’re all 20, for instance. Well, in this case, if you do 20 times 8, and then subtract 20, subtract 20, subtract 20 for each of the eight neighbors, well, the result of that is you just get that expression, which comes out to be 0. You multiplied 20 by 8, but then you subtracted 20 eight times, according to that particular kernel. The result of all that is just 0. So the takeaway here is that when a lot of the pixels are the same value, we end up getting a value close to 0. If, though, we had something like this, 20 is along this first row, then 50 is in the second row, and 50 is in the third row, well, then when you do this, because it’s the same kind of math, 20 times negative 1, 20 times negative 1, so on and so forth, then I get a higher value, a value like 90 in this particular case. And so the more general idea here is that by applying this kernel, negative 1s, 8 in the middle, and then negative 1s, what I get is when this middle value is very different from the neighboring values, like 50 is greater than these 20s, then you’ll end up with a value higher than 0. If this number is higher than its neighbors, you end up getting a bigger output. But if this value is the same as all of its neighbors, then you get a lower output, something like 0. And it turns out that this sort of filter can therefore be used in something like detecting edges in an image. Or I want to detect the boundaries between various different objects inside of an image. I might use a filter like this, which is able to tell whether the value of this pixel is different from the values of the neighboring pixel, if it’s greater than the values of the pixels that happen to surround it. And so we can use this in terms of image filtering. And so I’ll show you an example of that. I have here in filter.py a file that uses Python’s image library, or PIL, to do some image filtering. I go ahead and open an image. And then all I’m going to do is apply a kernel to that image. It’s going to be a 3 by 3 kernel, same kind of kernel we saw before. And here is the kernel. This is just a list representation of the same matrix that I showed you a moment ago. It’s negative 1, negative 1, negative 1. The second row is negative 1, 8, negative 1. And the third row is all negative 1s. And then at the end, I’m going to go ahead and show the filtered image. So if, for example, I go into convolution directory and I open up an image, like bridge.png, this is what an input image might look like, just an image of a bridge over a river. Now I’m going to go ahead and run this filter program on the bridge. And what I get is this image here. Just by taking the original image and applying that filter to each 3 by 3 grid, I’ve extracted all of the boundaries, all of the edges inside the image that separate one part of the image from another. So here I’ve got a representation of boundaries between particular parts of the image. And you might imagine that if a machine learning algorithm is trying to learn what an image is of, a filter like this could be pretty useful. Maybe the machine learning algorithm doesn’t care about all of the details of the image. It just cares about certain useful features. It cares about particular shapes that are able to help it determine that based on the image, this is going to be a bridge, for example. And so this type of idea of image convolution can allow us to apply filters to images that allow us to extract useful results out of those images, taking an image and extracting its edges, for example. And you might imagine many other filters that could be applied to an image that are able to extract particular values as well. And a filter might have separate kernels for the red values, the green values, and the blue values that are all summed together at the end, such that you could have particular filters looking for, is there red in this part of the image? Are there green in other parts of the image? You can begin to assemble these relevant and useful filters that are able to do these calculations as well. So that then was the idea of image convolution, applying some sort of filter to an image to be able to extract some useful features out of that image. But all the while, these images are still pretty big. There’s a lot of pixels involved in the image. And realistically speaking, if you’ve got a really big image, that poses a couple of problems. One, it means a lot of input going into the neural network. But two, it also means that we really have to care about what’s in each particular pixel. Whereas realistically, we often, if you’re looking at an image, you don’t care whether something is in one particular pixel versus the pixel immediately to the right of it. They’re pretty close together. You really just care about whether there’s a particular feature in some region of the image. And maybe you don’t care about exactly which pixel it happens to be in. And so there’s a technique we can use known as pooling. And what pooling is, is it means reducing the size of an input by sampling from regions inside of the input. So we’re going to take a big image and turn it into a smaller image by using pooling. And in particular, one of the most popular types of pooling is called max pooling. And what max pooling does is it pools just by choosing the maximum value in a particular region. So for example, let’s imagine I had this 4 by 4 image. But I wanted to reduce its dimensions. I wanted to make it a smaller image so that I have fewer inputs to work with. Well, what I could do is I could apply a 2 by 2 max pool, where the idea would be that I’m going to first look at this 2 by 2 region and say, what is the maximum value in that region? Well, it’s the number 50. So we’ll go ahead and just use the number 50. And then we’ll look at this 2 by 2 region. What is the maximum value here? It’s 110, so that’s going to be my value. Likewise here, the maximum value looks like 20. Go ahead and put that there. Then for this last region, the maximum value was 40. So we’ll go ahead and use that. And what I have now is a smaller representation of this same original image that I obtained just by picking the maximum value from each of these regions. So again, the advantages here are now I only have to deal with a 2 by 2 input instead of a 4 by 4. And you can imagine shrinking the size of an image even more. But in addition to that, I’m now able to make my analysis independent of whether a particular value was in this pixel or this pixel. I don’t care if the 50 was here or here. As long as it was generally in this region, I’ll still get access to that value. So it makes our algorithms a little bit more robust as well. So that then is pooling, taking the size of the image, reducing it a little bit by just sampling from particular regions inside of the image. And now we can put all of these ideas together, pooling, image convolution, and neural networks all together into another type of neural network called a convolutional neural network, or a CNN, which is a neural network that uses this convolution step usually in the context of analyzing an image, for example. And so the way that a convolutional neural network works is that we start with some sort of input image, some grid of pixels. But rather than immediately put that into the neural network layers that we’ve seen before, we’ll start by applying a convolution step, where the convolution step involves applying some number of different image filters to our original image in order to get what we call a feature map, the result of applying some filter to an image. And we could do this once, but in general, we’ll do this multiple times, getting a whole bunch of different feature maps, each of which might extract some different relevant feature out of the image, some different important characteristic of the image that we might care about using in order to calculate what the result should be. And in the same way that when we train neural networks, we can train neural networks to learn the weights between particular units inside of the neural networks, we can also train neural networks to learn what those filters should be, what the values of the filters should be in order to get the most useful, most relevant information out of the original image just by figuring out what setting of those filter values, the values inside of that kernel, results in minimizing the loss function, minimizing how poorly our hypothesis actually performs in figuring out the classification of a particular image, for example. So we first apply this convolution step, get a whole bunch of these various different feature maps. But these feature maps are quite large. There’s a lot of pixel values that happen to be here. And so a logical next step to take is a pooling step, where we reduce the size of these images by using max pooling, for example, extracting the maximum value from any particular region. There are other pooling methods that exist as well, depending on the situation. You could use something like average pooling, where instead of taking the maximum value from a region, you take the average value from a region, which has its uses as well. But in effect, what pooling will do is it will take these feature maps and reduce their dimensions so that we end up with smaller grids with fewer pixels. And this then is going to be easier for us to deal with. It’s going to mean fewer inputs that we have to worry about. And it’s also going to mean we’re more resilient, more robust against potential movements of particular values, just by one pixel, when ultimately we really don’t care about those one-pixel differences that might arise in the original image. And now, after we’ve done this pooling step, now we have a whole bunch of values that we can then flatten out and just put into a more traditional neural network. So we go ahead and flatten it, and then we end up with a traditional neural network that has one input for each of these values in each of these resulting feature maps after we do the convolution and after we do the pooling step. And so this then is the general structure of a convolutional network. We begin with the image, apply convolution, apply pooling, flatten the results, and then put that into a more traditional neural network that might itself have hidden layers. You can have deep convolutional networks that have hidden layers in between this flattened layer and the eventual output to be able to calculate various different features of those values. But this then can help us to be able to use convolution and pooling to use our knowledge about the structure of an image to be able to get better results, to be able to train our networks faster in order to better capture particular parts of the image. And there’s no reason necessarily why you can only use these steps once. In fact, in practice, you’ll often use convolution and pooling multiple times in multiple different steps. See, what you might imagine doing is starting with an image, first applying convolution to get a whole bunch of maps, then applying pooling, then applying convolution again, because these maps are still pretty big. You can apply convolution to try and extract relevant features out of this result. Then take those results, apply pooling in order to reduce their dimensions, and then take that and feed it into a neural network that maybe has fewer inputs. So here I have two different convolution and pooling steps. I do convolution and pooling once, and then I do convolution and pooling a second time, each time extracting useful features from the layer before it, each time using pooling to reduce the dimensions of what you’re ultimately looking at. And the goal now of this sort of model is that in each of these steps, you can begin to learn different types of features of the original image. That maybe in the first step, you learn very low level features. Just learn and look for features like edges and curves and shapes, because based on pixels and their neighboring values, you can figure out, all right, what are the edges? What are the curves? What are the various different shapes that might be present there? But then once you have a mapping that just represents where the edges and curves and shapes happen to be, you can imagine applying the same sort of process again to begin to look for higher level features, look for objects, maybe look for people’s eyes and facial recognition, for example. Maybe look for more complex shapes like the curves on a particular number if you’re trying to recognize a digit in a handwriting recognition sort of scenario. And then after all of that, now that you have these results that represent these higher level features, you can pass them into a neural network, which is really just a deep neural network that looks like this, where you might imagine making a binary classification or classifying into multiple categories or performing various different tasks on this sort of model. So convolutional neural networks can be quite powerful and quite popular when it comes towards trying to analyze images. We don’t strictly need them. We could have just used a vanilla neural network that just operates with layer after layer, as we’ve seen before. But these convolutional neural networks can be quite helpful, in particular, because of the way they model the way a human might look at an image, that instead of a human looking at every single pixel simultaneously and trying to convolve all of them by multiplying them together, you might imagine that what convolution is really doing is looking at various different regions of the image and extracting relevant information and features out of those parts of the image, the same way that a human might have visual receptors that are looking at particular parts of what they see and using those combining them to figure out what meaning they can draw from all of those various different inputs. And so you might imagine applying this to a situation like handwriting recognition. So we’ll go ahead and see an example of that now, where I’ll go ahead and open up handwriting.py. Again, what we do here is we first import TensorFlow. And then TensorFlow, it turns out, has a few data sets that are built into the library that you can just immediately access. And one of the most famous data sets in machine learning is the MNIST data set, which is just a data set of a whole bunch of samples of people’s handwritten digits. I showed you a slide of that a little while ago. And what we can do is just immediately access that data set which is built into the library so that if I want to do something like train on a whole bunch of handwritten digits, I can just use the data set that is provided to me. Of course, if I had my own data set of handwritten images, I can apply the same idea. I’d first just need to take those images and turn them into an array of pixels, because that’s the way that these are going to be formatted. They’re going to be formatted as, effectively, an array of individual pixels. Now there’s a bit of reshaping I need to do, just turning the data into a format that I can put into my convolutional neural network. So this is doing things like taking all the values and dividing them by 255. If you remember, these color values tend to range from 0 to 255. So I can divide them by 255 just to put them into 0 to 1 range, which might be a little bit easier to train on. And then doing various other modifications to the data just to get it into a nice usable format. But here’s the interesting and important part. Here is where I create the convolutional neural network, the CNN, where here I’m saying, go ahead and use a sequential model. And before I could use model.add to say add a layer, add a layer, add a layer, another way I could define it is just by passing as input to this sequential neural network a list of all of the layers that I want. And so here, the very first layer in my model is a convolution layer, where I’m first going to apply convolution to my image. I’m going to use 13 different filters. So my model is going to learn 32, rather, 32 different filters that I would like to learn on the input image, where each filter is going to be a 3 by 3 kernel. So we saw those 3 by 3 kernels before, where we could multiply each value in a 3 by 3 grid by a value, multiply it, and add all the results together. So here, I’m going to learn 32 different of these 3 by 3 filters. I can, again, specify my activation function. And I specify what my input shape is. My input shape in the banknotes case was just 4. I had 4 inputs. My input shape here is going to be 28, 28, 1, because for each of these handwritten digits, it turns out that the MNIST data set organizes their data. Each image is a 28 by 28 pixel grid. So we’re going to have a 28 by 28 pixel grid. And each one of those images only has one channel value. These handwritten digits are just black and white. So there’s just a single color value representing how much black or how much white. You might imagine that in a color image, if you were doing this sort of thing, you might have three different channels, a red, a green, and a blue channel, for example. But in the case of just handwriting recognition, recognizing a digit, we’re just going to use a single value for, like, shaded in or not shaded in. And it might range, but it’s just a single color value. And that, then, is the very first layer of our neural network, a convolutional layer that will take the input and learn a whole bunch of different filters that we can apply to the input to extract meaningful features. Next step is going to be a max pooling layer, also built right into TensorFlow, where this is going to be a layer that is going to use a pool size of 2 by 2, meaning we’re going to look at 2 by 2 regions inside of the image and just extract the maximum value. Again, we’ve seen why this can be helpful. It’ll help to reduce the size of our input. And once we’ve done that, we’ll go ahead and flatten all of the units just into a single layer that we can then pass into the rest of the neural network. And now, here’s the rest of the neural network. Here, I’m saying, let’s add a hidden layer to my neural network with 128 units, so a whole bunch of hidden units inside of the hidden layer. And just to prevent overfitting, I can add a dropout to that. Say, you know what, when you’re training, randomly dropout half of the nodes from this hidden layer just to make sure we don’t become over-reliant on any particular node, we begin to really generalize and stop ourselves from overfitting. So TensorFlow allows us, just by adding a single line, to add dropout into our model as well, such that when it’s training, it will perform this dropout step in order to help make sure that we don’t overfit on this particular data. And then finally, I add an output layer. The output layer is going to have 10 units, one for each category that I would like to classify digits into, so 0 through 9, 10 different categories. And the activation function I’m going to use here is called the softmax activation function. And in short, what the softmax activation function is going to do is it’s going to take the output and turn it into a probability distribution. So ultimately, it’s going to tell me, what did we estimate the probability is that this is a 2 versus a 3 versus a 4. And so it will turn it into that probability distribution for me. Next up, I’ll go ahead and compile my model and fit it on all of my training data. And then I can evaluate how well the neural network performs. And then I’ve added to my Python program, if I’ve provided a command line argument like the name of a file, I’m going to go ahead and save the model to a file. And so this can be quite useful too. Once you’ve done the training step, which could take some time in terms of taking all the time, going through the data, running back propagation with gradient descent to be able to say, all right, how should we adjust the weight to this particular model? You end up calculating values for these weights, calculating values for these filters. You’d like to remember that information so you can use it later. And so TensorFlow allows us to just save a model to a file, such that later, if we want to use the model we’ve learned, use the weights that we’ve learned to make some sort of new prediction, we can just use the model that already exists. So what we’re doing here is after we’ve done all the calculation, we go ahead and save the model to a file, such that we can use it a little bit later. So for example, if I go into digits, I’m going to run handwriting.py. I won’t save it this time. We’ll just run it and go ahead and see what happens. What will happen is we need to go through the model in order to train on all of these samples of handwritten digits. The MNIST data set gives us thousands and thousands of sample handwritten digits in the same format that we can use in order to train. And so now what you’re seeing is this training process. And unlike the banknotes case, where there was much fewer data points, the data was very, very simple, here this data is more complex and this training process takes time. And so this is another one of those cases where when training neural networks, this is why computational power is so important that oftentimes you see people wanting to use sophisticated GPUs in order to more efficiently be able to do this sort of neural network training. It also speaks to the reason why more data can be helpful. The more sample data points you have, the better you can begin to do this training. So here we’re going through 60,000 different samples of handwritten digits. And I said we’re going to go through them 10 times. We’re going to go through the data set 10 times, training each time, hopefully improving upon our weights with every time we run through this data set. And we can see over here on the right what the accuracy is each time we go ahead and run this model, that the first time it looks like we got an accuracy of about 92% of the digits correct based on this training set. We increased that to 96% or 97%. And every time we run this, we’re going to see hopefully the accuracy improve as we continue to try and use that gradient descent, that process of trying to run the algorithm, to minimize the loss that we get in order to more accurately predict what the output should be. And what this process is doing is it’s learning not only the weights, but it’s learning the features to use, the kernel matrix to use when performing that convolution step. Because this is a convolutional neural network, where I’m first performing those convolutions and then doing the more traditional neural network structure, this is going to learn all of those individual steps as well. And so here we see the TensorFlow provides me with some very nice output, telling me about how many seconds are left with each of these training runs that allows me to see just how well we’re doing. So we’ll go ahead and see how this network performs. It looks like we’ve gone through the data set seven times. We’re going through it an eighth time now. And at this point, the accuracy is pretty high. We saw we went from 92% up to 97%. Now it looks like 98%. And at this point, it seems like things are starting to level out. It’s probably a limit to how accurate we can ultimately be without running the risk of overfitting. Of course, with enough nodes, you would just memorize the input and overfit upon them. But we’d like to avoid doing that. And Dropout will help us with this. But now we see we’re almost done finishing our training step. We’re at 55,000. All right, we finished training. And now it’s going to go ahead and test for us on 10,000 samples. And it looks like on the testing set, we were at 98.8% accurate. So we ended up doing pretty well, it seems, on this testing set to see how accurately can we predict these handwritten digits. And so what we could do then is actually test it out. I’ve written a program called Recognition.py using PyGame. If you pass it a model that’s been trained, and I pre-trained an example model using this input data, what we can do is see whether or not we’ve been able to train this convolutional neural network to be able to predict handwriting, for example. So I can try, just like drawing a handwritten digit. I’ll go ahead and draw the number 2, for example. So there’s my number 2. Again, this is messy. If you tried to imagine, how would you write a program with just ifs and thens to be able to do this sort of calculation, it would be tricky to do so. But here I’ll press Classify, and all right, it seems I was able to correctly classify that what I drew was the number 2. I’ll go ahead and reset it, try it again. We’ll draw an 8, for example. So here is an 8. Press Classify. And all right, it predicts that the digit that I drew was an 8. And the key here is this really begins to show the power of what the neural network is doing, somehow looking at various different features of these different pixels, figuring out what the relevant features are, and figuring out how to combine them to get a classification. And this would be a difficult task to provide explicit instructions to the computer on how to do, to use a whole bunch of ifs ands to process all these pixel values to figure out what the handwritten digit is. Everyone’s going to draw their 8s a little bit differently. If I drew the 8 again, it would look a little bit different. And yet, ideally, we want to train a network to be robust enough so that it begins to learn these patterns on its own. All I said was, here is the structure of the network, and here is the data on which to train the network. And the network learning algorithm just tries to figure out what is the optimal set of weights, what is the optimal set of filters to use them in order to be able to accurately classify a digit into one category or another. Just going to show the power of these sorts of convolutional neural networks. And so that then was a look at how we can use convolutional neural networks to begin to solve problems with regards to computer vision, the ability to take an image and begin to analyze it. So this is the type of analysis you might imagine that’s happening in self-driving cars that are able to figure out what filters to apply to an image to understand what it is that the computer is looking at, or the same type of idea that might be applied to facial recognition and social media to be able to determine how to recognize faces in an image as well. You can imagine a neural network that instead of classifying into one of 10 different digits could instead classify like, is this person A or is this person B, trying to tell those people apart just based on convolution. And so now what we’ll take a look at is yet another type of neural network that can be quite popular for certain types of tasks. But to do so, we’ll try to generalize and think about our neural network a little bit more abstractly. That here we have a sample deep neural network where we have this input layer, a whole bunch of different hidden layers that are performing certain types of calculations, and then an output layer here that just generates some sort of output that we care about calculating. But we could imagine representing this a little more simply like this. Here is just a more abstract representation of our neural network. We have some input that might be like a vector of a whole bunch of different values as our input. That gets passed into a network that performs some sort of calculation or computation, and that network produces some sort of output. That output might be a single value. It might be a whole bunch of different values. But this is the general structure of the neural network that we’ve seen. There is some sort of input that gets fed into the network. And using that input, the network calculates what the output should be. And this sort of model for a neural network is what we might call a feed-forward neural network. Feed-forward neural networks have connections only in one direction. They move from one layer to the next layer to the layer after that, such that the inputs pass through various different hidden layers and then ultimately produce some sort of output. So feed-forward neural networks were very helpful for solving these types of classification problems that we saw before. We have a whole bunch of input. We want to learn what setting of weights will allow us to calculate the output effectively. But there are some limitations on feed-forward neural networks that we’ll see in a moment. In particular, the input needs to be of a fixed shape, like a fixed number of neurons are in the input layer. And there’s a fixed shape for the output, like a fixed number of neurons in the output layer. And that has some limitations of its own. And a possible solution to this, and we’ll see examples of the types of problems we can solve for this in just a second, is instead of just a feed-forward neural network, where there are only connections in one direction from left to right effectively across the network, we could also imagine a recurrent neural network, where a recurrent neural network generates output that gets fed back into itself as input for future runs of that network. So whereas in a traditional neural network, we have inputs that get fed into the network, that get fed into the output. And the only thing that determines the output is based on the original input and based on the calculation we do inside of the network itself. This goes in contrast with a recurrent neural network, where in a recurrent neural network, you can imagine output from the network feeding back to itself into the network again as input for the next time you do the calculations inside of the network. What this allows is it allows the network to maintain some sort of state, to store some sort of information that can be used on future runs of the network. Previously, the network just defined some weights, and we passed inputs through the network, and it generated outputs. But the network wasn’t saving any information based on those inputs to be able to remember for future iterations or for future runs. What a recurrent neural network will let us do is let the network store information that gets passed back in as input to the network again the next time we try and perform some sort of action. And this is particularly helpful when dealing with sequences of data. So we’ll see a real world example of this right now, actually. Microsoft has developed an AI known as the caption bot. And what the caption bot does is it says, I can understand the content of any photograph, and I’ll try to describe it as well as any human. I’ll analyze your photo, but I won’t store it or share it. And so what Microsoft’s caption bot seems to be claiming to do is it can take an image and figure out what’s in the image and just give us a caption to describe it. So let’s try it out. Here, for example, is an image of Harvard Square. It’s some people walking in front of one of the buildings at Harvard Square. I’ll go ahead and take the URL for that image, and I’ll paste it into caption bot and just press Go. So caption bot is analyzing the image, and then it says, I think it’s a group of people walking in front of a building, which seems amazing. The AI is able to look at this image and figure out what’s in the image. And the important thing to recognize here is that this is no longer just a classification task. We saw being able to classify images with a convolutional neural network where the job was take the image and then figure out, is it a 0 or a 1 or a 2, or is it this person’s face or that person’s face? What seems to be happening here is the input is an image, and we know how to get networks to take input of images, but the output is text. It’s a sentence. It’s a phrase, like a group of people walking in front of a building. And this would seem to pose a challenge for our more traditional feed-forward neural networks, for the reason being that in traditional neural networks, we just have a fixed-size input and a fixed-size output. There are a certain number of neurons in the input to our neural network and a certain number of outputs for our neural network, and then some calculation that goes on in between. But the size of the inputs and the number of values in the input and the number of values in the output, those are always going to be fixed based on the structure of the neural network. And that makes it difficult to imagine how a neural network could take an image like this and say it’s a group of people walking in front of the building because the output is text, like it’s a sequence of words. Now, it might be possible for a neural network to output one word, one word you could represent as a vector of values, and you can imagine ways of doing that. Next time, we’ll talk a little bit more about AI as it relates to language and language processing. But a sequence of words is much more challenging because depending on the image, you might imagine the output is a different number of words. We could have sequences of different lengths, and somehow we still want to be able to generate the appropriate output. And so the strategy here is to use a recurrent neural network, a neural network that can feed its own output back into itself as input for the next time. And this allows us to do what we call a one-to-many relationship for inputs to outputs, that in vanilla, more traditional neural networks, these are what we might consider to be one-to-one neural networks. You pass in one set of values as input. You get one vector of values as the output. But in this case, we want to pass in one value as input, the image, and we want to get a sequence, many values as output, where each value is like one of these words that gets produced by this particular algorithm. And so the way we might do this is we might imagine starting by providing input, the image, into our neural network. And the neural network is going to generate output, but the output is not going to be the whole sequence of words, because we can’t represent the whole sequence of words using just a fixed set of neurons. Instead, the output is just going to be the first word. We’re going to train the network to output what the first word of the caption should be. And you could imagine that Microsoft has trained this by running a whole bunch of training samples through the AI, giving it a whole bunch of pictures and what the appropriate caption was, and having the AI begin to learn from that. But now, because the network generates output that can be fed back into itself, you could imagine the output of the network being fed back into the same network. This here looks like a separate network, but it’s really the same network that’s just getting different input, that this network’s output gets fed back into itself, but it’s going to generate another output. And that other output is going to be the second word in the caption. And this recurrent neural network then, this network is going to generate other output that can be fed back into itself to generate yet another word, fed back into itself to generate another word. And so recurrent neural networks allow us to represent this one-to-many structure. You provide one image as input, and the neural network can pass data into the next run of the network, and then again and again, such that you could run the network multiple times, each time generating a different output still based on that original input. And this is where recurrent neural networks become particularly useful when dealing with sequences of inputs or outputs. And my output is a sequence of words, and since I can’t very easily represent outputting an entire sequence of words, I’ll instead output that sequence one word at a time by allowing my network to pass information about what still needs to be said about the photo into the next stage of running the network. So you could run the network multiple times, the same network with the same weights, just getting different input each time. First, getting input from the image, and then getting input from the network itself as additional information about what additionally needs to be given in a particular caption, for example. So this then is a one-to-many relationship inside of a recurrent neural network, but it turns out there are other models that we can use, other ways we can try and use recurrent neural networks to be able to represent data that might be stored in other forms as well. We saw how we could use neural networks in order to analyze images in the context of convolutional neural networks that take an image, figure out various different properties of the image, and are able to draw some sort of conclusion based on that. But you might imagine that something like YouTube, they need to be able to do a lot of learning based on video. They need to look through videos to detect if they’re like copyright violations, or they need to be able to look through videos to maybe identify what particular items are inside of the video, for example. And video, you might imagine, is much more difficult to put in as input to a neural network, because whereas an image, you could just treat each pixel as a different value, videos are sequences. They’re sequences of images, and each sequence might be of different length. And so it might be challenging to represent that entire video as a single vector of values that you could pass in to a neural network. And so here, too, recurrent neural networks can be a valuable solution for trying to solve this type of problem. Then instead of just passing in a single input into our neural network, we could pass in the input one frame at a time, you might imagine. First, taking the first frame of the video, passing it into the network, and then maybe not having the network output anything at all yet. Let it take in another input, and this time, pass it into the network. But the network gets information from the last time we provided an input into the network. Then we pass in a third input, and then a fourth input, where each time, what the network gets is it gets the most recent input, like each frame of the video. But it also gets information the network processed from all of the previous iterations. So on frame number four, you end up getting the input for frame number four plus information the network has calculated from the first three frames. And using all of that data combined, this recurrent neural network can begin to learn how to extract patterns from a sequence of data as well. And so you might imagine, if you want to classify a video into a number of different genres, like an educational video, or a music video, or different types of videos, that’s a classification task, where you want to take as input each of the frames of the video, and you want to output something like what it is, what category that it happens to belong to. And you can imagine doing this sort of thing, this sort of many-to-one learning, any time your input is a sequence. And so input is a sequence in the context of video. It could be in the context of, like, if someone has typed a message and you want to be able to categorize that message, like if you’re trying to take a movie review and trying to classify it as, is it a positive review or a negative review? That input is a sequence of words, and the output is a classification, positive or negative. There, too, a recurrent neural network might be helpful for analyzing sequences of words. And they’re quite popular when it comes to dealing with language. Could even be used for spoken language as well, that spoken language is an audio waveform that can be segmented into distinct chunks. And each of those could be passed in as an input into a recurrent neural network to be able to classify someone’s voice, for instance. If you want to do voice recognition to say, is this one person or is this another, here are also cases where you might want this many-to-one architecture for a recurrent neural network. And then as one final problem, just to take a look at in terms of what we can do with these sorts of networks, imagine what Google Translate is doing. So what Google Translate is doing is it’s taking some text written in one language and converting it into text written in some other language, for example, where now this input is a sequence of data. It’s a sequence of words. And the output is a sequence of words as well. It’s also a sequence. So here we want effectively a many-to-many relationship. Our input is a sequence and our output is a sequence as well. And it’s not quite going to work to just say, take each word in the input and translate it into a word in the output. Because ultimately, different languages put their words in different orders. And maybe one language uses two words for something, whereas another language only uses one. So we really want some way to take this information, this input, encode it somehow, and use that encoding to generate what the output ultimately should be. And this has been one of the big advancements in automated translation technology, is the ability to use the neural networks to do this instead of older, more traditional methods. And this has improved accuracy dramatically. And the way you might imagine doing this is, again, using a recurrent neural network with multiple inputs and multiple outputs. We start by passing in all the input. Input goes into the network. Another input, like another word, goes into the network. And we do this multiple times, like once for each word in the input that I’m trying to translate. And only after all of that is done does the network now start to generate output, like the first word of the translated sentence, and the next word of the translated sentence, so on and so forth, where each time the network passes information to itself by allowing for this model of giving some sort of state from one run in the network to the next run, assembling information about all the inputs, and then passing in information about which part of the output in order to generate next. And there are a number of different types of these sorts of recurrent neural networks. One of the most popular is known as the long short-term memory neural network, otherwise known as LSTM. But in general, these types of networks can be very, very powerful whenever we’re dealing with sequences, whether those are sequences of images or especially sequences of words when it comes towards dealing with natural language. And so that then were just some of the different types of neural networks that can be used to do all sorts of different computations. And these are incredibly versatile tools that can be applied to a number of different domains. We only looked at a couple of the most popular types of neural networks from more traditional feed-forward neural networks, convolutional neural networks, and recurrent neural networks. But there are other types as well. There are adversarial networks where networks compete with each other to try and be able to generate new types of data, as well as other networks that can solve other tasks based on what they happen to be structured and adapted for. And these are very powerful tools in machine learning from being able to very easily learn based on some set of input data and to be able to, therefore, figure out how to calculate some function from inputs to outputs, whether it’s input to some sort of classification like analyzing an image and getting a digit or machine translation where the input is in one language and the output is in another. These tools have a lot of applications for machine learning more generally. Next time, we’ll look at machine learning and AI in particular in the context of natural language. We talked a little bit about this today, but looking at how it is that our AI can begin to understand natural language and can begin to be able to analyze and do useful tasks with regards to human language, which turns out to be a challenging and interesting task. So we’ll see you next time. And welcome back, everybody, to our final class in an introduction to artificial intelligence with Python. Now, so far in this class, we’ve been taking problems that we want to solve intelligently and framing them in ways that computers are going to be able to make sense of. We’ve been taking problems and framing them as search problems or constraint satisfaction problems or optimization problems, for example. In essence, we have been trying to communicate about problems in ways that our computer is going to be able to understand. Today, the goal is going to be to get computers to understand the way you and I communicate naturally via our own natural languages, languages like English. But natural language contains a lot of nuance and complexity that’s going to make it challenging for computers to be able to understand. So we’ll need to explore some new tools and some new techniques to allow computers to make sense of natural language. So what is it exactly that we’re trying to get computers to do? Well, they all fall under this general heading of natural language processing, getting computers to work with natural language. And these tasks include tasks like automatic summarization. Given a long text, can we train the computer to be able to come up with a shorter representation of it? Information extraction, getting the computer to pull out relevant facts or details out of some text. Machine translation, like Google Translate, translating some text from one language into another language. Question answering, if you’ve ever asked a question to your phone or had a conversation with an AI chatbot where you provide some text to the computer, the computer is able to understand that text and then generate some text in response. Text classification, where we provide some text to the computer and the computer assigns it a label, positive or negative, inbox or spam, for example. And there are several other kinds of tasks that all fall under this heading of natural language processing. But before we take a look at how the computer might try to solve these kinds of tasks, it might be useful for us to think about language in general. What are the kinds of challenges that we might need to deal with as we start to think about language and getting a computer to be able to understand it? So one part of language that we’ll need to consider is the syntax of language. Syntax is all about the structure of language. Language is composed of individual words. And those words are composed together in some kind of structured whole. And if our computer is going to be able to understand language, it’s going to need to understand something about that structure. So let’s take a couple of examples. Here, for instance, is a sentence. Just before 9 o’clock, Sherlock Holmes stepped briskly into the room. That sentence is made up of words. And those words together form a structured whole. This is syntactically valid as a sentence. But we could take some of those same words, rearrange them, and come up with a sentence that is not syntactically valid. Here, for example, just before Sherlock Holmes 9 o’clock stepped briskly the room is still composed of valid words. But they’re not in any kind of logical whole. This is not a syntactically well-formed sentence. Another interesting challenge is that some sentences will have multiple possible valid structures. Here’s a sentence, for example. I saw the man on the mountain with a telescope. And here, this is a valid sentence. But it actually has two different possible structures that lend themselves to two different interpretations and two different meanings. Maybe I, the one doing the seeing, am the one with the telescope. Or maybe the man on the mountain is the one with the telescope. And so natural language is ambiguous. Sometimes the same sentence can be interpreted in multiple ways. And that’s something that we’ll need to think about as well. And this lends itself to another problem within language that we’ll need to think about, which is semantics. While syntax is all about the structure of language, semantics is about the meaning of language. It’s not enough for a computer just to know that a sentence is well-structured if it doesn’t know what that sentence means. And so semantics is going to concern itself with the meaning of words and the meaning of sentences. So if we go back to that same sentence as before, just before 9 o’clock, Sherlock Holmes stepped briskly into the room, I could come up with another sentence, say the sentence, a few minutes before 9, Sherlock Holmes walked quickly into the room. And those are two different sentences with some of the words the same and some of the words different. But the two sentences have essentially the same meaning. And so ideally, whatever model we build, we’ll be able to understand that these two sentences, while different, mean something very similar. Some syntactically well-formed sentences don’t mean anything at all. A famous example from linguist Noam Chomsky is the sentence, colorless green ideas sleep furiously. This is a syntactically, structurally well-formed sentence. We’ve got adjectives modifying a noun, ideas. We’ve got a verb and an adverb in the correct positions. But when taken as a whole, the sentence doesn’t really mean anything. And so if our computers are going to be able to work with natural language and perform tasks in natural language processing, these are some concerns we’ll need to think about. We’ll need to be thinking about syntax. And we’ll need to be thinking about semantics. So how could we go about trying to teach a computer how to understand the structure of natural language? Well, one approach we might take is by starting by thinking about the rules of natural language. Our natural languages have rules. In English, for example, nouns tend to come before verbs. Nouns can be modified by adjectives, for example. And so if only we could formalize those rules, then we could give those rules to a computer, and the computer would be able to make sense of them and understand them. And so let’s try to do exactly that. We’re going to try to define a formal grammar. Where a formal grammar is some system of rules for generating sentences in a language. This is going to be a rule-based approach to natural language processing. We’re going to give the computer some rules that we know about language and have the computer use those rules to make sense of the structure of language. And there are a number of different types of formal grammars. Each one of them has slightly different use cases. But today, we’re going to focus specifically on one kind of grammar known as a context-free grammar. So how does the context-free grammar work? Well, here is a sentence that we might want a computer to generate. She saw the city. And we’re going to call each of these words a terminal symbol. A terminal symbol, because once our computer has generated the word, there’s nothing else for it to generate. Once it’s generated the sentence, the computer is done. We’re going to associate each of these terminal symbols with a non-terminal symbol that generates it. So here we’ve got n, which stands for noun, like she or city. We’ve got v as a non-terminal symbol, which stands for a verb. And then we have d, which stands for determiner. A determiner is a word like the or a or an in English, for example. So each of these non-terminal symbols can generate the terminal symbols that we ultimately care about generating. But how do we know, or how does the computer know which non-terminal symbols are associated with which terminal symbols? Well, to do that, we need some kind of rule. Here are some what we call rewriting rules that have a non-terminal symbol on the left-hand side of an arrow. And on the right side is what that non-terminal symbol can be replaced with. So here we’re saying the non-terminal symbol n, again, which stands for noun, could be replaced by any of these options separated by vertical bars. n could be replaced by she or city or car or hairy. d for determiner could be replaced by the a or an and so forth. Each of these non-terminal symbols could be replaced by any of these words. We can also have non-terminal symbols that are replaced by other non-terminal symbols. Here is an interesting rule, np arrow n bar dn. So what does that mean? Well, np stands for a noun phrase. Sometimes when we have a noun phrase in a sentence, it’s not just a single word, it could be multiple words. And so here we’re saying a noun phrase could be just a noun, or it could be a determiner followed by a noun. So we might have a noun phrase that’s just a noun, like she, that’s a noun phrase. Or we could have a noun phrase that’s multiple words, something like the city also acts as a noun phrase. But in this case, it’s composed of two words, a determiner, the, and a noun city. We could do the same for verb phrases. A verb phrase, or VP, might be just a verb, or it might be a verb followed by a noun phrase. So we could have a verb phrase that’s just a single word, like the word walked, or we could have a verb phrase that is an entire phrase, something like saw the city, as an entire verb phrase. A sentence, meanwhile, we might then define as a noun phrase followed by a verb phrase. And so this would allow us to generate a sentence like she saw the city, an entire sentence made up of a noun phrase, which is just the word she, and then a verb phrase, which is saw the city, saw which is a verb, and then the city, which itself is also a noun phrase. And so if we could give these rules to a computer explaining to it what non-terminal symbols could be replaced by what other symbols, then a computer could take a sentence and begin to understand the structure of that sentence. And so let’s take a look at an example of how we might do that. And to do that, we’re going to use a Python library called NLTK, or the Natural Language Toolkit, which we’ll see a couple of times today. It contains a lot of helpful features and functions that we can use for trying to deal with and process natural language. So here we’ll take a look at how we can use NLTK in order to parse a context-free grammar. So let’s go ahead and open up cfg0.py, cfg standing for context-free grammar. And what you’ll see in this file is that I first import NLTK, the Natural Language Toolkit. And the first thing I do is define a context-free grammar, saying that a sentence is a noun phrase followed by a verb phrase. I’m defining what a noun phrase is, defining what a verb phrase is, and then giving some examples of what I can do with these non-terminal symbols, D for determiner, N for noun, and V for verb. We’re going to use NLTK to parse that grammar. Then we’ll ask the user for some input in the form of a sentence and split it into words. And then we’ll use this context-free grammar parser to try to parse that sentence and print out the resulting syntax tree. So let’s take a look at an example. We’ll go ahead and go into my cfg directory, and we’ll run cfg0.py. And here I’m asked to type in a sentence. Let’s say I type in she walked. And when I do that, I see that she walked is a valid sentence, where she is a noun phrase, and walked is the corresponding verb phrase. I could try to do this with a more complex sentence too. I could do something like she saw the city. And here we see that she is the noun phrase, and then saw the city is the entire verb phrase that makes up this sentence. So that was a very simple grammar. Let’s take a look at a slightly more complex grammar. Here is cfg1.py, where a sentence is still a noun phrase followed by a verb phrase, but I’ve added some other possible non-terminal symbols too. I have AP for adjective phrase and PP for prepositional phrase. And we specified that we could have an adjective phrase before a noun phrase or a prepositional phrase after a noun, for example. So lots of additional ways that we might try to structure a sentence and interpret and parse one of those resulting sentences. So let’s see that one in action. We’ll go ahead and run cfg1.py with this new grammar. And we’ll try a sentence like she saw the wide street. Here, Python’s NLTK is able to parse that sentence and identify that she saw the wide street has this particular structure, a sentence with a noun phrase and a verb phrase, where that verb phrase has a noun phrase that within it contains an adjective. And so it’s able to get some sense for what the structure of this language actually is. Let’s try another example. Let’s say she saw the dog with the binoculars. And we’ll try that sentence. And here, we get one possible syntax tree, she saw the dog with the binoculars. But notice that this sentence is actually a little bit ambiguous in our own natural language. Who has the binoculars? Is it she who has the binoculars or the dog who has the binoculars? And NLTK is able to identify both possible structures for the sentence. In this case, the dog with the binoculars is an entire noun phrase. It’s all underneath this NP here. So it’s the dog that has the binoculars. But we also got an alternative parse tree, where the dog is just the noun phrase. And with the binoculars is a prepositional phrase modifying saw. So she saw the dog and she used the binoculars in order to see the dog as well. So this allows us to get a sense for the structure of natural language. But it relies on us writing all of these rules. And it would take a lot of effort to write all of the rules for any possible sentence that someone might write or say in the English language. Language is complicated. And as a result, there are going to be some very complex rules. So what else might we try? We might try to take a statistical lens towards approaching this problem of natural language processing. If we were able to give the computer a lot of existing data of sentences written in the English language, what could we try to learn from that data? Well, it might be difficult to try and interpret long pieces of text all at once. So instead, what we might want to do is break up that longer text into smaller pieces of information instead. In particular, we might try to create n-grams out of a longer sequence of text. An n-gram is just some contiguous sequence of n items from a sample of text. It might be n characters in a row or n words in a row, for example. So let’s take a passage from Sherlock Holmes. And let’s look for all of the trigrams. A trigram is an n-gram where n is equal to 3. So in this case, we’re looking for sequences of three words in a row. So the trigrams here would be phrases like how often have. That’s three words in a row. Often have I is another trigram. Have I said, I said to, said to you, to you that. These are all trigrams, sequences of three words that appear in sequence. And if we could give the computer a large corpus of text and have it pull out all of the trigrams in this case, it could get a sense for what sequences of three words tend to appear next to each other in our own natural language and, as a result, get some sense for what the structure of the language actually is. So let’s take a look at an example of that. How can we use NLTK to try to get access to information about n-grams? So here, we’re going to open up ngrams.py. And this is a Python program that’s going to load a corpus of data, just some text files, into our computer’s memory. And then we’re going to use NLTK’s ngrams function, which is going to go through the corpus of text, pulling out all of the ngrams for a particular value of n. And then, by using Python’s counter class, we’re going to figure out what are the most common ngrams inside of this entire corpus of text. And we’re going to need a data set in order to do this. And I’ve prepared a data set of some of the stories of Sherlock Holmes. So it’s just a bunch of text files. A lot of words for it to analyze. And as a result, we’ll get a sense for what sequences of two words or three words that tend to be most common in natural language. So let’s give this a try. We’ll go into my ngrams directory. And we’ll run ngrams.py. We’ll try an n value of 2. So we’re looking for sequences of two words in a row. And we’ll use our corpus of stories from Sherlock Holmes. And when we run this program, we get a list of the most common ngrams where n is equal to 2, otherwise known as a bigram. So the most common one is of the. That’s a sequence of two words that appears quite frequently in natural language. Then in the. And it was. These are all common sequences of two words that appear in a row. Let’s instead now try running ngrams with n equal to 3. Let’s get all of the trigrams and see what we get. And now we see the most common trigrams are it was a. One of the. I think that. These are all sequences of three words that appear quite frequently. And we were able to do this essentially via a process known as tokenization. Tokenization is the process of splitting a sequence of characters into pieces. In this case, we’re splitting a long sequence of text into individual words and then looking at sequences of those words to get a sense for the structure of natural language. So once we’ve done this, once we’ve done the tokenization, once we’ve built up our corpus of ngrams, what can we do with that information? So the one thing that we might try is we could build a Markov chain, which you might recall from when we talked about probability. Recall that a Markov chain is some sequence of values where we can predict one value based on the values that came before it. And as a result, if we know all of the common ngrams in the English language, what words tend to be associated with what other words in sequence, we can use that to predict what word might come next in a sequence of words. And so we could build a Markov chain for language in order to try to generate natural language that follows the same statistical patterns as some input data. So let’s take a look at that and build a Markov chain for natural language. And as input, I’m going to use the works of William Shakespeare. So here I have a file Shakespeare.txt, which is just a bunch of the works of William Shakespeare. It’s a long text file, so plenty of data to analyze. And here in generator.py, I’m using a third party Python library in order to do this analysis. We’re going to read in the sample of text, and then we’re going to train a Markov model based on that text. And then we’re going to have the Markov chain generate some sentences. We’re going to generate a sentence that doesn’t appear in the original text, but that follows the same statistical patterns that’s generating it based on the ngrams trying to predict what word is likely to come next that we would expect based on those statistical patterns. So we’ll go ahead and go into our Markov directory, run this generator with the works of William Shakespeare’s input. And what we’re going to get are five new sentences, where these sentences are not necessarily sentences from the original input text itself, but just that follow the same statistical patterns. It’s predicting what word is likely to come next based on the input data that we’ve seen and the types of words that tend to appear in sequence there too. And so we’re able to generate these sentences. Of course, so far, there’s no guarantee that any of the sentences that are generated actually mean anything or make any sense. They just happen to follow the statistical patterns that our computer is already aware of. So we’ll return to this issue of how to generate text in perhaps a more accurate or more meaningful way a little bit later. So let’s now turn our attention to a slightly different problem, and that’s the problem of text classification. Text classification is the problem where we have some text and we want to put that text into some kind of category. We want to apply some sort of label to that text. And this kind of problem shows up in a wide variety of places. A commonplace might be your email inbox, for example. You get an email and you want your computer to be able to identify whether the email belongs in your inbox or whether it should be filtered out into spam. So we need to classify the text. Is it a good email or is it spam? Another common use case is sentiment analysis. We might want to know whether the sentiment of some text is positive or negative. And so how might we do that? This comes up in situations like product reviews, where we might have a bunch of reviews for a product on some website. My grandson loved it so much fun. Product broke after a few days. One of the best games I’ve played in a long time and kind of cheap and flimsy, not worth it. Here’s some example sentences that you might see on a product review website. And you and I could pretty easily look at this list of product reviews and decide which ones are positive and which ones are negative. We might say the first one and the third one, those seem like positive sentiment messages. But the second one and the fourth one seem like negative sentiment messages. But how did we know that? And how could we train a computer to be able to figure that out as well? Well, you might have clued your eye in on particular key words, where those particular words tend to mean something positive or negative. So you might have identified words like loved and fun and best tend to be associated with positive messages. And words like broke and cheap and flimsy tend to be associated with negative messages. So if only we could train a computer to be able to learn what words tend to be associated with positive versus negative messages, then maybe we could train a computer to do this kind of sentiment analysis as well. So we’re going to try to do just that. We’re going to use a model known as the bag of words model, which is a model that represents text as just an unordered collection of words. For the purpose of this model, we’re not going to worry about the sequence and the ordering of the words, which word came first, second, or third. We’re just going to treat the text as a collection of words in no particular order. And we’re losing information there, right? The order of words is important. And we’ll come back to that a little bit later. But for now, to simplify our model, it’ll help us tremendously just to think about text as some unordered collection of words. And in particular, we’re going to use the bag of words model to build something known as a naive Bayes classifier. So what is a naive Bayes classifier? Well, it’s a tool that’s going to allow us to classify text based on Bayes rule, again, which you might remember from when we talked about probability. Bayes rule says that the probability of B given A is equal to the probability of A given B multiplied by the probability of B divided by the probability of A. So how are we going to use this rule to be able to analyze text? Well, what are we interested in? We’re interested in the probability that a message has a positive sentiment and the probability that a message has a negative sentiment, which I’m here for simplicity going to represent just with these emoji, happy face and frown face, as positive and negative sentiment. And so if I had a review, something like my grandson loved it, then what I’m interested in is not just the probability that a message has positive sentiment, but the conditional probability that a message has positive sentiment given that this is the message my grandson loved it. But how do I go about calculating this value, the probability that the message is positive given that the review is this sequence of words? Well, here’s where the bag of words model comes in. Rather than treat this review as a string of a sequence of words in order, we’re just going to treat it as an unordered collection of words. We’re going to try to calculate the probability that the review is positive given that all of these words, my grandson loved it, are in the review in no particular order, just this unordered collection of words. And this is a conditional probability, which we can then apply Bayes rule to try to make sense of. And so according to Bayes rule, this conditional probability is equal to what? It’s equal to the probability that all of these four words are in the review given that the review is positive multiplied by the probability that the review is positive divided by the probability that all of these words happen to be in the review. So this is the value now that we’re going to try to calculate. Now, one thing you might notice is that the denominator here, the probability that all of these words appear in the review, doesn’t actually depend on whether or not we’re looking at the positive sentiment or negative sentiment case. So we can actually get rid of this denominator. We don’t need to calculate it. We can just say that this probability is proportional to the numerator. And then at the end, we’re going to need to normalize the probability distribution to make sure that all of the values sum up to the value 1. So now, how do we calculate this value? Well, this is the probability of all of these words given positive times probability of positive. And that, by the definition of joint probability, is just one big joint probability, the probability that all of these things are the case, that it’s a positive review, and that all four of these words are in the review. But still, it’s not entirely obvious how we calculate that value. And here is where we need to make one more assumption. And this is where the naive part of naive Bayes comes in. We’re going to make the assumption that all of the words are independent of each other. And by that, I mean that if the word grandson is in the review, that doesn’t change the probability that the word loved is in the review or that the word it is in the review, for example. And in practice, this assumption might not be true. It’s almost certainly the case that the probability of words do depend on each other. But it’s going to simplify our analysis and still give us reasonably good results just to assume that the words are independent of each other and they only depend on whether it’s positive or negative. You might, for example, expect the word loved to appear more often in a positive review than in a negative review. So what does that mean? Well, if we make this assumption, then we can say that this value, the probability we’re interested in, is not directly proportional to, but it’s naively proportional to this value. The probability that the review is positive times the probability that my is in the review, given that it’s positive, times the probability that grandson is in the review, given that it’s positive, and so on for the other two words that happen to be in this review. And now this value, which looks a little more complex, is actually a value that we can calculate pretty easily. So how are we going to estimate the probability that the review is positive? Well, if we have some training data, some example data of example reviews where each one has already been labeled as positive or negative, then we can estimate the probability that a review is positive just by counting the number of positive samples and dividing by the total number of samples that we have in our training data. And for the conditional probabilities, the probability of loved, given that it’s positive, well, that’s going to be the number of positive samples with loved in it divided by the total number of positive samples. So let’s take a look at an actual example to see how we could try to calculate these values. Here I’ve put together some sample data. The way to interpret the sample data is that based on the training data, 49% of the reviews are positive, 51% are negative. And then over here in this table, we have some conditional probabilities. And then we have if the review is positive, then there is a 30% chance that my appears in it. And if the review is negative, there is a 20% chance that my appears in it. And based on our training data among the positive reviews, 1% of them contain the word grandson. And among the negative reviews, 2% contain the word grandson. So using this data, let’s try to calculate this value, the value we’re interested in. And to do that, we’ll need to multiply all of these values together. The probability of positive, and then all of these positive conditional probabilities. And when we do that, we get some value. And then we can do the same thing for the negative case. We’re going to do the same thing, take the probability that it’s negative, multiply it by all of these conditional probabilities, and we’re going to get some other value. And now these values don’t sum to one. They’re not a probability distribution yet. But I can normalize them and get some values. And that tells me that we’re going to predict that my grandson loved it. We think there’s a 68% chance, probability 0.68, that that is a positive sentiment review, and 0.32 probability that it’s a negative review. So what problems might we run into here? What could potentially go wrong when doing this kind of analysis in order to analyze whether text has a positive or negative sentiment? Well, a couple of problems might arise. One problem might be, what if the word grandson never appears for any of the positive reviews? If that were the case, then when we try to calculate the value, the probability that we think the review is positive, we’re going to multiply all these values together, and we’re just going to get 0 for the positive case, because we’re all going to ultimately multiply by that 0 value. And so we’re going to say that we think there is no chance that the review is positive because it contains the word grandson. And in our training data, we’ve never seen the word grandson appear in a positive sentiment message before. And that’s probably not the right analysis, because in cases of rare words, it might be the case that in nowhere in our training data did we ever see the word grandson appear in a message that has positive sentiment. So what can we do to solve this problem? Well, one thing we’ll often do is some kind of additive smoothing, where we add some value alpha to each value in our distribution just to smooth out the data a little bit. And a common form of this is Laplace smoothing, where we add 1 to each value in our distribution. In essence, we pretend we’ve seen each value one more time than we actually have. So if we’ve never seen the word grandson for a positive review, we pretend we’ve seen it once. If we’ve seen it once, we pretend we’ve seen it twice, just to avoid the possibility that we might multiply by 0 and as a result, get some results we don’t want in our analysis. So let’s see what this looks like in practice. Let’s try to do some naive Bayes classification in order to classify text as either positive or negative. We’ll take a look at sentiment.py. And what this is going to do is load some sample data into memory, some examples of positive reviews and negative reviews. And then we’re going to train a naive Bayes classifier on all of this training data, training data that includes all of the words we see in positive reviews and all of the words we see in negative reviews. And then we’re going to try to classify some input. And so we’re going to do this based on a corpus of data. I have some example positive reviews. Here are some positive reviews. It was great, so much fun, for example. And then some negative reviews, not worth it, kind of cheap. These are some examples of negative reviews. So now let’s try to run this classifier and see how it would classify particular text as either positive or negative. We’ll go ahead and run our sentiment analysis on this corpus. And we need to provide it with a review. So I’ll say something like, I enjoyed it. And we see that the classifier says there is about a 0.92 probability that we think that this particular review is positive. Let’s try something negative. We’ll try kind of overpriced. And we see that there is a 0.96 probability now that we think that this particular review is negative. And so our naive Bayes classifier has learned what kinds of words tend to appear in positive reviews and what kinds of words tend to appear in negative reviews. And as a result of that, we’ve been able to design a classifier that can predict whether a particular review is positive or negative. And so this definitely is a useful tool that we can use to try and make some predictions. But we had to make some assumptions in order to get there. So what if we want to now try to build some more sophisticated models, use some tools from machine learning to try and take better advantage of language data to be able to draw more accurate conclusions and solve new kinds of tasks and new kinds of problems? Well, we’ve seen a couple of times now that when we want to take some data and take some input, put it in a way that the computer is going to be able to make sense of, it can be helpful to take that data and turn it into numbers, ultimately. And so what we might want to try to do is come up with some word representation, some way to take a word and translate its meaning into numbers. Because, for example, if we wanted to use a neural network to be able to process language, give our language to a neural network and have it make some predictions or perform some analysis there, a neural network takes its input and produces its output a vector of values, a vector of numbers. And so what we might want to do is take our data and somehow take words and convert them into some kind of numeric representation. So how might we do that? How might we take words and turn them into numbers? Let’s take a look at an example. Here’s a sentence, he wrote a book. And let’s say I wanted to take each of those words and turn it into a vector of values. Here’s one way I might do that. We’ll say he is going to be a vector that has a 1 in the first position and the rest of the values are 0. Wrote will have a 1 in the second position and the rest of the values are 0. A has a 1 in the third position with the rest of the value 0. And book has a 1 in the fourth position with the rest of the value 0. So each of these words now has a distinct vector representation. And this is what we often call a one-hot representation, a representation of the meaning of a word as a vector with a single 1 and all of the rest of the values are 0. And so when doing this, we now have a numeric representation for every word and we could pass in those vector representations into a neural network or other models that require some kind of numeric data as input. But this one-hot representation actually has a couple of problems and it’s not ideal for a few reasons. One reason is, here we’re just looking at four words. But if you imagine a vocabulary of thousands of words or more, these vectors are going to get quite long in order to have a distinct vector for every possible word in a vocabulary. And as a result of that, these longer vectors are going to be more difficult to deal with, more difficult to train, and so forth. And so that might be a problem. Another problem is a little bit more subtle. If we want to represent a word as a vector, and in particular the meaning of a word as a vector, then ideally it should be the case that words that have similar meanings should also have similar vector representations, so that they’re close to each other together inside a vector space. But that’s not really going to be the case with these one-hot representations, because if we take some similar words, say the word wrote and the word authored, which means similar things, they have entirely different vector representations. Likewise, book and novel, those two words mean somewhat similar things, but they have entirely different vector representations because they each have a one in some different position. And so that’s not ideal either. So what we might be interested in instead is some kind of distributed representation. A distributed representation is the representation of the meaning of a word distributed across multiple values, instead of just being one-hot with a one in one position. Here is what a distributed representation of words might be. Each word is associated with some vector of values, with the meaning distributed across multiple values, ideally in such a way that similar words have a similar vector representation. But how are we going to come up with those values? Where do those values come from? How can we define the meaning of a word in this distributed sequence of numbers? Well, to do that, we’re going to draw inspiration from a quote from British linguist J.R. Firth, who said, you shall know a word by the company it keeps. In other words, we’re going to define the meaning of a word based on the words that appear around it, the context words around it. Take, for example, this context, for blank he ate. You might wonder, what words could reasonably fill in that blank? Well, it might be words like breakfast or lunch or dinner. All of those could reasonably fill in that blank. And so what we’re going to say is because the words breakfast and lunch and dinner appear in a similar context, that they must have a similar meaning. And that’s something our computer could understand and try to learn. A computer could look at a big corpus of text, look at what words tend to appear in similar context to each other, and use that to identify which words have a similar meaning and should therefore appear close to each other inside a vector space. And so one common model for doing this is known as the word to vec model. It’s a model for generating word vectors, a vector representation for every word by looking at data and looking at the context in which a word appears. The idea is going to be this. If you start out with all of the words just in some random position in space and train it on some training data, what the word to vec model will do is start to learn what words appear in similar contexts. And it will move these vectors around in such a way that hopefully words with similar meanings, breakfast, lunch, and dinner, book, memoir, novel, will hopefully appear to be near to each other as vectors as well. So let’s now take a look at what word to vec might look like in practice when implemented in code. What I have here inside of words.txt is a pre-trained model where each of these words has some vector representation trained by word to vec. Each of these words has some sequence of values representing its meaning, hopefully in such a way that similar words are represented by similar vectors. I also have this file vectors.py, which is going to open up the words and form them into a dictionary. And we also define some useful functions like distance to get the distance between two word vectors and closest words to find which words are nearby in terms of having close vectors to each other. And so let’s give this a try. We’ll go ahead and open a Python interpreter. And I’m going to import these vectors. And we might say, all right, what is the vector representation of the word book? And we get this big long vector that represents the word book as a sequence of values. And this sequence of values by itself is not all that meaningful. But it is meaningful in the context of comparing it to other vectors for other words. So we could use this distance function, which is going to get us the distance between two word vectors. And we might say, what is the distance between the vector representation for the word book and the vector representation for the word novel? And we see that it’s 0.34. You can kind of interpret 0 as being really close together and 1 being very far apart. And so now, what is the distance between book and, let’s say, breakfast? Well, book and breakfast are more different from each other than book and novel are. So I would hopefully expect the distance to be larger. And in fact, it is 0.64 approximately. These two words are further away from each other. And what about now the distance between, let’s say, lunch and breakfast? Well, that’s about 0.2. Those are even closer together. They have a meaning that is closer to each other. Another interesting thing we might do is calculate the closest words. We might say, what are the closest words, according to Word2Vec, to the word book? And let’s say, let’s get the 10 closest words. What are the 10 closest vectors to the vector representation for the word book? And when we perform that analysis, we get this list of words. The closest one is book itself, but we also have books plural, and then essay, memoir, essays, novella, anthology, and so on. All of these words mean something similar to the word book, according to Word2Vec, at least, because they have a similar vector representation. So it seems like we’ve done a pretty good job of trying to capture this kind of vector representation of word meaning. One other interesting side effect of Word2Vec is that it’s also able to capture something about the relationships between words as well. Let’s take a look at an example. Here, for instance, are two words, man and king. And these are each represented by Word2Vec as vectors. So what might happen if I subtracted one from the other, calculated the value king minus man? Well, that will be the vector that will take us from man to king, somehow represent this relationship between the vector representation of the word man and the vector representation of the word king. And that’s what this value, king minus man, represents. So what would happen if I took the vector representation of the word woman and added that same value, king minus man, to it? What would we get as the closest word to that, for example? Well, we could try it. Let’s go ahead and go back to our Python interpreter and give this a try. I could say, what is the closest word to the vector representation of the word king minus the representation of the word man plus the representation of the word woman? And we see that the closest word is the word queen. We’ve somehow been able to capture the relationship between king and man. And then when we apply it to the word woman, we get, as the result, the word queen. So Word2Vec has been able to capture not just the words and how they’re similar to each other, but also something about the relationships between words and how those words are connected to each other. So now that we have this vector representation of words, what can we now do with it? Now we can represent words as numbers. And so we might try to pass those words as input to, say, a neural network. Neural networks we’ve seen are very powerful tools for identifying patterns and making predictions. Recall that a neural network you can think of as all of these units. But really what the neural network is doing is taking some input, passing it into the network, and then producing some output. And by providing the neural network with training data, we’re able to update the weights inside of the network so that the neural network can do a more accurate job of translating those inputs into those outputs. And now that we can represent words as numbers that could be the input or output, you could imagine passing a word in as input to a neural network and getting a word as output. And so when might that be useful? One common use for neural networks is in machine translation, when we want to translate text from one language into another, say translate English into French by passing English into the neural network and getting some French output. You might imagine, for instance, that we could take the English word for lamp, pass it into the neural network, get the French word for lamp as output. But in practice, when we’re translating text from one language to another, we’re usually not just interested in translating a single word from one language to another, but a sequence, say a sentence or a paragraph of words. Here, for example, is another paragraph, again taken from Sherlock Holmes, written in English. And what I might want to do is take that entire sentence, pass it into the neural network, and get as output a French translation of the same sentence. But recall that a neural network’s input and output needs to be of some fixed size. And a sentence is not a fixed size. It’s variable. You might have shorter sentences, and you might have longer sentences. So somehow, we need to solve the problem of translating a sequence into another sequence by means of a neural network. And that’s going to be true not only for machine translation, but also for other problems, problems like question answering. If I want to pass as input a question, something like what is the capital of Massachusetts, feed that as input into the neural network, I would hope that what I would get as output is a sentence like the capital is Boston, again, translating some sequence into some other sequence. And if you’ve ever had a conversation with an AI chatbot, or have ever asked your phone a question, it needs to do something like this. It needs to understand the sequence of words that you, the human, provided as input. And then the computer needs to generate some sequence of words as output. So how can we do this? Well, one tool that we can use is the recurrent neural network, which we took a look at last time, which is a way for us to provide a sequence of values to a neural network by running the neural network multiple times. And each time we run the neural network, what we’re going to do is we’re going to keep track of some hidden state. And that hidden state is going to be passed from one run of the neural network to the next run of the neural network, keeping track of all of the relevant information. And so let’s take a look at how we can apply that to something like this. And in particular, we’re going to look at an architecture known as an encoder-decoder architecture, where we’re going to encode this question into some kind of hidden state, and then use a decoder to decode that hidden state into the output that we’re interested in. So what’s that going to look like? We’ll start with the first word, the word what. That goes into our neural network, and it’s going to produce some hidden state. This is some information about the word what that our neural network is going to need to keep track of. Then when the second word comes along, we’re going to feed it into that same encoder neural network, but it’s going to get as input that hidden state as well. So we pass in the second word. We also get the information about the hidden state, and that’s going to continue for the other words in the input. This is going to produce a new hidden state. And so then when we get to the third word, the, that goes into the encoder. It also gets access to the hidden state, and then it produces a new hidden state that gets passed into the next run when we use the word capital. And the same thing is going to repeat for the other words that appear in the input. So of Massachusetts, that produces one final piece of hidden state. Now somehow, we need to signal the fact that we’re done. There’s nothing left in the input. And we typically do this by passing some kind of special token, say an end token, into the neural network. And now the decoding process is going to start. We’re going to generate the word the. But in addition to generating the word the, this decoder network is also going to generate some kind of hidden state. And so what happens the next time? Well, to generate the next word, it might be helpful to know what the first word was. So we might pass the first word the back into the decoder network. It’s going to get as input this hidden state, and it’s going to generate the next word capital. And that’s also going to generate some hidden state. And we’ll repeat that, passing capital into the network to generate the third word is, and then one more time in order to get the fourth word Boston. And at that point, we’re done. But how do we know we’re done? Usually, we’ll do this one more time, pass Boston into the decoder network, and get an output some end token to indicate that that is the end of our input. And so this then is how we could use a recurrent neural network to take some input, encode it into some hidden state, and then use that hidden state to decode it into the output we’re interested in. To visualize it in a slightly different way, we have some input sequence. This is just some sequence of words. That input sequence goes into the encoder, which in this case is a recurrent neural network generating these hidden states along the way until we generate some final hidden state, at which point we start the decoding process. Again, using a recurrent neural network, that’s going to generate the output sequence as well. So we’ve got the encoder, which is encoding the information about the input sequence into this hidden state, and then the decoder, which takes that hidden state and uses it in order to generate the output sequence. But there are some problems. And for many years, this was the state of the art. The recurrent neural network and variance on this approach were some of the best ways we knew in order to perform tasks in natural language processing. But there are some problems that we might want to try to deal with and that have been dealt with over the years to try and improve upon this kind of model. And one problem you might notice happens in this encoder stage. We’ve taken this input sequence, the sequence of words, and encoded it all into this final piece of hidden state. And that final piece of hidden state needs to contain all of the information from the input sequence that we need in order to generate the output sequence. And while that’s possible, it becomes increasingly difficult as the sequence gets larger and larger. For larger and larger input sequences, it’s going to become more and more difficult to store all of the information we need about the input inside this single hidden state piece of context. That’s a lot of information to pack into just a single value. It might be useful for us, when generating output, to not just refer to this one value, but to all of the previous hidden values that have been generated by the encoder. And so that might be useful, but how could we do that? We’ve got a lot of different values. We need to combine them somehow. So you could imagine adding them together, taking the average of them, for example. But doing that would assume that all of these pieces of hidden state are equally important. But that’s not necessarily true either. Some of these pieces of hidden state are going to be more important than others, depending on what word they most closely correspond to. This piece of hidden state very closely corresponds to the first word of the input sequence. This one very closely corresponds to the second word of the input sequence, for example. And some of those are going to be more important than others. To make matters more complicated, depending on which word of the output sequence we’re generating, different input words might be more or less important. And so what we really want is some way to decide for ourselves which of the input values are worth paying attention to, at what point in time. And this is the key idea behind a mechanism known as attention. Attention is all about letting us decide which values are important to pay attention to, when generating, in this case, the next word in our sequence. So let’s take a look at an example of that. Here’s a sentence. What is the capital of Massachusetts? Same sentence as before. And let’s imagine that we were trying to answer that question by generating tokens of output. So what would the output look like? Well, it’s going to look like something like the capital is. And let’s say we’re now trying to generate this last word here. What is that last word? How is the computer going to figure it out? Well, what it’s going to need to do is decide which values it’s going to pay attention to. And so the attention mechanism will allow us to calculate some attention scores for each word, some value corresponding to each word, determining how relevant is it for us to pay attention to that word right now? And in this case, when generating the fourth word of the output sequence, the most important words to pay attention to might be capital and Massachusetts, for example. That those words are going to be particularly relevant. And there are a number of different mechanisms that have been used in order to calculate these attention scores. It could be something as simple as a dot product to see how similar two vectors are, or we could train an entire neural network to calculate these attention scores. But the key idea is that during the training process for our neural network, we’re going to learn how to calculate these attention scores. Our model is going to learn what is important to pay attention to in order to decide what the next word should be. So the result of all of this, calculating these attention scores, is that we can calculate some value, some value for each input word, determining how important is it for us to pay attention to that particular value. And recall that each of these input words is also associated with one of these hidden state context vectors, capturing information about the sentence up to that point, but primarily focused on that word in particular. And so what we can now do is if we have all of these vectors and we have values representing how important is it for us to pay attention to those particular vectors, is we can take a weighted average. We can take all of these vectors, multiply them by their attention scores, and add them up to get some new vector value, which is going to represent the context from the input, but specifically paying attention to the words that we think are most important. And once we’ve done that, that context vector can be fed into our decoder in order to say that the word should be, in this case, Boston. So attention is this very powerful tool that allows any word when we’re trying to decode it to decide which words from the input should we pay attention to in order to determine what’s important for generating the next word of the output. And one of the first places this was really used was in the field of machine translation. Here’s an example of a diagram from the paper that introduced this idea, which was focused on trying to translate English sentences into French sentences. So we have an input English sentence up along the top, and then along the left side, the output French equivalent of that same sentence. And what you see in all of these squares are the attention scores visualized, where a lighter square indicates a higher attention score. And what you’ll notice is that there’s a strong correspondence between the French word and the equivalent English word, that the French word for agreement is really paying attention to the English word for agreement in order to decide what French word should be generated at that point in time. And sometimes you might pay attention to multiple words if you look at the French word for economic. That’s primarily paying attention to the English word for economic, but also paying attention to the English word for European in this case too. And so attention scores are very easy to visualize to get a sense for what is our machine learning model really paying attention to, what information is it using in order to determine what’s important and what’s not in order to determine what the ultimate output token should be. And so when we combine the attention mechanism with a recurrent neural network, we can get very powerful and useful results where we’re able to generate an output sequence by paying attention to the input sequence too. But there are other problems with this approach of using a recurrent neural network as well. In particular, notice that every run of the neural network depends on the output of the previous step. And that was important for getting a sense for the sequence of words and the ordering of those particular words. But we can’t run this unit of the neural network until after we’ve calculated the hidden state from the run before it from the previous input token. And what that means is that it’s very difficult to parallelize this process. That as the input sequence get longer and longer, we might want to use parallelism to try and speed up this process of training the neural network and making sense of all of this language data. But it’s difficult to do that. And it’s slow to do that with a recurrent neural network because all of it needs to be performed in sequence. And that’s become an increasing challenge as we’ve started to get larger and larger language models. The more language data that we have available to us to use to train our machine learning models, the more accurate it can be, the better representation of language it can have, the better understanding it can have, and the better results that we can see. And so we’ve seen this growth of large language models that are using larger and larger data sets. But as a result, they take longer and longer to train. And so this problem that recurrent neural networks are not easy to parallelize has become an increasing problem. And as a result of that, that was one of the main motivations for a different architecture, for thinking about how to deal with natural language. And that’s known as the transformer architecture. And this has been a significant milestone in the world of natural language processing for really increasing how well we can perform these kinds of natural language processing tasks, as well as how quickly we can train a machine learning model to be able to produce effective results. There are a number of different types of transformers in terms of how they work. But what we’re going to take a look at here is the basic architecture for how one might work with a transformer to get a sense for what’s involved and what we’re doing. So let’s start with the model we were looking at before, specifically at this encoder part of our encoder-decoder architecture, where we used a recurrent neural network to take this input sequence and capture all of this information about the hidden state and the information we need to know about that input sequence. Right now, it all needs to happen in this linear progression. But what the transformer is going to allow us to do is process each of the words independently in a way that’s easy to parallelize, rather than have each word wait for some other word. Each word is going to go through this same neural network and produce some kind of encoded representation of that particular input word. And all of this is going to happen in parallel. Now, it’s happening for all of the words at once, but we’re really just going to focus on what’s happening for one word to make it clear. But know that whatever you’re seeing happen for this one word is going to happen for all of the other input words, too. So what’s going on here? Well, we start with some input word. That input word goes into the neural network. And the output is hopefully some encoded representation of the input word, the information we need to know about the input word that’s going to be relevant to us as we’re generating the output. And because we’re doing this each word independently, it’s easy to parallelize. We don’t have to wait for the previous word before we run this word through the neural network. But what did we lose in this process by trying to parallelize this whole thing? Well, we’ve lost all notion of word ordering. The order of words is important. The sentence, Sherlock Holmes gave the book to Watson, has a different meaning than Watson gave the book to Sherlock Holmes. And so we want to keep track of that information about word position. In the recurrent neural network, that happened for us automatically because we could run each word one at a time through the neural network, get the hidden state, pass it on to the next run of the neural network. But that’s not the case here with the transformer, where each word is being processed independent of all of the other ones. So what are we going to do to try to solve that problem? One thing we can do is add some kind of positional encoding to the input word. The positional encoding is some vector that represents the position of the word in the sentence. This is the first word, the second word, the third word, and so forth. We’re going to add that to the input word. And the result of that is going to be a vector that captures multiple pieces of information. It captures the input word itself as well as where in the sentence it appears. The result of that is we can pass the output of that addition, the addition of the input word and the positional encoding into the neural network. That way, the neural network knows the word and where it appears in the sentence and can use both of those pieces of information to determine how best to represent the meaning of that word in the encoded representation at the end of it. In addition to what we have here, in addition to the positional encoding and this feed forward neural network, we’re also going to add one additional component, which is going to be a self-attention step. This is going to be attention where we’re paying attention to the other input words. Because the meaning or interpretation of an input word might vary depending on the other words in the input as well. And so we’re going to allow each word in the input to decide what other words in the input it should pay attention to in order to decide on its encoded representation. And that’s going to allow us to get a better encoded representation for each word because words are defined by their context, by the words around them and how they’re used in that particular context. This kind of self-attention is so valuable, in fact, that oftentimes the transformer will use multiple different self-attention layers at the same time to allow for this model to be able to pay attention to multiple facets of the input at the same time. And we call this multi-headed attention, where each attention head can pay attention to something different. And as a result, this network can learn to pay attention to many different parts of the input for this input word all at the same time. And in the spirit of deep learning, these two steps, this multi-headed self-attention layer and this neural network layer, that itself can be repeated multiple times, too, in order to get a deeper representation, in order to learn deeper patterns within the input text and ultimately get a better representation of language in order to get useful encoded representations of all of the input words. And so this is the process that a transformer might use in order to take an input word and get it its encoded representation. And the key idea is to really rely on this attention step in order to get information that’s useful in order to determine how to encode that word. And that process is going to repeat for all of the input words that are in the input sequence. We’re going to take all of the input words, encode them with some kind of positional encoding, feed those into these self-attention and feed-forward neural networks in order to ultimately get these encoded representations of the words. That’s the result of the encoder. We get all of these encoded representations that will be useful to us when it comes time then to try to decode all of this information into the output sequence we’re interested in. And again, this might take place in the context of machine translation, where the output is going to be the same sentence in a different language, or it might be an answer to a question in the case of an AI chatbot, for example. And so now let’s take a look at how that decoder is going to work. Ultimately, it’s going to have a very similar structure. Any time we’re trying to generate the next output word, we need to know what the previous output word is, as well as its positional encoding. Where in the output sequence are we? And we’re going to have these same steps, self-attention, because we might want an output word to be able to pay attention to other words in that same output, as well as a neural network. And that might itself repeat multiple times. But in this decoder, we’re going to add one additional step. We’re going to add an additional attention step, where instead of self-attention, where the output word is going to pay attention to other output words, in this step, we’re going to allow the output word to pay attention to the encoded representations. So recall that the encoder is taking all of the input words and transforming them into these encoded representations of all of the input words. But it’s going to be important for us to be able to decide which of those encoded representations we want to pay attention to when generating any particular token in the output sequence. And that’s what this additional attention step is going to allow us to do. It’s saying that every time we’re generating a word of the output, we can pay attention to the other words in the output, because we might want to know, what are the words we’ve generated previously? And we want to pay attention to some of them to decide what word is going to be next in the sequence. But we also care about paying attention to the input words, too. And we want the ability to decide which of these encoded representations of the input words are going to be relevant in order for us to generate the next step. And so these two pieces combine together. We have this encoder that takes all of the input words and produces this encoded representation. And we have this decoder that is able to take the previous output word, pay attention to that encoded input, and then generate the next output word. And this is one of the possible architectures we could use for a transformer, with the key idea being these attention steps that allow words to pay attention to each other. During the training process here, we can now much more easily parallelize this, because we don’t have to wait for all of the words to happen in sequence. And we can learn how we should perform these attention steps. The model is able to learn what is important to pay attention to, what things do I need to pay attention to, in order to be more accurate at predicting what the output word is. And this has proved to be a tremendously effective model for conversational AI agents, for building machine translation systems. And there have been many variants proposed on this model, too. Some transformers only use an encoder. Some only use a decoder. Some use some other combination of these different particular features. But the key ideas ultimately remain the same, this real focus on trying to pay attention to what is most important. And the world of natural language processing is fast growing and fast evolving. Year after year, we keep coming up with new models that allow us to do an even better job of performing these natural language related tasks, all on the surface of solving the tricky problem, which is our own natural language. We’ve seen how the syntax and semantics of our language is ambiguous, and it introduces all of these new challenges that we need to think about, if we’re going to be able to design AI agents that are able to work with language effectively. So as we think about where we’ve been in this class, all of the different types of artificial intelligence we’ve considered, we’ve looked at artificial intelligence in a wide variety of different forms now. We started by taking a look at search problems, where we looked at how AI can search for solutions, play games, and find the optimal decision to make. We talked about knowledge, how AI can represent information that it knows and use that information to generate new knowledge as well. Then we looked at what AI can do when it’s less certain, when it doesn’t know things for sure, and we have to represent things in terms of probability. We then took a look at optimization problems. We saw how a lot of problems in AI can be boiled down to trying to maximize or minimize some function. And we looked at strategies that AI can use in order to do that kind of maximizing and minimizing. We then looked at the world of machine learning, learning from data in order to figure out some patterns and identify how to perform a task by looking at the training data that we have available to it. And one of the most powerful tools there was the neural network, the sequence of units whose weights can be trained in order to allow us to really effectively go from input to output and predict how to get there by learning these underlying patterns. And then today, we took a look at language itself, trying to understand how can we train the computer to be able to understand our natural language, to be able to understand syntax and semantics, make sense of and generate natural language, which introduces a number of interesting problems too. And we’ve really just scratched the surface of artificial intelligence. There is so much interesting research and interesting new techniques and algorithms and ideas being introduced to try to solve these types of problems. So I hope you enjoyed this exploration into the world of artificial intelligence. A huge thanks to all of the course’s teaching staff and production team for making the class possible. This was an introduction to artificial intelligence with Python.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
February 23, 2025
PyTorch for Deep Learning & Machine Learning – Study Notes
PyTorch for Deep Learning FAQ

1. What are tensors and how are they represented in PyTorch?

Tensors are the fundamental data structures in PyTorch, used to represent numerical data. They can be thought of as multi-dimensional arrays. In PyTorch, tensors are created using the torch.tensor() function and can be classified as:
- Scalar: A single number (zero dimensions)
- Vector: A one-dimensional array (one dimension)
- Matrix: A two-dimensional array (two dimensions)
- Tensor: A general term for arrays with three or more dimensions
You can identify the number of dimensions by counting the pairs of closing square brackets used to define the tensor.

2. How do you determine the shape and dimensions of a tensor?
- Dimensions: Determined by counting the pairs of closing square brackets (e.g., [[]] represents two dimensions). Accessed using tensor.ndim.
- Shape: Represents the number of elements in each dimension. Accessed using tensor.shape or tensor.size().
For example, a tensor defined as [[1, 2], [3, 4]] has two dimensions and a shape of (2, 2), indicating two rows and two columns.

3. What are tensor data types and how do you change them?

Tensors have data types that specify the kind of numerical values they hold (e.g., float32, int64). The default data type in PyTorch is float32. You can change the data type of a tensor using the .type() method:

float_32_tensor = torch.tensor([1.0, 2.0, 3.0])

float_16_tensor = float_32_tensor.type(torch.float16)

4. What does “requires_grad” mean in PyTorch?

requires_grad is a parameter used when creating tensors. Setting it to True indicates that you want to track gradients for this tensor during training. This is essential for PyTorch to calculate derivatives and update model weights during backpropagation.

5. What is matrix multiplication in PyTorch and what are the rules?

Matrix multiplication, a key operation in deep learning, is performed using the @ operator or torch.matmul() function. Two important rules apply:
- Inner dimensions must match: The number of columns in the first matrix must equal the number of rows in the second matrix.
- Resulting matrix shape: The resulting matrix will have the number of rows from the first matrix and the number of columns from the second matrix.
6. What are common tensor operations for aggregation?

PyTorch provides several functions to aggregate tensor values, such as:
- torch.min(): Finds the minimum value.
- torch.max(): Finds the maximum value.
- torch.mean(): Calculates the average.
- torch.sum(): Calculates the sum.
These functions can be applied to the entire tensor or along specific dimensions.

7. What are the differences between reshape, view, and stack?
- reshape: Changes the shape of a tensor while maintaining the same data. The new shape must be compatible with the original number of elements.
- view: Creates a new view of the same underlying data as the original tensor, with a different shape. Changes to the view affect the original tensor.
- stack: Concatenates tensors along a new dimension, creating a higher-dimensional tensor.
8. What are the steps involved in a typical PyTorch training loop?
1. Forward Pass: Input data is passed through the model to get predictions.
2. Calculate Loss: The difference between predictions and actual labels is calculated using a loss function.
3. Zero Gradients: Gradients from previous iterations are reset to zero.
4. Backpropagation: Gradients are calculated for all parameters with requires_grad=True.
5. Optimize Step: The optimizer updates model weights based on calculated gradients.
Deep Learning and Machine Learning with PyTorch

Short-Answer Quiz

Instructions: Answer the following questions in 2-3 sentences each.
1. What are the key differences between a scalar, a vector, a matrix, and a tensor in PyTorch?
2. How can you determine the number of dimensions of a tensor in PyTorch?
3. Explain the concept of “shape” in relation to PyTorch tensors.
4. Describe how to create a PyTorch tensor filled with ones and specify its data type.
5. What is the purpose of the torch.zeros_like() function?
6. How do you convert a PyTorch tensor from one data type to another?
7. Explain the importance of ensuring tensors are on the same device and have compatible data types for operations.
8. What are tensor attributes, and provide two examples?
9. What is tensor broadcasting, and what are the two key rules for its operation?
10. Define tensor aggregation and provide two examples of aggregation functions in PyTorch.
Short-Answer Quiz Answer Key
1. In PyTorch, a scalar is a single number, a vector is an array of numbers with direction, a matrix is a 2-dimensional array of numbers, and a tensor is a multi-dimensional array that encompasses scalars, vectors, and matrices. All of these are represented as torch.Tensor objects in PyTorch.
2. The number of dimensions of a tensor can be determined using the tensor.ndim attribute, which returns the number of dimensions or axes present in the tensor.
3. The shape of a tensor refers to the number of elements along each dimension of the tensor. It is represented as a tuple, where each element in the tuple corresponds to the size of each dimension.
4. To create a PyTorch tensor filled with ones, use torch.ones(size) where size is a tuple specifying the desired dimensions. To specify the data type, use the dtype parameter, for example, torch.ones(size, dtype=torch.float64).
5. The torch.zeros_like() function creates a new tensor filled with zeros, having the same shape and data type as the input tensor. It is useful for quickly creating a tensor with the same structure but with zero values.
6. To convert a PyTorch tensor from one data type to another, use the .type() method, specifying the desired data type as an argument. For example, to convert a tensor to float16: tensor = tensor.type(torch.float16).
7. PyTorch operations require tensors to be on the same device (CPU or GPU) and have compatible data types for successful computation. Performing operations on tensors with mismatched devices or incompatible data types will result in errors.
8. Tensor attributes provide information about the tensor’s properties. Two examples are:
- dtype: Specifies the data type of the tensor elements.
- shape: Represents the dimensionality of the tensor as a tuple.
1. Tensor broadcasting allows operations between tensors with different shapes, automatically expanding the smaller tensor to match the larger one under certain conditions. The two key rules for broadcasting are:
- Inner dimensions must match.
- The resulting matrix has the shape of the broadcasted tensors.
1. Tensor aggregation involves reducing the elements of a tensor to a single value using specific functions. Two examples are:
- torch.min(): Finds the minimum value in a tensor.
- torch.mean(): Calculates the average value of the elements in a tensor.
Essay Questions
1. Discuss the concept of dimensionality in PyTorch tensors. Explain how to create tensors with different dimensions and demonstrate how to access specific elements within a tensor. Provide examples and illustrate the relationship between dimensions, shape, and indexing.
2. Explain the importance of data types in PyTorch. Describe different data types available for tensors and discuss the implications of choosing specific data types for tensor operations. Provide examples of data type conversion and highlight potential issues arising from data type mismatches.
3. Compare and contrast the torch.reshape(), torch.view(), and torch.permute() functions. Explain their functionalities, use cases, and any potential limitations or considerations. Provide code examples to illustrate their usage.
4. Discuss the purpose and functionality of the PyTorch nn.Module class. Explain how to create custom neural network modules by subclassing nn.Module. Provide a code example demonstrating the creation of a simple neural network module with at least two layers.
5. Describe the typical workflow for training a neural network model in PyTorch. Explain the steps involved, including data loading, model creation, loss function definition, optimizer selection, training loop implementation, and model evaluation. Provide a code example outlining the essential components of the training process.
Glossary of Key Terms

Tensor: A multi-dimensional array, the fundamental data structure in PyTorch.

Dimensionality: The number of axes or dimensions present in a tensor.

Shape: A tuple representing the size of each dimension in a tensor.

Data Type: The type of values stored in a tensor (e.g., float32, int64).

Tensor Broadcasting: Automatically expanding the dimensions of tensors during operations to enable compatibility.

Tensor Aggregation: Reducing the elements of a tensor to a single value using functions like min, max, or mean.

nn.Module: The base class for building neural network modules in PyTorch.

Forward Pass: The process of passing input data through a neural network to obtain predictions.

Loss Function: A function that measures the difference between predicted and actual values during training.

Optimizer: An algorithm that adjusts the model’s parameters to minimize the loss function.

Training Loop: Iteratively performing forward passes, loss calculation, and parameter updates to train a model.

Device: The hardware used for computation (CPU or GPU).

Data Loader: An iterable that efficiently loads batches of data for training or evaluation.

Exploring Deep Learning with PyTorch

Fundamentals of Tensors

1. Understanding Tensors
- Introduction to tensors, the fundamental data structure in PyTorch.
- Differentiating between scalars, vectors, matrices, and tensors.
- Exploring tensor attributes: dimensions, shape, and indexing.
2. Manipulating Tensors
- Creating tensors with varying data types, devices, and gradient tracking.
- Performing arithmetic operations on tensors and managing potential data type errors.
- Reshaping tensors, understanding the concept of views, and employing stacking operations like torch.stack, torch.vstack, and torch.hstack.
- Utilizing torch.squeeze to remove single dimensions and torch.unsqueeze to add them.
- Practicing advanced indexing techniques on multi-dimensional tensors.
3. Tensor Aggregation and Comparison
- Exploring tensor aggregation with functions like torch.min, torch.max, and torch.mean.
- Utilizing torch.argmin and torch.argmax to find the indices of minimum and maximum values.
- Understanding element-wise tensor comparison and its role in machine learning tasks.
Building Neural Networks

4. Introduction to torch.nn
- Introducing the torch.nn module, the cornerstone of neural network construction in PyTorch.
- Exploring the concept of neural network layers and their role in transforming data.
- Utilizing matplotlib for data visualization and understanding PyTorch version compatibility.
5. Linear Regression with PyTorch
- Implementing a simple linear regression model using PyTorch.
- Generating synthetic data, splitting it into training and testing sets.
- Defining a linear model with parameters, understanding gradient tracking with requires_grad.
- Setting up a training loop, iterating through epochs, performing forward and backward passes, and optimizing model parameters.
6. Non-Linear Regression with PyTorch
- Transitioning from linear to non-linear regression.
- Introducing non-linear activation functions like ReLU and Sigmoid.
- Visualizing the impact of activation functions on data transformations.
- Implementing custom ReLU and Sigmoid functions and comparing them with PyTorch’s built-in versions.
Working with Datasets and Data Loaders

7. Multi-Class Classification with PyTorch
- Exploring multi-class classification using the make_blobs dataset from scikit-learn.
- Setting hyperparameters for data creation, splitting data into training and testing sets.
- Visualizing multi-class data with matplotlib and understanding the relationship between features and labels.
- Converting NumPy arrays to PyTorch tensors, managing data type consistency between NumPy and PyTorch.
8. Building a Multi-Class Classification Model
- Constructing a multi-class classification model using PyTorch.
- Defining a model class, utilizing linear layers and activation functions.
- Implementing the forward pass, calculating logits and probabilities.
- Setting up a training loop, calculating loss, performing backpropagation, and optimizing model parameters.
9. Model Evaluation and Prediction
- Evaluating the trained multi-class classification model.
- Making predictions using the model and converting probabilities to class labels.
- Visualizing model predictions and comparing them to true labels.
10. Introduction to Data Loaders
- Understanding the importance of data loaders in PyTorch for efficient data handling.
- Implementing data loaders using torch.utils.data.DataLoader for both training and testing data.
- Exploring data loader attributes and understanding their role in data batching and shuffling.
11. Building a Convolutional Neural Network (CNN)
- Introduction to CNNs, a specialized architecture for image and sequence data.
- Implementing a CNN using PyTorch’s nn.Conv2d layer, understanding concepts like kernels, strides, and padding.
- Flattening convolutional outputs using nn.Flatten and connecting them to fully connected layers.
- Defining a CNN model class, implementing the forward pass, and understanding the flow of data through the network.
12. Training and Evaluating a CNN
- Setting up a training loop for the CNN model, utilizing device-agnostic code for CPU and GPU compatibility.
- Implementing helper functions for training and evaluation, calculating loss, accuracy, and training time.
- Visualizing training progress, tracking loss and accuracy over epochs.
13. Transfer Learning with Pre-trained Models
- Exploring the concept of transfer learning, leveraging pre-trained models for faster training and improved performance.
- Introducing torchvision, a library for computer vision tasks, and understanding its dataset and model functionalities.
- Implementing data transformations using torchvision.transforms for data augmentation and pre-processing.
14. Custom Datasets and Data Augmentation
- Creating custom datasets using torch.utils.data.Dataset for managing image data.
- Implementing data transformations for resizing, converting to tensors, and normalizing images.
- Visualizing data transformations and understanding their impact on image data.
- Implementing data augmentation techniques to increase data variability and improve model robustness.
15. Advanced CNN Architectures and Optimization
- Exploring advanced CNN architectures, understanding concepts like convolutional blocks, residual connections, and pooling layers.
- Implementing a more complex CNN model using convolutional blocks and exploring its performance.
- Optimizing the training process, introducing learning rate scheduling and momentum-based optimizers.
Please provide me with the full text to analyze, as I need the complete context to create a detailed timeline and a cast of characters. The provided text snippets focus on PyTorch concepts and code examples related to tensors, neural networks, and data loading.

For a comprehensive analysis, I need the entire document to understand the flow of information, identify specific events, and extract relevant character details.

Once you provide the complete text, I can generate:
- Timeline: A chronological list of significant events discussed in the text, including conceptual explanations, code demonstrations, and challenges presented.
- Cast of Characters: A list of key individuals mentioned, along with their roles and contributions based on the provided information.
Please share the complete “748-PyTorch for Deep Learning & Machine Learning – Full Course.pdf” document for a more accurate and detailed analysis.

Briefing Doc: Deep Dive into PyTorch for Deep Learning

This briefing document summarizes key themes and concepts extracted from excerpts of the “748-PyTorch for Deep Learning & Machine Learning – Full Course.pdf” focusing on PyTorch fundamentals, tensor manipulation, model building, and training.

Core Themes:
1. Tensors: The Heart of PyTorch:
- Understanding Tensors:
- Tensors are multi-dimensional arrays representing numerical data in PyTorch.
- Understanding dimensions, shapes, and data types of tensors is crucial.
- Scalar, Vector, Matrix, and Tensor are different names for tensors with varying dimensions.
- “Dimension is like the number of square brackets… the shape of the vector is two. So we have two by one elements. So that means a total of two elements.”
- Manipulating Tensors:
- Reshaping, viewing, stacking, squeezing, and unsqueezing tensors are essential for preparing data.
- Indexing and slicing allow access to specific elements within a tensor.
- “Reshape has to be compatible with the original dimensions… view of a tensor shares the same memory as the original input.”
- Tensor Operations:
- PyTorch provides various operations for manipulating tensors, including arithmetic, aggregation, and matrix multiplication.
- Understanding broadcasting rules is vital for performing element-wise operations on tensors of different shapes.
- “The min of this tensor would be 27. So you’re turning it from nine elements to one element, hence aggregation.”
1. Building Neural Networks with PyTorch:
- torch.nn Module:
- This module provides building blocks for constructing neural networks, including layers, activation functions, and loss functions.
- nn.Module is the base class for defining custom models.
- “nn is the building block layer for neural networks. And within nn, so nn stands for neural network, is module.”
- Model Construction:
- Defining a model involves creating layers and arranging them in a specific order.
- nn.Sequential allows stacking layers in a sequential manner.
- Custom models can be built by subclassing nn.Module and defining the forward method.
- “Can you see what’s going on here? So as you might have guessed, sequential, it implements most of this code for us”
- Parameters and Gradients:
- Model parameters are tensors that store the model’s learned weights and biases.
- Gradients are used during training to update these parameters.
- requires_grad=True enables gradient tracking for a tensor.
- “Requires grad optional. If the parameter requires gradient. Hmm. What does requires gradient mean? Well, let’s come back to that in a second.”
1. Training Neural Networks:
- Training Loop:
- The training loop iterates over the dataset multiple times (epochs) to optimize the model’s parameters.
- Each iteration involves a forward pass (making predictions), calculating the loss, performing backpropagation, and updating parameters.
- “Epochs, an epoch is one loop through the data…So epochs, we’re going to start with one. So one time through all of the data.”
- Optimizers:
- Optimizers, like Stochastic Gradient Descent (SGD), are used to update model parameters based on the calculated gradients.
- “Optimise a zero grad, loss backwards, optimise a step, step, step.”
- Loss Functions:
- Loss functions measure the difference between the model’s predictions and the actual targets.
- The choice of loss function depends on the specific task (e.g., mean squared error for regression, cross-entropy for classification).
1. Data Handling and Visualization:
- Data Loading:
- PyTorch provides DataLoader for efficiently iterating over datasets in batches.
- “DataLoader, this creates a python iterable over a data set.”
- Data Transformations:
- The torchvision.transforms module offers various transformations for preprocessing images, such as converting to tensors, resizing, and normalization.
- Visualization:
- matplotlib is a commonly used library for visualizing data and model outputs.
- Visualizing data and model predictions is crucial for understanding the learning process and debugging potential issues.
1. Device Agnostic Code:
- PyTorch allows running code on different devices (CPU or GPU).
- Writing device agnostic code ensures flexibility and portability.
- “Device agnostic code for the model and for the data.”
Important Facts:
- PyTorch’s default tensor data type is torch.float32.
- CUDA (Compute Unified Device Architecture) enables utilizing GPUs for accelerated computations.
- torch.no_grad() disables gradient tracking, often used during inference or evaluation.
- torch.argmax finds the index of the maximum value in a tensor.
Next Steps:
- Explore different model architectures (CNNs, RNNs, etc.).
- Implement various optimizers and loss functions.
- Work with more complex datasets and tasks.
- Experiment with hyperparameter tuning.
- Dive deeper into PyTorch’s documentation and tutorials.
Traditional Programming vs. Machine Learning

Traditional programming involves providing the computer with data and explicit rules to generate output. Machine learning, on the other hand, involves providing the computer with data and desired outputs, allowing the computer to learn the rules for itself. [1, 2]

Here’s a breakdown of the differences, illustrated with the example of creating a program for cooking a Sicilian grandmother’s roast chicken dish:

Traditional Programming
- Input: Vegetables, chicken
- Rules: Cut vegetables, season chicken, preheat oven, cook chicken for 30 minutes, add vegetables (rules are explicitly programmed)
- Output: Roast chicken dish
Machine Learning
- Input: A collection of various ingredients (vegetables, chicken)
- Desired output: A collection of images showcasing the finished roast chicken dish
- Process: The machine learning algorithm analyzes the inputs and desired outputs to figure out the rules (patterns) for transforming the ingredients into the desired dish. [2-4]
Key Differences in Approach
- In traditional programming, humans define the rules, while in machine learning, the computer learns the rules. [2, 3]
- Traditional programming requires a complete understanding of the problem and the steps to solve it. In contrast, machine learning can be used for complex problems where explicitly defining all the rules would be challenging. [5]
Why Use Machine Learning?

Machine learning excels in scenarios where:
- Complex problems involve long lists of rules: For instance, teaching a computer to drive a car involves countless rules that would be difficult to program manually. [5]
- Environments are constantly changing: A self-driving car needs to adapt to unpredictable situations on the road. Machine learning enables algorithms to learn and adjust based on new data. [6]
- Large datasets offer valuable insights: Machine learning algorithms can uncover hidden patterns and relationships within massive datasets. [7]
The sources also emphasize that machine learning isn’t always the solution. Google’s number one rule of machine learning advises that if a problem can be solved with a simple rule-based system, that approach should be preferred. [8-10]

Here is how PyTorch leverages tensors and neural networks for deep learning:
- Tensors: Deep learning relies on numerical data representation. In PyTorch, this is done using tensors. Tensors are multi-dimensional arrays of numbers that can represent various data types, including images, audio, and text [1-3].
- Neural Networks: Neural networks are a fundamental aspect of deep learning, consisting of interconnected layers that perform mathematical operations on tensors [2, 4-6]. PyTorch provides the building blocks for creating these networks through the torch.nn module [7, 8].
- GPU Acceleration: PyTorch leverages GPUs (Graphics Processing Units) to accelerate the computation of deep learning models [9]. GPUs excel at number crunching, originally designed for video games but now crucial for deep learning tasks due to their parallel processing capabilities [9, 10]. PyTorch uses CUDA, a parallel computing platform, to interface with NVIDIA GPUs, allowing for faster computations [10, 11].
- Key Modules:torch.nn: Contains layers, loss functions, and other components needed for constructing computational graphs (neural networks) [8, 12].
- torch.nn.Parameter: Defines learnable parameters for the model, often set by PyTorch layers [12].
- torch.nn.Module: The base class for all neural network modules; models should subclass this and override the forward method [12].
- torch.optim: Contains optimizers that help adjust model parameters during training through gradient descent [13].
- torch.utils.data.Dataset: The base class for creating custom datasets [14].
- torch.utils.data.DataLoader: Creates a Python iterable over a dataset, allowing for batched data loading [14-16].
1. Workflow:Data Preparation: Involves loading, preprocessing, and transforming data into tensors [17, 18].
2. Building a Model: Constructing a neural network by combining different layers from torch.nn [7, 19, 20].
3. Loss Function: Choosing a suitable loss function to measure the difference between model predictions and the actual targets [21-24].
4. Optimizer: Selecting an optimizer (e.g., SGD, Adam) to adjust the model’s parameters based on the calculated gradients [21, 22, 24-26].
5. Training Loop: Implementing a training loop that iteratively feeds data through the model, calculates the loss, backpropagates the gradients, and updates the model’s parameters [22, 24, 27, 28].
6. Evaluation: Evaluating the trained model on unseen data to assess its performance [24, 28].
Overall, PyTorch uses tensors as the fundamental data structure and provides the necessary tools (modules, classes, and functions) to construct neural networks, optimize their parameters using gradient descent, and efficiently run deep learning models, often with GPU acceleration.

Training, Evaluating, and Saving a Deep Learning Model Using PyTorch

To train a deep learning model with PyTorch, you first need to prepare your data and turn it into tensors [1]. Tensors are the fundamental building blocks of deep learning and can represent almost any kind of data, such as images, videos, audio, or even DNA [2, 3]. Once your data is ready, you need to build or pick a pre-trained model to suit your problem [1, 4].
- PyTorch offers a variety of pre-built deep learning models through resources like Torch Hub and Torch Vision.Models [5]. These models can be used as is or adjusted for a specific problem through transfer learning [5].
- If you are building your model from scratch, PyTorch provides a flexible and powerful framework for building neural networks using various layers and modules [6].
- The torch.nn module contains all the building blocks for computational graphs, another term for neural networks [7, 8].
- PyTorch also offers layers for specific tasks, such as convolutional layers for image data, linear layers for simple calculations, and many more [9].
- The torch.nn.Module serves as the base class for all neural network modules [8, 10]. When building a model from scratch, you should subclass nn.Module and override the forward method to define the computations that your model will perform [8, 11].
After choosing or building a model, you need to select a loss function and an optimizer [1, 4].
- The loss function measures how wrong your model’s predictions are compared to the ideal outputs [12].
- The optimizer takes into account the loss of a model and adjusts the model’s parameters, such as weights and biases, to improve the loss function [13].
- The specific loss function and optimizer you use will depend on the problem you are trying to solve [14].
With your data, model, loss function, and optimizer in place, you can now build a training loop [1, 13].
- The training loop iterates through your training data, making predictions, calculating the loss, and updating the model’s parameters to minimize the loss [15].
- PyTorch implements the mathematical algorithms of back propagation and gradient descent behind the scenes, making the training process relatively straightforward [16, 17].
- The loss.backward() function calculates the gradients of the loss function with respect to each parameter in the model [18]. The optimizer.step() function then uses those gradients to update the model’s parameters in the direction that minimizes the loss [18].
- You can monitor the training process by printing out the loss and other metrics [19].
In addition to a training loop, you also need a testing loop to evaluate your model’s performance on data it has not seen during training [13, 20]. The testing loop is similar to the training loop but does not update the model’s parameters. Instead, it calculates the loss and other metrics to evaluate how well the model generalizes to new data [21, 22].

To save your trained model, PyTorch provides several methods, including torch.save, torch.load, and torch.nn.Module.load_state_dict [23-25].
- The recommended way to save and load a PyTorch model is by saving and loading its state dictionary [26].
- The state dictionary is a Python dictionary object that maps each layer in the model to its parameter tensor [27].
- You can save the state dictionary using torch.save and load it back in using torch.load and the model’s load_state_dict method [28, 29].
By following this general workflow, you can train, evaluate, and save deep learning models using PyTorch for a wide range of real-world applications.

A Comprehensive Discussion of the PyTorch Workflow

The PyTorch workflow outlines the steps involved in building, training, and deploying deep learning models using the PyTorch framework. The sources offer a detailed walkthrough of this workflow, emphasizing its application in various domains, including computer vision and custom datasets.

1. Data Preparation and Loading

The foundation of any machine learning project lies in data. Getting your data ready is the crucial first step in the PyTorch workflow [1-3]. This step involves:
- Data Acquisition: Gathering the data relevant to your problem. This could involve downloading existing datasets or collecting your own.
- Data Preprocessing: Cleaning and transforming the raw data into a format suitable for training a machine learning model. This often includes handling missing values, normalizing numerical features, and converting categorical variables into numerical representations.
- Data Transformation into Tensors: Converting the preprocessed data into PyTorch tensors. Tensors are multi-dimensional arrays that serve as the fundamental data structure in PyTorch [4-6]. This step uses torch.tensor to create tensors from various data types.
- Dataset and DataLoader Creation:Organizing the data into PyTorch datasets using torch.utils.data.Dataset. This involves defining how to access individual samples and their corresponding labels [7, 8].
- Creating data loaders using torch.utils.data.DataLoader [7, 9-11]. Data loaders provide a Python iterable over the dataset, allowing you to efficiently iterate through the data in batches during training. They handle shuffling, batching, and other data loading operations.
2. Building or Picking a Pre-trained Model

Once your data is ready, the next step is to build or pick a pre-trained model [1, 2]. This is a critical decision that will significantly impact your model’s performance.
- Pre-trained Models: PyTorch offers pre-built models through resources like Torch Hub and Torch Vision.Models [12].
- Benefits: Leveraging pre-trained models can save significant time and resources. These models have already learned useful features from large datasets, which can be adapted to your specific task through transfer learning [12, 13].
- Transfer Learning: Involves fine-tuning a pre-trained model on your dataset, adapting its learned features to your problem. This is especially useful when working with limited data [12, 14].
- Building from Scratch:When Necessary: You might need to build a model from scratch if your problem is unique or if no suitable pre-trained models exist.
- PyTorch Flexibility: PyTorch provides the tools to create diverse neural network architectures, including:
- Multi-layer Perceptrons (MLPs): Composed of interconnected layers of neurons, often using torch.nn.Linear layers [15].
- Convolutional Neural Networks (CNNs): Specifically designed for image data, utilizing convolutional layers (torch.nn.Conv2d) to extract spatial features [16-18].
- Recurrent Neural Networks (RNNs): Suitable for sequential data, leveraging recurrent layers to process information over time.
Key Considerations in Model Building:
- Subclassing torch.nn.Module: PyTorch models typically subclass nn.Module and override the forward method to define the computational flow [19-23].
- Understanding Layers: Familiarity with various PyTorch layers (available in torch.nn) is crucial for constructing effective models. Each layer performs specific mathematical operations that transform the data as it flows through the network [24-26].
- Model Inspection:print(model): Provides a basic overview of the model’s structure and parameters.
- model.parameters(): Allows you to access and inspect the model’s learnable parameters [27].
- Torch Info: This package offers a more programmatic way to obtain a detailed summary of your model, including the input and output shapes of each layer [28-30].
3. Setting Up a Loss Function and Optimizer

Training a deep learning model involves optimizing its parameters to minimize a loss function. Therefore, choosing the right loss function and optimizer is essential [31-33].
- Loss Function: Measures the difference between the model’s predictions and the actual target values. The choice of loss function depends on the type of problem you are solving [34, 35]:
- Regression: Mean Squared Error (MSE) or Mean Absolute Error (MAE) are common choices [36].
- Binary Classification: Binary Cross Entropy (BCE) is often used [35-39]. PyTorch offers variations like torch.nn.BCELoss and torch.nn.BCEWithLogitsLoss. The latter combines a sigmoid layer with the BCE loss, often simplifying the code [38, 39].
- Multi-Class Classification: Cross Entropy Loss is a standard choice [35-37].
- Optimizer: Responsible for updating the model’s parameters based on the calculated gradients to minimize the loss function [31-33, 40]. Popular optimizers in PyTorch include:
- Stochastic Gradient Descent (SGD): A foundational optimization algorithm [35, 36, 41, 42].
- Adam: An adaptive optimization algorithm often offering faster convergence [35, 36, 42].
PyTorch provides various loss functions in torch.nn and optimizers in torch.optim [7, 40, 43].

4. Building a Training Loop

The heart of the PyTorch workflow lies in the training loop [32, 44-46]. It’s where the model learns patterns in the data through repeated iterations of:
- Forward Pass: Passing the input data through the model to generate predictions [47, 48].
- Loss Calculation: Using the chosen loss function to measure the difference between the predictions and the actual target values [47, 48].
- Back Propagation: Calculating the gradients of the loss with respect to each parameter in the model using loss.backward() [41, 47-49]. PyTorch handles this complex mathematical operation automatically.
- Parameter Update: Updating the model’s parameters using the calculated gradients and the chosen optimizer (e.g., optimizer.step()) [41, 47, 49]. This step nudges the parameters in a direction that minimizes the loss.
Key Aspects of a Training Loop:
- Epochs: The number of times the training loop iterates through the entire training dataset [50].
- Batches: Dividing the training data into smaller batches to improve computational efficiency and model generalization [10, 11, 51].
- Monitoring Training Progress: Printing the loss and other metrics during training allows you to track how well the model is learning [50]. You can use techniques like progress bars (e.g., using the tqdm library) to visualize the training progress [52].
5. Evaluation and Testing Loop

After training, you need to evaluate your model’s performance on unseen data using a testing loop [46, 48, 53]. The testing loop is similar to the training loop, but it does not update the model’s parameters [48]. Its purpose is to assess how well the trained model generalizes to new data.

Steps in a Testing Loop:
- Setting Evaluation Mode: Switching the model to evaluation mode (model.eval()) deactivates certain layers like dropout, which are only needed during training [53, 54].
- Inference Mode: Using PyTorch’s inference mode (torch.inference_mode()) disables gradient tracking and other computations unnecessary for inference, making the evaluation process faster [53-56].
- Forward Pass: Making predictions on the test data by passing it through the model [57].
- Loss and Metric Calculation: Calculating the loss and other relevant metrics (e.g., accuracy, precision, recall) to assess the model’s performance on the test data [53].
6. Saving and Loading the Model

Once you have a trained model that performs well, you need to save it for later use or deployment [58]. PyTorch offers different ways to save and load models, including saving the entire model or saving its state dictionary [59].
- State Dictionary: The recommended way is to save the model’s state dictionary [59, 60], which is a Python dictionary containing the model’s parameters. This approach is more efficient and avoids saving unnecessary information.
Saving and Loading using State Dictionary:
- Saving: torch.save(model.state_dict(), ‘model_filename.pth’)
1. Loading:Create an instance of the model: loaded_model = MyModel()
2. Load the state dictionary: loaded_model.load_state_dict(torch.load(‘model_filename.pth’))
7. Improving the Model (Iterative Process)

Building a successful deep learning model often involves an iterative process of experimentation and improvement [61-63]. After evaluating your initial model, you might need to adjust various aspects to enhance its performance. This includes:
- Hyperparameter Tuning: Experimenting with different values for hyperparameters like learning rate, batch size, and model architecture [64].
- Data Augmentation: Applying transformations to the training data (e.g., random cropping, flipping, rotations) to increase data diversity and improve model generalization [65].
- Regularization Techniques: Using techniques like dropout or weight decay to prevent overfitting and improve model robustness.
- Experiment Tracking: Utilizing tools like TensorBoard or Weights & Biases to track your experiments, log metrics, and visualize results [66]. This can help you gain insights into the training process and make informed decisions about model improvements.
Additional Insights from the Sources:
- Functionalization: As your models and training loops become more complex, it’s beneficial to functionalize your code to improve readability and maintainability [67]. The sources demonstrate this by creating functions for training and evaluation steps [68, 69].
- Device Agnostic Code: PyTorch allows you to write code that can run on either a CPU or a GPU [70-73]. By using torch.device to determine the available device, you can make your code more flexible and efficient.
- Debugging and Troubleshooting: The sources emphasize common debugging tips, such as printing shapes and values to check for errors and using the PyTorch documentation as a reference [9, 74-77].
By following the PyTorch workflow and understanding the key steps involved, you can effectively build, train, evaluate, and deploy deep learning models for various applications. The sources provide valuable code examples and explanations to guide you through this process, enabling you to tackle real-world problems with PyTorch.

A Comprehensive Discussion of Neural Networks

Neural networks are a cornerstone of deep learning, a subfield of machine learning. They are computational models inspired by the structure and function of the human brain. The sources, while primarily focused on the PyTorch framework, offer valuable insights into the principles and applications of neural networks.

1. What are Neural Networks?

Neural networks are composed of interconnected nodes called neurons, organized in layers. These layers typically include:
- Input Layer: Receives the initial data, representing features or variables.
- Hidden Layers: Perform computations on the input data, transforming it through a series of mathematical operations. A network can have multiple hidden layers, increasing its capacity to learn complex patterns.
- Output Layer: Produces the final output, such as predictions or classifications.
The connections between neurons have associated weights that determine the strength of the signal transmitted between them. During training, the network adjusts these weights to learn the relationships between input and output data.

2. The Power of Linear and Nonlinear Functions

Neural networks leverage a combination of linear and nonlinear functions to approximate complex relationships in data.
- Linear functions represent straight lines. While useful, they are limited in their ability to model nonlinear patterns.
- Nonlinear functions introduce curves and bends, allowing the network to capture more intricate relationships in the data.
The sources illustrate this concept by demonstrating how a simple linear model struggles to separate circularly arranged data points. However, introducing nonlinear activation functions like ReLU (Rectified Linear Unit) allows the model to capture the nonlinearity and successfully classify the data.

3. Key Concepts and Terminology
- Activation Functions: Nonlinear functions applied to the output of neurons, introducing nonlinearity into the network and enabling it to learn complex patterns. Common activation functions include sigmoid, ReLU, and tanh.
- Layers: Building blocks of a neural network, each performing specific computations.
- Linear Layers (torch.nn.Linear): Perform linear transformations on the input data using weights and biases.
- Convolutional Layers (torch.nn.Conv2d): Specialized for image data, extracting features using convolutional kernels.
- Pooling Layers: Reduce the spatial dimensions of feature maps, often used in CNNs.
4. Architectures and Applications

The specific arrangement of layers and their types defines the network’s architecture. Different architectures are suited to various tasks. The sources explore:
- Multi-layer Perceptrons (MLPs): Basic neural networks with fully connected layers, often used for tabular data.
- Convolutional Neural Networks (CNNs): Excellent at image recognition tasks, utilizing convolutional layers to extract spatial features.
- Recurrent Neural Networks (RNNs): Designed for sequential data like text or time series, using recurrent connections to process information over time.
5. Training Neural Networks

Training a neural network involves adjusting its weights to minimize a loss function, which measures the difference between predicted and actual values. The sources outline the key steps of a training loop:
1. Forward Pass: Input data flows through the network, generating predictions.
2. Loss Calculation: The loss function quantifies the error between predictions and target values.
3. Backpropagation: The algorithm calculates gradients of the loss with respect to each weight, indicating the direction and magnitude of weight adjustments needed to reduce the loss.
4. Parameter Update: An optimizer (e.g., SGD or Adam) updates the weights based on the calculated gradients, moving them towards values that minimize the loss.
6. PyTorch and Neural Network Implementation

The sources demonstrate how PyTorch provides a flexible and powerful framework for building and training neural networks. Key features include:
- torch.nn Module: Contains pre-built layers, activation functions, and other components for constructing neural networks.
- Automatic Differentiation: PyTorch automatically calculates gradients during backpropagation, simplifying the training process.
- GPU Acceleration: PyTorch allows you to leverage GPUs for faster training, especially beneficial for computationally intensive deep learning models.
7. Beyond the Basics

While the sources provide a solid foundation, the world of neural networks is vast and constantly evolving. Further exploration might involve:
- Advanced Architectures: Researching more complex architectures like ResNet, Transformer networks, and Generative Adversarial Networks (GANs).
- Transfer Learning: Utilizing pre-trained models to accelerate training and improve performance on tasks with limited data.
- Deployment and Applications: Learning how to deploy trained models into real-world applications, from image recognition systems to natural language processing tools.
By understanding the fundamental principles, architectures, and training processes, you can unlock the potential of neural networks to solve a wide range of problems across various domains. The sources offer a practical starting point for your journey into the world of deep learning.

Training Machine Learning Models: A Deep Dive

Building upon the foundation of neural networks, the sources provide a detailed exploration of the model training process, focusing on the practical aspects using PyTorch. Here’s an expanded discussion on the key concepts and steps involved:

1. The Significance of the Training Loop

The training loop lies at the heart of fitting a model to data, iteratively refining its parameters to learn the underlying patterns. This iterative process involves several key steps, often likened to a song with a specific sequence:
1. Forward Pass: Input data, transformed into tensors, is passed through the model’s layers, generating predictions.
2. Loss Calculation: The loss function quantifies the discrepancy between the model’s predictions and the actual target values, providing a measure of how “wrong” the model is.
3. Optimizer Zero Grad: Before calculating gradients, the optimizer’s gradients are reset to zero to prevent accumulating gradients from previous iterations.
4. Loss Backwards: Backpropagation calculates the gradients of the loss with respect to each weight in the network, indicating how much each weight contributes to the error.
5. Optimizer Step: The optimizer, using algorithms like Stochastic Gradient Descent (SGD) or Adam, adjusts the model’s weights based on the calculated gradients. These adjustments aim to nudge the weights in a direction that minimizes the loss.
2. Choosing a Loss Function and Optimizer

The sources emphasize the crucial role of selecting an appropriate loss function and optimizer tailored to the specific machine learning task:
- Loss Function: Different tasks require different loss functions. For example, binary classification tasks often use binary cross-entropy loss, while multi-class classification tasks use cross-entropy loss. The loss function guides the model’s learning by quantifying its errors.
- Optimizer: Optimizers like SGD and Adam employ various algorithms to update the model’s weights during training. Selecting the right optimizer can significantly impact the model’s convergence speed and performance.
3. Training and Evaluation Modes

PyTorch provides distinct training and evaluation modes for models, each with specific settings to optimize performance:
- Training Mode (model.train): This mode enables gradient tracking and activates components like dropout and batch normalization layers, essential for the learning process.
- Evaluation Mode (model.eval): This mode disables gradient tracking and deactivates components not needed during evaluation or prediction. It ensures that the model’s behavior during testing reflects its true performance without the influence of training-specific mechanisms.
4. Monitoring Progress with Loss Curves

The sources introduce the concept of loss curves as visual tools to track the model’s performance during training. Loss curves plot the loss value over epochs (passes through the entire dataset). Observing these curves helps identify potential issues like underfitting or overfitting:
- Underfitting: Indicated by a high and relatively unchanging loss value for both training and validation data, suggesting the model is not effectively learning the patterns in the data.
- Overfitting: Characterized by a low training loss but a high validation loss, implying the model has memorized the training data but struggles to generalize to unseen data.
5. Improving Through Experimentation

Model training often involves an iterative process of experimentation to improve performance. The sources suggest several strategies for improving a model’s ability to learn and generalize:

Model-centric approaches:
- Adding more layers: Increasing the depth of the network can enhance its capacity to learn complex patterns.
- Adding more hidden units: Expanding the width of layers can provide more representational power.
- Changing the activation function: Experimenting with different activation functions like ReLU or sigmoid can influence the model’s nonlinearity and learning behavior.
Data-centric approaches:
- Training for longer: Increasing the number of epochs allows the model more iterations to adjust its weights and potentially reach a lower loss.
- Data Augmentation: Artificially expanding the training dataset by applying transformations like rotations, flips, and crops can help the model generalize better to unseen data.
6. Saving and Loading Models

PyTorch enables saving and loading trained models, crucial for deploying models or resuming training from a previous state. This process often involves saving the model’s state dictionary, containing the learned weights and biases:
- Saving a model (torch.save): Preserves the model’s state dictionary for later use.
- Loading a model (torch.load): Retrieves a saved model’s state dictionary to restore a previously trained model.
7. Going Beyond the Basics

The sources provide a comprehensive foundation for understanding and implementing model training using PyTorch. As you progress, further exploration might include:
- Advanced Optimizers: Investigating optimizers beyond SGD and Adam, such as RMSprop and Adagrad, each with different advantages and characteristics.
- Hyperparameter Tuning: Exploring techniques like grid search and random search to systematically find optimal hyperparameters for the model, loss function, and optimizer.
- Monitoring with TensorBoard: Utilizing TensorBoard, a visualization tool, to track various metrics like loss, accuracy, and gradients during training, providing insights into the learning process.
By grasping the core principles of the training loop, the importance of loss functions and optimizers, and techniques for improving model performance, you gain the tools to effectively train neural networks and other machine learning models using PyTorch. The sources offer a practical guide to navigate the intricacies of model training, setting the stage for tackling more complex deep learning challenges.

A Deep Dive into Computer Vision with PyTorch

Building on the foundation of neural networks and model training, the sources provide an extensive exploration of computer vision using the PyTorch framework. They guide you through the process of building, training, and evaluating computer vision models, offering valuable insights into the core concepts and practical techniques involved.

1. Understanding Computer Vision Problems

Computer vision, broadly defined, encompasses tasks that enable computers to “see” and interpret visual information, mimicking human visual perception. The sources illustrate the vast scope of computer vision problems, ranging from basic classification to more complex tasks like object detection and image segmentation.

Examples of Computer Vision Problems:
- Image Classification: Assigning a label to an image from a predefined set of categories. For instance, classifying an image as containing a cat, dog, or bird.
- Object Detection: Identifying and localizing specific objects within an image, often by drawing bounding boxes around them. Applications include self-driving cars recognizing pedestrians and traffic signs.
- Image Segmentation: Dividing an image into meaningful regions, labeling each pixel with its corresponding object or category. This technique is used in medical imaging to identify organs and tissues.
2. The Power of Convolutional Neural Networks (CNNs)

The sources highlight CNNs as powerful deep learning models well-suited for computer vision tasks. CNNs excel at extracting spatial features from images using convolutional layers, mimicking the human visual system’s hierarchical processing of visual information.

Key Components of CNNs:
- Convolutional Layers: Perform convolutions using learnable filters (kernels) that slide across the input image, extracting features like edges, textures, and patterns.
- Activation Functions: Introduce nonlinearity, allowing CNNs to model complex relationships between image features and output predictions.
- Pooling Layers: Downsample feature maps, reducing computational complexity and making the model more robust to variations in object position and scale.
- Fully Connected Layers: Combine features extracted by convolutional and pooling layers, generating final predictions for classification or other tasks.
The sources provide practical insights into building CNNs using PyTorch’s torch.nn module, guiding you through the process of defining layers, constructing the network architecture, and implementing the forward pass.

3. Working with Torchvision

PyTorch’s Torchvision library emerges as a crucial tool for computer vision projects, offering a rich ecosystem of pre-built datasets, models, and transformations.

Key Components of Torchvision:
- Datasets: Provides access to popular computer vision datasets like MNIST, FashionMNIST, CIFAR, and ImageNet. These datasets simplify the process of obtaining and loading data for model training and evaluation.
- Models: Offers pre-trained models for various computer vision tasks, allowing you to leverage the power of transfer learning by fine-tuning these models on your own datasets.
- Transforms: Enables data preprocessing and augmentation. You can use transforms to resize, crop, flip, normalize, and augment images, artificially expanding your dataset and improving model generalization.
4. The Computer Vision Workflow

The sources outline a typical workflow for computer vision projects using PyTorch, emphasizing practical steps and considerations:
1. Data Preparation: Obtaining or creating a suitable dataset, organizing it into appropriate folders (e.g., by class labels), and applying necessary preprocessing or transformations.
2. Dataset and DataLoader: Utilizing PyTorch’s Dataset and DataLoader classes to efficiently load and batch data for training and evaluation.
3. Model Construction: Defining the CNN architecture using PyTorch’s torch.nn module, specifying layers, activation functions, and other components based on the problem’s complexity and requirements.
4. Loss Function and Optimizer: Selecting a suitable loss function that aligns with the task (e.g., cross-entropy loss for classification) and choosing an optimizer like SGD or Adam to update the model’s weights during training.
5. Training Loop: Implementing the iterative training process, involving forward pass, loss calculation, backpropagation, and weight updates. Monitoring training progress using loss curves to identify potential issues like underfitting or overfitting.
6. Evaluation: Assessing the model’s performance on a held-out test dataset using metrics like accuracy, precision, recall, and F1-score, depending on the task.
7. Model Saving and Loading: Preserving trained models for later use or deployment using torch.save and loading them back using torch.load.
8. Prediction on Custom Data: Demonstrating how to load and preprocess custom images, pass them through the trained model, and obtain predictions.
5. Going Beyond the Basics

The sources provide a comprehensive foundation, but computer vision is a rapidly evolving field. Further exploration might lead you to:
- Advanced Architectures: Exploring more complex CNN architectures like ResNet, Inception, and EfficientNet, each designed to address challenges in image recognition.
- Object Detection and Segmentation: Investigating specialized models and techniques for object detection (e.g., YOLO, Faster R-CNN) and image segmentation (e.g., U-Net, Mask R-CNN).
- Transfer Learning in Depth: Experimenting with various pre-trained models and fine-tuning strategies to optimize performance on your specific computer vision tasks.
- Real-world Applications: Researching how computer vision is applied in diverse domains, such as medical imaging, autonomous driving, robotics, and image editing software.
By mastering the fundamentals of computer vision, understanding CNNs, and leveraging PyTorch’s powerful tools, you can build and deploy models that empower computers to “see” and understand the visual world. The sources offer a practical guide to navigate this exciting domain, equipping you with the skills to tackle a wide range of computer vision challenges.

Understanding Data Augmentation in Computer Vision

Data augmentation is a crucial technique in computer vision that artificially expands the diversity and size of a training dataset by applying various transformations to the existing images [1, 2]. This process enhances the model’s ability to generalize and learn more robust patterns, ultimately improving its performance on unseen data.

Why Data Augmentation is Important
1. Increased Dataset Diversity: Data augmentation introduces variations in the training data, exposing the model to different perspectives of the same image [2]. This prevents the model from overfitting, where it learns to memorize the specific details of the training set rather than the underlying patterns of the target classes.
2. Reduced Overfitting: By making the training data more challenging, data augmentation forces the model to learn more generalizable features that are less sensitive to minor variations in the input images [3, 4].
3. Improved Model Generalization: A model trained with augmented data is better equipped to handle unseen data, as it has learned to recognize objects and patterns under various transformations, making it more robust and reliable in real-world applications [1, 5].
Types of Data Augmentations

The sources highlight several commonly used data augmentation techniques, particularly within the context of PyTorch’s torchvision.transforms module [6-8].
- Resize: Changing the dimensions of the images [9]. This helps standardize the input size for the model and can also introduce variations in object scale.
- Random Horizontal Flip: Flipping the images horizontally with a certain probability [8]. This technique is particularly effective for objects that are symmetric or appear in both left-right orientations.
- Random Rotation: Rotating the images by a random angle [3]. This helps the model learn to recognize objects regardless of their orientation.
- Random Crop: Cropping random sections of the images [9, 10]. This forces the model to focus on different parts of the image and can also introduce variations in object position.
- Color Jitter: Adjusting the brightness, contrast, saturation, and hue of the images [11]. This helps the model learn to recognize objects under different lighting conditions.
Trivial Augment: A State-of-the-Art Approach

The sources mention Trivial Augment, a data augmentation strategy used by the PyTorch team to achieve state-of-the-art results on their computer vision models [12, 13]. Trivial Augment leverages randomness to select and apply a combination of augmentations from a predefined set with varying intensities, leading to a diverse and challenging training dataset [14].

Practical Implementation in PyTorch

PyTorch’s torchvision.transforms module provides a comprehensive set of functions for data augmentation [6-8]. You can create a transform pipeline by composing a sequence of transformations using transforms.Compose. For example, a basic transform pipeline might include resizing, random horizontal flipping, and conversion to a tensor:

from torchvision import transforms

train_transform = transforms.Compose([

transforms.Resize((64, 64)),

transforms.RandomHorizontalFlip(p=0.5),

transforms.ToTensor(),

])

To apply data augmentation during training, you would pass this transform pipeline to the Dataset or DataLoader when loading your images [7, 15].

Evaluating the Impact of Data Augmentation

The sources emphasize the importance of comparing model performance with and without data augmentation to assess its effectiveness [16, 17]. By monitoring training metrics like loss and accuracy, you can observe how data augmentation influences the model’s learning process and its ability to generalize to unseen data [18, 19].

The Crucial Role of Hyperparameters in Model Training

Hyperparameters are external configurations that are set by the machine learning engineer or data scientist before training a model. They are distinct from the parameters of a model, which are the internal values (weights and biases) that the model learns from the data during training. Hyperparameters play a critical role in shaping the model’s architecture, behavior, and ultimately, its performance.

Defining Hyperparameters

As the sources explain, hyperparameters are values that we, as the model builders, control and adjust. In contrast, parameters are values that the model learns and updates during training. The sources use the analogy of parking a car:
- Hyperparameters are akin to the external controls of the car, such as the steering wheel, accelerator, and brake, which the driver uses to guide the vehicle.
- Parameters are like the internal workings of the engine and transmission, which adjust automatically based on the driver’s input.
Impact of Hyperparameters on Model Training

Hyperparameters directly influence the learning process of a model. They determine factors such as:
- Model Complexity: Hyperparameters like the number of layers and hidden units dictate the model’s capacity to learn intricate patterns in the data. More layers and hidden units typically increase the model’s complexity and ability to capture nonlinear relationships. However, excessive complexity can lead to overfitting.
- Learning Rate: The learning rate governs how much the optimizer adjusts the model’s parameters during each training step. A high learning rate allows for rapid learning but can lead to instability or divergence. A low learning rate ensures stability but may require longer training times.
- Batch Size: The batch size determines how many training samples are processed together before updating the model’s weights. Smaller batches can lead to faster convergence but might introduce more noise in the gradients. Larger batches provide more stable gradients but can slow down training.
- Number of Epochs: The number of epochs determines how many times the entire training dataset is passed through the model. More epochs can improve learning, but excessive training can also lead to overfitting.
Example: Tuning Hyperparameters for a CNN

Consider the task of building a CNN for image classification, as described in the sources. Several hyperparameters are crucial to the model’s performance:
- Number of Convolutional Layers: This hyperparameter determines how many layers are used to extract features from the images. More layers allow for the capture of more complex features but increase computational complexity.
- Kernel Size: The kernel size (filter size) in convolutional layers dictates the receptive field of the filters, influencing the scale of features extracted. Smaller kernels capture fine-grained details, while larger kernels cover wider areas.
- Stride: The stride defines how the kernel moves across the image during convolution. A larger stride results in downsampling and a smaller feature map.
- Padding: Padding adds extra pixels around the image borders before convolution, preventing information loss at the edges and ensuring consistent feature map dimensions.
- Activation Function: Activation functions like ReLU introduce nonlinearity, enabling the model to learn complex relationships between features. The choice of activation function can significantly impact model performance.
- Optimizer: The optimizer (e.g., SGD, Adam) determines how the model’s parameters are updated based on the calculated gradients. Different optimizers have different convergence properties and might be more suitable for specific datasets or architectures.
By carefully tuning these hyperparameters, you can optimize the CNN’s performance on the image classification task. Experimentation and iteration are key to finding the best hyperparameter settings for a given dataset and model architecture.

The Hyperparameter Tuning Process

The sources highlight the iterative nature of finding the best hyperparameter configurations. There’s no single “best” set of hyperparameters that applies universally. The optimal settings depend on the specific dataset, model architecture, and task. The sources also emphasize:
- Experimentation: Try different combinations of hyperparameters to observe their impact on model performance.
- Monitoring Loss Curves: Use loss curves to gain insights into the model’s training behavior, identifying potential issues like underfitting or overfitting and adjusting hyperparameters accordingly.
- Validation Sets: Employ a validation dataset to evaluate the model’s performance on unseen data during training, helping to prevent overfitting and select the best-performing hyperparameters.
- Automated Techniques: Explore automated hyperparameter tuning methods like grid search, random search, or Bayesian optimization to efficiently search the hyperparameter space.
By understanding the role of hyperparameters and mastering techniques for tuning them, you can unlock the full potential of your models and achieve optimal performance on your computer vision tasks.

The Learning Process of Deep Learning Models

Deep learning models learn from data by adjusting their internal parameters to capture patterns and relationships within the data. The sources provide a comprehensive overview of this process, particularly within the context of supervised learning using neural networks.

1. Data Representation: Turning Data into Numbers

The first step in deep learning is to represent the data in a numerical format that the model can understand. As the sources emphasize, “machine learning is turning things into numbers” [1, 2]. This process involves encoding various forms of data, such as images, text, or audio, into tensors, which are multi-dimensional arrays of numbers.

2. Model Architecture: Building the Learning Framework

Once the data is numerically encoded, a model architecture is defined. Neural networks are a common type of deep learning model, consisting of interconnected layers of neurons. Each layer performs mathematical operations on the input data, transforming it into increasingly abstract representations.
- Input Layer: Receives the numerical representation of the data.
- Hidden Layers: Perform computations on the input, extracting features and learning representations.
- Output Layer: Produces the final output of the model, which is tailored to the specific task (e.g., classification, regression).
3. Parameter Initialization: Setting the Starting Point

The parameters of a neural network, typically weights and biases, are initially assigned random values. These parameters determine how the model processes the data and ultimately define its behavior.

4. Forward Pass: Calculating Predictions

During training, the data is fed forward through the network, layer by layer. Each layer performs its mathematical operations, using the current parameter values to transform the input data. The final output of the network represents the model’s prediction for the given input.

5. Loss Function: Measuring Prediction Errors

A loss function is used to quantify the difference between the model’s predictions and the true target values. The loss function measures how “wrong” the model’s predictions are, providing a signal for how to adjust the parameters to improve performance.

6. Backpropagation: Calculating Gradients

Backpropagation is the core algorithm that enables deep learning models to learn. It involves calculating the gradients of the loss function with respect to each parameter in the network. These gradients indicate the direction and magnitude of change needed for each parameter to reduce the loss.

7. Optimizer: Updating Parameters

An optimizer uses the calculated gradients to update the model’s parameters. The optimizer’s goal is to minimize the loss function by iteratively adjusting the parameters in the direction that reduces the error. Common optimizers include Stochastic Gradient Descent (SGD) and Adam.

8. Training Loop: Iterative Learning Process

The training loop encompasses the steps of forward pass, loss calculation, backpropagation, and parameter update. This process is repeated iteratively over the training data, allowing the model to progressively refine its parameters and improve its predictive accuracy.
- Epochs: Each pass through the entire training dataset is called an epoch.
- Batch Size: Data is typically processed in batches, where a batch is a subset of the training data.
9. Evaluation: Assessing Model Performance

After training, the model is evaluated on a separate dataset (validation or test set) to assess its ability to generalize to unseen data. Metrics like accuracy, precision, and recall are used to measure the model’s performance on the task.

10. Hyperparameter Tuning: Optimizing the Learning Process

Hyperparameters are external configurations that influence the model’s learning process. Examples include learning rate, batch size, and the number of layers. Tuning hyperparameters is crucial to achieving optimal model performance. This often involves experimentation and monitoring training metrics to find the best settings.

Key Concepts and Insights
- Iterative Learning: Deep learning models learn through an iterative process of making predictions, calculating errors, and adjusting parameters.
- Gradient Descent: Backpropagation and optimizers work together to implement gradient descent, guiding the parameter updates towards minimizing the loss function.
- Feature Learning: Hidden layers in neural networks automatically learn representations of the data, extracting meaningful features that contribute to the model’s predictive ability.
- Nonlinearity: Activation functions introduce nonlinearity, allowing models to capture complex relationships in the data that cannot be represented by simple linear models.
By understanding these fundamental concepts, you can gain a deeper appreciation for how deep learning models learn from data and achieve remarkable performance on a wide range of tasks.

Key Situations for Deep Learning Solutions

The sources provide a detailed explanation of when deep learning is a good solution and when simpler approaches might be more suitable. Here are three key situations where deep learning often excels:

1. Problems with Long Lists of Rules

Deep learning models are particularly effective when dealing with problems that involve a vast and intricate set of rules that would be difficult or impossible to program explicitly. The sources use the example of driving a car, which encompasses countless rules regarding navigation, safety, and traffic regulations.
- Traditional programming struggles with such complexity, requiring engineers to manually define and code every possible scenario. This approach quickly becomes unwieldy and prone to errors.
- Deep learning offers a more flexible and adaptable solution. Instead of explicitly programming rules, deep learning models learn from data, automatically extracting patterns and relationships that represent the underlying rules.
2. Continuously Changing Environments

Deep learning shines in situations where the environment or the data itself is constantly evolving. Unlike traditional rule-based systems, which require manual updates to adapt to changes, deep learning models can continuously learn and update their knowledge as new data becomes available.
- The sources highlight the adaptability of deep learning, stating that models can “keep learning if it needs to” and “adapt and learn to new scenarios.”
- This capability is crucial in applications such as self-driving cars, where road conditions, traffic patterns, and even driving regulations can change over time.
3. Discovering Insights Within Large Collections of Data

Deep learning excels at uncovering hidden patterns and insights within massive datasets. The ability to process vast amounts of data is a key advantage of deep learning, enabling it to identify subtle relationships and trends that might be missed by traditional methods.
- The sources emphasize the flourishing of deep learning in handling large datasets, citing examples like the Food 101 dataset, which contains images of 101 different kinds of foods.
- This capacity for large-scale data analysis is invaluable in fields such as medical image analysis, where deep learning can assist in detecting diseases, identifying anomalies, and predicting patient outcomes.
In these situations, deep learning offers a powerful and flexible approach, allowing models to learn from data, adapt to changes, and extract insights from vast datasets, providing solutions that were previously challenging or even impossible to achieve with traditional programming techniques.

The Most Common Errors in Deep Learning

The sources highlight shape errors as one of the most prevalent challenges encountered by deep learning developers. The sources emphasize that this issue stems from the fundamental reliance on matrix multiplication operations in neural networks.
- Neural networks are built upon interconnected layers, and matrix multiplication is the primary mechanism for data transformation between these layers. [1]
- Shape errors arise when the dimensions of the matrices involved in these multiplications are incompatible. [1, 2]
- The sources illustrate this concept by explaining that for matrix multiplication to succeed, the inner dimensions of the matrices must match. [2, 3]
Three Big Errors in PyTorch and Deep Learning

The sources further elaborate on this concept within the specific context of the PyTorch deep learning framework, identifying three primary categories of errors:
1. Tensors not having the Right Data Type: The sources point out that using the incorrect data type for tensors can lead to errors, especially during the training of large neural networks. [4]
2. Tensors not having the Right Shape: This echoes the earlier discussion of shape errors and their importance in matrix multiplication operations. [4]
3. Device Issues: This category of errors arises when tensors are located on different devices, typically the CPU and GPU. PyTorch requires tensors involved in an operation to reside on the same device. [5]
The Ubiquity of Shape Errors

The sources consistently underscore the significance of understanding tensor shapes and dimensions in deep learning.
- They emphasize that mismatches in input and output shapes between layers are a frequent source of errors. [6]
- The process of reshaping, stacking, squeezing, and unsqueezing tensors is presented as a crucial technique for addressing shape-related issues. [7, 8]
- The sources advise developers to become familiar with their data’s shape and consult documentation to understand the expected input shapes for various layers and operations. [9]
Troubleshooting Tips and Practical Advice

Beyond identifying shape errors as a common challenge, the sources offer practical tips and insights for troubleshooting such issues.
- Understanding matrix multiplication rules: Developers are encouraged to grasp the fundamental rules governing matrix multiplication to anticipate and prevent shape errors. [3]
- Visualizing matrix multiplication: The sources recommend using the website matrixmultiplication.xyz as a tool for visualizing matrix operations and understanding their dimensional requirements. [10]
- Programmatic shape checking: The sources advocate for incorporating programmatic checks of tensor shapes using functions like tensor.shape to identify and debug shape mismatches. [11, 12]
By understanding the importance of tensor shapes and diligently checking for dimensional compatibility, deep learning developers can mitigate the occurrence of shape errors and streamline their development workflow.

Two Common Deep Learning Errors

The sources describe three major errors faced by deep learning developers: tensors not having the correct data type, tensors not having the correct shape, and device issues. [1] Two particularly common errors are data type and shape mismatches. [1, 2]

Data Type Mismatches

The sources explain that using the wrong data type for a tensor, especially when training large neural networks, can lead to errors. [1] For example, the torch.mean() function requires a float32 tensor, but if a long tensor is used, an error occurs. [3] Data type mismatches can also occur with loss functions. For instance, if the torch.nn.BCELoss is used instead of the torch.nn.BCEWithLogitsLoss, the inputs to the loss function must have gone through a sigmoid activation function. [4-6]

Shape Mismatches

Shape errors are extremely common in deep learning. [1, 2, 7-13] The sources explain that shape errors arise when the dimensions of matrices are incompatible during matrix multiplication operations. [7-9] To perform matrix multiplication, the inner dimensions of the matrices must match. [7, 14] Shape errors can also occur if the input or output shapes of tensors are mismatched between layers in a neural network. [11, 15] For example, a convolutional layer might expect a four-dimensional tensor, but if a three-dimensional tensor is used, an error will occur. [13] The sources recommend checking the shape of tensors frequently to catch these errors. [11, 16]

Let’s go through the topics covered in the “PyTorch for Deep Learning & Machine Learning – Full Course” one by one.

1. Introduction: Deep Learning vs. Traditional Programming

The sources start by introducing deep learning as a subset of machine learning, which itself is a subset of artificial intelligence [1]. They explain the key difference between traditional programming and machine learning [2].
- In traditional programming, we give the computer specific rules and data, and it produces the output.
- In machine learning, we provide the computer with data and desired outputs, and it learns the rules to map the data to the outputs.
The sources argue that deep learning is particularly well-suited for complex problems where it’s difficult to hand-craft rules [3, 4]. Examples include self-driving cars and image recognition. However, they also caution against using machine learning when a simpler, rule-based system would suffice [4, 5].

2. PyTorch Fundamentals: Tensors and Operations

The sources then introduce PyTorch, a popular deep learning framework written in Python [6, 7]. The core data structure in PyTorch is the tensor, a multi-dimensional array that can be used to represent various types of data [8].
- The sources explain the different types of tensors: scalars, vectors, matrices, and higher-order tensors [9].
- They demonstrate how to create tensors using torch.tensor() and showcase various operations like reshaping, indexing, stacking, and permuting [9-11].
Understanding tensor shapes and dimensions is crucial for avoiding errors in deep learning, as highlighted in our previous conversation about shape mismatches [12].

3. The PyTorch Workflow: From Data to Model

The sources then outline a typical PyTorch workflow [13] for developing deep learning models:
1. Data Preparation and Loading: The sources emphasize the importance of preparing data for machine learning [14] and the process of transforming raw data into a numerical representation suitable for models. They introduce data loaders (torch.utils.data.DataLoader) [15] for efficiently loading data in batches [16].
2. Building a Machine Learning Model: The sources demonstrate how to build models in PyTorch by subclassing nn.Module [17]. This involves defining the model’s layers and the forward pass, which specifies how data flows through the model.
3. Fitting the Model to the Data (Training): The sources explain the concept of a training loop [18], where the model iteratively learns from the data. Key steps in the training loop include:
- Forward Pass: Passing data through the model to get predictions.
- Calculating the Loss: Measuring how wrong the model’s predictions are using a loss function [19].
- Backpropagation: Calculating gradients to determine how to adjust the model’s parameters.
- Optimizer Step: Updating the model’s parameters using an optimizer [20] to minimize the loss.
1. Evaluating the Model: The sources highlight the importance of evaluating the model’s performance on unseen data to assess its generalization ability. This typically involves calculating metrics such as accuracy, precision, and recall [21].
2. Saving and Reloading the Model: The sources discuss methods for saving and loading trained models using torch.save() and torch.load() [22, 23].
3. Improving the Model: The sources provide tips and strategies for enhancing the model’s performance, including techniques like hyperparameter tuning, data augmentation, and using different model architectures [24].
4. Classification with PyTorch: Binary and Multi-Class

The sources dive into classification problems, a common type of machine learning task where the goal is to categorize data into predefined classes [25]. They discuss:
- Binary Classification: Predicting one of two possible classes [26].
- Multi-Class Classification: Choosing from more than two classes [27].
The sources demonstrate how to build classification models in PyTorch and showcase various techniques:
- Choosing appropriate loss functions like binary cross entropy loss (nn.BCELoss) for binary classification and cross entropy loss (nn.CrossEntropyLoss) for multi-class classification [28].
- Using activation functions like sigmoid for binary classification and softmax for multi-class classification [29].
- Evaluating classification models using metrics like accuracy, precision, recall, and confusion matrices [30].
5. Computer Vision with PyTorch: Convolutional Neural Networks (CNNs)

The sources introduce computer vision, the field of enabling computers to “see” and interpret images [31]. They focus on convolutional neural networks (CNNs), a type of neural network architecture specifically designed for processing image data [32].
- Torchvision: The sources introduce torchvision, a PyTorch library containing datasets, model architectures, and image transformation tools [33].
- Data Augmentation: The sources showcase data augmentation techniques using torchvision.transforms to create variations of training images and improve model robustness [34].
- CNN Building Blocks: The sources explain and demonstrate key CNN components like convolutional layers (nn.Conv2d), pooling layers, and activation functions [35].
They guide you through building CNNs from scratch and visualizing the learned features.

6. Custom Datasets: Working with Your Own Data

The sources address the challenge of working with custom datasets not readily available in PyTorch’s built-in libraries [36]. They explain how to:
- Create custom datasets by subclassing torch.utils.data.Dataset [37] and implementing methods for loading and processing data.
- Integrate custom datasets with PyTorch’s data loaders for efficient training.
This section empowers you to apply PyTorch to real-world problems using your own data.

7. Further Resources and Exercises

The sources provide numerous extracurricular resources and exercises [38-40] to deepen your understanding and practice your skills. These resources include:
- Links to relevant PyTorch documentation and tutorials
- Blog posts and articles on specific deep learning concepts
- Code templates and solution notebooks for exercises
By actively engaging with these resources, you can continue your learning journey and solidify your PyTorch proficiency.

This comprehensive overview of the topics covered in the “PyTorch for Deep Learning & Machine Learning – Full Course” provides a structured understanding of the key concepts and techniques. Remember, the best way to learn is by practicing and experimenting with the code provided in the sources.

Here are summaries of each set of 10 pages from the source document:

Pages 1-10 Summary: Introduction to Deep Learning and PyTorch Fundamentals

These pages introduce the fundamental concepts of deep learning, positioning it as a powerful subset of machine learning. The sources draw a clear distinction between traditional programming, where explicit rules dictate output, and machine learning, where algorithms learn rules from data. The emphasis is on PyTorch as the chosen deep learning framework, highlighting its core data structure: the tensor.

The sources provide practical guidance on creating tensors using torch.tensor() and manipulating them with operations like reshaping and indexing. They underscore the crucial role of understanding tensor shapes and dimensions, connecting it to the common challenge of shape errors discussed in our earlier conversation.

This set of pages lays the groundwork for understanding both the conceptual framework of deep learning and the practical tools provided by PyTorch.

Pages 11-20 Summary: Exploring Tensors, Neural Networks, and PyTorch Documentation

These pages build upon the introduction of tensors, expanding on operations like stacking and permuting to manipulate tensor structures further. They transition into a conceptual overview of neural networks, emphasizing their ability to learn complex patterns from data. However, the sources don’t provide detailed definitions of deep learning or neural networks, encouraging you to explore these concepts independently through external resources like Wikipedia and educational channels.

The sources strongly advocate for actively engaging with PyTorch documentation. They highlight the website as a valuable resource for understanding PyTorch’s features, functions, and examples. They encourage you to spend time reading and exploring the documentation, even if you don’t fully grasp every detail initially.

Pages 21-30 Summary: The PyTorch Workflow: Data, Models, Loss, and Optimization

This section of the source delves into the core PyTorch workflow, starting with the importance of data preparation. It emphasizes the transformation of raw data into tensors, making it suitable for deep learning models. Data loaders are presented as essential tools for efficiently handling large datasets by loading data in batches.

The sources then guide you through the process of building a machine learning model in PyTorch, using the concept of subclassing nn.Module. The forward pass is introduced as a fundamental step that defines how data flows through the model’s layers. The sources explain how models are trained by fitting them to the data, highlighting the iterative process of the training loop:
1. Forward pass: Input data is fed through the model to generate predictions.
2. Loss calculation: A loss function quantifies the difference between the model’s predictions and the actual target values.
3. Backpropagation: The model’s parameters are adjusted by calculating gradients, indicating how each parameter contributes to the loss.
4. Optimization: An optimizer uses the calculated gradients to update the model’s parameters, aiming to minimize the loss.
Pages 31-40 Summary: Evaluating Models, Running Tensors, and Important Concepts

The sources focus on evaluating the model’s performance, emphasizing its significance in determining how well the model generalizes to unseen data. They mention common metrics like accuracy, precision, and recall as tools for evaluating model effectiveness.

The sources introduce the concept of running tensors on different devices (CPU and GPU) using .to(device), highlighting its importance for computational efficiency. They also discuss the use of random seeds (torch.manual_seed()) to ensure reproducibility in deep learning experiments, enabling consistent results across multiple runs.

The sources stress the importance of documentation reading as a key exercise for understanding PyTorch concepts and functionalities. They also advocate for practical coding exercises to reinforce learning and develop proficiency in applying PyTorch concepts.

Pages 41-50 Summary: Exercises, Classification Introduction, and Data Visualization

The sources dedicate these pages to practical application and reinforcement of previously learned concepts. They present exercises designed to challenge your understanding of PyTorch workflows, data manipulation, and model building. They recommend referring to the documentation, practicing independently, and checking provided solutions as a learning approach.

The focus shifts to classification problems, distinguishing between binary classification, where the task is to predict one of two classes, and multi-class classification, involving more than two classes.

The sources then begin exploring data visualization, emphasizing the importance of understanding your data before applying machine learning models. They introduce the make_circles dataset as an example and use scatter plots to visualize its structure, highlighting the need for visualization as a crucial step in the data exploration process.

Pages 51-60 Summary: Data Splitting, Building a Classification Model, and Training

The sources discuss the critical concept of splitting data into training and test sets. This separation ensures that the model is evaluated on unseen data to assess its generalization capabilities accurately. They utilize the train_test_split function to divide the data and showcase the process of building a simple binary classification model in PyTorch.

The sources emphasize the familiar training loop process, where the model iteratively learns from the training data:
1. Forward pass through the model
2. Calculation of the loss function
3. Backpropagation of gradients
4. Optimization of model parameters
They guide you through implementing these steps and visualizing the model’s training progress using loss curves, highlighting the importance of monitoring these curves for insights into the model’s learning behavior.

Pages 61-70 Summary: Multi-Class Classification, Data Visualization, and the Softmax Function

The sources delve into multi-class classification, expanding upon the previously covered binary classification. They illustrate the differences between the two and provide examples of scenarios where each is applicable.

The focus remains on data visualization, emphasizing the importance of understanding your data before applying machine learning algorithms. The sources introduce techniques for visualizing multi-class data, aiding in pattern recognition and insight generation.

The softmax function is introduced as a crucial component in multi-class classification models. The sources explain its role in converting the model’s raw outputs (logits) into probabilities, enabling interpretation and decision-making based on these probabilities.

Pages 71-80 Summary: Evaluation Metrics, Saving/Loading Models, and Computer Vision Introduction

This section explores various evaluation metrics for assessing the performance of classification models. They introduce metrics like accuracy, precision, recall, F1 score, confusion matrices, and classification reports. The sources explain the significance of each metric and how to interpret them in the context of evaluating model effectiveness.

The sources then discuss the practical aspects of saving and loading trained models, highlighting the importance of preserving model progress and enabling future use without retraining.

The focus shifts to computer vision, a field that enables computers to “see” and interpret images. They discuss the use of convolutional neural networks (CNNs) as specialized neural network architectures for image processing tasks.

Pages 81-90 Summary: Computer Vision Libraries, Data Exploration, and Mini-Batching

The sources introduce essential computer vision libraries in PyTorch, particularly highlighting torchvision. They explain the key components of torchvision, including datasets, model architectures, and image transformation tools.

They guide you through exploring a computer vision dataset, emphasizing the importance of understanding data characteristics before model building. Techniques for visualizing images and examining data structure are presented.

The concept of mini-batching is discussed as a crucial technique for efficiently training deep learning models on large datasets. The sources explain how mini-batching involves dividing the data into smaller batches, reducing memory requirements and improving training speed.

Pages 91-100 Summary: Building a CNN, Training Steps, and Evaluation

This section dives into the practical aspects of building a CNN for image classification. They guide you through defining the model’s architecture, including convolutional layers (nn.Conv2d), pooling layers, activation functions, and a final linear layer for classification.

The familiar training loop process is revisited, outlining the steps involved in training the CNN model:
1. Forward pass of data through the model
2. Calculation of the loss function
3. Backpropagation to compute gradients
4. Optimization to update model parameters
The sources emphasize the importance of monitoring the training process by visualizing loss curves and calculating evaluation metrics like accuracy and loss. They provide practical code examples for implementing these steps and evaluating the model’s performance on a test dataset.

Pages 101-110 Summary: Troubleshooting, Non-Linear Activation Functions, and Model Building

The sources provide practical advice for troubleshooting common errors in PyTorch code, encouraging the use of the data explorer’s motto: visualize, visualize, visualize. The importance of checking tensor shapes, understanding error messages, and referring to the PyTorch documentation is highlighted. They recommend searching for specific errors online, utilizing resources like Stack Overflow, and if all else fails, asking questions on the course’s GitHub discussions page.

The concept of non-linear activation functions is introduced as a crucial element in building effective neural networks. These functions, such as ReLU, introduce non-linearity into the model, enabling it to learn complex, non-linear patterns in the data. The sources emphasize the importance of combining linear and non-linear functions within a neural network to achieve powerful learning capabilities.

Building upon this concept, the sources guide you through the process of constructing a more complex classification model incorporating non-linear activation functions. They demonstrate the step-by-step implementation, highlighting the use of ReLU and its impact on the model’s ability to capture intricate relationships within the data.

Pages 111-120 Summary: Data Augmentation, Model Evaluation, and Performance Improvement

The sources introduce data augmentation as a powerful technique for artificially increasing the diversity and size of training data, leading to improved model performance. They demonstrate various data augmentation methods, including random cropping, flipping, and color adjustments, emphasizing the role of torchvision.transforms in implementing these techniques. The TrivialAugment technique is highlighted as a particularly effective and efficient data augmentation strategy.

The sources reinforce the importance of model evaluation and explore advanced techniques for assessing the performance of classification models. They introduce metrics beyond accuracy, including precision, recall, F1-score, and confusion matrices. The use of torchmetrics and other libraries for calculating these metrics is demonstrated.

The sources discuss strategies for improving model performance, focusing on optimizing training speed and efficiency. They introduce concepts like mixed precision training and highlight the potential benefits of using TPUs (Tensor Processing Units) for accelerated deep learning tasks.

Pages 121-130 Summary: CNN Hyperparameters, Custom Datasets, and Image Loading

The sources provide a deeper exploration of CNN hyperparameters, focusing on kernel size, stride, and padding. They utilize the CNN Explainer website as a valuable resource for visualizing and understanding the impact of these hyperparameters on the convolutional operations within a CNN. They guide you through calculating output shapes based on these hyperparameters, emphasizing the importance of understanding the transformations applied to the input data as it passes through the network’s layers.

The concept of custom datasets is introduced, moving beyond the use of pre-built datasets like FashionMNIST. The sources outline the process of creating a custom dataset using PyTorch’s Dataset class, enabling you to work with your own data sources. They highlight the importance of structuring your data appropriately for use with PyTorch’s data loading utilities.

They demonstrate techniques for loading images using PyTorch, leveraging libraries like PIL (Python Imaging Library) and showcasing the steps involved in reading image data, converting it into tensors, and preparing it for use in a deep learning model.

Pages 131-140 Summary: Building a Custom Dataset, Data Visualization, and Data Augmentation

The sources guide you step-by-step through the process of building a custom dataset in PyTorch, specifically focusing on creating a food image classification dataset called FoodVision Mini. They cover techniques for organizing image data, creating class labels, and implementing a custom dataset class that inherits from PyTorch’s Dataset class.

They emphasize the importance of data visualization throughout the process, demonstrating how to visually inspect images, verify labels, and gain insights into the dataset’s characteristics. They provide code examples for plotting random images from the custom dataset, enabling visual confirmation of data loading and preprocessing steps.

The sources revisit data augmentation in the context of custom datasets, highlighting its role in improving model generalization and robustness. They demonstrate the application of various data augmentation techniques using torchvision.transforms to artificially expand the training dataset and introduce variations in the images.

Pages 141-150 Summary: Training and Evaluation with a Custom Dataset, Transfer Learning, and Advanced Topics

The sources guide you through the process of training and evaluating a deep learning model using your custom dataset (FoodVision Mini). They cover the steps involved in setting up data loaders, defining a model architecture, implementing a training loop, and evaluating the model’s performance using appropriate metrics. They emphasize the importance of monitoring training progress through visualization techniques like loss curves and exploring the model’s predictions on test data.

The sources introduce transfer learning as a powerful technique for leveraging pre-trained models to improve performance on a new task, especially when working with limited data. They explain the concept of using a model trained on a large dataset (like ImageNet) as a starting point and fine-tuning it on your custom dataset to achieve better results.

The sources provide an overview of advanced topics in PyTorch deep learning, including:
- Model experiment tracking: Tools and techniques for managing and tracking multiple deep learning experiments, enabling efficient comparison and analysis of model variations.
- PyTorch paper replicating: Replicating research papers using PyTorch, a valuable approach for understanding cutting-edge deep learning techniques and applying them to your own projects.
- PyTorch workflow debugging: Strategies for debugging and troubleshooting issues that may arise during the development and training of deep learning models in PyTorch.
These advanced topics provide a glimpse into the broader landscape of deep learning research and development using PyTorch, encouraging further exploration and experimentation beyond the foundational concepts covered in the previous sections.

Pages 151-160 Summary: Custom Datasets, Data Exploration, and the FoodVision Mini Dataset

The sources emphasize the importance of custom datasets when working with data that doesn’t fit into pre-existing structures like FashionMNIST. They highlight the different domain libraries available in PyTorch for handling specific types of data, including:
- Torchvision: for image data
- Torchtext: for text data
- Torchaudio: for audio data
- Torchrec: for recommendation systems data
Each of these libraries has a datasets module that provides tools for loading and working with data from that domain. Additionally, the sources mention Torchdata, which is a more general-purpose data loading library that is still under development.

The sources guide you through the process of creating a custom image dataset called FoodVision Mini, based on the larger Food101 dataset. They provide detailed instructions for:
1. Obtaining the Food101 data: This involves downloading the dataset from its original source.
2. Structuring the data: The sources recommend organizing the data in a specific folder structure, where each subfolder represents a class label and contains images belonging to that class.
3. Exploring the data: The sources emphasize the importance of becoming familiar with the data through visualization and exploration. This can help you identify potential issues with the data and gain insights into its characteristics.
They introduce the concept of becoming one with the data, spending significant time understanding its structure, format, and nuances before diving into model building. This echoes the data explorer’s motto: visualize, visualize, visualize.

The sources provide practical advice for exploring the dataset, including walking through directories and visualizing images to confirm the organization and content of the data. They introduce a helper function called walk_through_dir that allows you to systematically traverse the dataset’s folder structure and gather information about the number of directories and images within each class.

Pages 161-170 Summary: Creating a Custom Dataset Class and Loading Images

The sources continue the process of building the FoodVision Mini custom dataset, guiding you through creating a custom dataset class using PyTorch’s Dataset class. They outline the essential components and functionalities of such a class:
1. Initialization (__init__): This method sets up the dataset’s attributes, including the target directory containing the data and any necessary transformations to be applied to the images.
2. Length (__len__): This method returns the total number of samples in the dataset, providing a way to iterate through the entire dataset.
3. Item retrieval (__getitem__): This method retrieves a specific sample (image and label) from the dataset based on its index, enabling access to individual data points during training.
The sources demonstrate how to load images using the PIL (Python Imaging Library) and convert them into tensors, a format suitable for PyTorch deep learning models. They provide a detailed implementation of the load_image function, which takes an image path as input and returns a PIL image object. This function is then utilized within the __getitem__ method to load and preprocess images on demand.

They highlight the steps involved in creating a class-to-index mapping, associating each class label with a numerical index, a requirement for training classification models in PyTorch. This mapping is generated by scanning the target directory and extracting the class names from the subfolder names.

Pages 171-180 Summary: Data Visualization, Data Augmentation Techniques, and Implementing Transformations

The sources reinforce the importance of data visualization as an integral part of building a custom dataset. They provide code examples for creating a function that displays random images from the dataset along with their corresponding labels. This visual inspection helps ensure that the images are loaded correctly, the labels are accurate, and the data is appropriately preprocessed.

They further explore data augmentation techniques, highlighting their significance in enhancing model performance and generalization. They demonstrate the implementation of various augmentation methods, including random horizontal flipping, random cropping, and color jittering, using torchvision.transforms. These augmentations introduce variations in the training images, artificially expanding the dataset and helping the model learn more robust features.

The sources introduce the TrivialAugment technique, a data augmentation strategy that leverages randomness to apply a series of transformations to images, promoting diversity in the training data. They provide code examples for implementing TrivialAugment using torchvision.transforms and showcase its impact on the visual appearance of the images. They suggest experimenting with different augmentation strategies and visualizing their effects to understand their impact on the dataset.

Pages 181-190 Summary: Building a TinyVGG Model and Evaluating its Performance

The sources guide you through building a TinyVGG model architecture, a simplified version of the VGG convolutional neural network architecture. They demonstrate the step-by-step implementation of the model’s layers, including convolutional layers, ReLU activation functions, and max-pooling layers, using torch.nn modules. They use the CNN Explainer website as a visual reference for the TinyVGG architecture and encourage exploration of this resource to gain a deeper understanding of the model’s structure and operations.

The sources introduce the torchinfo package, a helpful tool for summarizing the structure and parameters of a PyTorch model. They demonstrate its usage for the TinyVGG model, providing a clear representation of the input and output shapes of each layer, the number of parameters in each layer, and the overall model size. This information helps in verifying the model’s architecture and understanding its computational complexity.

They walk through the process of evaluating the TinyVGG model’s performance on the FoodVision Mini dataset, covering the steps involved in setting up data loaders, defining a training loop, and calculating metrics like loss and accuracy. They emphasize the importance of monitoring training progress through visualization techniques like loss curves, plotting the loss value over epochs to observe the model’s learning trajectory and identify potential issues like overfitting.

Pages 191-200 Summary: Implementing Training and Testing Steps, and Setting Up a Training Loop

The sources guide you through the implementation of separate functions for the training step and testing step of the model training process. These functions encapsulate the logic for processing a single batch of data during training and testing, respectively.

The train_step function, as described in the sources, performs the following actions:
1. Forward pass: Passes the input batch through the model to obtain predictions.
2. Loss calculation: Computes the loss between the predictions and the ground truth labels.
3. Backpropagation: Calculates the gradients of the loss with respect to the model’s parameters.
4. Optimizer step: Updates the model’s parameters based on the calculated gradients to minimize the loss.
The test_step function is similar to the training step, but it omits the backpropagation and optimizer step since the goal during testing is to evaluate the model’s performance on unseen data without updating its parameters.

The sources then demonstrate how to integrate these functions into a training loop. This loop iterates over the specified number of epochs, processing the training data in batches. For each epoch, the loop performs the following steps:
1. Training phase: Calls the train_step function for each batch of training data, updating the model’s parameters.
2. Testing phase: Calls the test_step function for each batch of testing data, evaluating the model’s performance on unseen data.
The sources emphasize the importance of monitoring training progress by tracking metrics like loss and accuracy during both the training and testing phases. This allows you to observe how well the model is learning and identify potential issues like overfitting.

Pages 201-210 Summary: Visualizing Model Predictions and Exploring the Concept of Transfer Learning

The sources emphasize the value of visualizing the model’s predictions to gain insights into its performance and identify potential areas for improvement. They guide you through the process of making predictions on a set of test images and displaying the images along with their predicted and actual labels. This visual assessment helps you understand how well the model is generalizing to unseen data and can reveal patterns in the model’s errors.

They introduce the concept of transfer learning, a powerful technique in deep learning where you leverage knowledge gained from training a model on a large dataset to improve the performance of a model on a different but related task. The sources suggest exploring the torchvision.models module, which provides a collection of pre-trained models for various computer vision tasks. They highlight that these pre-trained models can be used as a starting point for your own models, either by fine-tuning the entire model or using parts of it as feature extractors.

They provide an overview of how to load pre-trained models from the torchvision.models module and modify their architecture to suit your specific task. The sources encourage experimentation with different pre-trained models and fine-tuning strategies to achieve optimal performance on your custom dataset.

Pages 211-310 Summary: Fine-Tuning a Pre-trained ResNet Model, Multi-Class Classification, and Exploring Binary vs. Multi-Class Problems

The sources shift focus to fine-tuning a pre-trained ResNet model for the FoodVision Mini dataset. They highlight the advantages of using a pre-trained model, such as faster training and potentially better performance due to leveraging knowledge learned from a larger dataset. The sources guide you through:
1. Loading a pre-trained ResNet model: They show how to use the torchvision.models module to load a pre-trained ResNet model, such as ResNet18 or ResNet34.
2. Modifying the final fully connected layer: To adapt the model to the FoodVision Mini dataset, the sources demonstrate how to change the output size of the final fully connected layer to match the number of classes in the dataset (3 in this case).
3. Freezing the initial layers: The sources discuss the strategy of freezing the weights of the initial layers of the pre-trained model to preserve the learned features from the larger dataset. This helps prevent catastrophic forgetting, where the model loses its previously acquired knowledge during fine-tuning.
4. Training the modified model: They provide instructions for training the fine-tuned model on the FoodVision Mini dataset, emphasizing the importance of monitoring training progress and evaluating the model’s performance.
The sources transition to discussing multi-class classification, explaining the distinction between binary classification (predicting between two classes) and multi-class classification (predicting among more than two classes). They provide examples of both types of classification problems:
- Binary Classification: Identifying email as spam or not spam, classifying images as containing a cat or a dog.
- Multi-class Classification: Categorizing images of different types of food, assigning topics to news articles, predicting the sentiment of a text review.
They introduce the ImageNet dataset, a large-scale dataset for image classification with 1000 object classes, as an example of a multi-class classification problem. They highlight the use of the softmax activation function for multi-class classification, explaining its role in converting the model’s raw output (logits) into probability scores for each class.

The sources guide you through building a neural network for a multi-class classification problem using PyTorch. They illustrate:
1. Creating a multi-class dataset: They use the sklearn.datasets.make_blobs function to generate a synthetic dataset with multiple classes for demonstration purposes.
2. Visualizing the dataset: The sources emphasize the importance of visualizing the dataset to understand its structure and distribution of classes.
3. Building a neural network model: They walk through the steps of defining a neural network model with multiple layers and activation functions using torch.nn modules.
4. Choosing a loss function: For multi-class classification, they introduce the cross-entropy loss function and explain its suitability for this type of problem.
5. Setting up an optimizer: They discuss the use of optimizers, such as stochastic gradient descent (SGD), for updating the model’s parameters during training.
6. Training the model: The sources provide instructions for training the multi-class classification model, highlighting the importance of monitoring training progress and evaluating the model’s performance.
Pages 311-410 Summary: Building a Robust Training Loop, Working with Nonlinearities, and Performing Model Sanity Checks

The sources guide you through building a more robust training loop for the multi-class classification problem, incorporating best practices like using a validation set for monitoring overfitting. They provide a detailed code implementation of the training loop, highlighting the key steps:
1. Iterating over epochs: The loop iterates over a specified number of epochs, processing the training data in batches.
2. Forward pass: For each batch, the input data is passed through the model to obtain predictions.
3. Loss calculation: The loss between the predictions and the target labels is computed using the chosen loss function.
4. Backward pass: The gradients of the loss with respect to the model’s parameters are calculated through backpropagation.
5. Optimizer step: The optimizer updates the model’s parameters based on the calculated gradients.
6. Validation: After each epoch, the model’s performance is evaluated on a separate validation set to monitor overfitting.
The sources introduce the concept of nonlinearities in neural networks and explain the importance of activation functions in introducing non-linearity to the model. They discuss various activation functions, such as:
- ReLU (Rectified Linear Unit): A popular activation function that sets negative values to zero and leaves positive values unchanged.
- Sigmoid: An activation function that squashes the input values between 0 and 1, commonly used for binary classification problems.
- Softmax: An activation function used for multi-class classification, producing a probability distribution over the different classes.
They demonstrate how to incorporate these activation functions into the model architecture and explain their impact on the model’s ability to learn complex patterns in the data.

The sources stress the importance of performing model sanity checks to verify that the model is functioning correctly and learning as expected. They suggest techniques like:
1. Testing on a simpler problem: Before training on the full dataset, the sources recommend testing the model on a simpler problem with known solutions to ensure that the model’s architecture and implementation are sound.
2. Visualizing model predictions: Comparing the model’s predictions to the ground truth labels can help identify potential issues with the model’s learning process.
3. Checking the loss function: Monitoring the loss value during training can provide insights into how well the model is optimizing its parameters.
Pages 411-510 Summary: Exploring Multi-class Classification Metrics and Deep Diving into Convolutional Neural Networks

The sources explore a range of multi-class classification metrics beyond accuracy, emphasizing that different metrics provide different perspectives on the model’s performance. They introduce:
- Precision: A measure of the proportion of correctly predicted positive cases out of all positive predictions.
- Recall: A measure of the proportion of correctly predicted positive cases out of all actual positive cases.
- F1-score: A harmonic mean of precision and recall, providing a balanced measure of the model’s performance.
- Confusion matrix: A visualization tool that shows the counts of true positive, true negative, false positive, and false negative predictions, providing a detailed breakdown of the model’s performance across different classes.
They guide you through implementing these metrics using PyTorch and visualizing the confusion matrix to gain insights into the model’s strengths and weaknesses.

The sources transition to discussing convolutional neural networks (CNNs), a specialized type of neural network architecture well-suited for image classification tasks. They provide an in-depth explanation of the key components of a CNN, including:
1. Convolutional layers: Layers that apply convolution operations to the input image, extracting features at different spatial scales.
2. Activation functions: Functions like ReLU that introduce non-linearity to the model, enabling it to learn complex patterns.
3. Pooling layers: Layers that downsample the feature maps, reducing the computational complexity and increasing the model’s robustness to variations in the input.
4. Fully connected layers: Layers that connect all the features extracted by the convolutional and pooling layers, performing the final classification.
They provide a visual explanation of the convolution operation, using the CNN Explainer website as a reference to illustrate how filters are applied to the input image to extract features. They discuss important hyperparameters of convolutional layers, such as:
- Kernel size: The size of the filter used for the convolution operation.
- Stride: The step size used to move the filter across the input image.
- Padding: The technique of adding extra pixels around the borders of the input image to control the output size of the convolutional layer.
Pages 511-610 Summary: Building a CNN Model from Scratch and Understanding Convolutional Layers

The sources provide a step-by-step guide to building a CNN model from scratch using PyTorch for the FoodVision Mini dataset. They walk through the process of defining the model architecture, including specifying the convolutional layers, activation functions, pooling layers, and fully connected layers. They emphasize the importance of carefully designing the model architecture to suit the specific characteristics of the dataset and the task at hand. They recommend starting with a simpler architecture and gradually increasing the model’s complexity if needed.

They delve deeper into understanding convolutional layers, explaining how they work and their role in extracting features from images. They illustrate:
1. Filters: Convolutional layers use filters (also known as kernels) to scan the input image, detecting patterns like edges, corners, and textures.
2. Feature maps: The output of a convolutional layer is a set of feature maps, each representing the presence of a particular feature in the input image.
3. Hyperparameters: They revisit the importance of hyperparameters like kernel size, stride, and padding in controlling the output size and feature extraction capabilities of convolutional layers.
The sources guide you through experimenting with different hyperparameter settings for the convolutional layers, emphasizing the importance of understanding how these choices affect the model’s performance. They recommend using visualization techniques, such as displaying the feature maps generated by different convolutional layers, to gain insights into how the model is learning features from the data.

The sources emphasize the iterative nature of the model development process, where you experiment with different architectures, hyperparameters, and training strategies to optimize the model’s performance. They recommend keeping track of the different experiments and their results to identify the most effective approaches.

Pages 611-710 Summary: Understanding CNN Building Blocks, Implementing Max Pooling, and Building a TinyVGG Model

The sources guide you through a deeper understanding of the fundamental building blocks of a convolutional neural network (CNN) for image classification. They highlight the importance of:
- Convolutional Layers: These layers extract features from input images using learnable filters. They discuss the interplay of hyperparameters like kernel size, stride, and padding, emphasizing their role in shaping the output feature maps and controlling the network’s receptive field.
- Activation Functions: Introducing non-linearity into the network is crucial for learning complex patterns. They revisit popular activation functions like ReLU (Rectified Linear Unit), which helps prevent vanishing gradients and speeds up training.
- Pooling Layers: Pooling layers downsample feature maps, making the network more robust to variations in the input image while reducing computational complexity. They explain the concept of max pooling, where the maximum value within a pooling window is selected, preserving the most prominent features.
The sources provide a detailed code implementation for max pooling using PyTorch’s torch.nn.MaxPool2d module, demonstrating how to apply it to the output of convolutional layers. They showcase how to calculate the output dimensions of the pooling layer based on the input size, stride, and pooling kernel size.

Building on these foundational concepts, the sources guide you through the construction of a TinyVGG model, a simplified version of the popular VGG architecture known for its effectiveness in image classification tasks. They demonstrate how to define the network architecture using PyTorch, stacking convolutional layers, activation functions, and pooling layers to create a deep and hierarchical representation of the input image. They emphasize the importance of designing the network structure based on principles like increasing the number of filters in deeper layers to capture more complex features.

The sources highlight the role of flattening the output of the convolutional layers before feeding it into fully connected layers, transforming the multi-dimensional feature maps into a one-dimensional vector. This transformation prepares the extracted features for the final classification task. They emphasize the importance of aligning the output size of the flattening operation with the input size of the subsequent fully connected layer.

Pages 711-810 Summary: Training a TinyVGG Model, Addressing Overfitting, and Evaluating the Model

The sources guide you through training the TinyVGG model on the FoodVision Mini dataset, emphasizing the importance of structuring the training process for optimal performance. They showcase a training loop that incorporates:
- Data Loading: Using DataLoader from PyTorch to efficiently load and batch training data, shuffling the samples in each epoch to prevent the model from learning spurious patterns from the data order.
- Device Agnostic Code: Writing code that can seamlessly switch between CPU and GPU devices for training and inference, making the code more flexible and adaptable to different hardware setups.
- Forward Pass: Passing the input data through the model to obtain predictions, applying the softmax function to the output logits to obtain probabilities for each class.
- Loss Calculation: Computing the loss between the model’s predictions and the ground truth labels using a suitable loss function, typically cross-entropy loss for multi-class classification tasks.
- Backward Pass: Calculating gradients of the loss with respect to the model’s parameters using backpropagation, highlighting the importance of understanding this fundamental algorithm that allows neural networks to learn from data.
- Optimization: Updating the model’s parameters using an optimizer like stochastic gradient descent (SGD) to minimize the loss and improve the model’s ability to make accurate predictions.
The sources emphasize the importance of monitoring the training process to ensure the model is learning effectively and generalizing well to unseen data. They guide you through tracking metrics like training loss and accuracy across epochs, visualizing them to identify potential issues like overfitting, where the model performs well on the training data but struggles to generalize to new data.

The sources address the problem of overfitting, suggesting techniques like:
- Data Augmentation: Artificially increasing the diversity of the training data by applying random transformations to the images, such as rotations, flips, and color adjustments, making the model more robust to variations in the input.
- Dropout: Randomly deactivating a proportion of neurons during training, forcing the network to learn more robust and generalizable features.
The sources showcase how to implement these techniques in PyTorch, highlighting the importance of finding the right balance between overfitting and underfitting, where the model is too simple to capture the patterns in the data.

The sources guide you through evaluating the trained model on the test set, measuring its performance using metrics like accuracy, precision, recall, and the F1-score. They emphasize the importance of using a separate test set, unseen during training, to assess the model’s ability to generalize to new data. They showcase how to generate a confusion matrix to visualize the model’s performance across different classes, identifying which classes the model struggles with the most.

The sources provide insights into analyzing the confusion matrix to gain a deeper understanding of the model’s strengths and weaknesses, informing further improvements and refinements. They emphasize that evaluating a model is not merely about reporting a single accuracy score, but rather a multifaceted process of understanding its behavior and limitations.

The main topic of the book, based on the provided excerpts and our conversation history, is deep learning with PyTorch. The book appears to function as a comprehensive course, designed to guide readers from foundational concepts to practical implementation, ultimately empowering them to build their own deep learning models.
- The book begins by introducing fundamental concepts:
- Machine Learning (ML) and Deep Learning (DL): The book establishes a clear understanding of these core concepts, explaining that DL is a subset of ML. [1-3] It emphasizes that DL is particularly well-suited for tasks involving complex patterns in large datasets. [1, 2]
- PyTorch: The book highlights PyTorch as a popular and powerful framework for deep learning. [4, 5] It emphasizes the practical, hands-on nature of the course, encouraging readers to “see things happen” rather than getting bogged down in theoretical definitions. [1, 3, 6]
- Tensors: The book underscores the role of tensors as the fundamental building blocks of data in deep learning, explaining how they represent data numerically for processing within neural networks. [5, 7, 8]
- The book then transitions into the PyTorch workflow, outlining the key steps involved in building and training deep learning models:
- Preparing and Loading Data: The book emphasizes the critical importance of data preparation, [9] highlighting techniques for loading, splitting, and visualizing data. [10-17]
- Building Models: The book guides readers through the process of constructing neural network models in PyTorch, introducing key modules like torch.nn. [18-22] It covers essential concepts like:
- Sub-classing nn.Module to define custom models [20]
- Implementing the forward method to define the flow of data through the network [21, 22]
- Training Models: The book details the training process, explaining:
- Loss Functions: These measure how well the model is performing, guiding the optimization process. [23, 24]
- Optimizers: These update the model’s parameters based on the calculated gradients, aiming to minimize the loss and improve accuracy. [25, 26]
- Training Loops: These iterate through the data, performing forward and backward passes to update the model’s parameters. [26-29]
- The Importance of Monitoring: The book stresses the need to track metrics like loss and accuracy during training to ensure the model is learning effectively and to diagnose issues like overfitting. [30-32]
- Evaluating Models: The book explains techniques for evaluating the performance of trained models on a separate test set, unseen during training. [15, 30, 33] It introduces metrics like accuracy, precision, recall, and the F1-score to assess model performance. [34, 35]
- Saving and Loading Models: The book provides instructions on how to save trained models and load them for later use, preserving the model’s learned parameters. [36-39]
- Beyond the foundational workflow, the book explores specific applications of deep learning:
- Classification: The book dedicates significant attention to classification problems, which involve categorizing data into predefined classes. [40-42] It covers:
- Binary Classification: Distinguishing between two classes (e.g., spam or not spam) [41, 43]
- Multi-Class Classification: Categorizing into more than two classes (e.g., different types of images) [41, 43]
- Computer Vision: The book dives into the world of computer vision, which focuses on enabling computers to “see” and interpret images. [44, 45] It introduces:
- Convolutional Neural Networks (CNNs): Specialized architectures designed to effectively process image data. [44-46]
- Torchvision: PyTorch’s library specifically designed for computer vision tasks. [47]
- Throughout the book, there is a strong emphasis on practical implementation, with:
- Coding Examples: The book uses plentiful code snippets to illustrate concepts and techniques, encouraging readers to experiment and learn by doing. [1, 3, 6, 48, 49]
- Exercises: The book provides exercises at the end of each section to reinforce learning and challenge readers to apply their knowledge. [50-64]
- Troubleshooting Tips: The book offers practical advice on how to debug code, interpret errors, and find solutions to common problems. [65-78]
- The book encourages continuous learning beyond the course material:
- Extracurricular Resources: The book consistently points readers to additional resources, such as the official PyTorch documentation, research papers, and blog posts, to deepen their understanding and explore more advanced topics. [7, 8, 19, 35, 37, 39, 55, 57-59, 61, 63, 64, 71, 79-94]
- The book strongly advocates for a visual and experimental approach to deep learning:
- Visualization: The book frequently encourages visualizing data and model outputs to gain insights and understand complex concepts. [14, 16, 17, 95-101]
- Experimentation: The book encourages readers to actively experiment with code, adjust parameters, and observe the effects to develop a deeper understanding. [32, 51, 53, 65, 102, 103]
In essence, the book provides a guided journey into the world of deep learning with PyTorch. It doesn’t just explain concepts—it empowers readers to build, train, and evaluate their own deep learning models. It does this by combining clear explanations, engaging code examples, practical exercises, and insightful references to supplementary resources, creating a comprehensive and stimulating learning experience.

Python’s Role in Deep Learning and Machine Learning with PyTorch

The sources strongly emphasize the use of Python as the primary programming language for working with PyTorch, a popular deep learning framework. While the sources focus on teaching PyTorch itself, they highlight Python’s integral role in making deep learning accessible and efficient. Here’s a discussion of Python’s significance, drawing upon insights from the sources:
- Python’s User-Friendliness: The sources mention that PyTorch “allows you to write fast deep learning code in Python” [1], implying that Python’s readability and ease of use make it suitable for expressing complex deep learning concepts in code. This is particularly beneficial for beginners with some Python experience (3-6 months is suggested) [2]. Python’s clear syntax and extensive libraries simplify the process of translating mathematical ideas into working deep learning models.
- PyTorch’s Pythonic Nature: The sources highlight how PyTorch seamlessly integrates with Python. Phrases like “beautifully interactive with pure Python” [3] and “testament to the Pythonic nature of PyTorch” [3] suggest that PyTorch leverages Python’s strengths, allowing developers to work with familiar Python constructs like functions and classes while building deep learning models. This tight integration enhances flexibility and allows for the incorporation of custom Python code directly into PyTorch workflows, as seen in the creation of custom accuracy functions [3].
- Integration with the Scientific Python Ecosystem: The sources mention the interaction between PyTorch and NumPy, a core library in the scientific Python ecosystem. PyTorch provides methods for converting data between NumPy arrays and PyTorch tensors [4, 5]. This interoperability signifies that PyTorch doesn’t exist in isolation. It leverages the rich ecosystem of scientific computing tools available in Python, allowing developers to utilize familiar libraries for tasks like data manipulation, visualization, and analysis in conjunction with PyTorch’s deep learning capabilities.
- Python as the Foundation: The sources consistently present code examples in Python, using Python syntax and constructs to demonstrate PyTorch concepts. This pervasive use of Python throughout the course material reinforces its position as the foundational language for interacting with PyTorch.
The sources, while focusing specifically on PyTorch, make it evident that Python plays a critical role in enabling deep learning. Python’s user-friendliness, its seamless integration with PyTorch, and its connection to a broader ecosystem of scientific computing tools contribute to PyTorch’s popularity and make deep learning more accessible to a wider audience.

Note: It is worth noting that while Python is dominant in the deep learning landscape, other languages are also used for deep learning, such as C++ and R. This information is not directly derived from the sources you provided and you may want to independently verify it.

Okay, here is a detailed summary of the book, broken down into sets of 10 pages, based on the provided excerpts. Keep in mind that these excerpts do not represent the full text of the book, so this summary provides insights into the covered topics and the overall teaching style.

Pages 1-10: Laying the Foundation: Machine Learning, Deep Learning, and PyTorch
- Introduction (Pages 1-2): These pages set the stage for the course, highlighting the importance of hands-on learning and experimentation with PyTorch. They emphasize the availability of course materials on GitHub and through the online book version at learnpytorch.io. It is also stated that the book may contain more content than is covered in the video transcript.
- Understanding Deep Learning (Pages 3-6): The book provides a concise overview of machine learning (ML) and deep learning (DL), emphasizing DL’s ability to handle complex patterns in large datasets. It suggests focusing on practical implementation rather than dwelling on detailed definitions, as these can be easily accessed online. The importance of considering simpler, rule-based solutions before resorting to ML is also stressed.
- Embracing Self-Learning (Pages 6-7): The book encourages active learning by suggesting readers explore topics like deep learning and neural networks independently, utilizing resources such as Wikipedia and specific YouTube channels like 3Blue1Brown. It stresses the value of forming your own understanding by consulting multiple sources and synthesizing information.
- Introducing PyTorch (Pages 8-10): PyTorch is introduced as a prominent deep learning framework, particularly popular in research. Its Pythonic nature is highlighted, making it efficient for writing deep learning code. The book directs readers to the official PyTorch documentation as a primary resource for exploring the framework’s capabilities.
Pages 11-20: PyTorch Fundamentals: Tensors, Operations, and More
- Getting Specific (Pages 11-12): The book emphasizes a hands-on approach, encouraging readers to explore concepts like tensors through online searches and coding experimentation. It highlights the importance of asking questions and actively engaging with the material rather than passively following along. The inclusion of exercises at the end of each module is mentioned to reinforce understanding.
- Learning Through Doing (Pages 12-14): The book emphasizes the importance of active learning through:
- Asking questions of yourself, the code, the community, and online resources.
- Completing the exercises provided to test knowledge and solidify understanding.
- Sharing your work to reinforce learning and contribute to the community.
- Avoiding Overthinking (Page 13): A key piece of advice is to avoid getting overwhelmed by the complexity of the subject. Starting with a clear understanding of the fundamentals and building upon them gradually is encouraged.
- Course Resources (Pages 14-17): The book reiterates the availability of course materials:
- GitHub repository: Containing code and other resources.
- GitHub discussions: A platform for asking questions and engaging with the community.
- learnpytorch.io: The online book version of the course.
- Tensors in Action (Pages 17-20): The book dives into PyTorch tensors, explaining their creation using torch.tensor and referencing the official documentation for further exploration. It demonstrates basic tensor operations, emphasizing that writing code and interacting with tensors is the best way to grasp their functionality. The use of the torch.arange function is introduced to create tensors with specific ranges and step sizes.
Pages 21-30: Understanding PyTorch’s Data Loading and Workflow
- Tensor Manipulation and Stacking (Pages 21-22): The book covers tensor manipulation techniques, including permuting dimensions (e.g., rearranging color channels, height, and width in an image tensor). The torch.stack function is introduced to concatenate tensors along a new dimension. The concept of a pseudo-random number generator and the role of a random seed are briefly touched upon, referencing the PyTorch documentation for a deeper understanding.
- Running Tensors on Devices (Pages 22-23): The book mentions the concept of running PyTorch tensors on different devices, such as CPUs and GPUs, although the details of this are not provided in the excerpts.
- Exercises and Extra Curriculum (Pages 23-27): The importance of practicing concepts through exercises is highlighted, and the book encourages readers to refer to the PyTorch documentation for deeper understanding. It provides guidance on how to approach exercises using Google Colab alongside the book material. The book also points out the availability of solution templates and a dedicated folder for exercise solutions.
- PyTorch Workflow in Action (Pages 28-31): The book begins exploring a complete PyTorch workflow, emphasizing a code-driven approach with explanations interwoven as needed. A six-step workflow is outlined:
1. Data preparation and loading
2. Building a machine learning/deep learning model
3. Fitting the model to data
4. Making predictions
5. Evaluating the model
6. Saving and loading the model
Pages 31-40: Data Preparation, Linear Regression, and Visualization
- The Two Parts of Machine Learning (Pages 31-33): The book breaks down machine learning into two fundamental parts:
- Representing Data Numerically: Converting data into a format suitable for models to process.
- Building a Model to Learn Patterns: Training a model to identify relationships within the numerical representation.
- Linear Regression Example (Pages 33-35): The book uses a linear regression example (y = a + bx) to illustrate the relationship between data and model parameters. It encourages a hands-on approach by coding the formula, emphasizing that coding helps solidify understanding compared to simply reading formulas.
- Visualizing Data (Pages 35-40): The book underscores the importance of data visualization using Matplotlib, adhering to the “visualize, visualize, visualize” motto. It provides code for plotting data, highlighting the use of scatter plots and the importance of consulting the Matplotlib documentation for detailed information on plotting functions. It guides readers through the process of creating plots, setting figure sizes, plotting training and test data, and customizing plot elements like colors, markers, and labels.
Pages 41-50: Model Building Essentials and Inference
- Color-Coding and PyTorch Modules (Pages 41-42): The book uses color-coding in the online version to enhance visual clarity. It also highlights essential PyTorch modules for data preparation, model building, optimization, evaluation, and experimentation, directing readers to the learnpytorch.io book and the PyTorch documentation.
- Model Predictions (Pages 42-43): The book emphasizes the process of making predictions using a trained model, noting the expectation that an ideal model would accurately predict output values based on input data. It introduces the concept of “inference mode,” which can enhance code performance during prediction. A Twitter thread and a blog post on PyTorch’s inference mode are referenced for further exploration.
- Understanding Loss Functions (Pages 44-47): The book dives into loss functions, emphasizing their role in measuring the discrepancy between a model’s predictions and the ideal outputs. It clarifies that loss functions can also be referred to as cost functions or criteria in different contexts. A table in the book outlines various loss functions in PyTorch, providing common values and links to documentation. The concept of Mean Absolute Error (MAE) and the L1 loss function are introduced, with encouragement to explore other loss functions in the documentation.
- Understanding Optimizers and Hyperparameters (Pages 48-50): The book explains optimizers, which adjust model parameters based on the calculated loss, with the goal of minimizing the loss over time. The distinction between parameters (values set by the model) and hyperparameters (values set by the data scientist) is made. The learning rate, a crucial hyperparameter controlling the step size of the optimizer, is introduced. The process of minimizing loss within a training loop is outlined, emphasizing the iterative nature of adjusting weights and biases.
Pages 51-60: Training Loops, Saving Models, and Recap
- Putting It All Together: The Training Loop (Pages 51-53): The book assembles the previously discussed concepts into a training loop, demonstrating the iterative process of updating a model’s parameters over multiple epochs. It shows how to track and print loss values during training, illustrating the gradual reduction of loss as the model learns. The convergence of weights and biases towards ideal values is shown as a sign of successful training.
- Saving and Loading Models (Pages 53-56): The book explains the process of saving trained models, preserving learned parameters for later use. The concept of a “state dict,” a Python dictionary mapping layers to their parameter tensors, is introduced. The use of torch.save and torch.load for saving and loading models is demonstrated. The book also references the PyTorch documentation for more detailed information on saving and loading models.
- Wrapping Up the Fundamentals (Pages 57-60): The book concludes the section on PyTorch workflow fundamentals, reiterating the key steps:
- Getting data ready
- Converting data to tensors
- Building or selecting a model
- Choosing a loss function and an optimizer
- Training the model
- Evaluating the model
- Saving and loading the model
- Exercises and Resources (Pages 57-60): The book provides exercises focused on the concepts covered in the section, encouraging readers to practice implementing a linear regression model from scratch. A variety of extracurricular resources are listed, including links to articles on gradient descent, backpropagation, loading and saving models, a PyTorch cheat sheet, and the unofficial PyTorch optimization loop song. The book directs readers to the extras folder in the GitHub repository for exercise templates and solutions.
This breakdown of the first 60 pages, based on the excerpts provided, reveals the book’s structured and engaging approach to teaching deep learning with PyTorch. It balances conceptual explanations with hands-on coding examples, exercises, and references to external resources. The book emphasizes experimentation and active learning, encouraging readers to move beyond passive reading and truly grasp the material by interacting with code and exploring concepts independently.

Note: Please keep in mind that this summary only covers the content found within the provided excerpts, which may not represent the entirety of the book.

Pages 61-70: Multi-Class Classification and Building a Neural Network
- Multi-Class Classification (Pages 61-63): The book introduces multi-class classification, where a model predicts one out of multiple possible classes. It shifts from the linear regression example to a new task involving a data set with four distinct classes. It also highlights the use of one-hot encoding to represent categorical data numerically, and emphasizes the importance of understanding the problem domain and using appropriate data representations for a given task.
- Preparing Data (Pages 63-64): The sources demonstrate the creation of a multi-class data set. The book uses PyTorch’s make_blobs function to generate synthetic data points representing four classes, each with its own color. It emphasizes the importance of visualizing the generated data and confirming that it aligns with the desired structure. The train_test_split function is used to divide the data into training and testing sets.
- Building a Neural Network (Pages 64-66): The book starts building a neural network model using PyTorch’s nn.Module class, showing how to define layers and connect them in a sequential manner. It provides a step-by-step explanation of the process:
1. Initialization: Defining the model class with layers and computations.
2. Input Layer: Specifying the number of features for the input layer based on the data set.
3. Hidden Layers: Creating hidden layers and determining their input and output sizes.
4. Output Layer: Defining the output layer with a size corresponding to the number of classes.
5. Forward Method: Implementing the forward pass, where data flows through the network.
- Matching Shapes (Pages 67-70): The book emphasizes the crucial concept of shape compatibility between layers. It shows how to calculate output shapes based on input shapes and layer parameters. It explains that input shapes must align with the expected shapes of subsequent layers to ensure smooth data flow. The book also underscores the importance of code experimentation to confirm shape alignment. The sources specifically focus on checking that the output shape of the network matches the shape of the target values (y) for training.
Pages 71-80: Loss Functions and Activation Functions
- Revisiting Loss Functions (Pages 71-73): The book revisits loss functions, now in the context of multi-class classification. It highlights that the choice of loss function depends on the specific problem type. The Mean Absolute Error (MAE), used for regression in previous examples, is not suitable for classification. Instead, the book introduces cross-entropy loss (nn.CrossEntropyLoss), emphasizing its suitability for classification tasks with multiple classes. It also mentions the BCEWithLogitsLoss, another common loss function for classification problems.
- The Role of Activation Functions (Pages 74-76): The book raises the concept of activation functions, hinting at their significance in model performance. The sources state that combining multiple linear layers in a neural network doesn’t increase model capacity because a series of linear transformations is still ultimately linear. This suggests that linear models might be limited in capturing complex, non-linear relationships in data.
- Visualizing Limitations (Pages 76-78): The sources introduce the “Data Explorer’s Motto”: “Visualize, visualize, visualize!” This highlights the importance of visualization for understanding both data and model behavior. The book provides a visualization demonstrating the limitations of a linear model, showing its inability to accurately classify data with non-linear boundaries.
- Exploring Nonlinearities (Pages 78-80): The sources pose the question, “What patterns could you draw if you were given an infinite amount of straight and non-straight lines?” This prompts readers to consider the expressive power of combining linear and non-linear components. The book then encourages exploring non-linear activation functions within the PyTorch documentation, specifically referencing torch.nn, and suggests trying to identify an activation function that has already been used in the examples. This interactive approach pushes learners to actively seek out information and connect concepts.
Pages 81-90: Building and Training with Non-Linearity
- Introducing ReLU (Pages 81-83): The sources emphasize the crucial role of non-linearity in neural network models, introducing the Rectified Linear Unit (ReLU) as a commonly used non-linear activation function. The book describes ReLU as a “magic piece of the puzzle,” highlighting its ability to add non-linearity to the model and enable the learning of more complex patterns. The sources again emphasize the importance of trying to draw various patterns using a combination of straight and curved lines to gain intuition about the impact of non-linearity.
- Building with ReLU (Pages 83-87): The book guides readers through modifying the neural network model by adding ReLU activation functions between the existing linear layers. The placement of ReLU functions within the model architecture is shown. The sources suggest experimenting with the TensorFlow Playground, a web-based tool for visualizing neural networks, to recreate the model and observe the effects of ReLU on data separation.
- Training the Enhanced Model (Pages 87-90): The book outlines the training process for the new model, utilizing familiar steps such as creating a loss function (BCEWithLogitsLoss in this case), setting up an optimizer (torch.optim.Adam), and defining training and evaluation loops. It demonstrates how to pass data through the model, calculate the loss, perform backpropagation, and update model parameters. The sources emphasize that even though the code structure is familiar, learners should strive to understand the underlying mechanisms and how they contribute to model training. It also suggests considering how the training code could be further optimized and modularized into functions for reusability.
It’s important to remember that this information is based on the provided excerpts, and the book likely covers these topics and concepts in more depth. The book’s interactive approach, focusing on experimentation, code interaction, and visualization, encourages active engagement with the material, urging readers to explore, question, and discover rather than passively follow along.

Continuing with Non-Linearity and Multi-Class Classification
- Visualizing Non-Linearity (Pages 91-94): The sources emphasize the importance of visualizing the model’s performance after incorporating the ReLU activation function. They use a custom plotting function, plot_decision_boundary, to visually assess the model’s ability to separate the circular data. The visualization reveals a significant improvement compared to the linear model, demonstrating that ReLU enables the model to learn non-linear decision boundaries and achieve a better separation of the classes.
- Pushing for Improvement (Pages 94-96): Even though the non-linear model shows improvement, the sources encourage continued experimentation to achieve even better performance. They challenge readers to improve the model’s accuracy on the test data to over 80%. This encourages an iterative approach to model development, where experimentation, analysis, and refinement are key. The sources suggest potential strategies, such as:
- Adding more layers to the network
- Increasing the number of hidden units
- Training for a greater number of epochs
- Adjusting the learning rate of the optimizer
- Multi-Class Classification Revisited (Pages 96-99): The sources return to multi-class classification, moving beyond the binary classification example of the circular data. They introduce a new data set called “X BLOB,” which consists of data points belonging to three distinct classes. This shift introduces additional challenges in model building and training, requiring adjustments to the model architecture, loss function, and evaluation metrics.
- Data Preparation and Model Building (Pages 99-102): The sources guide readers through preparing the X BLOB data set for training, using familiar steps such as splitting the data into training and testing sets and creating data loaders. The book emphasizes the importance of understanding the data set’s characteristics, such as the number of classes, and adjusting the model architecture accordingly. It also encourages experimentation with different model architectures, specifically referencing PyTorch’s torch.nn module, to find an appropriate model for the task. The TensorFlow Playground is again suggested as a tool for visualizing and experimenting with neural network architectures.
The sources repeatedly emphasize the iterative and experimental nature of machine learning and deep learning, urging learners to actively engage with the code, explore different options, and visualize results to gain a deeper understanding of the concepts. This hands-on approach fosters a mindset of continuous learning and improvement, crucial for success in these fields.

Building and Training with Non-Linearity: Pages 103-113
- The Power of Non-Linearity (Pages 103-105): The sources continue emphasizing the crucial role of non-linearity in neural networks, highlighting its ability to capture complex patterns in data. The book states that neural networks combine linear and non-linear functions to find patterns in data. It reiterates that linear functions alone are limited in their expressive power and that non-linear functions, like ReLU, enable models to learn intricate decision boundaries and achieve better separation of classes. The sources encourage readers to experiment with different non-linear activation functions and observe their impact on model performance, reinforcing the idea that experimentation is essential in machine learning.
- Multi-Class Model with Non-Linearity (Pages 105-108): Building upon the previous exploration, the sources guide readers through constructing a multi-class classification model with a non-linear activation function. The book provides a step-by-step breakdown of the model architecture, including:
1. Input Layer: Takes in features from the data set, same as before.
2. Hidden Layers: Incorporate linear transformations using PyTorch’s nn.Linear layers, just like in previous models.
3. ReLU Activation: Introduces ReLU activation functions between the linear layers, adding non-linearity to the model.
4. Output Layer: Produces a set of raw output values, also known as logits, corresponding to the number of classes.
- Prediction Probabilities (Pages 108-110): The sources explain that the raw output logits from the model need to be converted into probabilities to interpret the model’s predictions. They introduce the torch.softmax function, which transforms the logits into a probability distribution over the classes, indicating the likelihood of each class for a given input. The book emphasizes that understanding the relationship between logits, probabilities, and model predictions is crucial for evaluating and interpreting model outputs.
- Training and Evaluation (Pages 110-111): The sources outline the training process for the multi-class model, utilizing familiar steps such as setting up a loss function (Cross-Entropy Loss is recommended for multi-class classification), defining an optimizer (torch.optim.SGD), creating training and testing loops, and evaluating the model’s performance using loss and accuracy metrics. The sources reiterate the importance of device-agnostic code, ensuring that the model and data reside on the same device (CPU or GPU) for seamless computation. They also encourage readers to experiment with different optimizers and hyperparameters, such as learning rate and batch size, to observe their effects on training dynamics and model performance.
- Experimentation and Visualization (Pages 111-113): The sources strongly advocate for ongoing experimentation, urging readers to modify the model, adjust hyperparameters, and visualize results to gain insights into model behavior. They demonstrate how removing the ReLU activation function leads to a model with linear decision boundaries, resulting in a significant decrease in accuracy, highlighting the importance of non-linearity in capturing complex patterns. The sources also encourage readers to refer back to previous notebooks, experiment with different model architectures, and explore advanced visualization techniques to enhance their understanding of the concepts and improve model performance.
The consistent theme across these sections is the value of active engagement and experimentation. The sources emphasize that learning in machine learning and deep learning is an iterative process. Readers are encouraged to question assumptions, try different approaches, visualize results, and continuously refine their models based on observations and experimentation. This hands-on approach is crucial for developing a deep understanding of the concepts and fostering the ability to apply these techniques to real-world problems.

The Impact of Non-Linearity and Multi-Class Classification Challenges: Pages 113-116
- Non-Linearity’s Impact on Model Performance: The sources examine the critical role non-linearity plays in a model’s ability to accurately classify data. They demonstrate this by training a model without the ReLU activation function, resulting in linear decision boundaries and significantly reduced accuracy. The visualizations provided highlight the stark difference between the model with ReLU and the one without, showcasing how non-linearity enables the model to capture the circular patterns in the data and achieve better separation between classes [1]. This emphasizes the importance of understanding how different activation functions contribute to a model’s capacity to learn complex relationships within data.
- Understanding the Data and Model Relationship (Pages 115-116): The sources remind us that evaluating a model is as crucial as building one. They highlight the importance of becoming one with the data, both at the beginning and after training a model, to gain a deeper understanding of its behavior and performance. Analyzing the model’s predictions on the data helps identify potential issues, such as overfitting or underfitting, and guides further experimentation and refinement [2].
- Key Takeaways: The sources reinforce several key concepts and best practices in machine learning and deep learning:
- Visualize, Visualize, Visualize: Visualizing data and model predictions is crucial for understanding patterns, identifying potential issues, and guiding model development.
- Experiment, Experiment, Experiment: Trying different approaches, adjusting hyperparameters, and iteratively refining models based on observations is essential for achieving optimal performance.
- The Data Scientist’s/Machine Learning Practitioner’s Motto: Experimentation is at the heart of successful machine learning, encouraging continuous learning and improvement.
- Steps in Modeling with PyTorch: The sources repeatedly reinforce a structured workflow for building and training models in PyTorch, emphasizing the importance of following a methodical approach to ensure consistency and reproducibility.
The sources conclude this section by directing readers to a set of exercises and extra curriculum designed to solidify their understanding of non-linearity, multi-class classification, and the steps involved in building, training, and evaluating models in PyTorch. These resources provide valuable opportunities for hands-on practice and further exploration of the concepts covered. They also serve as a reminder that learning in these fields is an ongoing process that requires continuous engagement, experimentation, and a willingness to iterate and refine models based on observations and analysis [3].

Continuing the Computer Vision Workflow: Pages 116-129
- Introducing Computer Vision and CNNs: The sources introduce a new module focusing on computer vision and convolutional neural networks (CNNs). They acknowledge the excitement surrounding this topic and emphasize its importance as a core concept within deep learning. The sources also provide clear instructions on how to access help and resources if learners encounter challenges during the module, encouraging active engagement and a problem-solving mindset. They reiterate the motto of “if in doubt, run the code,” highlighting the value of practical experimentation. They also point to available resources, including the PyTorch Deep Learning repository, specific notebooks, and a dedicated discussions tab for questions and answers.
- Understanding Custom Datasets: The sources explain the concept of custom datasets, recognizing that while pre-built datasets like FashionMNIST are valuable for learning, real-world applications often involve working with unique data. They acknowledge the potential need for custom data loading solutions when existing libraries don’t provide the necessary functionality. The sources introduce the idea of creating a custom PyTorch dataset class by subclassing torch.utils.data.Dataset and implementing specific methods to handle data loading and preparation tailored to the unique requirements of the custom dataset.
- Building a Baseline Model (Pages 118-120): The sources guide readers through building a baseline computer vision model using PyTorch. They emphasize the importance of understanding the input and output shapes to ensure the model is appropriately configured for the task. The sources also introduce the concept of creating a dummy forward pass to check the model’s functionality and verify the alignment of input and output dimensions.
- Training the Baseline Model (Pages 120-125): The sources step through the process of training the baseline computer vision model. They provide a comprehensive breakdown of the code, including the use of a progress bar for tracking training progress. The steps highlighted include:
1. Setting up the training loop: Iterating through epochs and batches of data
2. Performing the forward pass: Passing data through the model to obtain predictions
3. Calculating the loss: Measuring the difference between predictions and ground truth labels
4. Backpropagation: Calculating gradients to update model parameters
5. Updating model parameters: Using the optimizer to adjust weights based on calculated gradients
- Evaluating Model Performance (Pages 126-128): The sources stress the importance of comprehensive evaluation, going beyond simple loss and accuracy metrics. They introduce techniques like plotting loss curves to visualize training dynamics and gain insights into model behavior. The sources also emphasize the value of experimentation, encouraging readers to explore the impact of different devices (CPU vs. GPU) on training time and performance.
- Improving Through Experimentation: The sources encourage ongoing experimentation to improve model performance. They introduce the idea of building a better model with non-linearity, suggesting the inclusion of activation functions like ReLU. They challenge readers to try building such a model and experiment with different configurations to observe their impact on results.
The sources maintain their consistent focus on hands-on learning, guiding readers through each step of building, training, and evaluating computer vision models using PyTorch. They emphasize the importance of understanding the underlying concepts while actively engaging with the code, trying different approaches, and visualizing results to gain deeper insights and build practical experience.

Functionizing Code for Efficiency and Readability: Pages 129-139
- The Benefits of Functionizing Training and Evaluation Loops: The sources introduce the concept of functionizing code, specifically focusing on training and evaluation (testing) loops in PyTorch. They explain that writing reusable functions for these repetitive tasks brings several advantages:
- Improved code organization and readability: Breaking down complex processes into smaller, modular functions enhances the overall structure and clarity of the code. This makes it easier to understand, maintain, and modify in the future.
- Reduced errors: Encapsulating common operations within functions helps prevent inconsistencies and errors that can arise from repeatedly writing similar code blocks.
- Increased efficiency: Reusable functions streamline the development process by eliminating the need to rewrite the same code for different models or datasets.
- Creating the train_step Function (Pages 130-132): The sources guide readers through creating a function called train_step that encapsulates the logic of a single training step within a PyTorch training loop. The function takes several arguments:
- model: The PyTorch model to be trained
- data_loader: The data loader providing batches of training data
- loss_function: The loss function used to calculate the training loss
- optimizer: The optimizer responsible for updating model parameters
- accuracy_function: A function for calculating the accuracy of the model’s predictions
- device: The device (CPU or GPU) on which to perform the computations
- The train_step function performs the following steps for each batch of training data:
1. Sets the model to training mode using model.train()
2. Sends the input data and labels to the specified device
3. Performs the forward pass by passing the data through the model
4. Calculates the loss using the provided loss function
5. Performs backpropagation to calculate gradients
6. Updates model parameters using the optimizer
7. Calculates and accumulates the training loss and accuracy for the batch
- Creating the test_step Function (Pages 132-136): The sources proceed to create a function called test_step that performs a single evaluation step on a batch of testing data. This function follows a similar structure to train_step, but with key differences:
- It sets the model to evaluation mode using model.eval() to disable certain behaviors, such as dropout, specific to training.
- It utilizes the torch.inference_mode() context manager to potentially optimize computations for inference tasks, aiming for speed improvements.
- It calculates and accumulates the testing loss and accuracy for the batch without updating the model’s parameters.
- Combining train_step and test_step into a train Function (Pages 137-139): The sources combine the functionality of train_step and test_step into a single function called train, which orchestrates the entire training and evaluation process over a specified number of epochs. The train function takes arguments similar to train_step and test_step, including the number of epochs to train for. It iterates through the specified epochs, calling train_step for each batch of training data and test_step for each batch of testing data. It tracks and prints the training and testing loss and accuracy for each epoch, providing a clear view of the model’s progress during training.
By encapsulating the training and evaluation logic into these functions, the sources demonstrate best practices in PyTorch code development, emphasizing modularity, readability, and efficiency. This approach makes it easier to experiment with different models, datasets, and hyperparameters while maintaining a structured and manageable codebase.

Leveraging Functions for Model Training and Evaluation: Pages 139-148
- Training Model 1 Using the train Function: The sources demonstrate how to use the newly created train function to train the model_1 that was built earlier. They highlight that only a few lines of code are needed to initiate the training process, showcasing the efficiency gained from functionization.
- Examining Training Results and Performance Comparison: The sources emphasize the importance of carefully examining the training results, particularly the training and testing loss curves. They point out that while model_1 achieves good results, the baseline model_0 appears to perform slightly better. This observation prompts a discussion on potential reasons for the difference in performance, including the possibility that the simpler baseline model might be better suited for the dataset or that further experimentation and hyperparameter tuning might be needed for model_1 to surpass model_0. The sources also highlight the impact of using a GPU for computations, showing that training on a GPU generally leads to faster training times compared to using a CPU.
- Creating a Results Dictionary to Track Experiments: The sources introduce the concept of creating a dictionary to store the results of different experiments. This organized approach allows for easy comparison and analysis of model performance across various configurations and hyperparameter settings. They emphasize the importance of such systematic tracking, especially when exploring multiple models and variations, to gain insights into the factors influencing performance and make informed decisions about model selection and improvement.
- Visualizing Loss Curves for Model Analysis: The sources encourage visualizing the loss curves using a function called plot_loss_curves. They stress the value of visual representations in understanding the training dynamics and identifying potential issues like overfitting or underfitting. By plotting the training and testing losses over epochs, it becomes easier to assess whether the model is learning effectively and generalizing well to unseen data. The sources present different scenarios for loss curves, including:
- Underfitting: The training loss remains high, indicating that the model is not capturing the patterns in the data effectively.
- Overfitting: The training loss decreases significantly, but the testing loss increases, suggesting that the model is memorizing the training data and failing to generalize to new examples.
- Good Fit: Both the training and testing losses decrease and converge, indicating that the model is learning effectively and generalizing well to unseen data.
- Addressing Overfitting and Introducing Data Augmentation: The sources acknowledge overfitting as a common challenge in machine learning and introduce data augmentation as one technique to mitigate it. Data augmentation involves creating variations of existing training data by applying transformations like random rotations, flips, or crops. This expands the effective size of the training set, potentially improving the model’s ability to generalize to new data. They acknowledge that while data augmentation may not always lead to significant improvements, it remains a valuable tool in the machine learning practitioner’s toolkit, especially when dealing with limited datasets or complex models prone to overfitting.
- Building and Training a CNN Model: The sources shift focus towards building a convolutional neural network (CNN) using PyTorch. They guide readers through constructing a CNN architecture, referencing the TinyVGG model from the CNN Explainer website as a starting point. The process involves stacking convolutional layers, activation functions (ReLU), and pooling layers to create a network capable of learning features from images effectively. They emphasize the importance of choosing appropriate hyperparameters, such as the number of filters, kernel size, and padding, and understanding their influence on the model’s capacity and performance.
- Creating Functions for Training and Evaluation with Custom Datasets: The sources revisit the concept of functionization, this time adapting the train_step and test_step functions to work with custom datasets. They highlight the importance of writing reusable and adaptable code that can handle various data formats and scenarios.
The sources continue to guide learners through a comprehensive workflow for building, training, and evaluating models in PyTorch, introducing advanced concepts and techniques along the way. They maintain their focus on practical application, encouraging hands-on experimentation, visualization, and analysis to deepen understanding and foster mastery of the tools and concepts involved in machine learning and deep learning.

Training and Evaluating Models with Custom Datasets: Pages 171-187
- Building the TinyVGG Architecture: The sources guide the creation of a CNN model based on the TinyVGG architecture. The model consists of convolutional layers, ReLU activation functions, and max-pooling layers arranged in a specific pattern to extract features from images effectively. The sources highlight the importance of understanding the role of each layer and how they work together to process image data. They also mention a blog post, “Making deep learning go brrr from first principles,” which might provide further insights into the principles behind deep learning models. You might want to explore this resource for a deeper understanding.
- Adapting Training and Evaluation Functions for Custom Datasets: The sources revisit the train_step and test_step functions, modifying them to accommodate custom datasets. They emphasize the need for flexibility in code, enabling it to handle different data formats and structures. The changes involve ensuring the data is loaded and processed correctly for the specific dataset used.
- Creating a train Function for Custom Dataset Training: The sources combine the train_step and test_step functions within a new train function specifically designed for custom datasets. This function orchestrates the entire training and evaluation process, looping through epochs, calling the appropriate step functions for each batch of data, and tracking the model’s performance.
- Training and Evaluating the Model: The sources demonstrate the process of training the TinyVGG model on the custom food image dataset using the newly created train function. They emphasize the importance of setting random seeds for reproducibility, ensuring consistent results across different runs.
- Analyzing Loss Curves and Accuracy Trends: The sources analyze the training results, focusing on the loss curves and accuracy trends. They point out that the model exhibits good performance, with the loss decreasing and the accuracy increasing over epochs. They also highlight the potential for further improvement by training for a longer duration.
- Exploring Different Loss Curve Scenarios: The sources discuss different types of loss curves, including:
- Underfitting: The training loss remains high, indicating the model isn’t effectively capturing the data patterns.
- Overfitting: The training loss decreases substantially, but the testing loss increases, signifying the model is memorizing the training data and failing to generalize to new examples.
- Good Fit: Both training and testing losses decrease and converge, demonstrating that the model is learning effectively and generalizing well.
- Addressing Overfitting with Data Augmentation: The sources introduce data augmentation as a technique to combat overfitting. Data augmentation creates variations of the training data through transformations like rotations, flips, and crops. This approach effectively expands the training dataset, potentially improving the model’s generalization abilities. They acknowledge that while data augmentation might not always yield significant enhancements, it remains a valuable strategy, especially for smaller datasets or complex models prone to overfitting.
- Building a Model with Data Augmentation: The sources demonstrate how to build a TinyVGG model incorporating data augmentation techniques. They explore the impact of data augmentation on model performance.
- Visualizing Results and Evaluating Performance: The sources advocate for visualizing results to gain insights into model behavior. They encourage using techniques like plotting loss curves and creating confusion matrices to assess the model’s effectiveness.
- Saving and Loading the Best Model: The sources highlight the importance of saving the best-performing model to preserve its state for future use. They demonstrate the process of saving and loading a PyTorch model.
- Exercises and Extra Curriculum: The sources provide guidance on accessing exercises and supplementary materials, encouraging learners to further explore and solidify their understanding of custom datasets, data augmentation, and CNNs in PyTorch.
The sources provide a comprehensive walkthrough of building, training, and evaluating models with custom datasets in PyTorch, introducing and illustrating various concepts and techniques along the way. They underscore the value of practical application, experimentation, and analysis to enhance understanding and skill development in machine learning and deep learning.

Continuing the Exploration of Custom Datasets and Data Augmentation
- Building a Model with Data Augmentation: The sources guide the construction of a TinyVGG model incorporating data augmentation techniques to potentially improve its generalization ability and reduce overfitting. [1] They introduce data augmentation as a way to create variations of existing training data by applying transformations like random rotations, flips, or crops. [1] This increases the effective size of the training dataset and exposes the model to a wider range of input patterns, helping it learn more robust features.
- Training the Model with Data Augmentation and Analyzing Results: The sources walk through the process of training the model with data augmentation and evaluating its performance. [2] They observe that, in this specific case, data augmentation doesn’t lead to substantial improvements in quantitative metrics. [2] The reasons for this could be that the baseline model might already be underfitting, or the specific augmentations used might not be optimal for the dataset. They emphasize that experimenting with different augmentations and hyperparameters is crucial to determine the most effective strategies for a given problem.
- Visualizing Loss Curves and Emphasizing the Importance of Evaluation: The sources stress the importance of visualizing results, especially loss curves, to understand the training dynamics and identify potential issues like overfitting or underfitting. [2] They recommend using the plot_loss_curves function to visually compare the training and testing losses across epochs. [2]
- Providing Access to Exercises and Extra Curriculum: The sources conclude by directing learners to the resources available for practicing the concepts covered, including an exercise template notebook and example solutions. [3] They encourage readers to attempt the exercises independently and use the example solutions as a reference only after making a genuine effort. [3] The exercises focus on building a CNN model for image classification, highlighting the steps involved in data loading, model creation, training, and evaluation. [3]
- Concluding the Section on Custom Datasets and Looking Ahead: The sources wrap up the section on working with custom datasets and using data augmentation techniques. [4] They point out that learners have now covered a significant portion of the course material and gained valuable experience in building, training, and evaluating PyTorch models for image classification tasks. [4] They briefly touch upon the next steps in the deep learning journey, including deployment, and encourage learners to continue exploring and expanding their knowledge. [4]
The sources aim to equip learners with the necessary tools and knowledge to tackle real-world deep learning projects. They advocate for a hands-on, experimental approach, emphasizing the importance of understanding the data, choosing appropriate models and techniques, and rigorously evaluating the results. They also encourage learners to continuously seek out new information and refine their skills through practice and exploration.

Exploring Techniques for Model Improvement and Evaluation: Pages 188-190
- Examining the Impact of Data Augmentation: The sources continue to assess the effectiveness of data augmentation in improving model performance. They observe that, despite its potential benefits, data augmentation might not always result in significant enhancements. In the specific example provided, the model trained with data augmentation doesn’t exhibit noticeable improvements compared to the baseline model. This outcome could be attributed to the baseline model potentially underfitting the data, implying that the model’s capacity is insufficient to capture the complexities of the dataset even with augmented data. Alternatively, the specific data augmentations employed might not be well-suited to the dataset, leading to minimal performance gains.
- Analyzing Loss Curves to Understand Model Behavior: The sources emphasize the importance of visualizing results, particularly loss curves, to gain insights into the model’s training dynamics. They recommend plotting the training and validation loss curves to observe how the model’s performance evolves over epochs. These visualizations help identify potential issues such as:
- Underfitting: When both training and validation losses remain high, suggesting the model isn’t effectively learning the patterns in the data.
- Overfitting: When the training loss decreases significantly while the validation loss increases, indicating the model is memorizing the training data rather than learning generalizable features.
- Good Fit: When both training and validation losses decrease and converge, demonstrating the model is learning effectively and generalizing well to unseen data.
- Directing Learners to Exercises and Supplementary Materials: The sources encourage learners to engage with the exercises and extra curriculum provided to solidify their understanding of the concepts covered. They point to resources like an exercise template notebook and example solutions designed to reinforce the knowledge acquired in the section. The exercises focus on building a CNN model for image classification, covering aspects like data loading, model creation, training, and evaluation.
The sources strive to equip learners with the critical thinking skills necessary to analyze model performance, identify potential problems, and explore strategies for improvement. They highlight the value of visualizing results and understanding the implications of different loss curve patterns. Furthermore, they encourage learners to actively participate in the provided exercises and seek out supplementary materials to enhance their practical skills in deep learning.

Evaluating the Effectiveness of Data Augmentation

The sources consistently emphasize the importance of evaluating the impact of data augmentation on model performance. While data augmentation is a widely used technique to mitigate overfitting and potentially improve generalization ability, its effectiveness can vary depending on the specific dataset and model architecture.

In the context of the food image classification task, the sources demonstrate building a TinyVGG model with and without data augmentation. They analyze the results and observe that, in this particular instance, data augmentation doesn’t lead to significant improvements in quantitative metrics like loss or accuracy. This outcome could be attributed to several factors:
- Underfitting Baseline Model: The baseline model, even without augmentation, might already be underfitting the data. This suggests that the model’s capacity is insufficient to capture the complexities of the dataset effectively. In such scenarios, data augmentation might not provide substantial benefits as the model’s limitations prevent it from leveraging the augmented data fully.
- Suboptimal Augmentations: The specific data augmentation techniques used might not be well-suited to the characteristics of the food image dataset. The chosen transformations might not introduce sufficient diversity or might inadvertently alter crucial features, leading to limited performance gains.
- Dataset Size: The size of the original dataset could influence the impact of data augmentation. For larger datasets, data augmentation might have a more pronounced effect, as it helps expand the training data and exposes the model to a wider range of variations. However, for smaller datasets, the benefits of augmentation might be less noticeable.
The sources stress the importance of experimentation and analysis to determine the effectiveness of data augmentation for a specific task. They recommend exploring different augmentation techniques, adjusting hyperparameters, and carefully evaluating the results to find the optimal strategy. They also point out that even if data augmentation doesn’t result in substantial quantitative improvements, it can still contribute to a more robust and generalized model. [1, 2]

Exploring Data Augmentation and Addressing Overfitting

The sources highlight the importance of data augmentation as a technique to combat overfitting in machine learning models, particularly in the realm of computer vision. They emphasize that data augmentation involves creating variations of the existing training data by applying transformations such as rotations, flips, or crops. This effectively expands the training dataset and presents the model with a wider range of input patterns, promoting the learning of more robust and generalizable features.

However, the sources caution that data augmentation is not a guaranteed solution and its effectiveness can vary depending on several factors, including:
- The nature of the dataset: The type of data and the inherent variability within the dataset can influence the impact of data augmentation. Certain datasets might benefit significantly from augmentation, while others might exhibit minimal improvement.
- The model architecture: The complexity and capacity of the model can determine how effectively it can leverage augmented data. A simple model might not fully utilize the augmented data, while a more complex model might be prone to overfitting even with augmentation.
- The choice of augmentation techniques: The specific transformations applied during augmentation play a crucial role in its success. Selecting augmentations that align with the characteristics of the data and the task at hand is essential. Inappropriate or excessive augmentations can even hinder performance.
The sources demonstrate the application of data augmentation in the context of a food image classification task using a TinyVGG model. They train the model with and without augmentation and compare the results. Notably, they observe that, in this particular scenario, data augmentation does not lead to substantial improvements in quantitative metrics such as loss or accuracy. This outcome underscores the importance of carefully evaluating the impact of data augmentation and not assuming its universal effectiveness.

To gain further insights into the model’s behavior and the effects of data augmentation, the sources recommend visualizing the training and validation loss curves. These visualizations can reveal patterns that indicate:
- Underfitting: If both the training and validation losses remain high, it suggests the model is not adequately learning from the data, even with augmentation.
- Overfitting: If the training loss decreases while the validation loss increases, it indicates the model is memorizing the training data and failing to generalize to unseen data.
- Good Fit: If both the training and validation losses decrease and converge, it signifies the model is learning effectively and generalizing well.
The sources consistently emphasize the importance of experimentation and analysis when applying data augmentation. They encourage trying different augmentation techniques, fine-tuning hyperparameters, and rigorously evaluating the results to determine the optimal strategy for a given problem. They also highlight that, even if data augmentation doesn’t yield significant quantitative gains, it can still contribute to a more robust and generalized model.

Ultimately, the sources advocate for a nuanced approach to data augmentation, recognizing its potential benefits while acknowledging its limitations. They urge practitioners to adopt a data-driven methodology, carefully considering the characteristics of the dataset, the model architecture, and the task requirements to determine the most effective data augmentation strategy.

The Purpose and Impact of Inference Mode in PyTorch

The sources introduce inference mode, a feature in PyTorch designed to optimize the model for making predictions, often referred to as “inference” or “evaluation” in machine learning. Inference mode is activated using the torch.inference_mode context manager, as demonstrated in source [1].

Key Benefits of Inference Mode

While the sources don’t go into extensive detail about the internal workings of inference mode, they highlight its primary benefits:
- Improved Speed: Inference mode disables gradient calculations and certain operations not required during prediction, resulting in faster code execution. Source [2] specifically mentions a Twitter thread where PyTorch developers discuss the speed enhancements achieved using inference mode.
- Reduced Memory Consumption: By disabling gradient tracking, inference mode reduces the memory footprint of the model during prediction. This can be particularly advantageous when deploying models on resource-constrained devices or handling large datasets.
Observing the Difference

Source [3] presents a comparison between predictions made with and without inference mode. The key distinction lies in the presence or absence of a grad_fn attribute in the output. The grad_fn attribute is associated with gradient calculations, essential for training but unnecessary during inference. When inference mode is active, the output lacks the grad_fn, indicating that gradient tracking is disabled, leading to speed and memory optimizations.

Making Inference Mode a Habit

The sources encourage developing the habit of using inference mode whenever making predictions with a PyTorch model. This practice ensures that the model operates in its most efficient mode for inference tasks, maximizing performance and minimizing resource utilization.

Beyond the Sources

The sources provide a high-level overview of inference mode and its benefits, recommending further exploration through external resources, particularly the PyTorch documentation and the mentioned Twitter thread. For a deeper understanding of the technical aspects and implementation details of inference mode, consulting the official PyTorch documentation would be beneficial.

Building a Robust PyTorch Training Loop: Pages 201-210

The sources transition into providing a detailed walkthrough of constructing a robust training loop in PyTorch for a machine-learning task involving straight-line data. This example focuses on regression, where the goal is to predict a continuous numerical value. They emphasize that while this specific task involves a simple linear relationship, the concepts and steps involved are generalizable to more complex scenarios.

Here’s a breakdown of the key elements covered in the sources:
- Data Generation and Preparation: The sources guide the reader through generating a synthetic dataset representing a straight line with a predefined weight and bias. This dataset simulates a real-world scenario where the goal is to train a model to learn the underlying relationship between input features and target variables.
- Model Definition: The sources introduce the nn.Linear module, a fundamental building block in PyTorch for defining linear layers in neural networks. They demonstrate how to instantiate a linear layer, specifying the input and output dimensions based on the dataset. This layer will learn the weight and bias parameters during training to approximate the straight-line relationship.
- Loss Function and Optimizer: The sources explain the importance of a loss function in training a machine learning model. In this case, they use the Mean Squared Error (MSE) loss, a common choice for regression tasks that measures the average squared difference between the predicted and actual values. They also introduce the concept of an optimizer, specifically Stochastic Gradient Descent (SGD), responsible for updating the model’s parameters to minimize the loss function during training.
- Training Loop Structure: The sources outline the core components of a training loop:
- Iterating Through Epochs: The training process typically involves multiple passes over the entire training dataset, each pass referred to as an epoch. The loop iterates through the specified number of epochs, performing the training steps for each epoch.
- Forward Pass: For each batch of data, the model makes predictions based on the current parameter values. This step involves passing the input data through the linear layer and obtaining the output, referred to as logits.
- Loss Calculation: The loss function (MSE in this example) is used to compute the difference between the model’s predictions (logits) and the actual target values.
- Backpropagation: This step involves calculating the gradients of the loss with respect to the model’s parameters. These gradients indicate the direction and magnitude of adjustments needed to minimize the loss.
- Optimizer Step: The optimizer (SGD in this case) utilizes the calculated gradients to update the model’s weight and bias parameters, moving them towards values that reduce the loss.
- Visualizing the Training Process: The sources emphasize the importance of visualizing the training progress to gain insights into the model’s behavior. They demonstrate plotting the loss values and parameter updates over epochs, helping to understand how the model is learning and whether the loss is decreasing as expected.
- Illustrating Epochs and Stepping the Optimizer: The sources use a coin analogy to explain the concept of epochs and the role of the optimizer in adjusting model parameters. They compare each epoch to moving closer to a coin at the back of a couch, with the optimizer taking steps to reduce the distance to the target (the coin).
The sources provide a comprehensive guide to constructing a fundamental PyTorch training loop for a regression problem, emphasizing the key components and the rationale behind each step. They stress the importance of visualization to understand the training dynamics and the role of the optimizer in guiding the model towards a solution that minimizes the loss function.

Understanding Non-Linearities and Activation Functions: Pages 211-220

The sources shift their focus to the concept of non-linearities in neural networks and their crucial role in enabling models to learn complex patterns beyond simple linear relationships. They introduce activation functions as the mechanism for introducing non-linearity into the model’s computations.

Here’s a breakdown of the key concepts covered in the sources:
- Limitations of Linear Models: The sources revisit the previous example of training a linear model to fit a straight line. They acknowledge that while linear models are straightforward to understand and implement, they are inherently limited in their capacity to model complex, non-linear relationships often found in real-world data.
- The Need for Non-Linearities: The sources emphasize that introducing non-linearity into the model’s architecture is essential for capturing intricate patterns and making accurate predictions on data with non-linear characteristics. They highlight that without non-linearities, neural networks would essentially collapse into a series of linear transformations, offering no advantage over simple linear models.
- Activation Functions: The sources introduce activation functions as the primary means of incorporating non-linearities into neural networks. Activation functions are applied to the output of linear layers, transforming the linear output into a non-linear representation. They act as “decision boundaries,” allowing the network to learn more complex and nuanced relationships between input features and target variables.
- Sigmoid Activation Function: The sources specifically discuss the sigmoid activation function, a common choice that squashes the input values into a range between 0 and 1. They highlight that while sigmoid was historically popular, it has limitations, particularly in deep networks where it can lead to vanishing gradients, hindering training.
- ReLU Activation Function: The sources present the ReLU (Rectified Linear Unit) activation function as a more modern and widely used alternative to sigmoid. ReLU is computationally efficient and addresses the vanishing gradient problem associated with sigmoid. It simply sets all negative values to zero and leaves positive values unchanged, introducing non-linearity while preserving the benefits of linear behavior in certain regions.
- Visualizing the Impact of Non-Linearities: The sources emphasize the importance of visualization to understand the impact of activation functions. They demonstrate how the addition of a ReLU activation function to a simple linear model drastically changes the model’s decision boundary, enabling it to learn non-linear patterns in a toy dataset of circles. They showcase how the ReLU-augmented model achieves near-perfect performance, highlighting the power of non-linearities in enhancing model capabilities.
- Exploration of Activation Functions in torch.nn: The sources guide the reader to explore the torch.nn module in PyTorch, which contains a comprehensive collection of activation functions. They encourage exploring the documentation and experimenting with different activation functions to understand their properties and impact on model behavior.
The sources provide a clear and concise introduction to the fundamental concepts of non-linearities and activation functions in neural networks. They emphasize the limitations of linear models and the essential role of activation functions in empowering models to learn complex patterns. The sources encourage a hands-on approach, urging readers to experiment with different activation functions in PyTorch and visualize their effects on model behavior.

Optimizing Gradient Descent: Pages 221-230

The sources move on to refining the gradient descent process, a crucial element in training machine-learning models. They highlight several techniques and concepts aimed at enhancing the efficiency and effectiveness of gradient descent.
- Gradient Accumulation and the optimizer.zero_grad() Method: The sources explain the concept of gradient accumulation, where gradients are calculated and summed over multiple batches before being applied to update model parameters. They emphasize the importance of resetting the accumulated gradients to zero before each batch using the optimizer.zero_grad() method. This prevents gradients from previous batches from interfering with the current batch’s calculations, ensuring accurate gradient updates.
- The Intertwined Nature of Gradient Descent Steps: The sources point out the interconnectedness of the steps involved in gradient descent:
- optimizer.zero_grad(): Resets the gradients to zero.
- loss.backward(): Calculates gradients through backpropagation.
- optimizer.step(): Updates model parameters based on the calculated gradients.
- They emphasize that these steps work in tandem to optimize the model parameters, moving them towards values that minimize the loss function.
- Learning Rate Scheduling and the Coin Analogy: The sources introduce the concept of learning rate scheduling, a technique for dynamically adjusting the learning rate, a hyperparameter controlling the size of parameter updates during training. They use the analogy of reaching for a coin at the back of a couch to explain this concept.
- Large Steps Initially: When starting the arm far from the coin (analogous to the initial stages of training), larger steps are taken to cover more ground quickly.
- Smaller Steps as the Target Approaches: As the arm gets closer to the coin (similar to approaching the optimal solution), smaller, more precise steps are needed to avoid overshooting the target.
- The sources suggest exploring resources on learning rate scheduling for further details.
- Visualizing Model Improvement: The sources demonstrate the positive impact of training for more epochs, showing how predictions align better with the target values as training progresses. They visualize the model’s predictions alongside the actual data points, illustrating how the model learns to fit the data more accurately over time.
- The torch.no_grad() Context Manager for Evaluation: The sources introduce the torch.no_grad() context manager, used during the evaluation phase to disable gradient calculations. This optimization enhances speed and reduces memory consumption, as gradients are unnecessary for evaluating a trained model.
- The Jingle for Remembering Training Steps: To help remember the key steps in a training loop, the sources introduce a catchy jingle: “For an epoch in a range, do the forward pass, calculate the loss, optimizer zero grad, loss backward, optimizer step, step, step.” This mnemonic device reinforces the sequence of actions involved in training a model.
- Customizing Printouts and Monitoring Metrics: The sources emphasize the flexibility of customizing printouts during training to monitor relevant metrics. They provide examples of printing the loss, weights, and bias values at specific intervals (every 10 epochs in this case) to track the training progress. They also hint at introducing accuracy metrics in later stages.
- Reinitializing the Model and the Importance of Random Seeds: The sources demonstrate reinitializing the model to start training from scratch, showcasing how the model begins with random predictions but progressively improves as training progresses. They emphasize the role of random seeds in ensuring reproducibility, allowing for consistent model initialization and experimentation.
The sources provide a comprehensive exploration of techniques and concepts for optimizing the gradient descent process in PyTorch. They cover gradient accumulation, learning rate scheduling, and the use of context managers for efficient evaluation. They emphasize visualization to monitor progress and the importance of random seeds for reproducible experiments.

Saving, Loading, and Evaluating Models: Pages 231-240

The sources guide readers through saving a trained model, reloading it for later use, and exploring additional evaluation metrics beyond just loss.
- Saving a Trained Model with torch.save(): The sources introduce the torch.save() function in PyTorch to save a trained model to a file. They emphasize the importance of saving models to preserve the learned parameters, allowing for later reuse without retraining. The code examples demonstrate saving the model’s state dictionary, containing the learned parameters, to a file named “01_pytorch_workflow_model_0.pth”.
- Verifying Model File Creation with ls: The sources suggest using the ls command in a terminal or command prompt to verify that the model file has been successfully created in the designated directory.
- Loading a Saved Model with torch.load(): The sources then present the torch.load() function for loading a saved model back into the environment. They highlight the ease of loading saved models, allowing for continued training or deployment for making predictions without the need to repeat the entire training process. They challenge readers to attempt loading the saved model before providing the code solution.
- Examining Loaded Model Parameters: The sources suggest examining the loaded model’s parameters, particularly the weights and biases, to confirm that they match the values from the saved model. This step ensures that the model has been loaded correctly and is ready for further use.
- Improving Model Performance with More Epochs: The sources revisit the concept of training for more epochs to improve model performance. They demonstrate how increasing the number of epochs can lead to lower loss and better alignment between predictions and target values. They encourage experimentation with different epoch values to observe the impact on model accuracy.
- Plotting Loss Curves to Visualize Training Progress: The sources showcase plotting loss curves to visualize the training progress over time. They track the loss values for both the training and test sets across epochs and plot these values to observe the trend of decreasing loss as training proceeds. The sources point out that if the training and test loss curves converge closely, it indicates that the model is generalizing well to unseen data, a desirable outcome.
- Storing Useful Values During Training: The sources recommend creating empty lists to store useful values during training, such as epoch counts, loss values, and test loss values. This organized storage facilitates later analysis and visualization of the training process.
- Reviewing Code, Slides, and Extra Curriculum: The sources encourage readers to review the code, accompanying slides, and extra curriculum resources for a deeper understanding of the concepts covered. They particularly recommend the book version of the course, which contains comprehensive explanations and additional resources.
This section of the sources focuses on the practical aspects of saving, loading, and evaluating PyTorch models. The sources provide clear code examples and explanations for these essential tasks, enabling readers to efficiently manage their trained models and assess their performance. They continue to emphasize the importance of visualization for understanding training progress and model behavior.

Building and Understanding Neural Networks: Pages 241-250

The sources transition from focusing on fundamental PyTorch workflows to constructing and comprehending neural networks for more complex tasks, particularly classification. They guide readers through building a neural network designed to classify data points into distinct categories.
- Shifting Focus to PyTorch Fundamentals: The sources highlight that the upcoming content will concentrate on the core principles of PyTorch, shifting away from the broader workflow-oriented perspective. They direct readers to specific sections in the accompanying resources, such as the PyTorch Fundamentals notebook and the online book version of the course, for supplementary materials and in-depth explanations.
- Exercises and Extra Curriculum: The sources emphasize the availability of exercises and extra curriculum materials to enhance learning and practical application. They encourage readers to actively engage with these resources to solidify their understanding of the concepts.
- Introduction to Neural Network Classification: The sources mark the beginning of a new section focused on neural network classification, a common machine learning task where models learn to categorize data into predefined classes. They distinguish between binary classification (one thing or another) and multi-class classification (more than two classes).
- Examples of Classification Problems: To illustrate classification tasks, the sources provide real-world examples:
- Image Classification: Classifying images as containing a cat or a dog.
- Spam Filtering: Categorizing emails as spam or not spam.
- Social Media Post Classification: Labeling posts on platforms like Facebook or Twitter based on their content.
- Fraud Detection: Identifying fraudulent transactions.
- Multi-Class Classification with Wikipedia Labels: The sources extend the concept of multi-class classification to using labels from the Wikipedia page for “deep learning.” They note that the Wikipedia page itself has multiple categories or labels, such as “deep learning,” “artificial neural networks,” “artificial intelligence,” and “emerging technologies.” This example highlights how a machine learning model could be trained to classify text based on multiple labels.
- Architecture, Input/Output Shapes, Features, and Labels: The sources outline the key aspects of neural network classification models that they will cover:
- Architecture: The structure and organization of the neural network, including the layers and their connections.
- Input/Output Shapes: The dimensions of the data fed into the model and the expected dimensions of the model’s predictions.
- Features: The input variables or characteristics used by the model to make predictions.
- Labels: The target variables representing the classes or categories to which the data points belong.
- Practical Example with the make_circles Dataset: The sources introduce a hands-on example using the make_circles dataset from scikit-learn, a Python library for machine learning. They generate a synthetic dataset consisting of 1000 data points arranged in two concentric circles, each circle representing a different class.
- Data Exploration and Visualization: The sources emphasize the importance of exploring and visualizing data before model building. They print the first five samples of both the features (X) and labels (Y) and guide readers through understanding the structure of the data. They acknowledge that discerning patterns from raw numerical data can be challenging and advocate for visualization to gain insights.
- Creating a Dictionary for Structured Data Representation: The sources structure the data into a dictionary format to organize the features (X1, X2) and labels (Y) for each sample. They explain the rationale behind this approach, highlighting how it improves readability and understanding of the dataset.
- Transitioning to Visualization: The sources prepare to shift from numerical representations to visual representations of the data, emphasizing the power of visualization for revealing patterns and gaining a deeper understanding of the dataset’s characteristics.
This section of the sources marks a transition to a more code-centric and hands-on approach to understanding neural networks for classification. They introduce essential concepts, provide real-world examples, and guide readers through a practical example using a synthetic dataset. They continue to advocate for visualization as a crucial tool for data exploration and model understanding.

Visualizing and Building a Classification Model: Pages 251-260

The sources demonstrate how to visualize the make_circles dataset and begin constructing a neural network model designed for binary classification.
- Visualizing the make_circles Dataset: The sources utilize Matplotlib, a Python plotting library, to visualize the make_circles dataset created earlier. They emphasize the data explorer’s motto: “Visualize, visualize, visualize,” underscoring the importance of visually inspecting data to understand patterns and relationships. The visualization reveals two distinct circles, each representing a different class, confirming the expected structure of the dataset.
- Splitting Data into Training and Test Sets: The sources guide readers through splitting the dataset into training and test sets using array slicing. They explain the rationale for this split:
- Training Set: Used to train the model and allow it to learn patterns from the data.
- Test Set: Held back from training and used to evaluate the model’s performance on unseen data, providing an estimate of its ability to generalize to new examples.
- They calculate and verify the lengths of the training and test sets, ensuring that the split adheres to the desired proportions (in this case, 80% for training and 20% for testing).
- Building a Simple Neural Network with PyTorch: The sources initiate building a simple neural network model using PyTorch. They introduce essential components of a PyTorch model:
- torch.nn.Module: The base class for all neural network modules in PyTorch.
- __init__ Method: The constructor method where model layers are defined.
- forward Method: Defines the forward pass of data through the model.
- They guide readers through creating a class named CircleModelV0 that inherits from torch.nn.Module and outline the steps for defining the model’s layers and the forward pass logic.
- Key Concepts in the Neural Network Model:
- Linear Layers: The model uses linear layers (torch.nn.Linear), which apply a linear transformation to the input data.
- Non-Linear Activation Function (Sigmoid): The model employs a non-linear activation function, specifically the sigmoid function (torch.sigmoid), to introduce non-linearity into the model. Non-linearity allows the model to learn more complex patterns in the data.
- Input and Output Dimensions: The sources carefully consider the input and output dimensions of each layer to ensure compatibility between the layers and the data. They emphasize the importance of aligning these dimensions to prevent errors during model execution.
- Visualizing the Neural Network Architecture: The sources present a visual representation of the neural network architecture, highlighting the flow of data through the layers, the application of the sigmoid activation function, and the final output representing the model’s prediction. They encourage readers to visualize their own neural networks to aid in comprehension.
- Loss Function and Optimizer: The sources introduce the concept of a loss function and an optimizer, crucial components of the training process:
- Loss Function: Measures the difference between the model’s predictions and the true labels, providing a signal to guide the model’s learning.
- Optimizer: Updates the model’s parameters (weights and biases) based on the calculated loss, aiming to minimize the loss and improve the model’s accuracy.
- They select the binary cross-entropy loss function (torch.nn.BCELoss) and the stochastic gradient descent (SGD) optimizer (torch.optim.SGD) for this classification task. They mention that alternative loss functions and optimizers exist and provide resources for further exploration.
- Training Loop and Evaluation: The sources establish a training loop, a fundamental process in machine learning where the model iteratively learns from the training data. They outline the key steps involved in each iteration of the loop:
1. Forward Pass: Pass the training data through the model to obtain predictions.
2. Calculate Loss: Compute the loss using the chosen loss function.
3. Zero Gradients: Reset the gradients of the model’s parameters.
4. Backward Pass (Backpropagation): Calculate the gradients of the loss with respect to the model’s parameters.
5. Update Parameters: Adjust the model’s parameters using the optimizer based on the calculated gradients.
- They perform a small number of training epochs (iterations over the entire training dataset) to demonstrate the training process. They evaluate the model’s performance after training by calculating the loss on the test data.
- Visualizing Model Predictions: The sources visualize the model’s predictions on the test data using Matplotlib. They plot the data points, color-coded by their true labels, and overlay the decision boundary learned by the model, illustrating how the model separates the data into different classes. They note that the model’s predictions, although far from perfect at this early stage of training, show some initial separation between the classes, indicating that the model is starting to learn.
- Improving a Model: An Overview: The sources provide a high-level overview of techniques for improving the performance of a machine learning model. They suggest various strategies for enhancing model accuracy, including adding more layers, increasing the number of hidden units, training for a longer duration, and incorporating non-linear activation functions. They emphasize that these strategies may not always guarantee improvement and that experimentation is crucial to determine the optimal approach for a particular dataset and problem.
- Saving and Loading Models with PyTorch: The sources reiterate the importance of saving trained models for later use. They demonstrate the use of torch.save() to save the model’s state dictionary to a file. They also showcase how to load a saved model using torch.load(), allowing for reuse without the need for retraining.
- Transition to Putting It All Together: The sources prepare to transition to a section where they will consolidate the concepts covered so far by working through a comprehensive example that incorporates the entire machine learning workflow, emphasizing practical application and problem-solving.
This section of the sources focuses on the practical aspects of building and training a simple neural network for binary classification. They guide readers through defining the model architecture, choosing a loss function and optimizer, implementing a training loop, and visualizing the model’s predictions. They also introduce strategies for improving model performance and reinforce the importance of saving and loading trained models.

Putting It All Together: Pages 261-270

The sources revisit the key steps in the PyTorch workflow, bringing together the concepts covered previously to solidify readers’ understanding of the end-to-end process. They emphasize a code-centric approach, encouraging readers to code along to reinforce their learning.
- Reiterating the PyTorch Workflow: The sources highlight the importance of practicing the PyTorch workflow to gain proficiency. They guide readers through a step-by-step review of the process, emphasizing a shift toward coding over theoretical explanations.
- The Importance of Practice: The sources stress that actively writing and running code is crucial for internalizing concepts and developing practical skills. They encourage readers to participate in coding exercises and explore additional resources to enhance their understanding.
- Data Preparation and Transformation into Tensors: The sources reiterate the initial steps of preparing data and converting it into tensors, a format suitable for PyTorch models. They remind readers of the importance of data exploration and transformation, emphasizing that these steps are fundamental to successful model development.
- Model Building, Loss Function, and Optimizer Selection: The sources revisit the core components of model construction:
- Building or Selecting a Model: Choosing an appropriate model architecture or constructing a custom model based on the problem’s requirements.
- Picking a Loss Function: Selecting a loss function that measures the difference between the model’s predictions and the true labels, guiding the model’s learning process.
- Building an Optimizer: Choosing an optimizer that updates the model’s parameters based on the calculated loss, aiming to minimize the loss and improve the model’s accuracy.
- Training Loop and Model Fitting: The sources highlight the central role of the training loop in machine learning. They recap the key steps involved in each iteration:
1. Forward Pass: Pass the training data through the model to obtain predictions.
2. Calculate Loss: Compute the loss using the chosen loss function.
3. Zero Gradients: Reset the gradients of the model’s parameters.
4. Backward Pass (Backpropagation): Calculate the gradients of the loss with respect to the model’s parameters.
5. Update Parameters: Adjust the model’s parameters using the optimizer based on the calculated gradients.
- Making Predictions and Evaluating the Model: The sources remind readers of the steps involved in using the trained model to make predictions on new data and evaluating its performance using appropriate metrics, such as loss and accuracy. They emphasize the importance of evaluating models on unseen data (the test set) to assess their ability to generalize to new examples.
- Saving and Loading Trained Models: The sources reiterate the value of saving trained models to avoid retraining. They demonstrate the use of torch.save() to save the model’s state dictionary to a file and torch.load() to load a saved model for reuse.
- Exercises and Extra Curriculum Resources: The sources consistently emphasize the availability of exercises and extra curriculum materials to supplement learning. They direct readers to the accompanying resources, such as the online book and the GitHub repository, where these materials can be found. They encourage readers to actively engage with these resources to solidify their understanding and develop practical skills.
- Transition to Convolutional Neural Networks: The sources prepare to move into a new section focused on computer vision and convolutional neural networks (CNNs), indicating that readers have gained a solid foundation in the fundamental PyTorch workflow and are ready to explore more advanced deep learning architectures. [1]
This section of the sources serves as a review and consolidation of the key concepts and steps involved in the PyTorch workflow. It reinforces the importance of practice and hands-on coding and prepares readers to explore more specialized deep learning techniques, such as CNNs for computer vision tasks.

Navigating Resources and Deep Learning Concepts: Pages 271-280

The sources transition into discussing resources for further learning and exploring essential deep learning concepts, setting the stage for a deeper understanding of PyTorch and its applications.
- Emphasizing Continuous Learning: The sources emphasize the importance of ongoing learning in the ever-evolving field of deep learning. They acknowledge that a single course cannot cover every aspect of PyTorch and encourage readers to actively seek out additional resources to expand their knowledge.
- Recommended Resources for PyTorch Mastery: The sources provide specific recommendations for resources that can aid in further exploration of PyTorch:
- Google Search: A fundamental tool for finding answers to specific questions, troubleshooting errors, and exploring various concepts related to PyTorch and deep learning. [1, 2]
- PyTorch Documentation: The official PyTorch documentation serves as an invaluable reference for understanding PyTorch’s functions, modules, and classes. The sources demonstrate how to effectively navigate the documentation to find information about specific functions, such as torch.arange. [3]
- GitHub Repository: The sources highlight a dedicated GitHub repository that houses the materials covered in the course, including notebooks, code examples, and supplementary resources. They encourage readers to utilize this repository as a learning aid and a source of reference. [4-14]
- Learn PyTorch Website: The sources introduce an online book version of the course, accessible through a website, offering a readable format for revisiting course content and exploring additional chapters that cover more advanced topics, including transfer learning, model experiment tracking, and paper replication. [1, 4, 5, 7, 11, 15-30]
- Course Q&A Forum: The sources acknowledge the importance of community support and encourage readers to utilize a dedicated Q&A forum, possibly on GitHub, to seek assistance from instructors and fellow learners. [4, 8, 11, 15]
- Encouraging Active Exploration of Definitions: The sources recommend that readers proactively research definitions of key deep learning concepts, such as deep learning and neural networks. They suggest using resources like Google Search and Wikipedia to explore various interpretations and develop a personal understanding of these concepts. They prioritize hands-on work over rote memorization of definitions. [1, 2]
- Structured Approach to the Course: The sources suggest a structured approach to navigating the course materials, presenting them in numerical order for ease of comprehension. They acknowledge that alternative learning paths exist but recommend following the numerical sequence for clarity. [31]
- Exercises, Extra Curriculum, and Documentation Reading: The sources emphasize the significance of hands-on practice and provide exercises designed to reinforce the concepts covered in the course. They also highlight the availability of extra curriculum materials for those seeking to deepen their understanding. Additionally, they encourage readers to actively engage with the PyTorch documentation to familiarize themselves with its structure and content. [6, 10, 12, 13, 16, 18-21, 23, 24, 28-30, 32-34]
This section of the sources focuses on directing readers towards valuable learning resources and fostering a mindset of continuous learning in the dynamic field of deep learning. They provide specific recommendations for accessing course materials, leveraging the PyTorch documentation, engaging with the community, and exploring definitions of key concepts. They also encourage active participation in exercises, exploration of extra curriculum content, and familiarization with the PyTorch documentation to enhance practical skills and deepen understanding.

Introducing the Coding Environment: Pages 281-290

The sources transition from theoretical discussion and resource navigation to a more hands-on approach, guiding readers through setting up their coding environment and introducing Google Colab as the primary tool for the course.
- Shifting to Hands-On Coding: The sources signal a shift in focus toward practical coding exercises, encouraging readers to actively participate and write code alongside the instructions. They emphasize the importance of getting involved with hands-on work rather than solely focusing on theoretical definitions.
- Introducing Google Colab: The sources introduce Google Colab, a cloud-based Jupyter notebook environment, as the primary tool for coding throughout the course. They suggest that using Colab facilitates a consistent learning experience and removes the need for local installations and setup, allowing readers to focus on learning PyTorch. They recommend using Colab as the preferred method for following along with the course materials.
- Advantages of Google Colab: The sources highlight the benefits of using Google Colab, including its accessibility, ease of use, and collaborative features. Colab provides a pre-configured environment with necessary libraries and dependencies already installed, simplifying the setup process for readers. Its cloud-based nature allows access from various devices and facilitates code sharing and collaboration.
- Navigating the Colab Interface: The sources guide readers through the basic functionality of Google Colab, demonstrating how to create new notebooks, run code cells, and access various features within the Colab environment. They introduce essential commands, such as torch.version and torchvision.version, for checking the versions of installed libraries.
- Creating and Running Code Cells: The sources demonstrate how to create new code cells within Colab notebooks and execute Python code within these cells. They illustrate the use of print() statements to display output and introduce the concept of importing necessary libraries, such as torch for PyTorch functionality.
- Checking Library Versions: The sources emphasize the importance of ensuring compatibility between PyTorch and its associated libraries. They demonstrate how to check the versions of installed libraries, such as torch and torchvision, using commands like torch.__version__ and torchvision.__version__. This step ensures that readers are using compatible versions for the upcoming code examples and exercises.
- Emphasizing Hands-On Learning: The sources reiterate their preference for hands-on learning and a code-centric approach, stating that they will prioritize coding together rather than spending extensive time on slides or theoretical explanations.
This section of the sources marks a transition from theoretical discussions and resource exploration to a more hands-on coding approach. They introduce Google Colab as the primary coding environment for the course, highlighting its benefits and demonstrating its basic functionality. The sources guide readers through creating code cells, running Python code, and checking library versions to ensure compatibility. By focusing on practical coding examples, the sources encourage readers to actively participate in the learning process and reinforce their understanding of PyTorch concepts.

Setting the Stage for Classification: Pages 291-300

The sources shift focus to classification problems, a fundamental task in machine learning, and begin by explaining the core concepts of binary, multi-class, and multi-label classification, providing examples to illustrate each type. They then delve into the specifics of binary and multi-class classification, setting the stage for building classification models in PyTorch.
- Introducing Classification Problems: The sources introduce classification as a key machine learning task where the goal is to categorize data into predefined classes or categories. They differentiate between various types of classification problems:
- Binary Classification: Involves classifying data into one of two possible classes. Examples include:
- Image Classification: Determining whether an image contains a cat or a dog.
- Spam Detection: Classifying emails as spam or not spam.
- Fraud Detection: Identifying fraudulent transactions from legitimate ones.
- Multi-Class Classification: Deals with classifying data into one of multiple (more than two) classes. Examples include:
- Image Recognition: Categorizing images into different object classes, such as cars, bicycles, and pedestrians.
- Handwritten Digit Recognition: Classifying handwritten digits into the numbers 0 through 9.
- Natural Language Processing: Assigning text documents to specific topics or categories.
- Multi-Label Classification: Involves assigning multiple labels to a single data point. Examples include:
- Image Tagging: Assigning multiple tags to an image, such as “beach,” “sunset,” and “ocean.”
- Text Classification: Categorizing documents into multiple relevant topics.
- Understanding the ImageNet Dataset: The sources reference the ImageNet dataset, a large-scale dataset commonly used in computer vision research, as an example of multi-class classification. They point out that ImageNet contains thousands of object categories, making it a challenging dataset for multi-class classification tasks.
- Illustrating Multi-Label Classification with Wikipedia: The sources use a Wikipedia article about deep learning as an example of multi-label classification. They point out that the article has multiple categories assigned to it, such as “deep learning,” “artificial neural networks,” and “artificial intelligence,” demonstrating that a single data point (the article) can have multiple labels.
- Real-World Examples of Classification: The sources provide relatable examples from everyday life to illustrate different classification scenarios:
- Photo Categorization: Modern smartphone cameras often automatically categorize photos based on their content, such as “people,” “food,” or “landscapes.”
- Email Filtering: Email services frequently categorize emails into folders like “primary,” “social,” or “promotions,” performing a multi-class classification task.
- Focusing on Binary and Multi-Class Classification: The sources acknowledge the existence of other types of classification but choose to focus on binary and multi-class classification for the remainder of the section. They indicate that these two types are fundamental and provide a strong foundation for understanding more complex classification scenarios.
This section of the sources sets the stage for exploring classification problems in PyTorch. They introduce different types of classification, providing examples and real-world applications to illustrate each type. The sources emphasize the importance of understanding binary and multi-class classification as fundamental building blocks for more advanced classification tasks. By providing clear definitions, examples, and a structured approach, the sources prepare readers to build and train classification models using PyTorch.

Building a Binary Classification Model with PyTorch: Pages 301-310

The sources begin the practical implementation of a binary classification model using PyTorch. They guide readers through generating a synthetic dataset, exploring its characteristics, and visualizing it to gain insights into the data before proceeding to model building.
- Generating a Synthetic Dataset with make_circles: The sources introduce the make_circles function from the sklearn.datasets module to create a synthetic dataset for binary classification. This function generates a dataset with two concentric circles, each representing a different class. The sources provide a code example using make_circles to generate 1000 samples, storing the features in the variable X and the corresponding labels in the variable Y. They emphasize the common convention of using capital X to represent a matrix of features and capital Y for labels.
- Exploring the Dataset: The sources guide readers through exploring the characteristics of the generated dataset:
- Examining the First Five Samples: The sources provide code to display the first five samples of both features (X) and labels (Y) using array slicing. They use print() statements to display the output, encouraging readers to visually inspect the data.
- Formatting for Clarity: The sources emphasize the importance of presenting data in a readable format. They use a dictionary to structure the data, mapping feature names (X1 and X2) to the corresponding values and including the label (Y). This structured format enhances the readability and interpretation of the data.
- Visualizing the Data: The sources highlight the importance of visualizing data, especially in classification tasks. They emphasize the data explorer’s motto: “visualize, visualize, visualize.” They point out that while patterns might not be evident from numerical data alone, visualization can reveal underlying structures and relationships.
- Visualizing with Matplotlib: The sources introduce Matplotlib, a popular Python plotting library, for visualizing the generated dataset. They provide a code example using plt.scatter() to create a scatter plot of the data, with different colors representing the two classes. The visualization reveals the circular structure of the data, with one class forming an inner circle and the other class forming an outer circle. This visual representation provides a clear understanding of the dataset’s characteristics and the challenge posed by the binary classification task.
This section of the sources marks the beginning of hands-on model building with PyTorch. They start by generating a synthetic dataset using make_circles, allowing for controlled experimentation and a clear understanding of the data’s structure. They guide readers through exploring the dataset’s characteristics, both numerically and visually. The use of Matplotlib to visualize the data reinforces the importance of understanding data patterns before proceeding to model development. By emphasizing the data explorer’s motto, the sources encourage readers to actively engage with the data and gain insights that will inform their subsequent modeling choices.

Exploring Model Architecture and PyTorch Fundamentals: Pages 311-320

The sources proceed with building a simple neural network model using PyTorch, introducing key components like layers, neurons, activation functions, and matrix operations. They guide readers through understanding the model’s architecture, emphasizing the connection between the code and its visual representation. They also highlight PyTorch’s role in handling computations and the importance of visualizing the network’s structure.
- Creating a Simple Neural Network Model: The sources guide readers through creating a basic neural network model in PyTorch. They introduce the concept of layers, representing different stages of computation in the network, and neurons, the individual processing units within each layer. They provide code to construct a model with:
- An Input Layer: Takes in two features, corresponding to the X1 and X2 features from the generated dataset.
- A Hidden Layer: Consists of five neurons, introducing the idea of hidden layers for learning complex patterns.
- An Output Layer: Produces a single output, suitable for binary classification.
- Relating Code to Visual Representation: The sources emphasize the importance of understanding the connection between the code and its visual representation. They encourage readers to visualize the network’s structure, highlighting the flow of data through the input, hidden, and output layers. This visualization clarifies how the network processes information and makes predictions.
- PyTorch’s Role in Computation: The sources explain that while they write the code to define the model’s architecture, PyTorch handles the underlying computations. PyTorch takes care of matrix operations, activation functions, and other mathematical processes involved in training and using the model.
- Illustrating Network Structure with torch.nn.Linear: The sources use the torch.nn.Linear module to create the layers in the neural network. They provide code examples demonstrating how to define the input and output dimensions for each layer, emphasizing that the output of one layer becomes the input to the subsequent layer.
- Understanding Input and Output Shapes: The sources emphasize the significance of input and output shapes in neural networks. They explain that the input shape corresponds to the number of features in the data, while the output shape depends on the type of problem. In this case, the binary classification model has an output shape of one, representing a single probability score for the positive class.
This section of the sources introduces readers to the fundamental concepts of building neural networks in PyTorch. They guide through creating a simple binary classification model, explaining the key components like layers, neurons, and activation functions. The sources emphasize the importance of visualizing the network’s structure and understanding the connection between the code and its visual representation. They highlight PyTorch’s role in handling computations and guide readers through defining the input and output shapes for each layer, ensuring the model’s structure aligns with the dataset and the classification task. By combining code examples with clear explanations, the sources provide a solid foundation for building and understanding neural networks in PyTorch.

Setting up for Success: Approaching the PyTorch Deep Learning Course: Pages 321-330

The sources transition from the specifics of model architecture to a broader discussion about navigating the PyTorch deep learning course effectively. They emphasize the importance of active learning, self-directed exploration, and leveraging available resources to enhance understanding and skill development.
- Embracing Google and Exploration: The sources advocate for active learning and encourage learners to “Google it.” They suggest that encountering unfamiliar concepts or terms should prompt learners to independently research and explore, using search engines like Google to delve deeper into the subject matter. This approach fosters a self-directed learning style and encourages learners to go beyond the course materials.
- Prioritizing Hands-On Experience: The sources stress the significance of hands-on experience over theoretical definitions. They acknowledge that while definitions are readily available online, the focus of the course is on practical implementation and building models. They encourage learners to prioritize coding and experimentation to solidify their understanding of PyTorch.
- Utilizing Wikipedia for Definitions: The sources specifically recommend Wikipedia as a reliable resource for looking up definitions. They recognize Wikipedia’s comprehensive and well-maintained content, suggesting it as a valuable tool for learners seeking clear and accurate explanations of technical terms.
- Structuring the Course for Effective Learning: The sources outline a structured approach to the course, breaking down the content into manageable modules and emphasizing a sequential learning process. They introduce the concept of “chapters” as distinct units of learning, each covering specific topics and building upon previous knowledge.
- Encouraging Questions and Discussion: The sources foster an interactive learning environment, encouraging learners to ask questions and engage in discussions. They highlight the importance of seeking clarification and sharing insights with instructors and peers to enhance the learning experience. They recommend utilizing online platforms, such as GitHub discussion pages, for asking questions and engaging in course-related conversations.
- Providing Course Materials on GitHub: The sources ensure accessibility to course materials by making them readily available on GitHub. They specify the repository where learners can access code, notebooks, and other resources used throughout the course. They also mention “learnpytorch.io” as an alternative location where learners can find an online, readable book version of the course content.
This section of the sources provides guidance on approaching the PyTorch deep learning course effectively. The sources encourage a self-directed learning style, emphasizing the importance of active exploration, independent research, and hands-on experimentation. They recommend utilizing online resources, including search engines and Wikipedia, for in-depth understanding and advocate for engaging in discussions and seeking clarification. By outlining a structured approach, providing access to comprehensive course materials, and fostering an interactive learning environment, the sources aim to equip learners with the necessary tools and mindset for a successful PyTorch deep learning journey.

Navigating Course Resources and Documentation: Pages 331-340

The sources guide learners on how to effectively utilize the course resources and navigate PyTorch documentation to enhance their learning experience. They emphasize the importance of referring to the materials provided on GitHub, engaging in Q&A sessions, and familiarizing oneself with the structure and features of the online book version of the course.
- Identifying Key Resources: The sources highlight three primary resources for the PyTorch course:
- Materials on GitHub: The sources specify a GitHub repository (“Mr. D. Burks in my GitHub slash PyTorch deep learning” [1]) as the central location for accessing course materials, including outlines, code, notebooks, and additional resources. This repository serves as a comprehensive hub for learners to find everything they need to follow along with the course. They note that this repository is a work in progress [1] but assure users that the organization will remain largely the same [1].
- Course Q&A: The sources emphasize the importance of asking questions and seeking clarification throughout the learning process. They encourage learners to utilize the designated Q&A platform, likely a forum or discussion board, to post their queries and engage with instructors and peers. This interactive component of the course fosters a collaborative learning environment and provides a valuable avenue for resolving doubts and gaining insights.
- Course Online Book (learnpytorch.io): The sources recommend referring to the online book version of the course, accessible at “learn pytorch.io” [2, 3]. This platform offers a structured and readable format for the course content, presenting the material in a more organized and comprehensive manner compared to the video lectures. The online book provides learners with a valuable resource to reinforce their understanding and revisit concepts in a more detailed format.
- Navigating the Online Book: The sources describe the key features of the online book platform, highlighting its user-friendly design and functionality:
- Readable Format and Search Functionality: The online book presents the course content in a clear and easily understandable format, making it convenient for learners to review and grasp the material. Additionally, the platform offers search functionality, enabling learners to quickly locate specific topics or concepts within the book. This feature enhances the book’s usability and allows learners to efficiently find the information they need.
- Structured Headings and Images: The online book utilizes structured headings and includes relevant images to organize and illustrate the content effectively. The use of headings breaks down the material into logical sections, improving readability and comprehension. The inclusion of images provides visual aids to complement the textual explanations, further enhancing understanding and engagement.
This section of the sources focuses on guiding learners on how to effectively utilize the various resources provided for the PyTorch deep learning course. The sources emphasize the importance of accessing the materials on GitHub, actively engaging in Q&A sessions, and utilizing the online book version of the course to supplement learning. By describing the structure and features of these resources, the sources aim to equip learners with the knowledge and tools to navigate the course effectively, enhance their understanding of PyTorch, and ultimately succeed in their deep learning journey.

Deep Dive into PyTorch Tensors: Pages 341-350

The sources shift focus to PyTorch tensors, the fundamental data structure for working with numerical data in PyTorch. They explain how to create tensors using various methods and introduce essential tensor operations like indexing, reshaping, and stacking. The sources emphasize the significance of tensors in deep learning, highlighting their role in representing data and performing computations. They also stress the importance of understanding tensor shapes and dimensions for effective manipulation and model building.
- Introducing the torch.nn Module: The sources introduce the torch.nn module as the core component for building neural networks in PyTorch. They explain that torch.nn provides a collection of classes and functions for defining and working with various layers, activation functions, and loss functions. They highlight that almost everything in PyTorch relies on torch.tensor as the foundational data structure.
- Creating PyTorch Tensors: The sources provide a practical introduction to creating PyTorch tensors using the torch.tensor function. They emphasize that this function serves as the primary method for creating tensors, which act as multi-dimensional arrays for storing and manipulating numerical data. They guide readers through basic examples, illustrating how to create tensors from lists of values.
- Encouraging Exploration of PyTorch Documentation: The sources consistently encourage learners to explore the official PyTorch documentation for in-depth understanding and reference. They specifically recommend spending at least 10 minutes reviewing the documentation for torch.tensor after completing relevant video tutorials. This practice fosters familiarity with PyTorch’s functionalities and encourages a self-directed learning approach.
- Exploring the torch.arange Function: The sources introduce the torch.arange function for generating tensors containing a sequence of evenly spaced values within a specified range. They provide code examples demonstrating how to use torch.arange to create tensors similar to Python’s built-in range function. They also explain the function’s parameters, including start, end, and step, allowing learners to control the sequence generation.
- Highlighting Deprecated Functions: The sources point out that certain PyTorch functions, like torch.range, may become deprecated over time as the library evolves. They inform learners about such deprecations and recommend using updated functions like torch.arange as alternatives. This awareness ensures learners are using the most current and recommended practices.
- Addressing Tensor Shape Compatibility in Reshaping: The sources discuss the concept of shape compatibility when reshaping tensors using the torch.reshape function. They emphasize that the new shape specified for the tensor must be compatible with the original number of elements in the tensor. They provide examples illustrating both compatible and incompatible reshaping scenarios, explaining the potential errors that may arise when incompatibility occurs. They also note that encountering and resolving errors during coding is a valuable learning experience, promoting problem-solving skills.
- Understanding Tensor Stacking with torch.stack: The sources introduce the torch.stack function for combining multiple tensors along a new dimension. They explain that stacking effectively concatenates tensors, creating a higher-dimensional tensor. They guide readers through code examples, demonstrating how to use torch.stack to combine tensors and control the stacking dimension using the dim parameter. They also reference the torch.stack documentation, encouraging learners to review it for a comprehensive understanding of the function’s usage.
- Illustrating Tensor Permutation with torch.permute: The sources delve into the torch.permute function for rearranging the dimensions of a tensor. They explain that permuting changes the order of axes in a tensor, effectively reshaping it without altering the underlying data. They provide code examples demonstrating how to use torch.permute to change the order of dimensions, illustrating the transformation of tensor shape. They also connect this concept to real-world applications, particularly in image processing, where permuting can be used to rearrange color channels, height, and width dimensions.
- Explaining Random Seed for Reproducibility: The sources address the importance of setting a random seed for reproducibility in deep learning experiments. They introduce the concept of pseudo-random number generators and explain how setting a random seed ensures consistent results when working with random processes. They link to PyTorch documentation for further exploration of random number generation and the role of random seeds.
- Providing Guidance on Exercises and Curriculum: The sources transition to discussing exercises and additional curriculum for learners to solidify their understanding of PyTorch fundamentals. They refer to the “PyTorch fundamentals notebook,” which likely contains a collection of exercises and supplementary materials for learners to practice the concepts covered in the course. They recommend completing these exercises to reinforce learning and gain hands-on experience. They also mention that each chapter in the online book concludes with exercises and extra curriculum, providing learners with ample opportunities for practice and exploration.
This section focuses on introducing PyTorch tensors, a fundamental concept in deep learning, and providing practical examples of tensor manipulation using functions like torch.arange, torch.reshape, and torch.stack. The sources encourage learners to refer to PyTorch documentation for comprehensive understanding and highlight the significance of tensors in representing data and performing computations. By combining code demonstrations with explanations and real-world connections, the sources equip learners with a solid foundation for working with tensors in PyTorch.

Working with Loss Functions and Optimizers in PyTorch: Pages 351-360

The sources transition to a discussion of loss functions and optimizers, crucial components of the training process for neural networks in PyTorch. They explain that loss functions measure the difference between model predictions and actual target values, guiding the optimization process towards minimizing this difference. They introduce different types of loss functions suitable for various machine learning tasks, such as binary classification and multi-class classification, highlighting their specific applications and characteristics. The sources emphasize the significance of selecting an appropriate loss function based on the nature of the problem and the desired model output. They also explain the role of optimizers in adjusting model parameters to reduce the calculated loss, introducing common optimizer choices like Stochastic Gradient Descent (SGD) and Adam, each with its unique approach to parameter updates.
- Understanding Binary Cross Entropy Loss: The sources introduce binary cross entropy loss as a commonly used loss function for binary classification problems, where the model predicts one of two possible classes. They note that PyTorch provides multiple implementations of binary cross entropy loss, including torch.nn.BCELoss and torch.nn.BCEWithLogitsLoss. They highlight a key distinction: torch.nn.BCELoss requires inputs to have already passed through the sigmoid activation function, while torch.nn.BCEWithLogitsLoss incorporates the sigmoid activation internally, offering enhanced numerical stability. The sources emphasize the importance of understanding these differences and selecting the appropriate implementation based on the model’s structure and activation functions.
- Exploring Loss Functions and Optimizers for Diverse Problems: The sources emphasize that PyTorch offers a wide range of loss functions and optimizers suitable for various machine learning problems beyond binary classification. They recommend referring to the online book version of the course for a comprehensive overview and code examples of different loss functions and optimizers applicable to diverse tasks. This comprehensive resource aims to equip learners with the knowledge to select appropriate components for their specific machine learning applications.
- Outlining the Training Loop Steps: The sources outline the key steps involved in a typical training loop for a neural network:
1. Forward Pass: Input data is fed through the model to obtain predictions.
2. Loss Calculation: The difference between predictions and actual target values is measured using the chosen loss function.
3. Optimizer Zeroing Gradients: Accumulated gradients from previous iterations are reset to zero.
4. Backpropagation: Gradients of the loss function with respect to model parameters are calculated, indicating the direction and magnitude of parameter adjustments needed to minimize the loss.
5. Optimizer Step: Model parameters are updated based on the calculated gradients and the optimizer’s update rule.
- Applying Sigmoid Activation for Binary Classification: The sources emphasize the importance of applying the sigmoid activation function to the raw output (logits) of a binary classification model before making predictions. They explain that the sigmoid function transforms the logits into a probability value between 0 and 1, representing the model’s confidence in each class.
- Illustrating Tensor Rounding and Dimension Squeezing: The sources demonstrate the use of torch.round to round tensor values to the nearest integer, often used for converting predicted probabilities into class labels in binary classification. They also explain the use of torch.squeeze to remove singleton dimensions from tensors, ensuring compatibility for operations requiring specific tensor shapes.
- Structuring Training Output for Clarity: The sources highlight the practice of organizing training output to enhance clarity and monitor progress. They suggest printing relevant metrics like epoch number, loss, and accuracy at regular intervals, allowing users to track the model’s learning progress over time.
This section introduces the concepts of loss functions and optimizers in PyTorch, emphasizing their importance in the training process. It guides learners on choosing suitable loss functions based on the problem type and provides insights into common optimizer choices. By explaining the steps involved in a typical training loop and showcasing practical code examples, the sources aim to equip learners with a solid understanding of how to train neural networks effectively in PyTorch.

Building and Evaluating a PyTorch Model: Pages 361-370

The sources transition to the practical application of the previously introduced concepts, guiding readers through the process of building, training, and evaluating a PyTorch model for a specific task. They emphasize the importance of structuring code clearly and organizing output for better understanding and analysis. The sources highlight the iterative nature of model development, involving multiple steps of training, evaluation, and refinement.
- Defining a Simple Linear Model: The sources provide a code example demonstrating how to define a simple linear model in PyTorch using torch.nn.Linear. They explain that this model takes a specified number of input features and produces a corresponding number of output features, performing a linear transformation on the input data. They stress that while this simple model may not be suitable for complex tasks, it serves as a foundational example for understanding the basics of building neural networks in PyTorch.
- Emphasizing Visualization in Data Exploration: The sources reiterate the importance of visualization in data exploration, encouraging readers to represent data visually to gain insights and understand patterns. They advocate for the “data explorer’s motto: visualize, visualize, visualize,” suggesting that visualizing data helps users become more familiar with its structure and characteristics, aiding in the model development process.
- Preparing Data for Model Training: The sources outline the steps involved in preparing data for model training, which often includes splitting data into training and testing sets. They explain that the training set is used to train the model, while the testing set is used to evaluate its performance on unseen data. They introduce a simple method for splitting data based on a predetermined index and mention the popular scikit-learn library’s train_test_split function as a more robust method for random data splitting. They highlight that data splitting ensures that the model’s ability to generalize to new data is assessed accurately.
- Creating a Training Loop: The sources provide a code example demonstrating the creation of a training loop, a fundamental component of training neural networks. The training loop iterates over the training data for a specified number of epochs, performing the steps outlined previously: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They emphasize that one epoch represents a complete pass through the entire training dataset. They also explain the concept of a “training loop” as the iterative process of updating model parameters over multiple epochs to minimize the loss function. They provide guidance on customizing the training loop, such as printing out loss and other metrics at specific intervals to monitor training progress.
- Visualizing Loss and Parameter Convergence: The sources encourage visualizing the loss function’s value over epochs to observe its convergence, indicating the model’s learning progress. They also suggest tracking changes in model parameters (weights and bias) to understand how they adjust during training to minimize the loss. The sources highlight that these visualizations provide valuable insights into the training process and help users assess the model’s effectiveness.
- Understanding the Concept of Overfitting: The sources introduce the concept of overfitting, a common challenge in machine learning, where a model performs exceptionally well on the training data but poorly on unseen data. They explain that overfitting occurs when the model learns the training data too well, capturing noise and irrelevant patterns that hinder its ability to generalize. They mention that techniques like early stopping, regularization, and data augmentation can mitigate overfitting, promoting better model generalization.
- Evaluating Model Performance: The sources guide readers through evaluating a trained model’s performance using the testing set, data that the model has not seen during training. They calculate the loss on the testing set to assess how well the model generalizes to new data. They emphasize the importance of evaluating the model on data separate from the training set to obtain an unbiased estimate of its real-world performance. They also introduce the idea of visualizing model predictions alongside the ground truth data (actual labels) to gain qualitative insights into the model’s behavior.
- Saving and Loading a Trained Model: The sources highlight the significance of saving a trained PyTorch model to preserve its learned parameters for future use. They provide a code example demonstrating how to save the model’s state dictionary, which contains the trained weights and biases, using torch.save. They also show how to load a saved model using torch.load, enabling users to reuse trained models without retraining.
This section guides readers through the practical steps of building, training, and evaluating a simple linear model in PyTorch. The sources emphasize visualization as a key aspect of data exploration and model understanding. By combining code examples with clear explanations and introducing essential concepts like overfitting and model evaluation, the sources equip learners with a practical foundation for building and working with neural networks in PyTorch.

Understanding Neural Networks and PyTorch Resources: Pages 371-380

The sources shift focus to neural networks, providing a conceptual understanding and highlighting resources for further exploration. They encourage active learning by posing challenges to readers, prompting them to apply their knowledge and explore concepts independently. The sources also emphasize the practical aspects of learning PyTorch, advocating for a hands-on approach with code over theoretical definitions.
- Encouraging Exploration of Neural Network Definitions: The sources acknowledge the abundance of definitions for neural networks available online and encourage readers to formulate their own understanding by exploring various sources. They suggest engaging with external resources like Google searches and Wikipedia to broaden their knowledge and develop a personal definition of neural networks.
- Recommending a Hands-On Approach to Learning: The sources advocate for a hands-on approach to learning PyTorch, emphasizing the importance of practical experience over theoretical definitions. They prioritize working with code and experimenting with different concepts to gain a deeper understanding of the framework.
- Presenting Key PyTorch Resources: The sources introduce valuable resources for learning PyTorch, including:
- GitHub Repository: A repository containing all course materials, including code examples, notebooks, and supplementary resources.
- Course Q&A: A dedicated platform for asking questions and seeking clarification on course content.
- Online Book: A comprehensive online book version of the course, providing in-depth explanations and code examples.
- Highlighting Benefits of the Online Book: The sources highlight the advantages of the online book version of the course, emphasizing its user-friendly features:
- Searchable Content: Users can easily search for specific topics or keywords within the book.
- Interactive Elements: The book incorporates interactive elements, allowing users to engage with the content more dynamically.
- Comprehensive Material: The book covers a wide range of PyTorch concepts and provides in-depth explanations.
- Demonstrating PyTorch Documentation Usage: The sources demonstrate how to effectively utilize PyTorch documentation, emphasizing its value as a reference guide. They showcase examples of searching for specific functions within the documentation, highlighting the clear explanations and usage examples provided.
- Addressing Common Errors in Deep Learning: The sources acknowledge that shape errors are common in deep learning, emphasizing the importance of understanding tensor shapes and dimensions for successful model implementation. They provide examples of shape errors encountered during code demonstrations, illustrating how mismatched tensor dimensions can lead to errors. They encourage users to pay close attention to tensor shapes and use debugging techniques to identify and resolve such issues.
- Introducing the Concept of Tensor Stacking: The sources introduce the concept of tensor stacking using torch.stack, explaining its functionality in concatenating a sequence of tensors along a new dimension. They clarify the dim parameter, which specifies the dimension along which the stacking operation is performed. They provide code examples demonstrating the usage of torch.stack and its impact on tensor shapes, emphasizing its utility in combining tensors effectively.
- Explaining Tensor Permutation: The sources explain tensor permutation as a method for rearranging the dimensions of a tensor using torch.permute. They emphasize that permuting a tensor changes how the data is viewed without altering the underlying data itself. They illustrate the concept with an example of permuting a tensor representing color channels, height, and width of an image, highlighting how the permutation operation reorders these dimensions while preserving the image data.
- Introducing Indexing on Tensors: The sources introduce the concept of indexing on tensors, a fundamental operation for accessing specific elements or subsets of data within a tensor. They present a challenge to readers, asking them to practice indexing on a given tensor to extract specific values. This exercise aims to reinforce the understanding of tensor indexing and its practical application.
- Explaining Random Seed and Random Number Generation: The sources explain the concept of a random seed in the context of random number generation, highlighting its role in controlling the reproducibility of random processes. They mention that setting a random seed ensures that the same sequence of random numbers is generated each time the code is executed, enabling consistent results for debugging and experimentation. They provide external resources, such as documentation links, for those interested in delving deeper into random number generation concepts in computing.
This section transitions from general concepts of neural networks to practical aspects of using PyTorch, highlighting valuable resources for further exploration and emphasizing a hands-on learning approach. By demonstrating documentation usage, addressing common errors, and introducing tensor manipulation techniques like stacking, permutation, and indexing, the sources equip learners with essential tools for working effectively with PyTorch.

Building a Model with PyTorch: Pages 381-390

The sources guide readers through building a more complex model in PyTorch, introducing the concept of subclassing nn.Module to create custom model architectures. They highlight the importance of understanding the PyTorch workflow, which involves preparing data, defining a model, selecting a loss function and optimizer, training the model, making predictions, and evaluating performance. The sources emphasize that while the steps involved remain largely consistent across different tasks, understanding the nuances of each step and how they relate to the specific problem being addressed is crucial for effective model development.
- Introducing the nn.Module Class: The sources explain that in PyTorch, neural network models are built by subclassing the nn.Module class, which provides a structured framework for defining model components and their interactions. They highlight that this approach offers flexibility and organization, enabling users to create custom architectures tailored to specific tasks.
- Defining a Custom Model Architecture: The sources provide a code example demonstrating how to define a custom model architecture by subclassing nn.Module. They emphasize the key components of a model definition:
- Constructor (__init__): This method initializes the model’s layers and other components.
- Forward Pass (forward): This method defines how the input data flows through the model’s layers during the forward propagation step.
- Understanding PyTorch Building Blocks: The sources explain that PyTorch provides a rich set of building blocks for neural networks, contained within the torch.nn module. They highlight that nn contains various layers, activation functions, loss functions, and other components essential for constructing neural networks.
- Illustrating the Flow of Data Through a Model: The sources visually illustrate the flow of data through the defined model, using diagrams to represent the input features, hidden layers, and output. They explain that the input data is passed through a series of linear transformations (nn.Linear layers) and activation functions, ultimately producing an output that corresponds to the task being addressed.
- Creating a Training Loop with Multiple Epochs: The sources demonstrate how to create a training loop that iterates over the training data for a specified number of epochs, performing the steps involved in training a neural network: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They highlight the importance of training for multiple epochs to allow the model to learn from the data iteratively and adjust its parameters to minimize the loss function.
- Observing Loss Reduction During Training: The sources show the output of the training loop, emphasizing how the loss value decreases over epochs, indicating that the model is learning from the data and improving its performance. They explain that this decrease in loss signifies that the model’s predictions are becoming more aligned with the actual labels.
- Emphasizing Visual Inspection of Data: The sources reiterate the importance of visualizing data, advocating for visually inspecting the data before making predictions. They highlight that understanding the data’s characteristics and patterns is crucial for informed model development and interpretation of results.
- Preparing Data for Visualization: The sources guide readers through preparing data for visualization, including splitting it into training and testing sets and organizing it into appropriate data structures. They mention using libraries like matplotlib to create visual representations of the data, aiding in data exploration and understanding.
- Introducing the torch.no_grad Context: The sources introduce the concept of the torch.no_grad context, explaining its role in performing computations without tracking gradients. They highlight that this context is particularly useful during model evaluation or inference, where gradient calculations are not required, leading to more efficient computation.
- Defining a Testing Loop: The sources guide readers through defining a testing loop, similar to the training loop, which iterates over the testing data to evaluate the model’s performance on unseen data. They emphasize the importance of evaluating the model on data separate from the training set to obtain an unbiased assessment of its ability to generalize. They outline the steps involved in the testing loop: performing a forward pass, calculating the loss, and accumulating relevant metrics like loss and accuracy.
The sources provide a comprehensive walkthrough of building and training a more sophisticated neural network model in PyTorch. They emphasize the importance of understanding the PyTorch workflow, from data preparation to model evaluation, and highlight the flexibility and organization offered by subclassing nn.Module to create custom model architectures. They continue to stress the value of visual inspection of data and encourage readers to explore concepts like data visualization and model evaluation in detail.

Building and Evaluating Models in PyTorch: Pages 391-400

The sources focus on training and evaluating a regression model in PyTorch, emphasizing the iterative nature of model development and improvement. They guide readers through the process of building a simple model, training it, evaluating its performance, and identifying areas for potential enhancements. They introduce the concept of non-linearity in neural networks, explaining how the addition of non-linear activation functions can enhance a model’s ability to learn complex patterns.
- Building a Regression Model with PyTorch: The sources provide a step-by-step guide to building a simple regression model using PyTorch. They showcase the creation of a model with linear layers (nn.Linear), illustrating how to define the input and output dimensions of each layer. They emphasize that for regression tasks, the output layer typically has a single output unit representing the predicted value.
- Creating a Training Loop for Regression: The sources demonstrate how to create a training loop specifically for regression tasks. They outline the familiar steps involved: forward pass, loss calculation, optimizer zeroing gradients, backpropagation, and optimizer step. They emphasize that the loss function used for regression differs from classification tasks, typically employing mean squared error (MSE) or similar metrics to measure the difference between predicted and actual values.
- Observing Loss Reduction During Regression Training: The sources show the output of the training loop for the regression model, highlighting how the loss value decreases over epochs, indicating that the model is learning to predict the target values more accurately. They explain that this decrease in loss signifies that the model’s predictions are converging towards the actual values.
- Evaluating the Regression Model: The sources guide readers through evaluating the trained regression model. They emphasize the importance of using a separate testing dataset to assess the model’s ability to generalize to unseen data. They outline the steps involved in evaluating the model on the testing set, including performing a forward pass, calculating the loss, and accumulating metrics.
- Visualizing Regression Model Predictions: The sources advocate for visualizing the predictions of the regression model, explaining that visual inspection can provide valuable insights into the model’s performance and potential areas for improvement. They suggest plotting the predicted values against the actual values, allowing users to assess how well the model captures the underlying relationship in the data.
- Introducing Non-Linearities in Neural Networks: The sources introduce the concept of non-linearity in neural networks, explaining that real-world data often exhibits complex, non-linear relationships. They highlight that incorporating non-linear activation functions into neural network models can significantly enhance their ability to learn and represent these intricate patterns. They mention activation functions like ReLU (Rectified Linear Unit) as common choices for introducing non-linearity.
- Encouraging Experimentation with Non-Linearities: The sources encourage readers to experiment with different non-linear activation functions, explaining that the choice of activation function can impact model performance. They suggest trying various activation functions and observing their effects on the model’s ability to learn from the data and make accurate predictions.
- Highlighting the Role of Hyperparameters: The sources emphasize that various components of a neural network, such as the number of layers, number of units in each layer, learning rate, and activation functions, are hyperparameters that can be adjusted to influence model performance. They encourage experimentation with different hyperparameter settings to find optimal configurations for specific tasks.
- Demonstrating the Impact of Adding Layers: The sources visually demonstrate the effect of adding more layers to a neural network model, explaining that increasing the model’s depth can enhance its ability to learn complex representations. They show how a deeper model, compared to a shallower one, can better capture the intricacies of the data and make more accurate predictions.
- Illustrating the Addition of ReLU Activation Functions: The sources provide a visual illustration of incorporating ReLU activation functions into a neural network model. They show how ReLU introduces non-linearity by applying a thresholding operation to the output of linear layers, enabling the model to learn non-linear decision boundaries and better represent complex relationships in the data.
This section guides readers through the process of building, training, and evaluating a regression model in PyTorch, emphasizing the iterative nature of model development. The sources highlight the importance of visualizing predictions and the role of non-linear activation functions in enhancing model capabilities. They encourage experimentation with different architectures and hyperparameters, fostering a deeper understanding of the factors influencing model performance and promoting a data-driven approach to model building.

Working with Tensors and Data in PyTorch: Pages 401-410

The sources guide readers through various aspects of working with tensors and data in PyTorch, emphasizing the fundamental role tensors play in deep learning computations. They introduce techniques for creating, manipulating, and understanding tensors, highlighting their importance in representing and processing data for neural networks.
- Creating Tensors in PyTorch: The sources detail methods for creating tensors in PyTorch, focusing on the torch.arange() function. They explain that torch.arange() generates a tensor containing a sequence of evenly spaced values within a specified range. They provide code examples illustrating the use of torch.arange() with various parameters like start, end, and step to control the generated sequence.
- Understanding the Deprecation of torch.range(): The sources note that the torch.range() function, previously used for creating tensors with a range of values, has been deprecated in favor of torch.arange(). They encourage users to adopt torch.arange() for creating tensors containing sequences of values.
- Exploring Tensor Shapes and Reshaping: The sources emphasize the significance of understanding tensor shapes in PyTorch, explaining that the shape of a tensor determines its dimensionality and the arrangement of its elements. They introduce the concept of reshaping tensors, using functions like torch.reshape() to modify a tensor’s shape while preserving its total number of elements. They provide code examples demonstrating how to reshape tensors to match specific requirements for various operations or layers in neural networks.
- Stacking Tensors Together: The sources introduce the torch.stack() function, explaining its role in concatenating a sequence of tensors along a new dimension. They explain that torch.stack() takes a list of tensors as input and combines them into a higher-dimensional tensor, effectively stacking them together along a specified dimension. They illustrate the use of torch.stack() with code examples, highlighting how it can be used to combine multiple tensors into a single structure.
- Permuting Tensor Dimensions: The sources explore the concept of permuting tensor dimensions, explaining that it involves rearranging the axes of a tensor. They introduce the torch.permute() function, which reorders the dimensions of a tensor according to specified indices. They demonstrate the use of torch.permute() with code examples, emphasizing its application in tasks like transforming image data from the format (Height, Width, Channels) to (Channels, Height, Width), which is often required by convolutional neural networks.
- Visualizing Tensors and Their Shapes: The sources advocate for visualizing tensors and their shapes, explaining that visual inspection can aid in understanding the structure and arrangement of tensor data. They suggest using tools like matplotlib to create graphical representations of tensors, allowing users to better comprehend the dimensionality and organization of tensor elements.
- Indexing and Slicing Tensors: The sources guide readers through techniques for indexing and slicing tensors, explaining how to access specific elements or sub-regions within a tensor. They demonstrate the use of square brackets ([]) for indexing tensors, illustrating how to retrieve elements based on their indices along various dimensions. They further explain how slicing allows users to extract a portion of a tensor by specifying start and end indices along each dimension. They provide code examples showcasing various indexing and slicing operations, emphasizing their role in manipulating and extracting data from tensors.
- Introducing the Concept of Random Seeds: The sources introduce the concept of random seeds, explaining their significance in controlling the randomness in PyTorch operations that involve random number generation. They explain that setting a random seed ensures that the same sequence of random numbers is generated each time the code is run, promoting reproducibility of results. They provide code examples demonstrating how to set a random seed using torch.manual_seed(), highlighting its importance in maintaining consistency during model training and experimentation.
- Exploring the torch.rand() Function: The sources explore the torch.rand() function, explaining its role in generating tensors filled with random numbers drawn from a uniform distribution between 0 and 1. They provide code examples demonstrating the use of torch.rand() to create tensors of various shapes filled with random values.
- Discussing Running Tensors and GPUs: The sources introduce the concept of running tensors on GPUs (Graphics Processing Units), explaining that GPUs offer significant computational advantages for deep learning tasks compared to CPUs. They highlight that PyTorch provides mechanisms for transferring tensors to and from GPUs, enabling users to leverage GPU acceleration for training and inference.
- Emphasizing Documentation and Extra Resources: The sources consistently encourage readers to refer to the PyTorch documentation for detailed information on functions, modules, and concepts. They also highlight the availability of supplementary resources, including online tutorials, blog posts, and research papers, to enhance understanding and provide deeper insights into various aspects of PyTorch.
This section guides readers through various techniques for working with tensors and data in PyTorch, highlighting the importance of understanding tensor shapes, reshaping, stacking, permuting, indexing, and slicing operations. They introduce concepts like random seeds and GPU acceleration, emphasizing the importance of leveraging available documentation and resources to enhance understanding and facilitate effective deep learning development using PyTorch.

Constructing and Training Neural Networks with PyTorch: Pages 411-420

The sources focus on building and training neural networks in PyTorch, specifically in the context of binary classification tasks. They guide readers through the process of creating a simple neural network architecture, defining a suitable loss function, setting up an optimizer, implementing a training loop, and evaluating the model’s performance on test data. They emphasize the use of activation functions, such as the sigmoid function, to introduce non-linearity into the network and enable it to learn complex decision boundaries.
- Building a Neural Network for Binary Classification: The sources provide a step-by-step guide to constructing a neural network specifically for binary classification. They show the creation of a model with linear layers (nn.Linear) stacked sequentially, illustrating how to define the input and output dimensions of each layer. They emphasize that the output layer for binary classification tasks typically has a single output unit, representing the probability of the positive class.
- Using the Sigmoid Activation Function: The sources introduce the sigmoid activation function, explaining its role in transforming the output of linear layers into a probability value between 0 and 1. They highlight that the sigmoid function introduces non-linearity into the network, allowing it to model complex relationships between input features and the target class.
- Creating a Training Loop for Binary Classification: The sources demonstrate the implementation of a training loop tailored for binary classification tasks. They outline the familiar steps involved: forward pass to calculate the loss, optimizer zeroing gradients, backpropagation to calculate gradients, and optimizer step to update model parameters.
- Understanding Binary Cross-Entropy Loss: The sources explain the concept of binary cross-entropy loss, a common loss function used for binary classification tasks. They describe how binary cross-entropy loss measures the difference between the predicted probabilities and the true labels, guiding the model to learn to make accurate predictions.
- Calculating Accuracy for Binary Classification: The sources demonstrate how to calculate accuracy for binary classification tasks. They show how to convert the model’s predicted probabilities into binary predictions using a threshold (typically 0.5), comparing these predictions to the true labels to determine the percentage of correctly classified instances.
- Evaluating the Model on Test Data: The sources emphasize the importance of evaluating the trained model on a separate testing dataset to assess its ability to generalize to unseen data. They outline the steps involved in testing the model, including performing a forward pass on the test data, calculating the loss, and computing the accuracy.
- Plotting Predictions and Decision Boundaries: The sources advocate for visualizing the model’s predictions and decision boundaries, explaining that visual inspection can provide valuable insights into the model’s behavior and performance. They suggest using plotting techniques to display the decision boundary learned by the model, illustrating how the model separates data points belonging to different classes.
- Using Helper Functions to Simplify Code: The sources introduce the use of helper functions to organize and streamline the code for training and evaluating the model. They demonstrate how to encapsulate repetitive tasks, such as plotting predictions or calculating accuracy, into reusable functions, improving code readability and maintainability.
This section guides readers through the construction and training of neural networks for binary classification in PyTorch. The sources emphasize the use of activation functions to introduce non-linearity, the choice of suitable loss functions and optimizers, the implementation of a training loop, and the evaluation of the model on test data. They highlight the importance of visualizing predictions and decision boundaries and introduce techniques for organizing code using helper functions.

Exploring Non-Linearities and Multi-Class Classification in PyTorch: Pages 421-430

The sources continue the exploration of neural networks, focusing on incorporating non-linearities using activation functions and expanding into multi-class classification. They guide readers through the process of enhancing model performance by adding non-linear activation functions, transitioning from binary classification to multi-class classification, choosing appropriate loss functions and optimizers, and evaluating model performance with metrics such as accuracy.
- Incorporating Non-Linearity with Activation Functions: The sources emphasize the crucial role of non-linear activation functions in enabling neural networks to learn complex patterns and relationships within data. They introduce the ReLU (Rectified Linear Unit) activation function, highlighting its effectiveness and widespread use in deep learning. They explain that ReLU introduces non-linearity by setting negative values to zero and passing positive values unchanged. This simple yet powerful activation function allows neural networks to model non-linear decision boundaries and capture intricate data representations.
- Understanding the Importance of Non-Linearity: The sources provide insights into the rationale behind incorporating non-linearity into neural networks. They explain that without non-linear activation functions, a neural network, regardless of its depth, would essentially behave as a single linear layer, severely limiting its ability to learn complex patterns. Non-linear activation functions, like ReLU, introduce bends and curves into the model’s decision boundaries, allowing it to capture non-linear relationships and make more accurate predictions.
- Transitioning to Multi-Class Classification: The sources smoothly transition from binary classification to multi-class classification, where the task involves classifying data into more than two categories. They explain the key differences between binary and multi-class classification, highlighting the need for adjustments in the model’s output layer and the choice of loss function and activation function.
- Using Softmax for Multi-Class Classification: The sources introduce the softmax activation function, commonly used in the output layer of multi-class classification models. They explain that softmax transforms the raw output scores (logits) of the network into a probability distribution over the different classes, ensuring that the predicted probabilities for all classes sum up to one.
- Choosing an Appropriate Loss Function for Multi-Class Classification: The sources guide readers in selecting appropriate loss functions for multi-class classification. They discuss cross-entropy loss, a widely used loss function for multi-class classification tasks, explaining how it measures the difference between the predicted probability distribution and the true label distribution.
- Implementing a Training Loop for Multi-Class Classification: The sources outline the steps involved in implementing a training loop for multi-class classification models. They demonstrate the familiar process of iterating through the training data in batches, performing a forward pass, calculating the loss, backpropagating to compute gradients, and updating the model’s parameters using an optimizer.
- Evaluating Multi-Class Classification Models: The sources focus on evaluating the performance of multi-class classification models using metrics like accuracy. They explain that accuracy measures the percentage of correctly classified instances over the entire dataset, providing an overall assessment of the model’s predictive ability.
- Visualizing Multi-Class Classification Results: The sources suggest visualizing the predictions and decision boundaries of multi-class classification models, emphasizing the importance of visual inspection for gaining insights into the model’s behavior and performance. They demonstrate techniques for plotting the decision boundaries learned by the model, showing how the model divides the feature space to separate data points belonging to different classes.
- Highlighting the Interplay of Linear and Non-linear Functions: The sources emphasize the combined effect of linear transformations (performed by linear layers) and non-linear transformations (introduced by activation functions) in allowing neural networks to learn complex patterns. They explain that the interplay of linear and non-linear functions enables the model to capture intricate data representations and make accurate predictions across a wide range of tasks.
This section guides readers through the process of incorporating non-linearity into neural networks using activation functions like ReLU and transitioning from binary to multi-class classification using the softmax activation function. The sources discuss the choice of appropriate loss functions for multi-class classification, demonstrate the implementation of a training loop, and highlight the importance of evaluating model performance using metrics like accuracy and visualizing decision boundaries to gain insights into the model’s behavior. They emphasize the critical role of combining linear and non-linear functions to enable neural networks to effectively learn complex patterns within data.

Visualizing and Building Neural Networks for Multi-Class Classification: Pages 431-440

The sources emphasize the importance of visualization in understanding data patterns and building intuition for neural network architectures. They guide readers through the process of visualizing data for multi-class classification, designing a simple neural network for this task, understanding input and output shapes, and selecting appropriate loss functions and optimizers. They introduce tools like PyTorch’s nn.Sequential container to structure models and highlight the flexibility of PyTorch for customizing neural networks.
- Visualizing Data for Multi-Class Classification: The sources advocate for visualizing data before building models, especially for multi-class classification. They illustrate the use of scatter plots to display data points with different colors representing different classes. This visualization helps identify patterns, clusters, and potential decision boundaries that a neural network could learn.
- Designing a Neural Network for Multi-Class Classification: The sources demonstrate the construction of a simple neural network for multi-class classification using PyTorch’s nn.Sequential container, which allows for a streamlined definition of the model’s architecture by stacking layers in a sequential order. They show how to define linear layers (nn.Linear) with appropriate input and output dimensions based on the number of features and the number of classes in the dataset.
- Determining Input and Output Shapes: The sources guide readers in determining the input and output shapes for the different layers of the neural network. They explain that the input shape of the first layer is determined by the number of features in the dataset, while the output shape of the last layer corresponds to the number of classes. The input and output shapes of intermediate layers can be adjusted to control the network’s capacity and complexity. They highlight the importance of ensuring that the input and output dimensions of consecutive layers are compatible for a smooth flow of data through the network.
- Selecting Loss Functions and Optimizers: The sources discuss the importance of choosing appropriate loss functions and optimizers for multi-class classification. They explain the concept of cross-entropy loss, a commonly used loss function for this type of classification task, and discuss its role in guiding the model to learn to make accurate predictions. They also mention optimizers like Stochastic Gradient Descent (SGD), highlighting their role in updating the model’s parameters to minimize the loss function.
- Using PyTorch’s nn Module for Neural Network Components: The sources emphasize the use of PyTorch’s nn module, which contains building blocks for constructing neural networks. They specifically demonstrate the use of nn.Linear for creating linear layers and nn.Sequential for structuring the model by combining multiple layers in a sequential manner. They highlight that PyTorch offers a vast array of modules within the nn package for creating diverse and sophisticated neural network architectures.
This section encourages the use of visualization to gain insights into data patterns for multi-class classification and guides readers in designing simple neural networks for this task. The sources emphasize the importance of understanding and setting appropriate input and output shapes for the different layers of the network and provide guidance on selecting suitable loss functions and optimizers. They showcase PyTorch’s flexibility and its powerful nn module for constructing neural network architectures.

Building a Multi-Class Classification Model: Pages 441-450

The sources continue the discussion of multi-class classification, focusing on designing a neural network architecture and creating a custom MultiClassClassification model in PyTorch. They guide readers through the process of defining the input and output shapes of each layer based on the number of features and classes in the dataset, constructing the model using PyTorch’s nn.Linear and nn.Sequential modules, and testing the data flow through the model with a forward pass. They emphasize the importance of understanding how the shape of data changes as it passes through the different layers of the network.
- Defining the Neural Network Architecture: The sources present a structured approach to designing a neural network architecture for multi-class classification. They outline the key components of the architecture:
- Input layer shape: Determined by the number of features in the dataset.
- Hidden layers: Allow the network to learn complex relationships within the data. The number of hidden layers and the number of neurons (hidden units) in each layer can be customized to control the network’s capacity and complexity.
- Output layer shape: Corresponds to the number of classes in the dataset. Each output neuron represents a different class.
- Output activation: Typically uses the softmax function for multi-class classification. Softmax transforms the network’s output scores (logits) into a probability distribution over the classes, ensuring that the predicted probabilities sum to one.
- Creating a Custom MultiClassClassification Model in PyTorch: The sources guide readers in implementing a custom MultiClassClassification model using PyTorch. They demonstrate how to define the model class, inheriting from PyTorch’s nn.Module, and how to structure the model using nn.Sequential to stack layers in a sequential manner.
- Using nn.Linear for Linear Transformations: The sources explain the use of nn.Linear for creating linear layers in the neural network. nn.Linear applies a linear transformation to the input data, calculating a weighted sum of the input features and adding a bias term. The weights and biases are the learnable parameters of the linear layer that the network adjusts during training to make accurate predictions.
- Testing Data Flow Through the Model: The sources emphasize the importance of testing the data flow through the model to ensure that the input and output shapes of each layer are compatible. They demonstrate how to perform a forward pass with dummy data to verify that data can successfully pass through the network without encountering shape errors.
- Troubleshooting Shape Issues: The sources provide tips for troubleshooting shape issues, highlighting the significance of paying attention to the error messages that PyTorch provides. Error messages related to shape mismatches often provide clues about which layers or operations need adjustments to ensure compatibility.
- Visualizing Shape Changes with Print Statements: The sources suggest using print statements within the model’s forward method to display the shape of the data as it passes through each layer. This visual inspection helps confirm that data transformations are occurring as expected and aids in identifying and resolving shape-related issues.
This section guides readers through the process of designing and implementing a multi-class classification model in PyTorch. The sources emphasize the importance of understanding input and output shapes for each layer, utilizing PyTorch’s nn.Linear for linear transformations, using nn.Sequential for structuring the model, and verifying the data flow with a forward pass. They provide tips for troubleshooting shape issues and encourage the use of print statements to visualize shape changes, facilitating a deeper understanding of the model’s architecture and behavior.

Training and Evaluating the Multi-Class Classification Model: Pages 451-460

The sources shift focus to the practical aspects of training and evaluating the multi-class classification model in PyTorch. They guide readers through creating a training loop, setting up an optimizer and loss function, implementing a testing loop to evaluate model performance on unseen data, and calculating accuracy as a performance metric. The sources emphasize the iterative nature of model training, involving forward passes, loss calculation, backpropagation, and parameter updates using an optimizer.
- Creating a Training Loop in PyTorch: The sources emphasize the importance of a training loop in machine learning, which is the process of iteratively training a model on a dataset. They guide readers in creating a training loop in PyTorch, incorporating the following key steps:
1. Iterating over epochs: An epoch represents one complete pass through the entire training dataset. The number of epochs determines how many times the model will see the training data during the training process.
2. Iterating over batches: The training data is typically divided into smaller batches to make the training process more manageable and efficient. Each batch contains a subset of the training data.
3. Performing a forward pass: Passing the input data (a batch of data) through the model to generate predictions.
4. Calculating the loss: Comparing the model’s predictions to the true labels to quantify how well the model is performing. This comparison is done using a loss function, such as cross-entropy loss for multi-class classification.
5. Performing backpropagation: Calculating gradients of the loss function with respect to the model’s parameters. These gradients indicate how much each parameter contributes to the overall error.
6. Updating model parameters: Adjusting the model’s parameters (weights and biases) using an optimizer, such as Stochastic Gradient Descent (SGD). The optimizer uses the calculated gradients to update the parameters in a direction that minimizes the loss function.
- Setting up an Optimizer and Loss Function: The sources demonstrate how to set up an optimizer and a loss function in PyTorch. They explain that optimizers play a crucial role in updating the model’s parameters to minimize the loss function during training. They showcase the use of the Adam optimizer (torch.optim.Adam), a popular optimization algorithm for deep learning. For the loss function, they use the cross-entropy loss (nn.CrossEntropyLoss), a common choice for multi-class classification tasks.
- Evaluating Model Performance with a Testing Loop: The sources guide readers in creating a testing loop in PyTorch to evaluate the trained model’s performance on unseen data (the test dataset). The testing loop follows a similar structure to the training loop but without the backpropagation and parameter update steps. It involves performing a forward pass on the test data, calculating the loss, and often using additional metrics like accuracy to assess the model’s generalization capability.
- Calculating Accuracy as a Performance Metric: The sources introduce accuracy as a straightforward metric for evaluating classification model performance. Accuracy measures the proportion of correctly classified samples in the test dataset, providing a simple indication of how well the model generalizes to unseen data.
This section emphasizes the importance of the training loop, which iteratively improves the model’s performance by adjusting its parameters based on the calculated loss. It guides readers through implementing the training loop in PyTorch, setting up an optimizer and loss function, creating a testing loop to evaluate model performance, and calculating accuracy as a basic performance metric for classification tasks.

Refining and Improving Model Performance: Pages 461-470

The sources guide readers through various strategies for refining and improving the performance of the multi-class classification model. They cover techniques like adjusting the learning rate, experimenting with different optimizers, exploring the concept of nonlinear activation functions, and understanding the idea of running tensors on a Graphical Processing Unit (GPU) for faster training. They emphasize that model improvement in machine learning often involves experimentation, trial-and-error, and a systematic approach to evaluating and comparing different model configurations.
- Adjusting the Learning Rate: The sources emphasize the importance of the learning rate in the training process. They explain that the learning rate controls the size of the steps the optimizer takes when updating model parameters during backpropagation. A high learning rate may lead to the model missing the optimal minimum of the loss function, while a very low learning rate can cause slow convergence, making the training process unnecessarily lengthy. The sources suggest experimenting with different learning rates to find an appropriate balance between speed and convergence.
- Experimenting with Different Optimizers: The sources highlight the importance of choosing an appropriate optimizer for training neural networks. They mention that different optimizers use different strategies for updating model parameters based on the calculated gradients, and some optimizers might be more suitable than others for specific problems or datasets. The sources encourage readers to experiment with various optimizers available in PyTorch, such as Stochastic Gradient Descent (SGD), Adam, and RMSprop, to observe their impact on model performance.
- Introducing Nonlinear Activation Functions: The sources introduce the concept of nonlinear activation functions and their role in enhancing the capacity of neural networks. They explain that linear layers alone can only model linear relationships within the data, limiting the complexity of patterns the model can learn. Nonlinear activation functions, applied to the outputs of linear layers, introduce nonlinearities into the model, enabling it to learn more complex relationships and capture nonlinear patterns in the data. The sources mention the sigmoid activation function as an example, but PyTorch offers a variety of nonlinear activation functions within the nn module.
- Utilizing GPUs for Faster Training: The sources touch on the concept of running PyTorch tensors on a GPU (Graphical Processing Unit) to significantly speed up the training process. GPUs are specialized hardware designed for parallel computations, making them particularly well-suited for the matrix operations involved in deep learning. By utilizing a GPU, training times can be significantly reduced, allowing for faster experimentation and model development.
- Improving a Model: The sources discuss the iterative process of improving a machine learning model, highlighting that model development rarely produces optimal results on the first attempt. They suggest a systematic approach involving the following:
- Starting simple: Beginning with a simpler model architecture and gradually increasing complexity if needed.
- Experimenting with hyperparameters: Tuning parameters like learning rate, batch size, and the number of hidden layers to find an optimal configuration.
- Evaluating and comparing results: Carefully analyzing the model’s performance on the training and test datasets, using metrics like loss and accuracy to assess its effectiveness and generalization capabilities.
This section guides readers in exploring various strategies for refining and improving the multi-class classification model. The sources emphasize the importance of adjusting the learning rate, experimenting with different optimizers, introducing nonlinear activation functions for enhanced model capacity, and leveraging GPUs for faster training. They underscore the iterative nature of model improvement, encouraging readers to adopt a systematic approach involving experimentation, hyperparameter tuning, and thorough evaluation.

Please note that specific recommendations about optimal learning rates or best optimizers for a given problem may vary depending on the dataset, model architecture, and other factors. These aspects often require experimentation and a deeper understanding of the specific machine learning problem being addressed.

Exploring the PyTorch Workflow and Model Evaluation: Pages 471-480

The sources guide readers through crucial aspects of the PyTorch workflow, focusing on saving and loading trained models, understanding common choices for loss functions and optimizers, and exploring additional classification metrics beyond accuracy. They delve into the concept of a confusion matrix as a valuable tool for evaluating classification models, providing deeper insights into the model’s performance across different classes. The sources advocate for a holistic approach to model evaluation, emphasizing that multiple metrics should be considered to gain a comprehensive understanding of a model’s strengths and weaknesses.
- Saving and Loading Trained PyTorch Models: The sources emphasize the importance of saving trained models in PyTorch. They demonstrate the process of saving a model’s state dictionary, which contains the learned parameters (weights and biases), using torch.save(). They also showcase the process of loading a saved model using torch.load(), enabling users to reuse trained models for inference or further training.
- Common Choices for Loss Functions and Optimizers: The sources present a table summarizing common choices for loss functions and optimizers in PyTorch, specifically tailored for binary and multi-class classification tasks. They provide brief descriptions of each loss function and optimizer, highlighting key characteristics and situations where they are commonly used. For binary classification, they mention the Binary Cross Entropy Loss (nn.BCELoss) and the Stochastic Gradient Descent (SGD) optimizer as common choices. For multi-class classification, they mention the Cross Entropy Loss (nn.CrossEntropyLoss) and the Adam optimizer.
- Exploring Additional Classification Metrics: The sources introduce additional classification metrics beyond accuracy, emphasizing the importance of considering multiple metrics for a comprehensive evaluation. They touch on precision, recall, the F1 score, confusion matrices, and classification reports as valuable tools for assessing model performance, particularly when dealing with imbalanced datasets or situations where different types of errors carry different weights.
- Constructing and Interpreting a Confusion Matrix: The sources introduce the confusion matrix as a powerful tool for visualizing the performance of a classification model. They explain that a confusion matrix displays the counts (or proportions) of correctly and incorrectly classified instances for each class. The rows of the matrix typically represent the true classes, while the columns represent the predicted classes. Each cell in the matrix represents the number of instances that were classified as belonging to a particular predicted class when their true class was different. The sources guide readers through creating a confusion matrix in PyTorch using the torchmetrics library, which provides a dedicated ConfusionMatrix class. They emphasize that confusion matrices offer valuable insights into:
- True positives (TP): Correctly predicted positive instances.
- True negatives (TN): Correctly predicted negative instances.
- False positives (FP): Incorrectly predicted positive instances (Type I errors).
- False negatives (FN): Incorrectly predicted negative instances (Type II errors).
This section highlights the practical steps of saving and loading trained PyTorch models, providing users with the ability to reuse trained models for different purposes. It presents common choices for loss functions and optimizers, aiding users in selecting appropriate configurations for their classification tasks. The sources expand the discussion on classification metrics, introducing additional measures like precision, recall, the F1 score, and the confusion matrix. They advocate for using a combination of metrics to gain a more nuanced understanding of model performance, particularly when addressing real-world problems where different types of errors have varying consequences.

Visualizing and Evaluating Model Predictions: Pages 481-490

The sources guide readers through the process of visualizing and evaluating the predictions made by the trained convolutional neural network (CNN) model. They emphasize the importance of going beyond overall accuracy and examining individual predictions to gain a deeper understanding of the model’s behavior and identify potential areas for improvement. The sources introduce techniques for plotting predictions visually, comparing model predictions to ground truth labels, and using a confusion matrix to assess the model’s performance across different classes.
- Visualizing Model Predictions: The sources introduce techniques for visualizing model predictions on individual images from the test dataset. They suggest randomly sampling a set of images from the test dataset, obtaining the model’s predictions for these images, and then displaying both the images and their corresponding predicted labels. This approach allows for a qualitative assessment of the model’s performance, enabling users to visually inspect how well the model aligns with human perception.
- Comparing Predictions to Ground Truth: The sources stress the importance of comparing the model’s predictions to the ground truth labels associated with the test images. By visually aligning the predicted labels with the true labels, users can quickly identify instances where the model makes correct predictions and instances where it errs. This comparison helps to pinpoint specific types of images or classes that the model might struggle with, providing valuable insights for further model refinement.
- Creating a Confusion Matrix for Deeper Insights: The sources reiterate the value of a confusion matrix for evaluating classification models. They guide readers through creating a confusion matrix using libraries like torchmetrics and mlxtend, which offer tools for calculating and visualizing confusion matrices. The confusion matrix provides a comprehensive overview of the model’s performance across all classes, highlighting the counts of true positives, true negatives, false positives, and false negatives. This visualization helps to identify classes that the model might be confusing, revealing patterns of misclassification that can inform further model development or data augmentation strategies.
This section guides readers through practical techniques for visualizing and evaluating the predictions made by the trained CNN model. The sources advocate for a multi-faceted evaluation approach, emphasizing the value of visually inspecting individual predictions, comparing them to ground truth labels, and utilizing a confusion matrix to analyze the model’s performance across all classes. By combining qualitative and quantitative assessment methods, users can gain a more comprehensive understanding of the model’s capabilities, identify its strengths and weaknesses, and glean insights for potential improvements.

Getting Started with Computer Vision and Convolutional Neural Networks: Pages 491-500

The sources introduce the field of computer vision and convolutional neural networks (CNNs), providing readers with an overview of key libraries, resources, and the basic concepts involved in building computer vision models with PyTorch. They guide readers through setting up the necessary libraries, understanding the structure of CNNs, and preparing to work with image datasets. The sources emphasize a hands-on approach to learning, encouraging readers to experiment with code and explore the concepts through practical implementation.
- Essential Computer Vision Libraries in PyTorch: The sources present several essential libraries commonly used for computer vision tasks in PyTorch, highlighting their functionalities and roles in building and training CNNs:
- Torchvision: This library serves as the core domain library for computer vision in PyTorch. It provides utilities for data loading, image transformations, pre-trained models, and more. Within torchvision, several sub-modules are particularly relevant:
- datasets: This module offers a collection of popular computer vision datasets, including ImageNet, CIFAR10, CIFAR100, MNIST, and FashionMNIST, readily available for download and use in PyTorch.
- models: This module contains a variety of pre-trained CNN architectures, such as ResNet, AlexNet, VGG, and Inception, which can be used directly for inference or fine-tuned for specific tasks.
- transforms: This module provides a range of image transformations, including resizing, cropping, flipping, and normalization, which are crucial for preprocessing image data before feeding it into a CNN.
- utils: This module offers helpful utilities for tasks like visualizing images, displaying model summaries, and saving and loading checkpoints.
- Matplotlib: This versatile plotting library is essential for visualizing images, plotting training curves, and exploring data patterns in computer vision tasks.
- Exploring Convolutional Neural Networks: The sources provide a high-level introduction to CNNs, explaining that they are specialized neural networks designed for processing data with a grid-like structure, such as images. They highlight the key components of a CNN:
- Convolutional Layers: These layers apply a series of learnable filters (kernels) to the input image, extracting features like edges, textures, and patterns. The filters slide across the input image, performing convolutions to produce feature maps that highlight specific characteristics of the image.
- Pooling Layers: These layers downsample the feature maps generated by convolutional layers, reducing their spatial dimensions while preserving important features. Pooling layers help to make the model more robust to variations in the position of features within the image.
- Fully Connected Layers: These layers, often found in the final stages of a CNN, connect all the features extracted by the convolutional and pooling layers, enabling the model to learn complex relationships between these features and perform high-level reasoning about the image content.
- Obtaining and Preparing Image Datasets: The sources guide readers through the process of obtaining image datasets for training computer vision models, emphasizing the importance of:
- Choosing the right dataset: Selecting a dataset relevant to the specific computer vision task being addressed.
- Understanding dataset structure: Familiarizing oneself with the organization of images and labels within the dataset, ensuring compatibility with PyTorch’s data loading mechanisms.
- Preprocessing images: Applying necessary transformations to the images, such as resizing, cropping, normalization, and data augmentation, to prepare them for input into a CNN.
This section serves as a starting point for readers venturing into the world of computer vision and CNNs using PyTorch. The sources introduce essential libraries, resources, and basic concepts, equipping readers with the foundational knowledge and tools needed to begin building and training computer vision models. They highlight the structure of CNNs, emphasizing the roles of convolutional, pooling, and fully connected layers in processing image data. The sources stress the importance of selecting appropriate image datasets, understanding their structure, and applying necessary preprocessing steps to prepare the data for training.

Getting Hands-on with the FashionMNIST Dataset: Pages 501-510

The sources walk readers through the practical steps involved in working with the FashionMNIST dataset for image classification using PyTorch. They cover checking library versions, exploring the torchvision.datasets module, setting up the FashionMNIST dataset for training, understanding data loaders, and visualizing samples from the dataset. The sources emphasize the importance of familiarizing oneself with the dataset’s structure, accessing its elements, and gaining insights into the images and their corresponding labels.
- Checking Library Versions for Compatibility: The sources recommend checking the versions of the PyTorch and torchvision libraries to ensure compatibility and leverage the latest features. They provide code snippets to display the version numbers of both libraries using torch.__version__ and torchvision.__version__. This step helps to avoid potential issues arising from version mismatches and ensures a smooth workflow.
- Exploring the torchvision.datasets Module: The sources introduce the torchvision.datasets module as a valuable resource for accessing a variety of popular computer vision datasets. They demonstrate how to explore the available datasets within this module, providing examples like Caltech101, CIFAR100, CIFAR10, MNIST, FashionMNIST, and ImageNet. The sources explain that these datasets can be easily downloaded and loaded into PyTorch using dedicated functions within the torchvision.datasets module.
- Setting Up the FashionMNIST Dataset: The sources guide readers through the process of setting up the FashionMNIST dataset for training an image classification model. They outline the following steps:
1. Importing Necessary Modules: Import the required modules from torchvision.datasets and torchvision.transforms.
2. Downloading the Dataset: Download the FashionMNIST dataset using the FashionMNIST class from torchvision.datasets, specifying the desired root directory for storing the dataset.
3. Applying Transformations: Apply transformations to the images using the transforms.Compose function. Common transformations include:
- transforms.ToTensor(): Converts PIL images (common format for image data) to PyTorch tensors.
- transforms.Normalize(): Normalizes the pixel values of the images, typically to a range of 0 to 1 or -1 to 1, which can help to improve model training.
- Understanding Data Loaders: The sources introduce data loaders as an essential component for efficiently loading and iterating through datasets in PyTorch. They explain that data loaders provide several benefits:
- Batching: They allow you to easily create batches of data, which is crucial for training models on large datasets that cannot be loaded into memory all at once.
- Shuffling: They can shuffle the data between epochs, helping to prevent the model from memorizing the order of the data and improving its ability to generalize.
- Parallel Loading: They support parallel loading of data, which can significantly speed up the training process.
- Visualizing Samples from the Dataset: The sources emphasize the importance of visualizing samples from the dataset to gain a better understanding of the data being used for training. They provide code examples for iterating through a data loader, extracting image tensors and their corresponding labels, and displaying the images using matplotlib. This visual inspection helps to ensure that the data has been loaded and preprocessed correctly and can provide insights into the characteristics of the images within the dataset.
This section offers practical guidance on working with the FashionMNIST dataset for image classification. The sources emphasize the importance of checking library versions, exploring available datasets in torchvision.datasets, setting up the FashionMNIST dataset for training, understanding the role of data loaders, and visually inspecting samples from the dataset. By following these steps, readers can effectively load, preprocess, and visualize image data, laying the groundwork for building and training computer vision models.

Mini-Batches and Building a Baseline Model with Linear Layers: Pages 511-520

The sources introduce the concept of mini-batches in machine learning, explaining their significance in training models on large datasets. They guide readers through the process of creating mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. The sources then demonstrate how to build a simple baseline model using linear layers for classifying images from the FashionMNIST dataset, highlighting the steps involved in setting up the model’s architecture, defining the input and output shapes, and performing a forward pass to verify data flow.
- The Importance of Mini-Batches: The sources explain that mini-batches play a crucial role in training machine learning models, especially when dealing with large datasets. They break down the dataset into smaller, manageable chunks called mini-batches, which are processed by the model in each training iteration. Using mini-batches offers several advantages:
- Efficient Memory Usage: Processing the entire dataset at once can overwhelm the computer’s memory, especially for large datasets. Mini-batches allow the model to work on smaller portions of the data, reducing memory requirements and making training feasible.
- Faster Training: Updating the model’s parameters after each sample can be computationally expensive. Mini-batches enable the model to calculate gradients and update parameters based on a group of samples, leading to faster convergence and reduced training time.
- Improved Generalization: Training on mini-batches introduces some randomness into the process, as the samples within each batch are shuffled. This randomness can help the model to learn more robust patterns and improve its ability to generalize to unseen data.
- Creating Mini-Batches with DataLoader: The sources demonstrate how to create mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. The DataLoader class provides a convenient way to iterate through the dataset in batches, handling shuffling, batching, and data loading automatically. It takes the dataset as input, along with the desired batch size and other optional parameters.
- Building a Baseline Model with Linear Layers: The sources guide readers through the construction of a simple baseline model using linear layers for classifying images from the FashionMNIST dataset. They outline the following steps:
1. Defining the Model Architecture: The sources start by creating a class called LinearModel that inherits from nn.Module, which is the base class for all neural network modules in PyTorch. Within the class, they define the following layers:
- A linear layer (nn.Linear) that takes the flattened input image (784 features, representing the 28×28 pixels of a FashionMNIST image) and maps it to a hidden layer with a specified number of units.
- Another linear layer that maps the hidden layer to the output layer, producing a tensor of scores for each of the 10 classes in FashionMNIST.
1. Setting Up the Input and Output Shapes: The sources emphasize the importance of aligning the input and output shapes of the linear layers to ensure proper data flow through the model. They specify the input features and output features for each linear layer based on the dataset’s characteristics and the desired number of hidden units.
2. Performing a Forward Pass: The sources demonstrate how to perform a forward pass through the model using a randomly generated tensor. This step verifies that the data flows correctly through the layers and helps to confirm the expected output shape. They print the output tensor and its shape, providing insights into the model’s behavior.
This section introduces the concept of mini-batches and their importance in machine learning, providing practical guidance on creating mini-batches from the FashionMNIST dataset using PyTorch’s DataLoader class. It then demonstrates how to build a simple baseline model using linear layers for classifying images, highlighting the steps involved in defining the model architecture, setting up the input and output shapes, and verifying data flow through a forward pass. This foundation prepares readers for building more complex convolutional neural networks for image classification tasks.

Training and Evaluating a Linear Model on the FashionMNIST Dataset: Pages 521-530

The sources guide readers through the process of training and evaluating the previously built linear model on the FashionMNIST dataset, focusing on creating a training loop, setting up a loss function and an optimizer, calculating accuracy, and implementing a testing loop to assess the model’s performance on unseen data.
- Setting Up the Loss Function and Optimizer: The sources explain that a loss function quantifies how well the model’s predictions match the true labels, with lower loss values indicating better performance. They discuss common choices for loss functions and optimizers, emphasizing the importance of selecting appropriate options based on the problem and dataset.
- The sources specifically recommend binary cross-entropy loss (BCE) for binary classification problems and cross-entropy loss (CE) for multi-class classification problems.
- They highlight that PyTorch provides both nn.BCELoss and nn.CrossEntropyLoss implementations for these loss functions.
- For the optimizer, the sources mention stochastic gradient descent (SGD) as a common choice, with PyTorch offering the torch.optim.SGD class for its implementation.
- Creating a Training Loop: The sources outline the fundamental steps involved in a training loop, emphasizing the iterative process of adjusting the model’s parameters to minimize the loss and improve its ability to classify images correctly. The typical steps in a training loop include:
1. Forward Pass: Pass a batch of data through the model to obtain predictions.
2. Calculate the Loss: Compare the model’s predictions to the true labels using the chosen loss function.
3. Optimizer Zero Grad: Reset the gradients calculated from the previous batch to avoid accumulating gradients across batches.
4. Loss Backward: Perform backpropagation to calculate the gradients of the loss with respect to the model’s parameters.
5. Optimizer Step: Update the model’s parameters based on the calculated gradients and the optimizer’s learning rate.
- Calculating Accuracy: The sources introduce accuracy as a metric for evaluating the model’s performance, representing the percentage of correctly classified samples. They provide a code snippet to calculate accuracy by comparing the predicted labels to the true labels.
- Implementing a Testing Loop: The sources explain the importance of evaluating the model’s performance on a separate set of data, the test set, that was not used during training. This helps to assess the model’s ability to generalize to unseen data and prevent overfitting, where the model performs well on the training data but poorly on new data. The testing loop follows similar steps to the training loop, but without updating the model’s parameters:
1. Forward Pass: Pass a batch of test data through the model to obtain predictions.
2. Calculate the Loss: Compare the model’s predictions to the true test labels using the loss function.
3. Calculate Accuracy: Determine the percentage of correctly classified test samples.
The sources provide code examples for implementing the training and testing loops, including detailed explanations of each step. They also emphasize the importance of monitoring the loss and accuracy values during training to track the model’s progress and ensure that it is learning effectively. These steps provide a comprehensive understanding of the training and evaluation process, enabling readers to apply these techniques to their own image classification tasks.

Building and Training a Multi-Layer Model with Non-Linear Activation Functions: Pages 531-540

The sources extend the image classification task by introducing non-linear activation functions and building a more complex multi-layer model. They emphasize the importance of non-linearity in enabling neural networks to learn complex patterns and improve classification accuracy. The sources guide readers through implementing the ReLU (Rectified Linear Unit) activation function and constructing a multi-layer model, demonstrating its performance on the FashionMNIST dataset.
- The Role of Non-Linear Activation Functions: The sources explain that linear models, while straightforward, are limited in their ability to capture intricate relationships in data. Introducing non-linear activation functions between linear layers enhances the model’s capacity to learn complex patterns. Non-linear activation functions allow the model to approximate non-linear decision boundaries, enabling it to classify data points that are not linearly separable.
- Introducing ReLU Activation: The sources highlight ReLU as a popular non-linear activation function, known for its simplicity and effectiveness. ReLU replaces negative values in the input tensor with zero, while retaining positive values. This simple operation introduces non-linearity into the model, allowing it to learn more complex representations of the data. The sources provide the code for implementing ReLU in PyTorch using nn.ReLU().
- Constructing a Multi-Layer Model: The sources guide readers through building a more complex model with multiple linear layers and ReLU activations. They introduce a three-layer model:
1. A linear layer that takes the flattened input image (784 features) and maps it to a hidden layer with a specified number of units.
2. A ReLU activation function applied to the output of the first linear layer.
3. Another linear layer that maps the activated hidden layer to a second hidden layer with a specified number of units.
4. A ReLU activation function applied to the output of the second linear layer.
5. A final linear layer that maps the activated second hidden layer to the output layer (10 units, representing the 10 classes in FashionMNIST).
- Training and Evaluating the Multi-Layer Model: The sources demonstrate how to train and evaluate this multi-layer model using the same training and testing loops described in the previous pages summary. They emphasize that the inclusion of ReLU activations between the linear layers significantly enhances the model’s performance compared to the previous linear models. This improvement highlights the crucial role of non-linearity in enabling neural networks to learn complex patterns and achieve higher classification accuracy.
The sources provide code examples for implementing the multi-layer model with ReLU activations, showcasing the steps involved in defining the model’s architecture, setting up the layers and activations, and training the model using the established training and testing loops. These examples offer practical guidance on building and training more complex models with non-linear activation functions, laying the foundation for understanding and implementing even more sophisticated architectures like convolutional neural networks.

Improving Model Performance and Visualizing Predictions: Pages 541-550

The sources discuss strategies for improving the performance of machine learning models, focusing on techniques to enhance a model’s ability to learn from data and make accurate predictions. They also guide readers through visualizing the model’s predictions, providing insights into its decision-making process and highlighting areas for potential improvement.
- Improving a Model’s Performance: The sources acknowledge that achieving satisfactory results with machine learning models often involves an iterative process of experimentation and refinement. They outline several strategies to improve a model’s performance, emphasizing that the effectiveness of these techniques can vary depending on the complexity of the problem and the characteristics of the dataset. Some common approaches include:
1. Adding More Layers: Increasing the depth of the neural network by adding more layers can enhance its capacity to learn complex representations of the data. However, adding too many layers can lead to overfitting, especially if the dataset is small.
2. Adding More Hidden Units: Increasing the number of hidden units within each layer can also enhance the model’s ability to capture intricate patterns. Similar to adding more layers, adding too many hidden units can contribute to overfitting.
3. Training for Longer: Allowing the model to train for a greater number of epochs can provide more opportunities to adjust its parameters and minimize the loss. However, excessive training can also lead to overfitting, especially if the model’s capacity is high.
4. Changing the Learning Rate: The learning rate determines the step size the optimizer takes when updating the model’s parameters. A learning rate that is too high can cause the optimizer to overshoot the optimal values, while a learning rate that is too low can slow down convergence. Experimenting with different learning rates can improve the model’s ability to find the optimal parameter values.
- Visualizing Model Predictions: The sources stress the importance of visualizing the model’s predictions to gain insights into its decision-making process. Visualizations can reveal patterns in the data that the model is capturing and highlight areas where it is struggling to make accurate predictions. The sources guide readers through creating visualizations using Matplotlib, demonstrating how to plot the model’s predictions for different classes and analyze its performance.
The sources provide practical advice and code examples for implementing these improvement strategies, encouraging readers to experiment with different techniques to find the optimal configuration for their specific problem. They also emphasize the value of visualizing model predictions to gain a deeper understanding of its strengths and weaknesses, facilitating further model refinement and improvement. This section equips readers with the knowledge and tools to iteratively improve their models and enhance their understanding of the model’s behavior through visualizations.

Saving, Loading, and Evaluating Models: Pages 551-560

The sources shift their focus to the practical aspects of saving, loading, and comprehensively evaluating trained models. They emphasize the importance of preserving trained models for future use, enabling the application of trained models to new data without retraining. The sources also introduce techniques for assessing model performance beyond simple accuracy, providing a more nuanced understanding of a model’s strengths and weaknesses.
- Saving and Loading Trained Models: The sources highlight the significance of saving trained models to avoid the time and computational expense of retraining. They outline the process of saving a model’s state dictionary, which contains the learned parameters (weights and biases), using PyTorch’s torch.save() function. The sources provide a code example demonstrating how to save a model’s state dictionary to a file, typically with a .pth extension. They also explain how to load a saved model using torch.load(), emphasizing the need to create an instance of the model with the same architecture before loading the saved state dictionary.
- Making Predictions With a Loaded Model: The sources guide readers through making predictions using a loaded model, emphasizing the importance of setting the model to evaluation mode (model.eval()) before making predictions. Evaluation mode deactivates certain layers, such as dropout, that are used during training but not during inference. They provide a code snippet illustrating the process of loading a saved model, setting it to evaluation mode, and using it to generate predictions on new data.
- Evaluating Model Performance Beyond Accuracy: The sources acknowledge that accuracy, while a useful metric, can provide an incomplete picture of a model’s performance, especially when dealing with imbalanced datasets where some classes have significantly more samples than others. They introduce the concept of a confusion matrix as a valuable tool for evaluating classification models. A confusion matrix displays the number of correct and incorrect predictions for each class, providing a detailed breakdown of the model’s performance across different classes. The sources explain how to interpret a confusion matrix, highlighting its ability to reveal patterns in misclassifications and identify classes where the model is performing poorly.
The sources guide readers through the essential steps of saving, loading, and evaluating trained models, equipping them with the skills to manage trained models effectively and perform comprehensive assessments of model performance beyond simple accuracy. This section focuses on the practical aspects of deploying and understanding the behavior of trained models, providing a valuable foundation for applying machine learning models to real-world tasks.

Putting it All Together: A PyTorch Workflow and Building a Classification Model: Pages 561 – 570

The sources guide readers through a comprehensive PyTorch workflow for building and training a classification model, consolidating the concepts and techniques covered in previous sections. They illustrate this workflow by constructing a binary classification model to classify data points generated using the make_circles dataset in scikit-learn.
- PyTorch End-to-End Workflow: The sources outline a structured approach to developing PyTorch models, encompassing the following key steps:
1. Data: Acquire, prepare, and transform data into a suitable format for training. This step involves understanding the dataset, loading the data, performing necessary preprocessing steps, and splitting the data into training and testing sets.
2. Model: Choose or build a model architecture appropriate for the task, considering the complexity of the problem and the nature of the data. This step involves selecting suitable layers, activation functions, and other components of the model.
3. Loss Function: Select a loss function that quantifies the difference between the model’s predictions and the actual target values. The choice of loss function depends on the type of problem (e.g., binary classification, multi-class classification, regression).
4. Optimizer: Choose an optimization algorithm that updates the model’s parameters to minimize the loss function. Popular optimizers include stochastic gradient descent (SGD), Adam, and RMSprop.
5. Training Loop: Implement a training loop that iteratively feeds the training data to the model, calculates the loss, and updates the model’s parameters using the chosen optimizer.
6. Evaluation: Evaluate the trained model’s performance on the testing set using appropriate metrics, such as accuracy, precision, recall, and the confusion matrix.
- Building a Binary Classification Model: The sources demonstrate this workflow by creating a binary classification model to classify data points generated using scikit-learn’s make_circles dataset. They guide readers through:
1. Generating the Dataset: Using make_circles to create a dataset of data points arranged in concentric circles, with each data point belonging to one of two classes.
2. Visualizing the Data: Employing Matplotlib to visualize the generated data points, providing a visual representation of the classification task.
3. Building the Model: Constructing a multi-layer neural network with linear layers and ReLU activation functions. The output layer utilizes the sigmoid activation function to produce probabilities for the two classes.
4. Choosing the Loss Function and Optimizer: Selecting the binary cross-entropy loss function (nn.BCELoss) and the stochastic gradient descent (SGD) optimizer for this binary classification task.
5. Implementing the Training Loop: Implementing the training loop to train the model, including the steps for calculating the loss, backpropagation, and updating the model’s parameters.
6. Evaluating the Model: Assessing the model’s performance using accuracy, precision, recall, and visualizing the predictions.
The sources provide a clear and structured approach to developing PyTorch models for classification tasks, emphasizing the importance of a systematic workflow that encompasses data preparation, model building, loss function and optimizer selection, training, and evaluation. This section offers a practical guide to applying the concepts and techniques covered in previous sections to build a functioning classification model, preparing readers for more complex tasks and datasets.

Multi-Class Classification with PyTorch: Pages 571-580

The sources introduce the concept of multi-class classification, expanding on the binary classification discussed in previous sections. They guide readers through building a multi-class classification model using PyTorch, highlighting the key differences and considerations when dealing with problems involving more than two classes. The sources utilize a synthetic dataset of multi-dimensional blobs created using scikit-learn’s make_blobs function to illustrate this process.
- Multi-Class Classification: The sources distinguish multi-class classification from binary classification, explaining that multi-class classification involves assigning data points to one of several possible classes. They provide examples of real-world multi-class classification problems, such as classifying images into different categories (e.g., cats, dogs, birds) or identifying different types of objects in an image.
- Building a Multi-Class Classification Model: The sources outline the steps for building a multi-class classification model in PyTorch, emphasizing the adjustments needed compared to binary classification:
1. Generating the Dataset: Using scikit-learn’s make_blobs function to create a synthetic dataset with multiple classes, where each data point has multiple features and belongs to one specific class.
2. Visualizing the Data: Utilizing Matplotlib to visualize the generated data points and their corresponding class labels, providing a visual understanding of the multi-class classification problem.
3. Building the Model: Constructing a neural network with linear layers and ReLU activation functions. The key difference in multi-class classification lies in the output layer. Instead of a single output neuron with a sigmoid activation function, the output layer has multiple neurons, one for each class. The softmax activation function is applied to the output layer to produce a probability distribution over the classes.
4. Choosing the Loss Function and Optimizer: Selecting an appropriate loss function for multi-class classification, such as the cross-entropy loss (nn.CrossEntropyLoss), and choosing an optimizer like stochastic gradient descent (SGD) or Adam.
5. Implementing the Training Loop: Implementing the training loop to train the model, similar to binary classification but using the chosen loss function and optimizer for multi-class classification.
6. Evaluating the Model: Evaluating the performance of the trained model using appropriate metrics for multi-class classification, such as accuracy and the confusion matrix. The sources emphasize that accuracy alone may not be sufficient for evaluating models on imbalanced datasets and suggest exploring other metrics like precision and recall.
The sources provide a comprehensive guide to building and training multi-class classification models in PyTorch, highlighting the adjustments needed in model architecture, loss function, and evaluation metrics compared to binary classification. By working through a concrete example using the make_blobs dataset, the sources equip readers with the fundamental knowledge and practical skills to tackle multi-class classification problems using PyTorch.

Enhancing a Model and Introducing Nonlinearities: Pages 581 – 590

The sources discuss strategies for improving the performance of machine learning models and introduce the concept of nonlinear activation functions, which play a crucial role in enabling neural networks to learn complex patterns in data. They explore ways to enhance a previously built multi-class classification model and introduce the ReLU (Rectified Linear Unit) activation function as a widely used nonlinearity in deep learning.
- Improving a Model’s Performance: The sources acknowledge that achieving satisfactory results with a machine learning model often involves experimentation and iterative improvement. They present several strategies for enhancing a model’s performance, including:
1. Adding More Layers: Increasing the depth of the neural network by adding more layers can allow the model to learn more complex representations of the data. The sources suggest that adding layers can be particularly beneficial for tasks with intricate data patterns.
2. Increasing Hidden Units: Expanding the number of hidden units within each layer can provide the model with more capacity to capture and learn the underlying patterns in the data.
3. Training for Longer: Extending the number of training epochs can give the model more opportunities to learn from the data and potentially improve its performance. However, training for too long can lead to overfitting, where the model performs well on the training data but poorly on unseen data.
4. Using a Smaller Learning Rate: Decreasing the learning rate can lead to more stable training and allow the model to converge to a better solution, especially when dealing with complex loss landscapes.
5. Adding Nonlinearities: Incorporating nonlinear activation functions between layers is essential for enabling neural networks to learn nonlinear relationships in the data. Without nonlinearities, the model would essentially be a series of linear transformations, limiting its ability to capture complex patterns.
- Introducing the ReLU Activation Function: The sources introduce the ReLU activation function as a widely used nonlinearity in deep learning. They describe ReLU’s simple yet effective operation: it outputs the input directly if the input is positive and outputs zero if the input is negative. Mathematically, ReLU(x) = max(0, x).
- The sources highlight the benefits of ReLU, including its computational efficiency and its tendency to mitigate the vanishing gradient problem, which can hinder training in deep networks.
- Incorporating ReLU into the Model: The sources guide readers through adding ReLU activation functions to the previously built multi-class classification model. They demonstrate how to insert ReLU layers between the linear layers of the model, enabling the network to learn nonlinear decision boundaries and improve its ability to classify the data.
The sources provide a practical guide to improving machine learning model performance and introduce the concept of nonlinearities, emphasizing the importance of ReLU activation functions in enabling neural networks to learn complex data patterns. By incorporating ReLU into the multi-class classification model, the sources showcase the power of nonlinearities in enhancing a model’s ability to capture and represent the underlying structure of the data.

Building and Evaluating Convolutional Neural Networks: Pages 591 – 600

The sources transition from traditional feedforward neural networks to convolutional neural networks (CNNs), a specialized architecture particularly effective for computer vision tasks. They emphasize the power of CNNs in automatically learning and extracting features from images, eliminating the need for manual feature engineering. The sources utilize a simplified version of the VGG architecture, dubbed “TinyVGG,” to illustrate the building blocks of CNNs and their application in image classification.
- Convolutional Neural Networks (CNNs): The sources introduce CNNs as a powerful type of neural network specifically designed for processing data with a grid-like structure, such as images. They explain that CNNs excel in computer vision tasks because they exploit the spatial relationships between pixels in an image, learning to identify patterns and features that are relevant for classification.
- Key Components of CNNs: The sources outline the fundamental building blocks of CNNs:
1. Convolutional Layers: Convolutional layers perform convolutions, a mathematical operation that involves sliding a filter (also called a kernel) over the input image to extract features. The filter acts as a pattern detector, learning to recognize specific shapes, edges, or textures in the image.
2. Activation Functions: Non-linear activation functions, such as ReLU, are applied to the output of convolutional layers to introduce non-linearity into the network, enabling it to learn complex patterns.
3. Pooling Layers: Pooling layers downsample the output of convolutional layers, reducing the spatial dimensions of the feature maps while retaining the most important information. Common pooling operations include max pooling and average pooling.
4. Fully Connected Layers: Fully connected layers, similar to those in traditional feedforward networks, are often used in the final stages of a CNN to perform classification based on the extracted features.
- Building TinyVGG: The sources guide readers through implementing a simplified version of the VGG architecture, named TinyVGG, to demonstrate how to build and train a CNN for image classification. They detail the architecture of TinyVGG, which consists of:
1. Convolutional Blocks: Multiple convolutional blocks, each comprising convolutional layers, ReLU activation functions, and a max pooling layer.
2. Classifier Layer: A final classifier layer consisting of a flattening operation followed by fully connected layers to perform classification.
- Training and Evaluating TinyVGG: The sources provide code for training TinyVGG using the FashionMNIST dataset, a collection of grayscale images of clothing items. They demonstrate how to define the training loop, calculate the loss, perform backpropagation, and update the model’s parameters using an optimizer. They also guide readers through evaluating the trained model’s performance using accuracy and other relevant metrics.
The sources provide a clear and accessible introduction to CNNs and their application in image classification, demonstrating the power of CNNs in automatically learning features from images without manual feature engineering. By implementing and training TinyVGG, the sources equip readers with the practical skills and understanding needed to build and work with CNNs for computer vision tasks.

Visualizing CNNs and Building a Custom Dataset: Pages 601-610

The sources emphasize the importance of understanding how convolutional neural networks (CNNs) operate and guide readers through visualizing the effects of convolutional layers, kernels, strides, and padding. They then transition to the concept of custom datasets, explaining the need to go beyond pre-built datasets and create datasets tailored to specific machine learning problems. The sources utilize the Food101 dataset, creating a smaller subset called “Food Vision Mini” to illustrate building a custom dataset for image classification.
- Visualizing CNNs: The sources recommend using the CNN Explainer website (https://poloclub.github.io/cnn-explainer/) to gain a deeper understanding of how CNNs work.
- They acknowledge that the mathematical operations involved in convolutions can be challenging to grasp. The CNN Explainer provides an interactive visualization that allows users to experiment with different CNN parameters and observe their effects on the input image.
- Key Insights from CNN Explainer: The sources highlight the following key concepts illustrated by the CNN Explainer:
1. Kernels: Kernels, also called filters, are small matrices that slide across the input image, extracting features by performing element-wise multiplications and summations. The values within the kernel represent the weights that the CNN learns during training.
2. Strides: Strides determine how much the kernel moves across the input image in each step. Larger strides result in a larger downsampling of the input, reducing the spatial dimensions of the output feature maps.
3. Padding: Padding involves adding extra pixels around the borders of the input image. Padding helps control the spatial dimensions of the output feature maps and can prevent information loss at the edges of the image.
- Building a Custom Dataset: The sources recognize that many real-world machine learning problems require creating custom datasets that are not readily available. They guide readers through the process of building a custom dataset for image classification, using the Food101 dataset as an example.
- Creating Food Vision Mini: The sources construct a smaller subset of the Food101 dataset called Food Vision Mini, which contains only three classes (pizza, steak, and sushi) and a reduced number of images. They advocate for starting with a smaller dataset for experimentation and development, scaling up to the full dataset once the model and workflow are established.
- Standard Image Classification Format: The sources emphasize the importance of organizing the dataset into a standard image classification format, where images are grouped into separate folders corresponding to their respective classes. This standard format facilitates data loading and preprocessing using PyTorch’s built-in tools.
- Loading Image Data using ImageFolder: The sources introduce PyTorch’s ImageFolder class, a convenient tool for loading image data that is organized in the standard image classification format. They demonstrate how to use ImageFolder to create dataset objects for the training and testing splits of Food Vision Mini.
- They highlight the benefits of ImageFolder, including its automatic labeling of images based on their folder location and its ability to apply transformations to the images during loading.
- Visualizing the Custom Dataset: The sources encourage visualizing the custom dataset to ensure that the images and labels are loaded correctly. They provide code for displaying random images and their corresponding labels from the training dataset, enabling a qualitative assessment of the dataset’s content.
The sources offer a practical guide to understanding and visualizing CNNs and provide a step-by-step approach to building a custom dataset for image classification. By using the Food Vision Mini dataset as a concrete example, the sources equip readers with the knowledge and skills needed to create and work with datasets tailored to their specific machine learning problems.

Building a Custom Dataset Class and Exploring Data Augmentation: Pages 611-620

The sources shift from using the convenient ImageFolder class to building a custom Dataset class in PyTorch, providing greater flexibility and control over data loading and preprocessing. They explain the structure and key methods of a custom Dataset class and demonstrate how to implement it for the Food Vision Mini dataset. The sources then explore data augmentation techniques, emphasizing their role in improving model generalization by artificially increasing the diversity of the training data.
- Building a Custom Dataset Class: The sources guide readers through creating a custom Dataset class in PyTorch, offering a more versatile approach compared to ImageFolder for handling image data. They outline the essential components of a custom Dataset:
1. Initialization (__init__): The initialization method sets up the necessary attributes of the dataset, such as the image paths, labels, and transformations.
2. Length (__len__): The length method returns the total number of samples in the dataset, allowing PyTorch’s data loaders to determine the dataset’s size.
3. Get Item (__getitem__): The get item method retrieves a specific sample from the dataset given its index. It typically involves loading the image, applying transformations, and returning the transformed image and its corresponding label.
- Implementing the Custom Dataset: The sources provide a step-by-step implementation of a custom Dataset class for the Food Vision Mini dataset. They demonstrate how to:
1. Collect Image Paths and Labels: Iterate through the image directories and store the paths to each image along with their corresponding labels.
2. Define Transformations: Specify the desired image transformations to be applied during data loading, such as resizing, cropping, and converting to tensors.
3. Implement __getitem__: Retrieve the image at the given index, apply transformations, and return the transformed image and label as a tuple.
- Benefits of Custom Dataset Class: The sources highlight the advantages of using a custom Dataset class:
1. Flexibility: Custom Dataset classes offer greater control over data loading and preprocessing, allowing developers to tailor the data handling process to their specific needs.
2. Extensibility: Custom Dataset classes can be easily extended to accommodate various data formats and incorporate complex data loading logic.
3. Code Clarity: Custom Dataset classes promote code organization and readability, making it easier to understand and maintain the data loading pipeline.
- Data Augmentation: The sources introduce data augmentation as a crucial technique for improving the generalization ability of machine learning models. Data augmentation involves artificially expanding the training dataset by applying various transformations to the original images.
- Purpose of Data Augmentation: The goal of data augmentation is to expose the model to a wider range of variations in the data, reducing the risk of overfitting and enabling the model to learn more robust and generalizable features.
- Types of Data Augmentations: The sources showcase several common data augmentation techniques, including:
1. Random Flipping: Flipping images horizontally or vertically.
2. Random Cropping: Cropping images to different sizes and positions.
3. Random Rotation: Rotating images by a random angle.
4. Color Jitter: Adjusting image brightness, contrast, saturation, and hue.
- Benefits of Data Augmentation: The sources emphasize the following benefits of data augmentation:
1. Increased Data Diversity: Data augmentation artificially expands the training dataset, exposing the model to a wider range of image variations.
2. Improved Generalization: Training on augmented data helps the model learn more robust features that generalize better to unseen data.
3. Reduced Overfitting: Data augmentation can mitigate overfitting by preventing the model from memorizing specific examples in the training data.
- Incorporating Data Augmentations: The sources guide readers through applying data augmentations to the Food Vision Mini dataset using PyTorch’s transforms module.
- They demonstrate how to compose multiple transformations into a pipeline, applying them sequentially to the images during data loading.
- Visualizing Augmented Images: The sources encourage visualizing the augmented images to ensure that the transformations are being applied as expected. They provide code for displaying random augmented images from the training dataset, allowing a qualitative assessment of the augmentation pipeline’s effects.
The sources provide a comprehensive guide to building a custom Dataset class in PyTorch, empowering readers to handle data loading and preprocessing with greater flexibility and control. They then explore the concept and benefits of data augmentation, emphasizing its role in enhancing model generalization by introducing artificial diversity into the training data.

Constructing and Training a TinyVGG Model: Pages 621-630

The sources guide readers through constructing a TinyVGG model, a simplified version of the VGG (Visual Geometry Group) architecture commonly used in computer vision. They explain the rationale behind TinyVGG’s design, detail its layers and activation functions, and demonstrate how to implement it in PyTorch. They then focus on training the TinyVGG model using the custom Food Vision Mini dataset. They highlight the importance of setting a random seed for reproducibility and illustrate the training process using a combination of code and explanatory text.
- Introducing TinyVGG Architecture: The sources introduce the TinyVGG architecture as a simplified version of the VGG architecture, well-known for its performance in image classification tasks.
- Rationale Behind TinyVGG: They explain that TinyVGG aims to capture the essential elements of the VGG architecture while using fewer layers and parameters, making it more computationally efficient and suitable for smaller datasets like Food Vision Mini.
- Layers and Activation Functions in TinyVGG: The sources provide a detailed breakdown of the layers and activation functions used in the TinyVGG model:
1. Convolutional Layers (nn.Conv2d): Multiple convolutional layers are used to extract features from the input images. Each convolutional layer applies a set of learnable filters (kernels) to the input, generating feature maps that highlight different patterns in the image.
2. ReLU Activation Function (nn.ReLU): The rectified linear unit (ReLU) activation function is applied after each convolutional layer. ReLU introduces non-linearity into the model, allowing it to learn complex relationships between features. It is defined as f(x) = max(0, x), meaning it outputs the input directly if it is positive and outputs zero if the input is negative.
3. Max Pooling Layers (nn.MaxPool2d): Max pooling layers downsample the feature maps by selecting the maximum value within a small window. This reduces the spatial dimensions of the feature maps while retaining the most salient features.
4. Flatten Layer (nn.Flatten): The flatten layer converts the multi-dimensional feature maps from the convolutional layers into a one-dimensional feature vector. This vector is then fed into the fully connected layers for classification.
5. Linear Layer (nn.Linear): The linear layer performs a matrix multiplication on the input feature vector, producing a set of scores for each class.
- Implementing TinyVGG in PyTorch: The sources guide readers through implementing the TinyVGG architecture using PyTorch’s nn.Module class. They define a class called TinyVGG that inherits from nn.Module and implements the model’s architecture in its __init__ and forward methods.
- __init__ Method: This method initializes the model’s layers, including convolutional layers, ReLU activation functions, max pooling layers, a flatten layer, and a linear layer for classification.
- forward Method: This method defines the flow of data through the model, taking an input tensor and passing it through the various layers in the correct sequence.
- Setting the Random Seed: The sources stress the importance of setting a random seed before training the model using torch.manual_seed(42). This ensures that the model’s initialization and training process are deterministic, making the results reproducible.
- Training the TinyVGG Model: The sources demonstrate how to train the TinyVGG model on the Food Vision Mini dataset. They provide code for:
1. Creating an Instance of the Model: Instantiating the TinyVGG class creates an object representing the model.
2. Choosing a Loss Function: Selecting an appropriate loss function to measure the difference between the model’s predictions and the true labels.
3. Setting up an Optimizer: Choosing an optimization algorithm to update the model’s parameters during training, aiming to minimize the loss function.
4. Defining a Training Loop: Implementing a loop that iterates through the training data, performs forward and backward passes, updates model parameters, and tracks the training progress.
The sources provide a practical walkthrough of constructing and training a TinyVGG model using the Food Vision Mini dataset. They explain the architecture’s design principles, detail its layers and activation functions, and demonstrate how to implement and train the model in PyTorch. They emphasize the importance of setting a random seed for reproducibility, enabling others to replicate the training process and results.

Visualizing the Model, Evaluating Performance, and Comparing Results: Pages 631-640

The sources move towards visualizing the TinyVGG model’s layers and their effects on input data, offering insights into how convolutional neural networks process information. They then focus on evaluating the model’s performance using various metrics, emphasizing the need to go beyond simple accuracy and consider measures like precision, recall, and F1 score for a more comprehensive assessment. Finally, the sources introduce techniques for comparing the performance of different models, highlighting the role of dataframes in organizing and presenting the results.
- Visualizing TinyVGG’s Convolutional Layers: The sources explore how to visualize the convolutional layers of the TinyVGG model.
- They leverage the CNN Explainer website, which offers an interactive tool for understanding the workings of convolutional neural networks.
- The sources guide readers through creating dummy data in the same shape as the input data used in the CNN Explainer, allowing them to observe how the model’s convolutional layers transform the input.
- The sources emphasize the importance of understanding hyperparameters like kernel size, stride, and padding and their influence on the convolutional operation.
- Understanding Kernel Size, Stride, and Padding: The sources explain the significance of key hyperparameters involved in convolutional layers:
1. Kernel Size: Refers to the size of the filter that slides across the input image. A larger kernel captures a wider receptive field, allowing the model to learn more complex features. However, a larger kernel also increases the number of parameters and computational complexity.
2. Stride: Determines the step size at which the kernel moves across the input. A larger stride results in a smaller output feature map, effectively downsampling the input.
3. Padding: Involves adding extra pixels around the input image to control the output size and prevent information loss at the edges. Different padding strategies, such as “same” padding or “valid” padding, influence how the kernel interacts with the image boundaries.
- Evaluating Model Performance: The sources shift focus to evaluating the performance of the trained TinyVGG model. They emphasize that relying solely on accuracy may not provide a complete picture, especially when dealing with imbalanced datasets where one class might dominate the others.
- Metrics Beyond Accuracy: The sources introduce several additional metrics for evaluating classification models:
1. Precision: Measures the proportion of correctly predicted positive instances out of all instances predicted as positive. A high precision indicates that the model is good at avoiding false positives.
2. Recall: Measures the proportion of correctly predicted positive instances out of all actual positive instances. A high recall suggests that the model is effective at identifying most of the positive instances.
3. F1 Score: The harmonic mean of precision and recall, providing a balanced measure that considers both false positives and false negatives. It is particularly useful when dealing with imbalanced datasets where precision and recall might provide conflicting insights.
- Confusion Matrix: The sources introduce the concept of a confusion matrix, a powerful tool for visualizing the performance of a classification model.
- Structure of a Confusion Matrix: The confusion matrix is a table that shows the counts of true positives, true negatives, false positives, and false negatives for each class, providing a detailed breakdown of the model’s prediction patterns.
- Benefits of Confusion Matrix: The confusion matrix helps identify classes that the model struggles with, providing insights into potential areas for improvement.
- Comparing Model Performance: The sources explore techniques for comparing the performance of different models trained on the Food Vision Mini dataset. They demonstrate how to use Pandas dataframes to organize and present the results clearly and concisely.
- Creating a Dataframe for Comparison: The sources guide readers through creating a dataframe that includes relevant metrics like training time, training loss, test loss, and test accuracy for each model. This allows for a side-by-side comparison of their performance.
- Benefits of Dataframes: Dataframes provide a structured and efficient way to handle and analyze tabular data. They enable easy sorting, filtering, and visualization of the results, facilitating the process of model selection and comparison.
The sources emphasize the importance of going beyond simple accuracy when evaluating classification models. They introduce a range of metrics, including precision, recall, and F1 score, and highlight the usefulness of the confusion matrix in providing a detailed analysis of the model’s prediction patterns. The sources then demonstrate how to use dataframes to compare the performance of multiple models systematically, aiding in model selection and understanding the impact of different design choices or training strategies.

Building, Training, and Evaluating a Multi-Class Classification Model: Pages 641-650

The sources transition from binary classification, where models distinguish between two classes, to multi-class classification, which involves predicting one of several possible classes. They introduce the concept of multi-class classification, comparing it to binary classification, and use the Fashion MNIST dataset as an example, where models need to classify images into ten different clothing categories. The sources guide readers through adapting the TinyVGG architecture and training process for this multi-class setting, explaining the modifications needed for handling multiple classes.
- From Binary to Multi-Class Classification: The sources explain the shift from binary to multi-class classification.
- Binary Classification: Involves predicting one of two possible classes, like “cat” or “dog” in an image classification task.
- Multi-Class Classification: Extends the concept to predicting one of multiple classes, as in the Fashion MNIST dataset, where models must classify images into classes like “T-shirt,” “Trouser,” “Pullover,” “Dress,” “Coat,” “Sandal,” “Shirt,” “Sneaker,” “Bag,” and “Ankle Boot.” [1, 2]
- Adapting TinyVGG for Multi-Class Classification: The sources explain how to modify the TinyVGG architecture for multi-class problems.
- Output Layer: The key change involves adjusting the output layer of the TinyVGG model. The number of output units in the final linear layer needs to match the number of classes in the dataset. For Fashion MNIST, this means having ten output units, one for each clothing category. [3]
- Activation Function: They also recommend using the softmax activation function in the output layer for multi-class classification. The softmax function converts the raw output scores (logits) from the linear layer into a probability distribution over the classes, where each probability represents the model’s confidence in assigning the input to that particular class. [4]
- Choosing the Right Loss Function and Optimizer: The sources guide readers through selecting appropriate loss functions and optimizers for multi-class classification:
- Cross-Entropy Loss: They recommend using the cross-entropy loss function, a common choice for multi-class classification tasks. Cross-entropy loss measures the dissimilarity between the predicted probability distribution and the true label distribution. [5]
- Optimizers: The sources discuss using optimizers like Stochastic Gradient Descent (SGD) or Adam to update the model’s parameters during training, aiming to minimize the cross-entropy loss. [5]
- Training the Multi-Class Model: The sources demonstrate how to train the adapted TinyVGG model on the Fashion MNIST dataset, following a similar training loop structure used in previous sections:
- Data Loading: Loading batches of image data and labels from the Fashion MNIST dataset using PyTorch’s DataLoader. [6, 7]
- Forward Pass: Passing the input data through the model to obtain predictions (logits). [8]
- Calculating Loss: Computing the cross-entropy loss between the predicted logits and the true labels. [8]
- Backpropagation: Calculating gradients of the loss with respect to the model’s parameters. [8]
- Optimizer Step: Updating the model’s parameters using the chosen optimizer, aiming to minimize the loss. [8]
- Evaluating Performance: The sources reiterate the importance of evaluating model performance using metrics beyond simple accuracy, especially in multi-class settings.
- Precision, Recall, F1 Score: They encourage considering metrics like precision, recall, and F1 score, which provide a more nuanced understanding of the model’s ability to correctly classify instances across different classes. [9]
- Confusion Matrix: They highlight the usefulness of the confusion matrix, allowing visualization of the model’s prediction patterns and identification of classes the model struggles with. [10]
The sources smoothly transition readers from binary to multi-class classification. They outline the key differences, provide clear instructions on adapting the TinyVGG architecture for multi-class tasks, and guide readers through the training process. They emphasize the need for comprehensive model evaluation, suggesting the use of metrics beyond accuracy and showcasing the value of the confusion matrix in analyzing the model’s performance.

Evaluating Model Predictions and Understanding Data Augmentation: Pages 651-660

The sources guide readers through evaluating model predictions on individual samples from the Fashion MNIST dataset, emphasizing the importance of visual inspection and understanding where the model succeeds or fails. They then introduce the concept of data augmentation as a technique for artificially increasing the diversity of the training data, aiming to improve the model’s generalization ability and robustness.
- Visually Evaluating Model Predictions: The sources demonstrate how to make predictions on individual samples from the test set and visualize them alongside their true labels.
- Selecting Random Samples: They guide readers through selecting random samples from the test data, preparing the images for visualization using matplotlib, and making predictions using the trained model.
- Visualizing Predictions: They showcase a technique for creating a grid of images, displaying each test sample alongside its predicted label and its true label. This visual approach provides insights into the model’s performance on specific instances.
- Analyzing Results: The sources encourage readers to analyze the visual results, looking for patterns in the model’s predictions and identifying instances where it might be making errors. This process helps understand the strengths and weaknesses of the model’s learned representations.
- Confusion Matrix for Deeper Insights: The sources revisit the concept of the confusion matrix, introduced earlier, as a powerful tool for evaluating classification model performance.
- Creating a Confusion Matrix: They guide readers through creating a confusion matrix using libraries like torchmetrics and mlxtend, which offer convenient functions for computing and visualizing confusion matrices.
- Interpreting the Confusion Matrix: The sources explain how to interpret the confusion matrix, highlighting the patterns in the model’s predictions and identifying classes that might be easily confused.
- Benefits of Confusion Matrix: They emphasize that the confusion matrix provides a more granular view of the model’s performance compared to simple accuracy, allowing for a deeper understanding of its prediction patterns.
- Data Augmentation: The sources introduce the concept of data augmentation as a technique to improve model generalization and performance.
- Definition of Data Augmentation: They define data augmentation as the process of artificially increasing the diversity of the training data by applying various transformations to the original images.
- Benefits of Data Augmentation: The sources explain that data augmentation helps expose the model to a wider range of variations during training, making it more robust to changes in input data and improving its ability to generalize to unseen examples.
- Common Data Augmentation Techniques: The sources discuss several commonly used data augmentation techniques:
1. Random Cropping: Involves randomly selecting a portion of the image to use for training, helping the model learn to recognize objects regardless of their location within the image.
2. Random Flipping: Horizontally flipping images, teaching the model to recognize objects even when they are mirrored.
3. Random Rotation: Rotating images by a random angle, improving the model’s ability to handle different object orientations.
4. Color Jitter: Adjusting the brightness, contrast, saturation, and hue of images, making the model more robust to variations in lighting and color.
- Applying Data Augmentation in PyTorch: The sources demonstrate how to apply data augmentation using PyTorch’s transforms module, which offers a wide range of built-in transformations for image data. They create a custom transformation pipeline that includes random cropping, random horizontal flipping, and random rotation. They then visualize examples of augmented images, highlighting the diversity introduced by these transformations.
The sources guide readers through evaluating individual model predictions, showcasing techniques for visual inspection and analysis using matplotlib. They reiterate the importance of the confusion matrix as a tool for gaining deeper insights into the model’s prediction patterns. They then introduce the concept of data augmentation, explaining its purpose and benefits. The sources provide clear explanations of common data augmentation techniques and demonstrate how to apply them using PyTorch’s transforms module, emphasizing the role of data augmentation in improving model generalization and robustness.

Building and Training a TinyVGG Model on a Custom Dataset: Pages 661-670

The sources shift focus to building and training a TinyVGG convolutional neural network model on the custom food dataset (pizza, steak, sushi) prepared in the previous sections. They guide readers through the process of model definition, setting up a loss function and optimizer, and defining training and testing steps for the model. The sources emphasize a step-by-step approach, encouraging experimentation and understanding of the model’s architecture and training dynamics.
- Defining the TinyVGG Architecture: The sources provide a detailed breakdown of the TinyVGG architecture, outlining the layers and their configurations:
- Convolutional Blocks: They describe the arrangement of convolutional layers (nn.Conv2d), activation functions (typically ReLU – nn.ReLU), and max-pooling layers (nn.MaxPool2d) within convolutional blocks. They explain how these blocks extract features from the input images at different levels of abstraction.
- Classifier Layer: They describe the classifier layer, consisting of a flattening operation (nn.Flatten) followed by fully connected linear layers (nn.Linear). This layer takes the extracted features from the convolutional blocks and maps them to the output classes (pizza, steak, sushi).
- Model Implementation: The sources guide readers through implementing the TinyVGG model in PyTorch, showing how to define the model class by subclassing nn.Module:
- __init__ Method: They demonstrate the initialization of the model’s layers within the __init__ method, setting up the convolutional blocks and the classifier layer.
- forward Method: They explain the forward method, which defines the flow of data through the model during the forward pass, outlining how the input data passes through each layer and transformation.
- Input and Output Shape Verification: The sources stress the importance of verifying the input and output shapes of each layer in the model. They encourage readers to print the shapes at different stages to ensure the data is flowing correctly through the network and that the dimensions are as expected. They also mention techniques for troubleshooting shape mismatches.
- Introducing torchinfo Package: The sources introduce the torchinfo package as a helpful tool for summarizing the architecture of a PyTorch model, providing information about layer shapes, parameters, and the overall structure of the model. They demonstrate how to use torchinfo to get a concise overview of the defined TinyVGG model.
- Setting Up the Loss Function and Optimizer: The sources guide readers through selecting a suitable loss function and optimizer for training the TinyVGG model:
- Cross-Entropy Loss: They recommend using the cross-entropy loss function for the multi-class classification problem of the food dataset. They explain that cross-entropy loss is commonly used for classification tasks and measures the difference between the predicted probability distribution and the true label distribution.
- Stochastic Gradient Descent (SGD) Optimizer: They suggest using the SGD optimizer for updating the model’s parameters during training. They explain that SGD is a widely used optimization algorithm that iteratively adjusts the model’s parameters to minimize the loss function.
- Defining Training and Testing Steps: The sources provide code for defining the training and testing steps of the model training process:
- train_step Function: They define a train_step function, which takes a batch of training data as input, performs a forward pass through the model, calculates the loss, performs backpropagation to compute gradients, and updates the model’s parameters using the optimizer. They emphasize accumulating the loss and accuracy over the batches within an epoch.
- test_step Function: They define a test_step function, which takes a batch of testing data as input, performs a forward pass to get predictions, calculates the loss, and accumulates the loss and accuracy over the batches. They highlight that the test_step does not involve updating the model’s parameters, as it’s used for evaluation purposes.
The sources guide readers through the process of defining the TinyVGG architecture, verifying layer shapes, setting up the loss function and optimizer, and defining the training and testing steps for the model. They emphasize the importance of understanding the model’s structure and the flow of data through it. They encourage readers to experiment and pay attention to details to ensure the model is correctly implemented and set up for training.

Training, Evaluating, and Saving the TinyVGG Model: Pages 671-680

The sources guide readers through the complete training process of the TinyVGG model on the custom food dataset, highlighting techniques for visualizing training progress, evaluating model performance, and saving the trained model for later use. They emphasize practical considerations, such as setting up training loops, tracking loss and accuracy metrics, and making predictions on test data.
- Implementing the Training Loop: The sources provide code for implementing the training loop, iterating through multiple epochs and performing training and testing steps for each epoch. They break down the training loop into clear steps:
- Epoch Iteration: They use a for loop to iterate over the specified number of training epochs.
- Setting Model to Training Mode: Before starting the training step for each epoch, they explicitly set the model to training mode using model.train(). They explain that this is important for activating certain layers, like dropout or batch normalization, which behave differently during training and evaluation.
- Iterating Through Batches: Within each epoch, they use another for loop to iterate through the batches of data from the training data loader.
- Calling the train_step Function: For each batch, they call the previously defined train_step function, which performs a forward pass, calculates the loss, performs backpropagation, and updates the model’s parameters.
- Accumulating Loss and Accuracy: They accumulate the training loss and accuracy values over the batches within an epoch.
- Setting Model to Evaluation Mode: Before starting the testing step, they set the model to evaluation mode using model.eval(). They explain that this deactivates training-specific behaviors of certain layers.
- Iterating Through Test Batches: They iterate through the batches of data from the test data loader.
- Calling the test_step Function: For each batch, they call the test_step function, which calculates the loss and accuracy on the test data.
- Accumulating Test Loss and Accuracy: They accumulate the test loss and accuracy values over the test batches.
- Calculating Average Loss and Accuracy: After iterating through all the training and testing batches, they calculate the average training loss, training accuracy, test loss, and test accuracy for the epoch.
- Printing Epoch Statistics: They print the calculated statistics for each epoch, providing a clear view of the model’s progress during training.
- Visualizing Training Progress: The sources emphasize the importance of visualizing the training process to gain insights into the model’s learning dynamics:
- Creating Loss and Accuracy Curves: They guide readers through creating plots of the training loss and accuracy values over the epochs, allowing for visual inspection of how the model is improving.
- Analyzing Loss Curves: They explain how to analyze the loss curves, looking for trends that indicate convergence or potential issues like overfitting. They suggest that a steadily decreasing loss curve generally indicates good learning progress.
- Saving and Loading the Best Model: The sources highlight the importance of saving the model with the best performance achieved during training:
- Tracking the Best Test Loss: They introduce a variable to track the best test loss achieved so far during training.
- Saving the Model When Test Loss Improves: They include a condition within the training loop to save the model’s state dictionary (model.state_dict()) whenever a new best test loss is achieved.
- Loading the Saved Model: They demonstrate how to load the saved model’s state dictionary using torch.load() and use it to restore the model’s parameters for later use.
- Evaluating the Loaded Model: The sources guide readers through evaluating the performance of the loaded model on the test data:
- Performing a Test Pass: They use the test_step function to calculate the loss and accuracy of the loaded model on the entire test dataset.
- Comparing Results: They compare the results of the loaded model with the results obtained during training to ensure that the loaded model performs as expected.
The sources provide a comprehensive walkthrough of the training process for the TinyVGG model, emphasizing the importance of setting up the training loop, tracking loss and accuracy metrics, visualizing training progress, saving the best model, and evaluating its performance. They offer practical tips and best practices for effective model training, encouraging readers to actively engage in the process, analyze the results, and gain a deeper understanding of how the model learns and improves.

Understanding and Implementing Custom Datasets: Pages 681-690

The sources shift focus to explaining the concept and implementation of custom datasets in PyTorch, emphasizing the flexibility and customization they offer for handling diverse types of data beyond pre-built datasets. They guide readers through the process of creating a custom dataset class, understanding its key methods, and visualizing samples from the custom dataset.
- Introducing Custom Datasets: The sources introduce the concept of custom datasets in PyTorch, explaining that they allow for greater control and flexibility in handling data that doesn’t fit the structure of pre-built datasets. They highlight that custom datasets are especially useful when working with:
- Data in Non-Standard Formats: Data that is not readily available in formats supported by pre-built datasets, requiring specific loading and processing steps.
- Data with Unique Structures: Data with specific organizational structures or relationships that need to be represented in a particular way.
- Data Requiring Specialized Transformations: Data that requires specific transformations or augmentations to prepare it for model training.
- Using torchvision.datasets.ImageFolder : The sources acknowledge that the torchvision.datasets.ImageFolder class can handle many image classification datasets. They explain that ImageFolder works well when the data follows a standard directory structure, where images are organized into subfolders representing different classes. However, they also emphasize the need for custom dataset classes when dealing with data that doesn’t conform to this standard structure.
- Building FoodVisionMini Custom Dataset: The sources guide readers through creating a custom dataset class called FoodVisionMini, designed to work with the smaller subset of the Food 101 dataset (pizza, steak, sushi) prepared earlier. They outline the key steps and considerations involved:
- Subclassing torch.utils.data.Dataset: They explain that custom dataset classes should inherit from the torch.utils.data.Dataset class, which provides the basic framework for representing a dataset in PyTorch.
- Implementing Required Methods: They highlight the essential methods that need to be implemented in a custom dataset class:
- __init__ Method: The __init__ method initializes the dataset, taking the necessary arguments, such as the data directory, transformations to be applied, and any other relevant information.
- __len__ Method: The __len__ method returns the total number of samples in the dataset.
- __getitem__ Method: The __getitem__ method retrieves a data sample at a given index. It typically involves loading the data, applying transformations, and returning the processed data and its corresponding label.
- __getitem__ Method Implementation: The sources provide a detailed breakdown of implementing the __getitem__ method in the FoodVisionMini dataset:
- Getting the Image Path: The method first determines the file path of the image to be loaded based on the provided index.
- Loading the Image: It uses PIL.Image.open() to open the image file.
- Applying Transformations: It applies the specified transformations (if any) to the loaded image.
- Converting to Tensor: It converts the transformed image to a PyTorch tensor.
- Returning Data and Label: It returns the processed image tensor and its corresponding class label.
- Overriding the __len__ Method: The sources also explain the importance of overriding the __len__ method to return the correct number of samples in the custom dataset. They demonstrate a simple implementation that returns the length of the list of image file paths.
- Visualizing Samples from the Custom Dataset: The sources emphasize the importance of visually inspecting samples from the custom dataset to ensure that the data is loaded and processed correctly. They guide readers through creating a function to display random images from the dataset, including their labels, to verify the dataset’s integrity and the effectiveness of applied transformations.
The sources provide a detailed guide to understanding and implementing custom datasets in PyTorch. They explain the motivations for using custom datasets, the key methods to implement, and practical considerations for loading, processing, and visualizing data. They encourage readers to explore the flexibility of custom datasets and create their own to handle diverse data formats and structures for their specific machine learning tasks.

Exploring Data Augmentation and Building the TinyVGG Model Architecture: Pages 691-700

The sources introduce the concept of data augmentation, a powerful technique for enhancing the diversity and robustness of training datasets, and then guide readers through building the TinyVGG model architecture using PyTorch.
- Visualizing the Effects of Data Augmentation: The sources demonstrate the visual effects of applying data augmentation techniques to images from the custom food dataset. They showcase examples where images have been:
- Cropped: Portions of the original images have been removed, potentially changing the focus or composition.
- Darkened/Brightened: The overall brightness or contrast of the images has been adjusted, simulating variations in lighting conditions.
- Shifted: The content of the images has been moved within the frame, altering the position of objects.
- Rotated: The images have been rotated by a certain angle, introducing variations in orientation.
- Color-Modified: The color balance or saturation of the images has been altered, simulating variations in color perception.
The sources emphasize that applying these augmentations randomly during training can help the model learn more robust and generalizable features, making it less sensitive to variations in image appearance and less prone to overfitting the training data.
- Creating a Function to Display Random Transformed Images: The sources provide code for creating a function to display random images from the custom dataset after they have been transformed using data augmentation techniques. This function allows for visual inspection of the augmented images, helping readers understand the impact of different transformations on the dataset. They explain how this function can be used to:
- Verify Transformations: Ensure that the intended augmentations are being applied correctly to the images.
- Assess Augmentation Strength: Evaluate whether the strength or intensity of the augmentations is appropriate for the dataset and task.
- Visualize Data Diversity: Observe the increased diversity in the dataset resulting from data augmentation.
- Implementing the TinyVGG Model Architecture: The sources guide readers through implementing the TinyVGG model architecture, a convolutional neural network architecture known for its simplicity and effectiveness in image classification tasks. They outline the key building blocks of the TinyVGG model:
- Convolutional Blocks (conv_block): The model uses multiple convolutional blocks, each consisting of:
- Convolutional Layers (nn.Conv2d): These layers apply learnable filters to the input image, extracting features at different scales and orientations.
- ReLU Activation Layers (nn.ReLU): These layers introduce non-linearity into the model, allowing it to learn complex patterns in the data.
- Max Pooling Layers (nn.MaxPool2d): These layers downsample the feature maps, reducing their spatial dimensions while retaining the most important features.
- Classifier Layer: The convolutional blocks are followed by a classifier layer, which consists of:
- Flatten Layer (nn.Flatten): This layer converts the multi-dimensional feature maps from the convolutional blocks into a one-dimensional feature vector.
- Linear Layer (nn.Linear): This layer performs a linear transformation on the feature vector, producing output logits that represent the model’s predictions for each class.
The sources emphasize the hierarchical structure of the TinyVGG model, where the convolutional blocks progressively extract more abstract and complex features from the input image, and the classifier layer uses these features to make predictions. They explain that the TinyVGG model’s simple yet effective design makes it a suitable choice for various image classification tasks, and its modular structure allows for customization and experimentation with different layer configurations.
- Troubleshooting Shape Mismatches: The sources address the common issue of shape mismatches that can occur when building deep learning models, emphasizing the importance of carefully checking the input and output dimensions of each layer:
- Using Error Messages as Guides: They explain that error messages related to shape mismatches can provide valuable clues for identifying the source of the issue.
- Printing Shapes for Verification: They recommend printing the shapes of tensors at various points in the model to verify that the dimensions are as expected and to trace the flow of data through the model.
- Calculating Shapes Manually: They suggest calculating the expected output shapes of convolutional and pooling layers manually, considering factors like kernel size, stride, and padding, to ensure that the model is structured correctly.
- Using torchinfo for Model Summary: The sources introduce the torchinfo package, a useful tool for visualizing the structure and parameters of a PyTorch model. They explain that torchinfo can provide a comprehensive summary of the model, including:
- Layer Information: The type and configuration of each layer in the model.
- Input and Output Shapes: The expected dimensions of tensors at each stage of the model.
- Number of Parameters: The total number of trainable parameters in the model.
- Memory Usage: An estimate of the model’s memory requirements.
The sources demonstrate how to use torchinfo to summarize the TinyVGG model, highlighting its ability to provide insights into the model’s architecture and complexity, and assist in debugging shape-related issues.

The sources provide a practical guide to understanding and implementing data augmentation techniques, building the TinyVGG model architecture, and troubleshooting common issues. They emphasize the importance of visualizing the effects of augmentations, carefully checking layer shapes, and utilizing tools like torchinfo for model analysis. These steps lay the foundation for training the TinyVGG model on the custom food dataset in subsequent sections.

Training and Evaluating the TinyVGG Model on a Custom Dataset: Pages 701-710

The sources guide readers through training and evaluating the TinyVGG model on the custom food dataset, explaining how to implement training and evaluation loops, track model performance, and visualize results.
- Preparing for Model Training: The sources outline the steps to prepare for training the TinyVGG model:
- Setting a Random Seed: They emphasize the importance of setting a random seed for reproducibility. This ensures that the random initialization of model weights and any data shuffling during training is consistent across different runs, making it easier to compare and analyze results. [1]
- Creating a List of Image Paths: They generate a list of paths to all the image files in the custom dataset. This list will be used to access and process images during training. [1]
- Visualizing Data with PIL: They demonstrate how to use the Python Imaging Library (PIL) to:
- Open and Display Images: Load and display images from the dataset using PIL.Image.open(). [2]
- Convert Images to Arrays: Transform images into numerical arrays using np.array(), enabling further processing and analysis. [3]
- Inspect Color Channels: Examine the red, green, and blue (RGB) color channels of images, understanding how color information is represented numerically. [3]
- Implementing Image Transformations: They review the concept of image transformations and their role in preparing images for model input, highlighting:
- Conversion to Tensors: Transforming images into PyTorch tensors, the required data format for inputting data into PyTorch models. [3]
- Resizing and Cropping: Adjusting image dimensions to ensure consistency and compatibility with the model’s input layer. [3]
- Normalization: Scaling pixel values to a specific range, typically between 0 and 1, to improve model training stability and efficiency. [3]
- Data Augmentation: Applying random transformations to images during training to increase data diversity and prevent overfitting. [4]
- Utilizing ImageFolder for Data Loading: The sources demonstrate the convenience of using the torchvision.datasets.ImageFolder class for loading images from a directory structured according to image classification standards. They explain how ImageFolder:
- Organizes Data by Class: Automatically infers class labels based on the subfolder structure of the image directory, streamlining data organization. [5]
- Provides Data Length: Offers a __len__ method to determine the number of samples in the dataset, useful for tracking progress during training. [5]
- Enables Sample Access: Implements a __getitem__ method to retrieve a specific image and its corresponding label based on its index, facilitating data access during training. [5]
- Creating DataLoader for Batch Processing: The sources emphasize the importance of using the torch.utils.data.DataLoader class to create data loaders, explaining their role in:
- Batching Data: Grouping multiple images and labels into batches, allowing the model to process multiple samples simultaneously, which can significantly speed up training. [6]
- Shuffling Data: Randomizing the order of samples within batches to prevent the model from learning spurious patterns based on the order of data presentation. [6]
- Loading Data Efficiently: Optimizing data loading and transfer, especially when working with large datasets, to minimize training time and resource usage. [6]
- Visualizing a Sample and Label: The sources guide readers through visualizing an image and its label from the custom dataset using Matplotlib, allowing for a visual confirmation that the data is being loaded and processed correctly. [7]
- Understanding Data Shape and Transformations: The sources highlight the importance of understanding how data shapes change as they pass through different stages of the model:
- Color Channels First (NCHW): PyTorch often expects images in the format “Batch Size (N), Color Channels (C), Height (H), Width (W).” [8]
- Transformations and Shape: They reiterate the importance of verifying that image transformations result in the expected output shapes, ensuring compatibility with subsequent layers. [8]
- Replicating ImageFolder Functionality: The sources provide code for replicating the core functionality of ImageFolder manually. They explain that this exercise can deepen understanding of how custom datasets are created and provide a foundation for building more specialized datasets in the future. [9]
The sources meticulously guide readers through the essential steps of preparing data, loading it using ImageFolder, and creating data loaders for efficient batch processing. They emphasize the importance of data visualization, shape verification, and understanding the transformations applied to images. These detailed explanations set the stage for training and evaluating the TinyVGG model on the custom food dataset.

Constructing the Training Loop and Evaluating Model Performance: Pages 711-720

The sources focus on building the training loop and evaluating the performance of the TinyVGG model on the custom food dataset. They introduce techniques for tracking training progress, calculating loss and accuracy, and visualizing the training process.
- Creating Training and Testing Step Functions: The sources explain the importance of defining separate functions for the training and testing steps. They guide readers through implementing these functions:
- train_step Function: This function outlines the steps involved in a single training iteration. It includes:
1. Setting the Model to Train Mode: The model is set to training mode (model.train()) to enable gradient calculations and updates during backpropagation.
2. Performing a Forward Pass: The input data (images) is passed through the model to obtain the output predictions (logits).
3. Calculating the Loss: The predicted logits are compared to the true labels using a loss function (e.g., cross-entropy loss), providing a measure of how well the model’s predictions match the actual data.
4. Calculating the Accuracy: The model’s accuracy is calculated by determining the percentage of correct predictions.
5. Zeroing Gradients: The gradients from the previous iteration are reset to zero (optimizer.zero_grad()) to prevent their accumulation and ensure that each iteration’s gradients are calculated independently.
6. Performing Backpropagation: The gradients of the loss function with respect to the model’s parameters are calculated (loss.backward()), tracing the path of error back through the network.
7. Updating Model Parameters: The optimizer updates the model’s parameters (optimizer.step()) based on the calculated gradients, adjusting the model’s weights and biases to minimize the loss function.
8. Returning Loss and Accuracy: The function returns the calculated loss and accuracy for the current training iteration, allowing for performance monitoring.
- test_step Function: This function performs a similar process to the train_step function, but without gradient calculations or parameter updates. It is designed to evaluate the model’s performance on a separate test dataset, providing an unbiased assessment of how well the model generalizes to unseen data.
- Implementing the Training Loop: The sources outline the structure of the training loop, which iteratively trains and evaluates the model over a specified number of epochs:
- Looping through Epochs: The loop iterates through the desired number of epochs, allowing the model to see and learn from the training data multiple times.
- Looping through Batches: Within each epoch, the loop iterates through the batches of data provided by the training data loader.
- Calling train_step and test_step: For each batch, the train_step function is called to train the model, and periodically, the test_step function is called to evaluate the model’s performance on the test dataset.
- Tracking and Accumulating Loss and Accuracy: The loss and accuracy values from each batch are accumulated to calculate the average loss and accuracy for the entire epoch.
- Printing Progress: The training progress, including epoch number, loss, and accuracy, is printed to the console, providing a real-time view of the model’s performance.
- Using tqdm for Progress Bars: The sources recommend using the tqdm library to create progress bars, which visually display the progress of the training loop, making it easier to track how long each epoch takes and estimate the remaining training time.
- Visualizing Training Progress with Loss Curves: The sources emphasize the importance of visualizing the model’s training progress by plotting loss curves. These curves show how the loss function changes over time (epochs or batches), providing insights into:
- Model Convergence: Whether the model is successfully learning and reducing the error on the training data, indicated by a decreasing loss curve.
- Overfitting: If the loss on the training data continues to decrease while the loss on the test data starts to increase, it might indicate that the model is overfitting the training data and not generalizing well to unseen data.
- Understanding Ideal and Problematic Loss Curves: The sources provide examples of ideal and problematic loss curves, helping readers identify patterns that suggest healthy training progress or potential issues that may require adjustments to the model’s architecture, hyperparameters, or training process.
The sources provide a detailed guide to constructing the training loop, tracking model performance, and visualizing the training process. They explain how to implement training and testing steps, use tqdm for progress tracking, and interpret loss curves to monitor the model’s learning and identify potential issues. These steps are crucial for successfully training and evaluating the TinyVGG model on the custom food dataset.

Experiment Tracking and Enhancing Model Performance: Pages 721-730

The sources guide readers through tracking model experiments and exploring techniques to enhance the TinyVGG model’s performance on the custom food dataset. They explain methods for comparing results, adjusting hyperparameters, and introduce the concept of transfer learning.
- Comparing Model Results: The sources introduce strategies for comparing the results of different model training experiments. They demonstrate how to:
- Create a Dictionary to Store Results: Organize the results of each experiment, including loss, accuracy, and training time, into separate dictionaries for easy access and comparison.
- Use Pandas DataFrames for Analysis: Leverage the power of Pandas DataFrames to:
- Structure Results: Neatly organize the results from different experiments into a tabular format, facilitating clear comparisons.
- Sort and Analyze Data: Sort and analyze the data to identify trends, such as which model configuration achieved the lowest loss or highest accuracy, and to observe how changes in hyperparameters affect performance.
- Exploring Ways to Improve a Model: The sources discuss various techniques for improving the performance of a deep learning model, including:
- Adjusting Hyperparameters: Modifying hyperparameters, such as the learning rate, batch size, and number of epochs, can significantly impact model performance. They suggest experimenting with these parameters to find optimal settings for a given dataset.
- Adding More Layers: Increasing the depth of the model by adding more layers can potentially allow the model to learn more complex representations of the data, leading to improved accuracy.
- Adding More Hidden Units: Increasing the number of hidden units in each layer can also enhance the model’s capacity to learn intricate patterns in the data.
- Training for Longer: Training the model for more epochs can sometimes lead to further improvements, but it is crucial to monitor the loss curves for signs of overfitting.
- Using a Different Optimizer: Different optimizers employ distinct strategies for updating model parameters. Experimenting with various optimizers, such as Adam or RMSprop, might yield better performance compared to the default stochastic gradient descent (SGD) optimizer.
- Leveraging Transfer Learning: The sources introduce the concept of transfer learning, a powerful technique where a model pre-trained on a large dataset is used as a starting point for training on a smaller, related dataset. They explain how transfer learning can:
- Improve Performance: Benefit from the knowledge gained by the pre-trained model, often resulting in faster convergence and higher accuracy on the target dataset.
- Reduce Training Time: Leverage the pre-trained model’s existing feature representations, potentially reducing the need for extensive training from scratch.
- Making Predictions on a Custom Image: The sources demonstrate how to use the trained model to make predictions on a custom image. This involves:
- Loading and Transforming the Image: Loading the image using PIL, applying the same transformations used during training (resizing, normalization, etc.), and converting the image to a PyTorch tensor.
- Passing the Image through the Model: Inputting the transformed image tensor into the trained model to obtain the predicted logits.
- Applying Softmax for Probabilities: Converting the raw logits into probabilities using the softmax function, indicating the model’s confidence in each class prediction.
- Determining the Predicted Class: Selecting the class with the highest probability as the model’s prediction for the input image.
- Understanding Model Performance: The sources emphasize the importance of evaluating the model’s performance both quantitatively and qualitatively:
- Quantitative Evaluation: Using metrics like loss and accuracy to assess the model’s performance numerically, providing objective measures of its ability to learn and generalize.
- Qualitative Evaluation: Examining predictions on individual images to gain insights into the model’s decision-making process. This can help identify areas where the model struggles and suggest potential improvements to the training data or model architecture.
The sources cover important aspects of tracking experiments, improving model performance, and making predictions. They explain methods for comparing results, discuss various hyperparameter tuning techniques and introduce transfer learning. They also guide readers through making predictions on custom images and emphasize the importance of both quantitative and qualitative evaluation to understand the model’s strengths and limitations.

Building Custom Datasets with PyTorch: Pages 731-740

The sources shift focus to constructing custom datasets in PyTorch. They explain the motivation behind creating custom datasets, walk through the process of building one for the food classification task, and highlight the importance of understanding the dataset structure and visualizing the data.
- Understanding the Need for Custom Datasets: The sources explain that while pre-built datasets like FashionMNIST are valuable for learning and experimentation, real-world machine learning projects often require working with custom datasets specific to the problem at hand. Building custom datasets allows for greater flexibility and control over the data used for training models.
- Creating a Custom ImageDataset Class: The sources guide readers through creating a custom dataset class named ImageDataset, which inherits from the Dataset class provided by PyTorch. They outline the key steps and methods involved:
1. Initialization (__init__): This method initializes the dataset by:
- Defining the root directory where the image data is stored.
- Setting up the transformation pipeline to be applied to each image (e.g., resizing, normalization).
- Creating a list of image file paths by recursively traversing the directory structure.
- Generating a list of corresponding labels based on the image’s parent directory (representing the class).
1. Calculating Dataset Length (__len__): This method returns the total number of samples in the dataset, determined by the length of the image file path list. This allows PyTorch’s data loaders to know how many samples are available.
2. Getting a Sample (__getitem__): This method fetches a specific sample from the dataset given its index. It involves:
- Retrieving the image file path and label corresponding to the provided index.
- Loading the image using PIL.
- Applying the defined transformations to the image.
- Converting the image to a PyTorch tensor.
- Returning the transformed image tensor and its associated label.
- Mapping Class Names to Integers: The sources demonstrate a helper function that maps class names (e.g., “pizza”, “steak”, “sushi”) to integer labels (e.g., 0, 1, 2). This is necessary for PyTorch models, which typically work with numerical labels.
- Visualizing Samples and Labels: The sources stress the importance of visually inspecting the data to gain a better understanding of the dataset’s structure and contents. They guide readers through creating a function to display random images from the custom dataset along with their corresponding labels, allowing for a qualitative assessment of the data.
The sources provide a comprehensive overview of building custom datasets in PyTorch, specifically focusing on creating an ImageDataset class for image classification tasks. They outline the essential methods for initialization, calculating length, and retrieving samples, along with the process of mapping class names to integers and visualizing the data.

Visualizing and Augmenting Custom Datasets: Pages 741-750

The sources focus on visualizing data from the custom ImageDataset and introduce the concept of data augmentation as a technique to enhance model performance. They guide readers through creating a function to display random images from the dataset and explore various data augmentation techniques, specifically using the torchvision.transforms module.
- Creating a Function to Display Random Images: The sources outline the steps involved in creating a function to visualize random images from the custom dataset, enabling a qualitative assessment of the data and the transformations applied. They provide detailed guidance on:
1. Function Definition: Define a function that accepts the dataset, class names, the number of images to display (defaulting to 10), and a boolean flag (display_shape) to optionally show the shape of each image.
2. Limiting Display for Practicality: To prevent overwhelming the display, the function caps the maximum number of images to 10. If the user requests more than 10 images, the function automatically sets the limit to 10 and disables the display_shape option.
3. Random Sampling: Generate a list of random indices within the range of the dataset’s length using random.sample. The number of indices to sample is determined by the n parameter (number of images to display).
4. Setting up the Plot: Create a Matplotlib figure with a size adjusted based on the number of images to display.
5. Iterating through Samples: Loop through the randomly sampled indices, retrieving the corresponding image and label from the dataset using the __getitem__ method.
6. Creating Subplots: For each image, create a subplot within the Matplotlib figure, arranging them in a single row.
7. Displaying Images: Use plt.imshow to display the image within its designated subplot.
8. Setting Titles: Set the title of each subplot to display the class name of the image.
9. Optional Shape Display: If the display_shape flag is True, print the shape of each image tensor below its subplot.
- Introducing Data Augmentation: The sources highlight the importance of data augmentation, a technique that artificially increases the diversity of training data by applying various transformations to the original images. Data augmentation helps improve the model’s ability to generalize and reduces the risk of overfitting. They provide a conceptual explanation of data augmentation and its benefits, emphasizing its role in enhancing model robustness and performance.
- Exploring torchvision.transforms: The sources guide readers through the torchvision.transforms module, a valuable tool in PyTorch that provides a range of image transformations for data augmentation. They discuss specific transformations like:
- RandomHorizontalFlip: Randomly flips the image horizontally with a given probability.
- RandomRotation: Rotates the image by a random angle within a specified range.
- ColorJitter: Randomly adjusts the brightness, contrast, saturation, and hue of the image.
- RandomResizedCrop: Crops a random portion of the image and resizes it to a given size.
- ToTensor: Converts the PIL image to a PyTorch tensor.
- Normalize: Normalizes the image tensor using specified mean and standard deviation values.
- Visualizing Transformed Images: The sources demonstrate how to visualize images after applying data augmentation transformations. They create a new transformation pipeline incorporating the desired augmentations and then use the previously defined function to display random images from the dataset after they have been transformed.
The sources provide valuable insights into visualizing custom datasets and leveraging data augmentation to improve model training. They explain the creation of a function to display random images, introduce data augmentation as a concept, and explore various transformations provided by the torchvision.transforms module. They also demonstrate how to visualize the effects of these transformations, allowing for a better understanding of how they augment the training data.

Implementing a Convolutional Neural Network for Food Classification: Pages 751-760

The sources shift focus to building and training a convolutional neural network (CNN) to classify images from the custom food dataset. They walk through the process of implementing a TinyVGG architecture, setting up training and testing functions, and evaluating the model’s performance.
- Building a TinyVGG Architecture: The sources introduce the TinyVGG architecture as a simplified version of the popular VGG network, known for its effectiveness in image classification tasks. They provide a step-by-step guide to constructing the TinyVGG model using PyTorch:
1. Defining Input Shape and Hidden Units: Establish the input shape of the images, considering the number of color channels, height, and width. Also, determine the number of hidden units to use in convolutional layers.
2. Constructing Convolutional Blocks: Create two convolutional blocks, each consisting of:
- A 2D convolutional layer (nn.Conv2d) to extract features from the input images.
- A ReLU activation function (nn.ReLU) to introduce non-linearity.
- Another 2D convolutional layer.
- Another ReLU activation function.
- A max-pooling layer (nn.MaxPool2d) to downsample the feature maps, reducing their spatial dimensions.
1. Creating the Classifier Layer: Define the classifier layer, responsible for producing the final classification output. This layer comprises:
- A flattening layer (nn.Flatten) to convert the multi-dimensional feature maps from the convolutional blocks into a one-dimensional feature vector.
- A linear layer (nn.Linear) to perform the final classification, mapping the features to the number of output classes.
- A ReLU activation function.
- Another linear layer to produce the final output with the desired number of classes.
1. Combining Layers in nn.Sequential: Utilize nn.Sequential to organize and connect the convolutional blocks and the classifier layer in a sequential manner, defining the flow of data through the model.
- Verifying Model Architecture with torchinfo: The sources introduce the torchinfo package as a helpful tool for summarizing and verifying the architecture of a PyTorch model. They demonstrate its usage by passing the created TinyVGG model to torchinfo.summary, providing a concise overview of the model’s layers, input and output shapes, and the number of trainable parameters.
- Setting up Training and Testing Functions: The sources outline the process of creating functions for training and testing the TinyVGG model. They provide a detailed explanation of the steps involved in each function:
- Training Function (train_step): This function handles a single training step, accepting the model, data loader, loss function, optimizer, and device as input:
1. Set the model to training mode (model.train()).
2. Iterate through batches of data from the data loader.
3. For each batch, send the input data and labels to the specified device.
4. Perform a forward pass through the model to obtain predictions (logits).
5. Calculate the loss using the provided loss function.
6. Perform backpropagation to compute gradients.
7. Update model parameters using the optimizer.
8. Accumulate training loss for the epoch.
9. Return the average training loss.
- Testing Function (test_step): This function evaluates the model’s performance on a given dataset, accepting the model, data loader, loss function, and device as input:
1. Set the model to evaluation mode (model.eval()).
2. Disable gradient calculation using torch.no_grad().
3. Iterate through batches of data from the data loader.
4. For each batch, send the input data and labels to the specified device.
5. Perform a forward pass through the model to obtain predictions.
6. Calculate the loss.
7. Accumulate testing loss.
8. Return the average testing loss.
- Training and Evaluating the Model: The sources guide readers through the process of training the TinyVGG model using the defined training function. They outline steps such as:
1. Instantiating the model and moving it to the desired device (CPU or GPU).
2. Defining the loss function (e.g., cross-entropy loss) and optimizer (e.g., SGD).
3. Setting up the training loop for a specified number of epochs.
4. Calling the train_step function for each epoch to train the model on the training data.
5. Evaluating the model’s performance on the test data using the test_step function.
6. Tracking and printing training and testing losses for each epoch.
- Visualizing the Loss Curve: The sources emphasize the importance of visualizing the loss curve to monitor the model’s training progress and detect potential issues like overfitting or underfitting. They provide guidance on creating a plot showing the training loss over epochs, allowing users to observe how the loss decreases as the model learns.
- Preparing for Model Improvement: The sources acknowledge that the initial performance of the TinyVGG model may not be optimal. They suggest various techniques to potentially improve the model’s performance in subsequent steps, paving the way for further experimentation and model refinement.
The sources offer a comprehensive walkthrough of building and training a TinyVGG model for image classification using a custom food dataset. They detail the architecture of the model, explain the training and testing procedures, and highlight the significance of visualizing the loss curve. They also lay the foundation for exploring techniques to enhance the model’s performance in later stages.

Improving Model Performance and Tracking Experiments: Pages 761-770

The sources transition from establishing a baseline model to exploring techniques for enhancing its performance and introduce methods for tracking experimental results. They focus on data augmentation strategies using the torchvision.transforms module and creating a system for comparing different model configurations.
- Evaluating the Custom ImageDataset: The sources revisit the custom ImageDataset created earlier, emphasizing the importance of assessing its functionality. They use the previously defined plot_random_images function to visually inspect a sample of images from the dataset, confirming that the images are loaded correctly and transformed as intended.
- Data Augmentation for Enhanced Performance: The sources delve deeper into data augmentation as a crucial technique for improving the model’s ability to generalize to unseen data. They highlight how data augmentation artificially increases the diversity and size of the training data, leading to more robust models that are less prone to overfitting.
- Exploring torchvision.transforms for Augmentation: The sources guide users through different data augmentation techniques available in the torchvision.transforms module. They explain the purpose and effects of various transformations, including:
- RandomHorizontalFlip: Randomly flips the image horizontally, adding variability to the dataset.
- RandomRotation: Rotates the image by a random angle within a specified range, exposing the model to different orientations.
- ColorJitter: Randomly adjusts the brightness, contrast, saturation, and hue of the image, making the model more robust to variations in lighting and color.
- Visualizing Augmented Images: The sources demonstrate how to visualize the effects of data augmentation by applying transformations to images and then displaying the transformed images. This visual inspection helps understand the impact of the augmentations and ensure they are applied correctly.
- Introducing TrivialAugment: The sources introduce TrivialAugment, a data augmentation strategy that randomly applies a sequence of simple augmentations to each image. They explain that TrivialAugment has been shown to be effective in improving model performance, particularly when combined with other techniques. They provide a link to a research paper for further reading on TrivialAugment, encouraging users to explore the strategy in more detail.
- Applying TrivialAugment to the Custom Dataset: The sources guide users through applying TrivialAugment to the custom food dataset. They create a new transformation pipeline incorporating TrivialAugment and then use the plot_random_images function to display a sample of augmented images, allowing users to visually assess the impact of the augmentations.
- Creating a System for Comparing Model Results: The sources shift focus to establishing a structured approach for tracking and comparing the performance of different model configurations. They create a dictionary called compare_results to store results from various model experiments. This dictionary is designed to hold information such as training time, training loss, testing loss, and testing accuracy for each model.
- Setting Up a Pandas DataFrame: The sources introduce Pandas DataFrames as a convenient tool for organizing and analyzing experimental results. They convert the compare_results dictionary into a Pandas DataFrame, providing a structured table-like representation of the results, making it easier to compare the performance of different models.
The sources provide valuable insights into techniques for improving model performance, specifically focusing on data augmentation strategies. They guide users through various transformations available in the torchvision.transforms module, explain the concept and benefits of TrivialAugment, and demonstrate how to visualize the effects of these augmentations. Moreover, they introduce a structured approach for tracking and comparing experimental results using a dictionary and a Pandas DataFrame, laying the groundwork for systematic model experimentation and analysis.

Predicting on a Custom Image and Wrapping Up the Custom Datasets Section: Pages 771-780

The sources shift focus to making predictions on a custom image using the trained TinyVGG model and summarize the key concepts covered in the custom datasets section. They guide users through the process of preparing the image, making predictions, and analyzing the results.
- Preparing a Custom Image for Prediction: The sources outline the steps for preparing a custom image for prediction:
1. Obtaining the Image: Acquire an image that aligns with the classes the model was trained on. In this case, the image should be of either pizza, steak, or sushi.
2. Resizing and Converting to RGB: Ensure the image is resized to the dimensions expected by the model (64×64 in this case) and converted to RGB format. This resizing step is crucial as the model was trained on images with specific dimensions and expects the same input format during prediction.
3. Converting to a PyTorch Tensor: Transform the image into a PyTorch tensor using torchvision.transforms.ToTensor(). This conversion is necessary to feed the image data into the PyTorch model.
- Making Predictions with the Trained Model: The sources walk through the process of using the trained TinyVGG model to make predictions on the prepared custom image:
1. Setting the Model to Evaluation Mode: Switch the model to evaluation mode using model.eval(). This step ensures that the model behaves appropriately for prediction, deactivating functionalities like dropout that are only used during training.
2. Performing a Forward Pass: Pass the prepared image tensor through the model to obtain the model’s predictions (logits).
3. Applying Softmax to Obtain Probabilities: Convert the raw logits into prediction probabilities using the softmax function (torch.softmax()). Softmax transforms the logits into a probability distribution, where each value represents the model’s confidence in the image belonging to a particular class.
4. Determining the Predicted Class: Identify the class with the highest predicted probability, representing the model’s final prediction for the input image.
- Analyzing the Prediction Results: The sources emphasize the importance of carefully analyzing the prediction results, considering both quantitative and qualitative aspects. They highlight that even if the model’s accuracy may not be perfect, a qualitative assessment of the predictions can provide valuable insights into the model’s behavior and potential areas for improvement.
- Summarizing the Custom Datasets Section: The sources provide a comprehensive summary of the key concepts covered in the custom datasets section:
1. Understanding Custom Datasets: They reiterate the importance of working with custom datasets, especially when dealing with domain-specific problems or when pre-trained models may not be readily available. They emphasize the ability of custom datasets to address unique challenges and tailor models to specific needs.
2. Building a Custom Dataset: They recap the process of building a custom dataset using torchvision.datasets.ImageFolder. They highlight the benefits of ImageFolder for handling image data organized in standard image classification format, where images are stored in separate folders representing different classes.
3. Creating a Custom ImageDataset Class: They review the steps involved in creating a custom ImageDataset class, demonstrating the flexibility and control this approach offers for handling and processing data. They explain the key methods required for a custom dataset, including __init__, __len__, and __getitem__, and how these methods interact with the data loader.
4. Data Augmentation Techniques: They emphasize the importance of data augmentation for improving model performance, particularly in scenarios where the training data is limited. They reiterate the techniques explored earlier, including random horizontal flipping, random rotation, color jittering, and TrivialAugment, highlighting how these techniques can enhance the model’s ability to generalize to unseen data.
5. Training and Evaluating Models: They summarize the process of training and evaluating models on custom datasets, highlighting the steps involved in setting up training loops, evaluating model performance, and visualizing results.
- Introducing Exercises and Extra Curriculum: The sources conclude the custom datasets section by providing a set of exercises and extra curriculum resources to reinforce the concepts covered. They direct users to the learnpytorch.io website and the pytorch-deep-learning GitHub repository for exercise templates, example solutions, and additional learning materials.
- Previewing Upcoming Sections: The sources briefly preview the upcoming sections of the course, hinting at topics like transfer learning, model experiment tracking, paper replicating, and more advanced architectures. They encourage users to continue their learning journey, exploring more complex concepts and techniques in deep learning with PyTorch.
The sources provide a practical guide to making predictions on a custom image using a trained TinyVGG model, carefully explaining the preparation steps, prediction process, and analysis of results. Additionally, they offer a concise summary of the key concepts covered in the custom datasets section, reinforcing the understanding of custom datasets, data augmentation techniques, and model training and evaluation. Finally, they introduce exercises and extra curriculum resources to encourage further practice and learning while previewing the exciting topics to come in the remainder of the course.

Setting Up a TinyVGG Model and Exploring Model Architectures: Pages 781-790

The sources transition from data preparation and augmentation to building a convolutional neural network (CNN) model using the TinyVGG architecture. They guide users through the process of defining the model’s architecture, understanding its components, and preparing it for training.
- Introducing the TinyVGG Architecture: The sources introduce TinyVGG, a simplified version of the VGG (Visual Geometry Group) architecture, known for its effectiveness in image classification tasks. They provide a visual representation of the TinyVGG architecture, outlining its key components, including:
- Convolutional Blocks: The foundation of TinyVGG, composed of convolutional layers (nn.Conv2d) followed by ReLU activation functions (nn.ReLU) and max-pooling layers (nn.MaxPool2d). Convolutional layers extract features from the input images, ReLU introduces non-linearity, and max-pooling downsamples the feature maps, reducing their dimensionality and making the model more robust to variations in the input.
- Classifier Layer: The final layer of TinyVGG, responsible for classifying the extracted features into different categories. It consists of a flattening layer (nn.Flatten), which converts the multi-dimensional feature maps from the convolutional blocks into a single vector, followed by a linear layer (nn.Linear) that outputs a score for each class.
- Building a TinyVGG Model in PyTorch: The sources provide a step-by-step guide to building a TinyVGG model in PyTorch using the nn.Module class. They explain the structure of the model definition, outlining the key components:
1. __init__ Method: Initializes the model’s layers and components, including convolutional blocks and the classifier layer.
2. forward Method: Defines the forward pass of the model, specifying how the input data flows through the different layers and operations.
- Understanding Input and Output Shapes: The sources emphasize the importance of understanding and verifying the input and output shapes of each layer in the model. They guide users through calculating the dimensions of the feature maps at different stages of the network, taking into account factors such as the kernel size, stride, and padding of the convolutional layers. This understanding of shape transformations is crucial for ensuring that data flows correctly through the network and for debugging potential shape mismatches.
- Passing a Random Tensor Through the Model: The sources recommend passing a random tensor with the expected input shape through the model as a preliminary step to verify the model’s architecture and identify potential shape errors. This technique helps ensure that data can successfully flow through the network before proceeding with training.
- Introducing torchinfo for Model Summary: The sources introduce the torchinfo package as a helpful tool for summarizing PyTorch models. They demonstrate how to use torchinfo.summary to obtain a concise overview of the model’s architecture, including the input and output shapes of each layer and the number of trainable parameters. This package provides a convenient way to visualize and verify the model’s structure, making it easier to understand and debug.
The sources provide a detailed walkthrough of building a TinyVGG model in PyTorch, explaining the architecture’s components, the steps involved in defining the model using nn.Module, and the significance of understanding input and output shapes. They introduce practical techniques like passing a random tensor through the model for verification and leverage the torchinfo package for obtaining a comprehensive model summary. These steps lay a solid foundation for building and understanding CNN models for image classification tasks.

Training the TinyVGG Model and Evaluating its Performance: Pages 791-800

The sources shift focus to training the constructed TinyVGG model on the custom food image dataset. They guide users through creating training and testing functions, setting up a training loop, and evaluating the model’s performance using metrics like loss and accuracy.
- Creating Training and Testing Functions: The sources outline the process of creating separate functions for the training and testing steps, promoting modularity and code reusability.
- train_step Function: This function performs a single training step, encompassing the forward pass, loss calculation, backpropagation, and parameter updates.
1. Forward Pass: It takes a batch of data from the training dataloader, passes it through the model, and obtains the model’s predictions.
2. Loss Calculation: It calculates the loss between the predictions and the ground truth labels using a chosen loss function (e.g., cross-entropy loss for classification).
3. Backpropagation: It computes the gradients of the loss with respect to the model’s parameters using the loss.backward() method. Backpropagation determines how each parameter contributed to the error, guiding the optimization process.
4. Parameter Updates: It updates the model’s parameters based on the computed gradients using an optimizer (e.g., stochastic gradient descent). The optimizer adjusts the parameters to minimize the loss, improving the model’s performance over time.
5. Accuracy Calculation: It calculates the accuracy of the model’s predictions on the current batch of training data. Accuracy measures the proportion of correctly classified samples.
- test_step Function: This function evaluates the model’s performance on a batch of test data, computing the loss and accuracy without updating the model’s parameters.
1. Forward Pass: It takes a batch of data from the testing dataloader, passes it through the model, and obtains the model’s predictions. The model’s behavior is set to evaluation mode (model.eval()) before performing the forward pass to ensure that training-specific functionalities like dropout are deactivated.
2. Loss Calculation: It calculates the loss between the predictions and the ground truth labels using the same loss function as in train_step.
3. Accuracy Calculation: It calculates the accuracy of the model’s predictions on the current batch of testing data.
- Setting up a Training Loop: The sources demonstrate the implementation of a training loop that iterates through the training data for a specified number of epochs, calling the train_step and test_step functions at each epoch.
1. Epoch Iteration: The loop iterates for a predefined number of epochs, each epoch representing a complete pass through the entire training dataset.
2. Training Phase: For each epoch, the loop iterates through the batches of training data provided by the training dataloader, calling the train_step function for each batch. The train_step function performs the forward pass, loss calculation, backpropagation, and parameter updates as described above. The training loss and accuracy values are accumulated across all batches within an epoch.
3. Testing Phase: After each epoch, the loop iterates through the batches of testing data provided by the testing dataloader, calling the test_step function for each batch. The test_step function computes the loss and accuracy on the testing data without updating the model’s parameters. The testing loss and accuracy values are also accumulated across all batches.
4. Printing Progress: The loop prints the training and testing loss and accuracy values at regular intervals, typically after each epoch or a set number of epochs. This step provides feedback on the model’s progress and allows for monitoring its performance over time.
- Visualizing Training Progress: The sources highlight the importance of visualizing the training process, particularly the loss curves, to gain insights into the model’s behavior and identify potential issues like overfitting or underfitting. They suggest plotting the training and testing losses over epochs to observe how the loss values change during training.
The sources guide users through setting up a robust training pipeline for the TinyVGG model, emphasizing modularity through separate training and testing functions and a structured training loop. They recommend monitoring and visualizing training progress, particularly using loss curves, to gain a deeper understanding of the model’s behavior and performance. These steps provide a practical foundation for training and evaluating CNN models on custom image datasets.

Training and Experimenting with the TinyVGG Model on a Custom Dataset: Pages 801-810

The sources guide users through training their TinyVGG model on the custom food image dataset using the training functions and loop set up in the previous steps. They emphasize the importance of tracking and comparing model results, including metrics like loss, accuracy, and training time, to evaluate performance and make informed decisions about model improvements.
- Tracking Model Results: The sources recommend using a dictionary to store the training and testing results for each epoch, including the training loss, training accuracy, testing loss, and testing accuracy. This approach allows users to track the model’s performance over epochs and to easily compare the results of different models or training configurations. [1]
- Setting Up the Training Process: The sources provide code for setting up the training process, including:
1. Initializing a Results Dictionary: Creating a dictionary to store the model’s training and testing results. [1]
2. Implementing the Training Loop: Utilizing the tqdm library to display a progress bar during training and iterating through the specified number of epochs. [2]
3. Calling Training and Testing Functions: Invoking the train_step and test_step functions for each epoch, passing in the necessary arguments, including the model, dataloaders, loss function, optimizer, and device. [3]
4. Updating the Results Dictionary: Storing the training and testing loss and accuracy values for each epoch in the results dictionary. [2]
5. Printing Epoch Results: Displaying the training and testing results for each epoch. [3]
6. Calculating and Printing Total Training Time: Measuring the total time taken for training and printing the result. [4]
- Evaluating and Comparing Model Results: The sources guide users through plotting the training and testing losses and accuracies over epochs to visualize the model’s performance. They explain how to analyze the loss curves for insights into the training process, such as identifying potential overfitting or underfitting. [5, 6] They also recommend comparing the results of different models trained with various configurations to understand the impact of different architectural choices or hyperparameters on performance. [7]
- Improving Model Performance: Building upon the visualization and comparison of results, the sources discuss strategies for improving the model’s performance, including:
1. Adding More Layers: Increasing the depth of the model to enable it to learn more complex representations of the data. [8]
2. Adding More Hidden Units: Expanding the capacity of each layer to enhance its ability to capture intricate patterns in the data. [8]
3. Training for Longer: Increasing the number of epochs to allow the model more time to learn from the data. [9]
4. Using a Smaller Learning Rate: Adjusting the learning rate, which determines the step size during parameter updates, to potentially improve convergence and prevent oscillations around the optimal solution. [8]
5. Trying a Different Optimizer: Exploring alternative optimization algorithms, each with its unique approach to updating parameters, to potentially find one that better suits the specific problem. [8]
6. Using Learning Rate Decay: Gradually reducing the learning rate over epochs to fine-tune the model and improve convergence towards the optimal solution. [8]
7. Adding Regularization Techniques: Implementing methods like dropout or weight decay to prevent overfitting, which occurs when the model learns the training data too well and performs poorly on unseen data. [8]
- Visualizing Loss Curves: The sources emphasize the importance of understanding and interpreting loss curves to gain insights into the training process. They provide visual examples of different loss curve shapes and explain how to identify potential issues like overfitting or underfitting based on the curves’ behavior. They also offer guidance on interpreting ideal loss curves and discuss strategies for addressing problems like overfitting or underfitting, pointing to additional resources for further exploration. [5, 10]
The sources offer a structured approach to training and evaluating the TinyVGG model on a custom food image dataset, encouraging the use of dictionaries to track results, visualizing performance through loss curves, and comparing different model configurations. They discuss potential areas for model improvement and highlight resources for delving deeper into advanced techniques like learning rate scheduling and regularization. These steps empower users to systematically experiment, analyze, and enhance their models’ performance on image classification tasks using custom datasets.

Evaluating Model Performance and Introducing Data Augmentation: Pages 811-820

The sources emphasize the need to comprehensively evaluate model performance beyond just loss and accuracy. They introduce concepts like training time and tools for visualizing comparisons between different trained models. They also explore the concept of data augmentation as a strategy to improve model performance, focusing specifically on the “Trivial Augment” technique.
- Comparing Model Results: The sources guide users through creating a Pandas DataFrame to organize and compare the results of different trained models. The DataFrame includes columns for metrics like training loss, training accuracy, testing loss, testing accuracy, and training time, allowing for a clear comparison of the models’ performance across various metrics.
- Data Augmentation: The sources explain data augmentation as a technique for artificially increasing the diversity and size of the training dataset by applying various transformations to the original images. Data augmentation aims to improve the model’s generalization ability and reduce overfitting by exposing the model to a wider range of variations within the training data.
- Trivial Augment: The sources focus on Trivial Augment [1], a data augmentation technique known for its simplicity and effectiveness. They guide users through implementing Trivial Augment using PyTorch’s torchvision.transforms module, showcasing how to apply transformations like random cropping, horizontal flipping, color jittering, and other augmentations to the training images. They provide code examples for defining a transformation pipeline using torchvision.transforms.Compose to apply a sequence of augmentations to the input images.
- Visualizing Augmented Images: The sources recommend visualizing the augmented images to ensure that the applied transformations are appropriate and effective. They provide code using Matplotlib to display a grid of augmented images, allowing users to visually inspect the impact of the transformations on the training data.
- Understanding the Benefits of Data Augmentation: The sources explain the potential benefits of data augmentation, including:
- Improved Generalization: Exposing the model to a wider range of variations within the training data can help it learn more robust and generalizable features, leading to better performance on unseen data.
- Reduced Overfitting: Increasing the diversity of the training data can mitigate overfitting, which occurs when the model learns the training data too well and performs poorly on new, unseen data.
- Increased Effective Dataset Size: Artificially expanding the training dataset through augmentations can be beneficial when the original dataset is relatively small.
The sources present a structured approach to evaluating and comparing model performance using Pandas DataFrames. They introduce data augmentation, particularly Trivial Augment, as a valuable technique for enhancing model generalization and performance. They guide users through implementing data augmentation pipelines using PyTorch’s torchvision.transforms module and recommend visualizing augmented images to ensure their effectiveness. These steps empower users to perform thorough model evaluation, understand the importance of data augmentation, and implement it effectively using PyTorch to potentially boost model performance on image classification tasks.

Exploring Convolutional Neural Networks and Building a Custom Model: Pages 821-830

The sources shift focus to the fundamentals of Convolutional Neural Networks (CNNs), introducing their key components and operations. They walk users through building a custom CNN model, incorporating concepts like convolutional layers, ReLU activation functions, max pooling layers, and flattening layers to create a model capable of learning from image data.
- Introduction to CNNs: The sources provide an overview of CNNs, explaining their effectiveness in image classification tasks due to their ability to learn spatial hierarchies of features. They introduce the essential components of a CNN, including:
1. Convolutional Layers: Convolutional layers apply filters to the input image to extract features like edges, textures, and patterns. These filters slide across the image, performing convolutions to create feature maps that capture different aspects of the input.
2. ReLU Activation Function: ReLU (Rectified Linear Unit) is a non-linear activation function applied to the output of convolutional layers. It introduces non-linearity into the model, allowing it to learn complex relationships between features.
3. Max Pooling Layers: Max pooling layers downsample the feature maps produced by convolutional layers, reducing their dimensionality while retaining important information. They help make the model more robust to variations in the input image.
4. Flattening Layer: A flattening layer converts the multi-dimensional output of the convolutional and pooling layers into a one-dimensional vector, preparing it as input for the fully connected layers of the network.
- Building a Custom CNN Model: The sources guide users through constructing a custom CNN model using PyTorch’s nn.Module class. They outline a step-by-step process, explaining how to define the model’s architecture:
1. Defining the Model Class: Creating a Python class that inherits from nn.Module, setting up the model’s structure and layers.
2. Initializing the Layers: Instantiating the convolutional layers (nn.Conv2d), ReLU activation function (nn.ReLU), max-pooling layers (nn.MaxPool2d), and flattening layer (nn.Flatten) within the model’s constructor (__init__).
3. Implementing the Forward Pass: Defining the forward method, outlining the flow of data through the model’s layers during the forward pass, including the application of convolutional operations, activation functions, and pooling.
4. Setting Model Input Shape: Determining the expected input shape for the model based on the dimensions of the input images, considering the number of color channels, height, and width.
5. Verifying Input and Output Shapes: Ensuring that the input and output shapes of each layer are compatible, using techniques like printing intermediate shapes or utilizing tools like torchinfo to summarize the model’s architecture.
- Understanding Input and Output Shapes: The sources highlight the importance of comprehending the input and output shapes of each layer in the CNN. They explain how to calculate the output shape of convolutional layers based on factors like kernel size, stride, and padding, providing resources for a deeper understanding of these concepts.
- Using torchinfo for Model Summary: The sources introduce the torchinfo package as a helpful tool for summarizing PyTorch models, visualizing their architecture, and verifying input and output shapes. They demonstrate how to use torchinfo to print a concise summary of the model’s layers, parameters, and input/output sizes, aiding in understanding the model’s structure and ensuring its correctness.
The sources provide a clear and structured introduction to CNNs and guide users through building a custom CNN model using PyTorch. They explain the key components of CNNs, including convolutional layers, activation functions, pooling layers, and flattening layers. They walk users through defining the model’s architecture, understanding input/output shapes, and using tools like torchinfo to visualize and verify the model’s structure. These steps equip users with the knowledge and skills to create and work with CNNs for image classification tasks using custom datasets.

Training and Evaluating the TinyVGG Model: Pages 831-840

The sources walk users through the process of training and evaluating the TinyVGG model using the custom dataset created in the previous steps. They guide users through setting up training and testing functions, training the model for multiple epochs, visualizing the training progress using loss curves, and comparing the performance of the custom TinyVGG model to a baseline model.
- Setting up Training and Testing Functions: The sources present Python functions for training and testing the model, highlighting the key steps involved in each phase:
- train_step Function: This function performs a single training step, iterating through batches of training data and performing the following actions:
1. Forward Pass: Passing the input data through the model to get predictions.
2. Loss Calculation: Computing the loss between the predictions and the target labels using a chosen loss function.
3. Backpropagation: Calculating gradients of the loss with respect to the model’s parameters.
4. Optimizer Update: Updating the model’s parameters using an optimization algorithm to minimize the loss.
5. Accuracy Calculation: Calculating the accuracy of the model’s predictions on the training batch.
- test_step Function: Similar to the train_step function, this function evaluates the model’s performance on the test data, iterating through batches of test data and performing the forward pass, loss calculation, and accuracy calculation.
- Training the Model: The sources guide users through training the TinyVGG model for a specified number of epochs, calling the train_step and test_step functions in each epoch. They showcase how to track and store the training and testing loss and accuracy values across epochs for later analysis and visualization.
- Visualizing Training Progress with Loss Curves: The sources emphasize the importance of visualizing the training progress by plotting loss curves. They explain that loss curves depict the trend of the loss value over epochs, providing insights into the model’s learning process.
- Interpreting Loss Curves: They guide users through interpreting loss curves, highlighting that a decreasing loss generally indicates that the model is learning effectively. They explain that if the training loss continues to decrease but the testing loss starts to increase or plateau, it might indicate overfitting, where the model performs well on the training data but poorly on unseen data.
- Comparing Models and Exploring Hyperparameter Tuning: The sources compare the performance of the custom TinyVGG model to a baseline model, providing insights into the effectiveness of the chosen architecture. They suggest exploring techniques like hyperparameter tuning to potentially improve the model’s performance.
- Hyperparameter Tuning: They briefly introduce hyperparameter tuning as the process of finding the optimal values for the model’s hyperparameters, such as learning rate, batch size, and the number of hidden units.
The sources provide a comprehensive guide to training and evaluating the TinyVGG model using the custom dataset. They outline the steps involved in creating training and testing functions, performing the training process, visualizing training progress using loss curves, and comparing the model’s performance to a baseline model. These steps equip users with a structured approach to training, evaluating, and iteratively improving CNN models for image classification tasks.

Saving, Loading, and Reflecting on the PyTorch Workflow: Pages 841-850

The sources guide users through saving and loading the trained TinyVGG model, emphasizing the importance of preserving trained models for future use. They also provide a comprehensive reflection on the key steps involved in the PyTorch workflow for computer vision tasks, summarizing the concepts and techniques covered throughout the previous sections and offering insights into the overall process.
- Saving and Loading the Trained Model: The sources highlight the significance of saving trained models to avoid retraining from scratch. They explain that saving the model’s state dictionary, which contains the learned parameters, allows for easy reloading and reuse.
- Using torch.save: They demonstrate how to use PyTorch’s torch.save function to save the model’s state dictionary to a file, specifying the file path and the state dictionary as arguments. This step ensures that the trained model’s parameters are stored persistently.
- Using torch.load: They showcase how to use PyTorch’s torch.load function to load the saved state dictionary back into a new model instance. They explain the importance of creating a new model instance with the same architecture as the saved model before loading the state dictionary. This step allows for seamless restoration of the trained model’s parameters.
- Verifying Loaded Model: They suggest making predictions using the loaded model to ensure that it performs as expected and the loading process was successful.
- Reflecting on the PyTorch Workflow: The sources provide a comprehensive recap of the essential steps involved in the PyTorch workflow for computer vision tasks, summarizing the concepts and techniques covered in the previous sections. They present a structured overview of the workflow, highlighting the following key stages:
1. Data Preparation: Preparing the data, including loading, splitting into training and testing sets, and applying necessary transformations.
2. Model Building: Constructing the neural network model, defining its architecture, layers, and activation functions.
3. Loss Function and Optimizer Selection: Choosing an appropriate loss function to measure the model’s performance and an optimizer to update the model’s parameters during training.
4. Training Loop: Implementing a training loop to iteratively train the model on the training data, performing forward passes, loss calculations, backpropagation, and optimizer updates.
5. Model Evaluation: Evaluating the model’s performance on the test data, using metrics like loss and accuracy.
6. Hyperparameter Tuning and Experimentation: Exploring different model architectures, hyperparameters, and data augmentation techniques to potentially improve the model’s performance.
7. Saving and Loading the Model: Preserving the trained model by saving its state dictionary to a file for future use.
- Encouraging Further Exploration and Practice: The sources emphasize that mastering the PyTorch workflow requires practice and encourage users to explore different datasets, models, and techniques to deepen their understanding. They recommend referring to the PyTorch documentation and online resources for additional learning and problem-solving.
The sources provide clear guidance on saving and loading trained models, emphasizing the importance of preserving trained models for reuse. They offer a thorough recap of the PyTorch workflow for computer vision tasks, summarizing the key steps and techniques covered in the previous sections. They guide users through the process of saving the model’s state dictionary and loading it back into a new model instance. By emphasizing the overall workflow and providing practical examples, the sources equip users with a solid foundation for tackling computer vision projects using PyTorch. They encourage further exploration and experimentation to solidify understanding and enhance practical skills in building, training, and deploying computer vision models.

Expanding the Horizons of PyTorch: Pages 851-860

The sources shift focus from the specific TinyVGG model and custom dataset to a broader exploration of PyTorch’s capabilities. They introduce additional concepts, resources, and areas of study within the realm of deep learning and PyTorch, encouraging users to expand their knowledge and pursue further learning beyond the scope of the initial tutorial.
- Advanced Topics and Resources for Further Learning: The sources recognize that the covered material represents a foundational introduction to PyTorch and deep learning, and they acknowledge that there are many more advanced topics and areas of specialization within this field.
- Transfer Learning: The sources highlight transfer learning as a powerful technique that involves leveraging pre-trained models on large datasets to improve the performance on new, potentially smaller datasets.
- Model Experiment Tracking: They introduce the concept of model experiment tracking, emphasizing the importance of keeping track of different model architectures, hyperparameters, and results for organized experimentation and analysis.
- PyTorch Paper Replication: The sources mention the practice of replicating research papers that introduce new deep learning architectures or techniques using PyTorch. They suggest that this is a valuable way to gain deeper understanding and practical experience with cutting-edge advancements in the field.
- Additional Chapters and Resources: The sources point to additional chapters and resources available on the learnpytorch.io website, indicating that the learning journey continues beyond the current section. They encourage users to explore these resources to deepen their understanding of various aspects of deep learning and PyTorch.
- Encouraging Continued Learning and Exploration: The sources strongly emphasize the importance of continuous learning and exploration within the field of deep learning. They recognize that deep learning is a rapidly evolving field with new architectures, techniques, and applications emerging frequently.
- Staying Updated with Advancements: They advise users to stay updated with the latest research papers, blog posts, and online courses to keep their knowledge and skills current.
- Building Projects and Experimenting: The sources encourage users to actively engage in building projects, experimenting with different datasets and models, and participating in the deep learning community.
The sources gracefully transition from the specific tutorial on TinyVGG and custom datasets to a broader perspective on the vast landscape of deep learning and PyTorch. They introduce additional topics, resources, and areas of study, encouraging users to continue their learning journey and explore more advanced concepts. By highlighting these areas and providing guidance on where to find further information, the sources empower users to expand their knowledge, skills, and horizons within the exciting and ever-evolving world of deep learning and PyTorch.

Diving into Multi-Class Classification with PyTorch: Pages 861-870

The sources introduce the concept of multi-class classification, a common task in machine learning where the goal is to categorize data into one of several possible classes. They contrast this with binary classification, which involves only two classes. The sources then present the FashionMNIST dataset, a collection of grayscale images of clothing items, as an example for demonstrating multi-class classification using PyTorch.
- Multi-Class Classification: The sources distinguish multi-class classification from binary classification, explaining that multi-class classification involves assigning data points to one of multiple possible categories, while binary classification deals with only two categories. They emphasize that many real-world problems fall under the umbrella of multi-class classification. [1]
- FashionMNIST Dataset: The sources introduce the FashionMNIST dataset, a widely used dataset for image classification tasks. This dataset comprises 70,000 grayscale images of 10 different clothing categories, including T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. The sources highlight that this dataset provides a suitable playground for experimenting with multi-class classification techniques using PyTorch. [1, 2]
- Preparing the Data: The sources outline the steps involved in preparing the FashionMNIST dataset for use in PyTorch, emphasizing the importance of loading the data, splitting it into training and testing sets, and applying necessary transformations. They mention using PyTorch’s DataLoader class to efficiently handle data loading and batching during training and testing. [2]
- Building a Multi-Class Classification Model: The sources guide users through building a simple neural network model for multi-class classification using PyTorch. They discuss the choice of layers, activation functions, and the output layer’s activation function. They mention using a softmax activation function in the output layer to produce a probability distribution over the possible classes. [2]
- Training the Model: The sources outline the process of training the multi-class classification model, highlighting the use of a suitable loss function (such as cross-entropy loss) and an optimization algorithm (such as stochastic gradient descent) to minimize the loss and improve the model’s accuracy during training. [2]
- Evaluating the Model: The sources emphasize the need to evaluate the trained model’s performance on the test dataset, using metrics such as accuracy, precision, recall, and the F1-score to assess its effectiveness in classifying images into the correct categories. [2]
- Visualization for Understanding: The sources advocate for visualizing the data and the model’s predictions to gain insights into the classification process. They suggest techniques like plotting the images and their corresponding predicted labels to qualitatively assess the model’s performance. [2]
The sources effectively introduce the concept of multi-class classification and its relevance in various machine learning applications. They guide users through the process of preparing the FashionMNIST dataset, building a neural network model, training the model, and evaluating its performance. By emphasizing visualization and providing code examples, the sources equip users with the tools and knowledge to tackle multi-class classification problems using PyTorch.

Beyond Accuracy: Exploring Additional Classification Metrics: Pages 871-880

The sources introduce several additional metrics for evaluating the performance of classification models, going beyond the commonly used accuracy metric. They highlight the importance of considering multiple metrics to gain a more comprehensive understanding of a model’s strengths and weaknesses. The sources also emphasize that the choice of appropriate metrics depends on the specific problem and the desired balance between different types of errors.
- Limitations of Accuracy: The sources acknowledge that accuracy, while a useful metric, can be misleading in situations where the classes are imbalanced. In such cases, a model might achieve high accuracy simply by correctly classifying the majority class, even if it performs poorly on the minority class.
- Precision and Recall: The sources introduce precision and recall as two important metrics that provide a more nuanced view of a classification model’s performance, particularly when dealing with imbalanced datasets.
- Precision: Precision measures the proportion of correctly classified positive instances out of all instances predicted as positive. A high precision indicates that the model is good at avoiding false positives.
- Recall: Recall, also known as sensitivity or the true positive rate, measures the proportion of correctly classified positive instances out of all actual positive instances. A high recall suggests that the model is effective at identifying all positive instances.
- F1-Score: The sources present the F1-score as a harmonic mean of precision and recall, providing a single metric that balances both precision and recall. A high F1-score indicates a good balance between minimizing false positives and false negatives.
- Confusion Matrix: The sources introduce the confusion matrix as a valuable tool for visualizing the performance of a classification model. A confusion matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a detailed breakdown of the model’s predictions across different classes.
- Classification Report: The sources mention the classification report as a comprehensive summary of key classification metrics, including precision, recall, F1-score, and support (the number of instances of each class) for each class in the dataset.
- TorchMetrics Module: The sources recommend exploring the torchmetrics module in PyTorch, which provides a wide range of pre-implemented classification metrics. Using this module simplifies the calculation and tracking of various metrics during model training and evaluation.
The sources effectively expand the discussion of classification model evaluation by introducing additional metrics that go beyond accuracy. They explain precision, recall, the F1-score, the confusion matrix, and the classification report, highlighting their importance in understanding a model’s performance, especially in cases of imbalanced datasets. By encouraging the use of the torchmetrics module, the sources provide users with practical tools to easily calculate and track these metrics during their machine learning workflows. They emphasize that choosing the right metrics depends on the specific problem and the relative importance of different types of errors.

Exploring Convolutional Neural Networks and Computer Vision: Pages 881-890

The sources mark a transition into the realm of computer vision, specifically focusing on Convolutional Neural Networks (CNNs), a type of neural network architecture highly effective for image-related tasks. They introduce core concepts of CNNs and showcase their application in image classification using the FashionMNIST dataset.
- Introduction to Computer Vision: The sources acknowledge computer vision as a rapidly expanding field within deep learning, encompassing tasks like image classification, object detection, and image segmentation. They emphasize the significance of CNNs as a powerful tool for extracting meaningful features from image data, enabling machines to “see” and interpret visual information.
- Convolutional Neural Networks (CNNs): The sources provide a foundational understanding of CNNs, highlighting their key components and how they differ from traditional neural networks.
- Convolutional Layers: They explain how convolutional layers apply filters (also known as kernels) to the input image to extract features such as edges, textures, and patterns. These filters slide across the image, performing convolutions to produce feature maps.
- Activation Functions: The sources discuss the use of activation functions like ReLU (Rectified Linear Unit) within CNNs to introduce non-linearity, allowing the network to learn complex relationships in the image data.
- Pooling Layers: They explain how pooling layers, such as max pooling, downsample the feature maps, reducing their dimensionality while retaining essential information, making the network more computationally efficient and robust to variations in the input image.
- Fully Connected Layers: The sources mention that after several convolutional and pooling layers, the extracted features are flattened and passed through fully connected layers, similar to those found in traditional neural networks, to perform the final classification.
- Applying CNNs to FashionMNIST: The sources guide users through building a simple CNN model for image classification using the FashionMNIST dataset. They walk through the process of defining the model architecture, choosing appropriate layers and hyperparameters, and training the model using the training dataset.
- Evaluation and Visualization: The sources emphasize evaluating the trained CNN model on the test dataset, using metrics like accuracy to assess its performance. They also encourage visualizing the model’s predictions and the learned feature maps to gain a deeper understanding of how the CNN is “seeing” and interpreting the images.
- Importance of Experimentation: The sources highlight that designing and training effective CNNs often involves experimentation with different architectures, hyperparameters, and training techniques. They encourage users to explore different approaches and carefully analyze the results to optimize their models for specific computer vision tasks.
Working with Tensors and Building Models in PyTorch: Pages 891-900

The sources shift focus to the practical aspects of working with tensors in PyTorch and building neural network models for both regression and classification tasks. They emphasize the importance of understanding tensor operations, data manipulation, and building blocks of neural networks within the PyTorch framework.
- Understanding Tensors: The sources reiterate the importance of tensors as the fundamental data structure in PyTorch, highlighting their role in representing data and model parameters. They discuss tensor creation, indexing, and various operations like stacking, permuting, and reshaping tensors to prepare data for use in neural networks.
- Building a Regression Model: The sources walk through the steps of building a simple linear regression model in PyTorch to predict a continuous target variable from a set of input features. They explain:
- Model Architecture: Defining a model class that inherits from PyTorch’s nn.Module, specifying the linear layers and activation functions that make up the model.
- Loss Function: Choosing an appropriate loss function, such as Mean Squared Error (MSE), to measure the difference between the model’s predictions and the actual target values.
- Optimizer: Selecting an optimizer, such as Stochastic Gradient Descent (SGD), to update the model’s parameters during training, minimizing the loss function.
- Training Loop: Implementing a training loop that iterates through the training data, performs forward and backward passes, calculates the loss, and updates the model’s parameters using the optimizer.
- Addressing Shape Errors: The sources address common shape errors that arise when working with tensors in PyTorch, emphasizing the importance of ensuring that tensor dimensions are compatible for operations like matrix multiplication. They provide examples of troubleshooting shape mismatches and adjusting tensor dimensions using techniques like reshaping or transposing.
- Visualizing Data and Predictions: The sources advocate for visualizing the data and the model’s predictions to gain insights into the regression process. They suggest plotting the input features against the target variable, along with the model’s predicted line, to visually assess the model’s fit and performance.
- Introducing Non-linearities: The sources acknowledge the limitations of linear models in capturing complex relationships in data. They introduce the concept of non-linear activation functions, such as ReLU (Rectified Linear Unit), as a way to introduce non-linearity into the model, enabling it to learn more complex patterns. They explain how incorporating ReLU layers can enhance a model’s ability to fit non-linear data.
The sources effectively transition from theoretical concepts to practical implementation by demonstrating how to work with tensors in PyTorch and build basic neural network models for both regression and classification tasks. They guide users through the essential steps of model definition, loss function selection, optimizer choice, and training loop implementation. By highlighting common pitfalls like shape errors and emphasizing visualization, the sources provide a hands-on approach to learning PyTorch and its application in building machine learning models. They also introduce the crucial concept of non-linear activation functions, laying the foundation for exploring more complex neural network architectures in subsequent sections.

Here are two ways to improve a model’s performance, based on the provided sources:
- Add More Layers to the Model: Adding more layers gives the model more opportunities to learn about patterns in the data. If a model currently has two layers with approximately 20 parameters, adding more layers would increase the number of parameters the model uses to try and learn the patterns in the data [1].
- Fit the Model for Longer: Every epoch is one pass through the data. Fitting the model for longer gives it more of a chance to learn. For example, if the model has only had 100 opportunities to look at a dataset, it may not be enough. Increasing the opportunities to 1,000 may improve the model’s results [2].
How Loss Functions Measure Model Performance

The sources explain that a loss function is crucial for training machine learning models. A loss function quantifies how “wrong” a model’s predictions are compared to the desired output. [1-6] The output of a loss function is a numerical value representing the error. Lower loss values indicate better performance.

Here’s how the loss function works in practice:
- Forward Pass: The model makes predictions on the input data. [7, 8] These predictions are often referred to as “logits” before further processing. [9-14]
- Comparing Predictions to True Values: The loss function takes the model’s predictions and compares them to the true labels from the dataset. [4, 8, 15-19]
- Calculating the Error: The loss function calculates a numerical value representing the difference between the predictions and the true labels. [1, 4-6, 8, 20-29] This value is the “loss,” and the specific calculation depends on the type of loss function used.
- Guiding Model Improvement: The loss value is used by the optimizer to adjust the model’s parameters (weights and biases) to reduce the error in subsequent predictions. [3, 20, 24, 27, 30-38] This iterative process of making predictions, calculating the loss, and updating the parameters is what drives the model’s learning during training.
The goal of training is to minimize the loss function, effectively bringing the model’s predictions closer to the true values. [4, 21, 27, 32, 37, 39-41]

The sources explain that different loss functions are appropriate for different types of problems. [42-48] For example:
- Regression problems (predicting a continuous numerical value) often use loss functions like Mean Absolute Error (MAE, also called L1 loss in PyTorch) or Mean Squared Error (MSE). [42, 44-46, 49, 50]
- Classification problems (predicting a category or class label) might use loss functions like Binary Cross Entropy (BCE) for binary classification or Cross Entropy for multi-class classification. [42, 43, 45, 46, 48, 50, 51]
The sources also highlight the importance of using the appropriate loss function for the chosen model and task. [44, 52, 53]

Key takeaway: Loss functions serve as a feedback mechanism, providing a quantitative measure of how well a model is performing. By minimizing the loss, the model learns to make more accurate predictions and improve its overall performance.

Main Steps in a PyTorch Training Loop

The sources provide a detailed explanation of the PyTorch training loop, highlighting its importance in the machine learning workflow. The training loop is the process where the model iteratively learns from the data and adjusts its parameters to improve its predictions. The sources provide code examples and explanations for both regression and classification problems.

Here is a breakdown of the main steps involved in a PyTorch training loop:

1. Setting Up
- Epochs: Define the number of epochs, which represent the number of times the model will iterate through the entire training dataset. [1]
- Training Mode: Set the model to training mode using model.train(). This activates specific settings and behaviors within the model, such as enabling dropout and batch normalization layers, crucial for training. [1, 2]
- Data Loading: Prepare the data loader to feed batches of training data to the model. [3]
2. Iterating Through Data Batches
- Loop: Initiate a loop to iterate through each batch of data provided by the data loader. [1]
3. The Optimization Loop (for each batch)
- Forward Pass: Pass the input data through the model to obtain predictions (often referred to as “logits” before further processing). [4, 5]
- Loss Calculation: Calculate the loss, which measures the difference between the model’s predictions and the true labels. Choose a loss function appropriate for the problem type (e.g., MSE for regression, Cross Entropy for classification). [5, 6]
- Zero Gradients: Reset the gradients of the model’s parameters to zero. This step is crucial to ensure that gradients from previous batches do not accumulate and affect the current batch’s calculations. [5, 7]
- Backpropagation: Calculate the gradients of the loss function with respect to the model’s parameters. This step involves going backward through the network, computing how much each parameter contributed to the loss. PyTorch handles this automatically using loss.backward(). [5, 7, 8]
- Gradient Descent: Update the model’s parameters to minimize the loss function. This step uses an optimizer (e.g., SGD, Adam) to adjust the weights and biases in the direction that reduces the loss. PyTorch’s optimizer.step() performs this parameter update. [5, 7, 8]
4. Testing (Evaluation) Loop (typically performed after each epoch)
- Evaluation Mode: Set the model to evaluation mode using model.eval(). This deactivates training-specific settings (like dropout) and prepares the model for inference. [2, 9]
- Inference Mode: Use the torch.inference_mode() context manager to perform inference. This disables gradient calculations and other operations not required for testing, potentially improving speed and memory efficiency. [9, 10]
- Forward Pass (on Test Data): Pass the test data through the model to obtain predictions. [4, 9]
- Loss Calculation (on Test Data): Calculate the loss on the test data to assess the model’s performance on unseen data. [9, 11]
- Performance Metrics: Calculate additional performance metrics relevant to the problem, such as accuracy for classification or other evaluation metrics suitable for the specific task. [9]
5. Repetition

The training loop repeats for the specified number of epochs, iterating through the data, performing the optimization steps, and periodically evaluating the model on the test data.

Key Takeaways
- The training loop is the heart of the model’s learning process in PyTorch.
- Each step in the loop plays a vital role in enabling the model to learn from data and improve its predictions.
- PyTorch simplifies many aspects of the training loop, handling backpropagation and gradient updates automatically, allowing you to focus on defining the model architecture, choosing appropriate loss functions and optimizers, and preparing the data.
The sources emphasize the importance of understanding these steps and how they work together to train effective machine learning models in PyTorch. They provide code examples and detailed explanations for each step, illustrating the concepts with practical implementations.

Reasons for PyTorch’s Popularity

The sources offer several insights into why PyTorch has become so popular for deep learning and machine learning:
- Research Favorite: The sources emphasize that PyTorch is the most popular deep learning research framework, as evidenced by its widespread use in published research papers and code repositories. [1, 2] For example, Papers with Code, a website tracking machine learning papers and their associated code, shows that 58% of the 65,000 papers with code they’ve tracked are implemented with PyTorch. [2] This popularity stems from PyTorch’s flexibility, ease of use, and ability to support cutting-edge research in various domains.
- Pythonic and User-Friendly: PyTorch is written in Python, making it highly accessible to the vast community of Python developers and researchers. [3] Its intuitive and Pythonic API simplifies the process of defining, training, and evaluating models. The dynamic computational graph enables flexible experimentation and debugging, contributing to its popularity among researchers and practitioners.
- Strong Industry Adoption: Beyond research, PyTorch has seen significant adoption by industry leaders like Tesla, Microsoft, OpenAI, and Facebook (Meta). [4-9] Tesla utilizes PyTorch for the computer vision models powering its Autopilot system. [5] OpenAI has standardized on PyTorch for its research and development. [6, 7] Facebook leverages PyTorch for its diverse machine learning applications. [9] This widespread industry adoption further validates PyTorch’s capabilities and reinforces its position as a leading deep learning framework.
- GPU Acceleration: PyTorch allows developers to leverage the power of GPUs (Graphics Processing Units) for faster computations. [3, 10] This is crucial for deep learning tasks, which often involve large datasets and computationally intensive operations. PyTorch utilizes CUDA, NVIDIA’s parallel computing platform and API, to enable efficient execution on GPUs, significantly accelerating training and inference processes. [11, 12]
- Rich Ecosystem and Resources: PyTorch benefits from a robust ecosystem of tools, libraries, and resources. Torch Hub provides access to pre-trained models, facilitating transfer learning. [3] Torchvision offers datasets, transformations, and model architectures specifically designed for computer vision tasks. [3, 13] PyTorch’s extensive documentation, active community forums, and numerous tutorials provide ample support for learners and practitioners at all levels.
- Accessibility and Ease of Setup: Tools like Google Colab further enhance PyTorch’s accessibility by providing a cloud-based environment with free GPU access and pre-installed PyTorch. [14-21] This ease of setup lowers the barrier to entry for beginners and enables rapid experimentation and prototyping without requiring extensive local hardware setup.
The sources, along with the conversation history, consistently portray PyTorch as a powerful, flexible, and user-friendly framework well-suited for both research and production-level deep learning applications. Its popularity is driven by a confluence of factors, including its research focus, Pythonic nature, strong industry support, GPU acceleration capabilities, a rich ecosystem, and ease of accessibility.

One of the main reasons for PyTorch’s popularity in research is its prominence as the most popular deep learning research framework [1, 2]. The sources emphasize that PyTorch is widely used in published research papers and code repositories.

For example, Papers with Code, a website tracking machine learning papers and their associated code, shows that 58% of the 65,000 papers with code they’ve tracked are implemented with PyTorch [3]. This demonstrates how often researchers choose PyTorch when implementing their models. This popularity likely stems from PyTorch’s flexibility, ease of use, and ability to support cutting-edge research in various domains.

The Impact of Random Seeds on Neural Network Training

The sources explain that randomness is a fundamental aspect of training neural networks in PyTorch. They highlight the use of random seeds as a mechanism for controlling this randomness and enabling reproducibility in experiments. Here’s how random seeds influence the training process:

1. Initialization with Random Numbers
- Random Parameters: Neural networks typically start with their parameters (weights and biases) initialized with random values. Source [1] emphasizes this, stating that “Random tensors are important because the way many neural networks learn is that they start with tensors full of random numbers and then adjust those random numbers to better represent the data.” The sources demonstrate this with code examples, using functions like torch.rand() to generate random tensors for weight and bias initialization.
- Stochastic Gradient Descent: The sources mention “stochastic” gradient descent [2], highlighting that the optimization process itself involves randomness. The optimizer takes random steps to update the model’s parameters, gradually reducing the loss.
2. The Role of the Random Seed
- Flavoring the Randomness: Source [3] explains the concept of a random seed as a way to “flavor the randomness” Because computers generate pseudo-random numbers, the random seed acts as a starting point for this pseudo-random number generation process. Setting a specific seed ensures that the sequence of random numbers generated will be the same every time the code is run.
- Enabling Reproducibility: Using a random seed makes experiments reproducible. This means that if you share your code and the seed value with someone else, they should be able to obtain the same results as you, assuming they are using the same software and hardware environment. Source [4] illustrates this with the example of sharing a notebook with a friend and wanting to reduce the randomness in the results.
- PyTorch Implementation: In PyTorch, you can set the random seed using torch.manual_seed(seed_value). Source [5] provides examples of setting the random seed and creating reproducible random tensors. Source [6] further notes that there is a separate seed for CUDA operations, torch.cuda.manual_seed(seed_value), to control randomness when using GPUs.
3. Effects on Training
- Deterministic Initialization: Setting a random seed ensures that the model starts with the same initial parameters every time you run the training code. This removes variability in the starting point of the optimization process.
- Consistent Optimization Path: With a fixed seed, the optimizer will take the same steps during training, leading to a more consistent learning trajectory. This can be helpful for debugging and understanding the model’s behavior.
- Comparable Experiments: Using random seeds makes it easier to compare the results of different experiments. For example, if you are trying different model architectures or hyperparameters, using the same seed across experiments ensures that the observed differences are due to the changes you made rather than random variations in initialization or optimization.
Key Takeaways
- Randomness is inherent in neural network training, impacting both parameter initialization and the optimization process.
- Random seeds provide a mechanism for controlling this randomness, making experiments reproducible and facilitating comparisons.
- Setting a random seed doesn’t eliminate randomness entirely but ensures that the same sequence of “random” numbers is used, leading to more consistent and comparable results.
The sources encourage developers to be mindful of randomness and leverage random seeds to enhance the reproducibility and reliability of their machine learning experiments. While complete determinism is often difficult to achieve due to factors beyond the random seed, using seeds is a valuable practice for improving the scientific rigor of deep learning research and development.

Training a Neural Network in PyTorch: A Step-by-Step Guide

The sources outline the primary steps involved in training a neural network using the PyTorch deep learning framework. These steps, often referred to as the PyTorch workflow, provide a structured approach to building, training, and evaluating models.

1. Data Preparation and Loading
- Data Acquisition: This initial step involves obtaining the data required for your machine-learning task. As noted in Source, data can take various forms, including structured data (e.g., spreadsheets), images, videos, audio, and even DNA sequences.
- Data Exploration: Becoming familiar with your data is crucial. This might involve visualizing the data (e.g., plotting images, creating histograms) and understanding its distribution, patterns, and potential biases.
- Data Preprocessing: Preparing the data for use with a PyTorch model often requires transformation and formatting. This could involve:
- Numerical Encoding: Converting categorical data into numerical representations, as many machine learning models operate on numerical inputs.
- Normalization: Scaling numerical features to a standard range (e.g., between 0 and 1) to prevent features with larger scales from dominating the learning process.
- Reshaping: Restructuring data into the appropriate dimensions expected by the neural network.
- Tensor Conversion: The sources emphasize that tensors are the fundamental building blocks of data in PyTorch. You’ll need to convert your data into PyTorch tensors using functions like torch.tensor().
- Dataset and DataLoader: Source recommends using PyTorch’s Dataset and DataLoader classes to efficiently manage and load data during training. A Dataset object represents your dataset, while a DataLoader provides an iterable over the dataset, enabling batching, shuffling, and other data handling operations.
2. Model Building or Selection
- Model Architecture: This step involves defining the structure of your neural network. You’ll need to decide on:
- Layer Types: PyTorch provides a wide range of layers in the torch.nn module, including linear layers (nn.Linear), convolutional layers (nn.Conv2d), recurrent layers (nn.LSTM), and more.
- Number of Layers: The depth of your network, often determined through experimentation and the complexity of the task.
- Number of Hidden Units: The dimensionality of the hidden representations within the network.
- Activation Functions: Non-linear functions applied to the output of layers to introduce non-linearity into the model.
- Model Implementation: You can build models from scratch, stacking layers together manually, or leverage pre-trained models from repositories like Torch Hub, particularly for tasks like image classification. Source showcases both approaches:
- Subclassing nn.Module: This common pattern involves creating a Python class that inherits from nn.Module. You’ll define layers as attributes of the class and implement the forward() method to specify how data flows through the network.
- Using nn.Sequential: Source demonstrates this simpler method for creating sequential models where data flows linearly through a sequence of layers.
3. Loss Function and Optimizer Selection
- Loss Function: The loss function measures how well the model is performing during training. It quantifies the difference between the model’s predictions and the actual target values. The choice of loss function depends on the nature of the problem:
- Regression: Common loss functions include Mean Squared Error (MSE) and Mean Absolute Error (MAE).
- Classification: Common loss functions include Cross-Entropy Loss and Binary Cross-Entropy Loss.
- Optimizer: The optimizer is responsible for updating the model’s parameters (weights and biases) during training, aiming to minimize the loss function. Popular optimizers in PyTorch include Stochastic Gradient Descent (SGD) and Adam.
- Hyperparameters: Both the loss function and optimizer often have hyperparameters that you’ll need to tune. For example, the learning rate for an optimizer controls the step size taken during parameter updates.
4. Training Loop Implementation
- Epochs: The training process is typically organized into epochs. An epoch involves iterating over the entire training dataset once. You’ll specify the number of epochs to train for.
- Batches: To improve efficiency, data is often processed in batches rather than individually. You’ll set the batch size, determining the number of data samples processed in each iteration of the training loop.
- Training Steps: The core of the training loop involves the following steps, repeated for each batch of data:
- Forward Pass: Passing the input data through the model to obtain predictions.
- Loss Calculation: Computing the loss by comparing predictions to the target values.
- Backpropagation: Calculating gradients of the loss with respect to the model’s parameters. This identifies how each parameter contributed to the error.
- Parameter Update: Using the optimizer to update the model’s parameters based on the calculated gradients. The goal is to adjust parameters in a direction that reduces the loss.
- Evaluation: Periodically, you’ll evaluate the model’s performance on a separate validation set to monitor its progress and prevent overfitting (where the model learns the training data too well and performs poorly on unseen data).
5. Model Saving and Loading
- Saving: Once the model is trained to a satisfactory level, you’ll want to save it for later use. The sources describe methods for saving PyTorch models, including:
- Saving the State Dictionary: This approach saves the model’s learned parameters in a dictionary-like object. It’s generally the recommended method as it’s more efficient and flexible.
- Saving the Entire Model: This saves the entire model architecture and parameters. However, it can lead to larger file sizes and potential compatibility issues if the PyTorch version changes.
- Loading: You can later load a saved model to reuse it for inference (making predictions on new data) or to continue training.
6. Additional Considerations
- GPU Acceleration: The sources highlight the importance of utilizing GPUs for faster training, especially for large models and datasets. PyTorch provides mechanisms for transferring models and data to GPUs.
- Experiment Tracking: As you train and experiment with different models and hyperparameters, it’s essential to keep track of your experiments. Tools like Weights & Biases, MLflow, and TensorBoard (mentioned in Source) can help you log metrics, visualize training progress, and compare different runs.
- Deployment: Once you have a well-trained model, you might want to deploy it to make predictions in real-world applications. This often involves considerations related to model serving, optimization, and integration with other systems.
Key Points from Sources
- Iterative Workflow: The PyTorch workflow is iterative, meaning you’ll often go back and forth between different steps, experimenting with various model architectures, hyperparameters, and data preprocessing techniques to improve performance.
- Experimentation: The sources encourage experimentation as a core part of the machine-learning process. Trying different things and observing their impact is crucial for finding the best model for your task.
- Focus on Code: While a theoretical understanding of deep learning concepts is helpful, the sources prioritize a hands-on, code-centric approach, emphasizing that writing code and running experiments is the most effective way to learn PyTorch.
- Abundant Resources: The sources emphasize the availability of extensive PyTorch documentation, tutorials, and community forums. These resources are invaluable for learning the framework and troubleshooting issues.
Beyond the Basics

The training steps described here provide a foundational understanding of how to train neural networks in PyTorch. As you progress in your deep learning journey, you’ll encounter more advanced topics and techniques. This might include:
- Transfer Learning: Leveraging pre-trained models to jump-start your model’s performance, particularly when you have limited data.
- Fine-tuning: Adapting pre-trained models to your specific task by modifying and training certain layers.
- Custom Layers and Architectures: Designing and implementing your own specialized neural network layers and architectures to address unique challenges.
- Distributed Training: Training models across multiple GPUs or machines to accelerate the process and handle extremely large datasets.
The sources provide a strong foundation for getting started with PyTorch. They encourage an experimental mindset, highlighting the importance of writing code, running experiments, and iteratively refining your models to achieve optimal performance.

Key Advantages of Utilizing PyTorch

The sources highlight several key benefits of using PyTorch as a deep learning framework, particularly for research and building machine learning models. Here’s a summary of the main advantages:

1. Research-Oriented and Widely Adopted:
- PyTorch is positioned as the “most popular research deep learning framework” [1], with statistics from December 2021 indicating that 58% of papers with code on Papers With Code were implemented using PyTorch [2]. This popularity within the research community stems from its flexibility and ease of use, making it a preferred choice for developing state-of-the-art machine learning algorithms.
- The widespread adoption of PyTorch is further evidenced by its use in prominent organizations like Tesla (for Autopilot computer vision models), OpenAI, Facebook (for in-house machine learning applications), and Microsoft [3-5].
2. Pythonic and User-Friendly:
- PyTorch is deeply integrated with Python, making it highly accessible for Python developers [1]. Its syntax and structure align closely with Pythonic conventions, reducing the learning curve for those already familiar with the language.
- This user-friendliness is emphasized throughout the sources, advocating for a hands-on, code-centric approach to learning PyTorch and stressing that “if you know Python, it’s a very user-friendly programming language” [6].
3. Dynamic Computational Graph and Debugging Ease:
- PyTorch’s dynamic computational graph is a significant advantage. Unlike static graph frameworks like TensorFlow (at least in its earlier versions), PyTorch builds the graph as you execute the code [This information is not from your provided sources]. This dynamic nature allows for greater flexibility during development, as you can modify the graph on the fly. It also simplifies debugging, as you can use standard Python debugging tools to inspect variables and step through the code.
4. GPU Acceleration and Fast Computations:
- PyTorch enables you to leverage the power of GPUs to accelerate computations [1, 7]. This is particularly crucial for deep learning, where training often involves vast amounts of data and computationally intensive operations.
- PyTorch accomplishes GPU acceleration through CUDA, NVIDIA’s parallel computing platform and API [8, 9]. It also supports TPUs (Tensor Processing Units), although GPUs are more commonly used with PyTorch [9].
- The sources highlight this benefit, noting that PyTorch “allows you to write fast deep learning code in Python” [1] and “enables you to run your code, your machine learning code accelerated on a GPU” [7].
5. Extensive Ecosystem and Community Support:
- PyTorch benefits from a rich ecosystem of tools, libraries, and resources. Source mentions Torch Hub, a repository for pre-trained models, and Torch Vision, a library specifically designed for computer vision tasks [6].
- There’s also a vibrant PyTorch community, providing support through forums, tutorials, and online resources [This information is not from your provided sources].
6. Flexibility and Customization:
- PyTorch’s flexibility allows you to build a wide range of models, from simple linear regressions to complex convolutional neural networks, and experiment with various architectures and techniques.
- The sources demonstrate this flexibility through examples, showing how to create models by subclassing nn.Module, using nn.Sequential, and implementing custom layers.
7. Focus on Tensors as Fundamental Building Blocks:
- The sources repeatedly emphasize that tensors are the core data structures in PyTorch [10, 11]. This focus on tensors simplifies data handling and manipulation, as tensors provide a unified representation for various data types, including images, text, and numerical data.
In summary, PyTorch offers a compelling combination of research-oriented features, user-friendliness, performance, and a robust ecosystem, making it a powerful and popular choice for deep learning practitioners.

PyTorch and GPU Acceleration: A CUDA-Powered Partnership

The sources provide a clear explanation of how PyTorch utilizes GPUs to achieve faster computations in deep learning. Here’s a breakdown:
- GPUs are inherently designed for numerical computation. They were initially developed for video games, which require intensive graphical processing that involves a significant number of mathematical calculations. This inherent capability makes them well-suited for the complex operations involved in deep learning. [1]
- PyTorch provides an interface to harness this computational power through CUDA. CUDA, which stands for Compute Unified Device Architecture, is NVIDIA’s parallel computing platform and API. It acts as a bridge between software, like PyTorch, and NVIDIA GPUs, enabling the software to use the GPU for general-purpose computing tasks. [2]
- By leveraging CUDA, PyTorch can execute deep learning code on NVIDIA GPUs. This allows for significant speedups, particularly when dealing with large datasets and complex models. [3]
- Device Agnostic Code: PyTorch promotes the concept of “device agnostic code”, which means the code can run on either a GPU (if available) or a CPU. This is achieved by setting a device variable that defaults to CUDA if a GPU is present and falls back to the CPU if not. [4, 5]
- The to method plays a crucial role. This method is used to move tensors and models to the desired device (either CUDA for GPU or CPU). By moving the data and the model to the GPU, PyTorch ensures that all computations are performed on the faster hardware. [6, 7]
- The sources strongly emphasize the performance benefits of using a GPU. They mention that it results in “faster computations”, allowing for faster pattern discovery in data, more experimentation, and ultimately, finding the best model. [8, 9]
Key Takeaways:
- PyTorch’s ability to utilize GPUs stems from its integration with CUDA.
- The to method allows for seamless movement of data and models to the GPU.
- Utilizing a GPU can lead to substantial speed improvements in deep learning tasks.
It’s important to note that while GPUs generally offer significant performance gains, there are situations where the overhead of transferring data to and from the GPU might outweigh the computational benefits, particularly with smaller datasets and less complex models. [10]

Top Three Errors in PyTorch

The sources identify three major error types that you’re likely to encounter when working with PyTorch and deep learning:

1. Tensor Data Type Mismatches
- The Root of the Problem: PyTorch relies heavily on tensors for representing and manipulating data. Tensors have an associated data type, such as float32, int64, or bool. Many PyTorch functions and operations require tensors to have specific data types to work correctly. If the data types of tensors involved in a calculation are incompatible, PyTorch will raise an error.
- Common Manifestations: You might encounter this error when:
- Performing mathematical operations between tensors with mismatched data types (e.g., multiplying a float32 tensor by an int64 tensor) [1, 2].
- Using a function that expects a particular data type but receiving a tensor of a different type (e.g., torch.mean requires a float32 tensor) [3-5].
- Real-World Example: The sources illustrate this error with torch.mean. If you attempt to calculate the mean of a tensor that isn’t a floating-point type, PyTorch will throw an error. To resolve this, you need to convert the tensor to float32 using tensor.type(torch.float32) [4].
- Debugging Strategies:Carefully inspect the data types of the tensors involved in the operation or function call where the error occurs.
- Use tensor.dtype to check a tensor’s data type.
- Convert tensors to the required data type using tensor.type().
- Key Insight: Pay close attention to data types. When in doubt, default to float32 as it’s PyTorch’s preferred data type [6].
2. Tensor Shape Mismatches
- The Core Issue: Tensors also have a shape, which defines their dimensionality. For example, a vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, and an image with three color channels is often represented as a 3-dimensional tensor. Many PyTorch operations, especially matrix multiplications and neural network layers, have strict requirements regarding the shapes of input tensors.
- Where It Goes Wrong:Matrix Multiplication: The inner dimensions of matrices being multiplied must match [7, 8].
- Neural Networks: The output shape of one layer needs to be compatible with the input shape of the next layer.
- Reshaping Errors: Attempting to reshape a tensor into an incompatible shape (e.g., squeezing 9 elements into a shape of 1×7) [9].
- Example in Action: The sources provide an example of a shape error during matrix multiplication using torch.matmul. If the inner dimensions don’t match, PyTorch will raise an error [8].
- Troubleshooting Tips:Shape Inspection: Thoroughly understand the shapes of your tensors using tensor.shape.
- Visualization: When possible, visualize tensors (especially high-dimensional ones) to get a better grasp of their structure.
- Reshape Carefully: Ensure that reshaping operations (tensor.reshape, tensor.view) result in compatible shapes.
- Crucial Takeaway: Always verify shape compatibility before performing operations. Shape errors are prevalent in deep learning, so be vigilant.
3. Device Mismatches (CPU vs. GPU)
- The Device Divide: PyTorch supports both CPUs and GPUs for computation. GPUs offer significant performance advantages, but require data and models to reside in GPU memory. If you attempt to perform an operation between tensors or models located on different devices, PyTorch will raise an error.
- Typical Scenarios:Moving Data to GPU: You might forget to move your input data to the GPU using tensor.to(device), leading to an error when performing calculations with a model that’s on the GPU [10].
- NumPy and GPU Tensors: NumPy operates on CPU memory, so you can’t directly use NumPy functions on GPU tensors [11]. You need to first move the tensor back to the CPU using tensor.cpu() [12].
- Source Illustration: The sources demonstrate this issue when trying to use numpy.array() on a tensor that’s on the GPU. The solution is to bring the tensor back to the CPU using tensor.cpu() [12].
- Best Practices:Device Agnostic Code: Use the device variable and the to() method to ensure that data and models are on the correct device [11, 13].
- CPU-to-GPU Transfers: Minimize the number of data transfers between the CPU and GPU, as these transfers can introduce overhead.
- Essential Reminder: Be device-aware. Always ensure that all tensors involved in an operation are on the same device (either CPU or GPU) to avoid errors.
The Big Three Errors in PyTorch and Deep Learning

The sources dedicate significant attention to highlighting the three most common errors encountered when working with PyTorch for deep learning, emphasizing that mastering these will equip you to handle a significant portion of the challenges you’ll face in your deep learning journey.

1. Tensor Not the Right Data Type
- The Core of the Issue: Tensors, the fundamental building blocks of data in PyTorch, come with associated data types (dtype), such as float32, float16, int32, and int64 [1, 2]. These data types specify how much detail a single number is stored with in memory [3]. Different PyTorch functions and operations may require specific data types to work correctly [3, 4].
- Why it’s Tricky: Sometimes operations may unexpectedly work even if tensors have different data types [4, 5]. However, other operations, especially those involved in training large neural networks, can be quite sensitive to data type mismatches and will throw errors [4].
- Debugging and Prevention:Awareness is Key: Be mindful of the data types of your tensors and the requirements of the operations you’re performing.
- Check Data Types: Utilize tensor.dtype to inspect the data type of a tensor [6].
- Conversion: If needed, convert tensors to the desired data type using tensor.type(desired_dtype) [7].
- Real-World Example: The sources provide examples of using torch.mean, a function that requires a float32 tensor [8, 9]. If you attempt to use it with an integer tensor, PyTorch will throw an error. You’ll need to convert the tensor to float32 before calculating the mean.
2. Tensor Not the Right Shape
- The Heart of the Problem: Neural networks are essentially intricate structures built upon layers of matrix multiplications. For these operations to work seamlessly, the shapes (dimensions) of tensors must be compatible [10-12].
- Shape Mismatch Scenarios: This error arises when:
- The inner dimensions of matrices being multiplied don’t match, violating the fundamental rule of matrix multiplication [10, 13].
- Neural network layers receive input tensors with incompatible shapes, preventing the data from flowing through the network as expected [11].
- You attempt to reshape a tensor into a shape that doesn’t accommodate all its elements [14].
- Troubleshooting and Best Practices:Inspect Shapes: Make it a habit to meticulously examine the shapes of your tensors using tensor.shape [6].
- Visualize: Whenever possible, try to visualize your tensors to gain a clearer understanding of their structure, especially for higher-dimensional tensors. This can help you identify potential shape inconsistencies.
- Careful Reshaping: Exercise caution when using operations like tensor.reshape or tensor.view to modify the shape of a tensor. Always ensure that the resulting shape is compatible with the intended operation or layer.
- Source Illustration: The sources offer numerous instances where shape errors occur during matrix multiplication and when passing data through neural network layers [13-18].
3. Tensor Not on the Right Device
- The Device Dilemma: PyTorch allows you to perform computations on either a CPU or a GPU, with GPUs offering substantial speed advantages for deep learning tasks [19, 20]. However, this flexibility introduces the potential for device mismatches, where you attempt to perform operations between tensors located on different devices (CPU or GPU), resulting in errors [19, 21].
- Common Culprits:Data on CPU, Model on GPU: You might load your data but forget to transfer it to the GPU using tensor.to(device), leading to an error when interacting with a model that’s on the GPU [22].
- NumPy and GPUs: NumPy, a popular Python library for numerical computing, operates on CPU memory. Attempting to directly use NumPy functions on GPU tensors will result in errors. You’ll need to first move the tensor back to the CPU using tensor.cpu() [19].
- Debugging and Solutions:Device-Agnostic Code: Adopt the practice of writing device-agnostic code, using the device variable and the to() method. This will help ensure that your data and models are consistently on the correct device [17].
- Minimize Transfers: Be mindful of the overhead associated with data transfers between the CPU and GPU. Try to minimize these transfers to optimize performance.
- Source Walkthrough: The sources provide examples of device mismatch errors, particularly when attempting to use NumPy functions with tensors on the GPU [19] and when data and models are not explicitly moved to the same device [17, 22-24].
The sources consistently emphasize that these three types of errors are incredibly prevalent in PyTorch and deep learning in general, often leading to a significant amount of troubleshooting and debugging. By understanding the nature of these errors, their common causes, and how to address them, you’ll be well-prepared to tackle a substantial portion of the challenges you’ll encounter while developing and training deep learning models with PyTorch.

The Dynamic Duo: Gradient Descent and Backpropagation

The sources highlight two fundamental algorithms that are at the heart of training neural networks: gradient descent and backpropagation. Let’s explore each of these in detail.

1. Gradient Descent: The Optimizer
- What it Does: Gradient descent is an optimization algorithm that aims to find the best set of parameters (weights and biases) for a neural network to minimize the loss function. The loss function quantifies how “wrong” the model’s predictions are compared to the actual target values.
- The Analogy: Imagine you’re standing on a mountain and want to find the lowest point (the valley). Gradient descent is like taking small steps downhill, following the direction of the steepest descent. The “steepness” is determined by the gradient of the loss function.
- In PyTorch: PyTorch provides the torch.optim module, which contains various implementations of gradient descent and other optimization algorithms. You specify the model’s parameters and a learning rate (which controls the size of the steps taken downhill). [1-3]
- Variations: There are different flavors of gradient descent:
- Stochastic Gradient Descent (SGD): Updates parameters based on the gradient calculated from a single data point or a small batch of data. This introduces some randomness (noise) into the optimization process, which can help escape local minima. [3]
- Adam: A more sophisticated variant of SGD that uses momentum and adaptive learning rates to improve convergence speed and stability. [4, 5]
- Key Insight: The choice of optimizer and its hyperparameters (like learning rate) can significantly influence the training process and the final performance of your model. Experimentation is often needed to find the best settings for a given problem.
2. Backpropagation: The Gradient Calculator
- Purpose: Backpropagation is the algorithm responsible for calculating the gradients of the loss function with respect to the neural network’s parameters. These gradients are then used by gradient descent to update the parameters in the direction that reduces the loss.
- How it Works: Backpropagation uses the chain rule from calculus to efficiently compute gradients, starting from the output layer and propagating them backward through the network layers to the input.
- The “Backward Pass”: In PyTorch, you trigger backpropagation by calling the loss.backward() method. This calculates the gradients and stores them in the grad attribute of each parameter tensor. [6-9]
- PyTorch’s Magic: PyTorch’s autograd feature handles the complexities of backpropagation automatically. You don’t need to manually implement the chain rule or derivative calculations. [10, 11]
- Essential for Learning: Backpropagation is the key to enabling neural networks to learn from data by adjusting their parameters in a way that minimizes prediction errors.
The sources emphasize that gradient descent and backpropagation work in tandem: backpropagation computes the gradients, and gradient descent uses these gradients to update the model’s parameters, gradually improving its performance over time. [6, 10]

Transfer Learning: Leveraging Existing Knowledge

Transfer learning is a powerful technique in deep learning where you take a model that has already been trained on a large dataset for a particular task and adapt it to solve a different but related task. This approach offers several advantages, especially when dealing with limited data or when you want to accelerate the training process. The sources provide examples of how transfer learning can be applied and discuss some of the key resources within PyTorch that support this technique.

The Core Idea: Instead of training a model from scratch, you start with a model that has already learned a rich set of features from a massive dataset (often called a pre-trained model). These pre-trained models are typically trained on datasets like ImageNet, which contains millions of images across thousands of categories.

How it Works:
1. Choose a Pre-trained Model: Select a pre-trained model that is relevant to your target task. For image classification, popular choices include ResNet, VGG, and Inception.
2. Feature Extraction: Use the pre-trained model as a feature extractor. You can either:
- Freeze the weights of the early layers of the model (which have learned general image features) and only train the later layers (which are more specific to your task).
- Fine-tune the entire pre-trained model, allowing all layers to adapt to your target dataset.
1. Transfer to Your Task: Replace the final layer(s) of the pre-trained model with layers that match the output requirements of your task. For example, if you’re classifying images into 10 categories, you’d replace the final layer with a layer that outputs 10 probabilities.
2. Train on Your Data: Train the modified model on your dataset. Since the pre-trained model already has a good understanding of general image features, the training process can converge faster and achieve better performance, even with limited data.
PyTorch Resources for Transfer Learning:
- Torch Hub: A repository of pre-trained models that can be easily loaded and used. The sources mention Torch Hub as a valuable resource for finding models to use in transfer learning.
- torchvision.models: Contains a collection of popular computer vision architectures (like ResNet and VGG) that come with pre-trained weights. You can easily load these models and modify them for your specific tasks.
Benefits of Transfer Learning:
- Faster Training: Since you’re not starting from random weights, the training process typically requires less time.
- Improved Performance: Pre-trained models often bring a wealth of knowledge that can lead to better accuracy on your target task, especially when you have a small dataset.
- Less Data Required: Transfer learning can be highly effective even when your dataset is relatively small.
Examples in the Sources:

The sources provide a glimpse into how transfer learning can be applied to image classification problems. For instance, you could leverage a model pre-trained on ImageNet to classify different types of food images or to distinguish between different clothing items in fashion images.

Key Takeaway: Transfer learning is a valuable technique that allows you to build upon the knowledge gained from training large models on extensive datasets. By adapting these pre-trained models, you can often achieve better results faster, particularly in scenarios where labeled data is scarce.

Here are some reasons why you might choose a machine learning algorithm over traditional programming:
- When you have problems with long lists of rules, it can be helpful to use a machine learning or a deep learning approach. For example, the rules of driving would be very difficult to code into a traditional program, but machine learning and deep learning are currently being used in self-driving cars to manage these complexities [1].
- Machine learning can be beneficial in continually changing environments because it can adapt to new data. For example, a machine learning model for self-driving cars could learn to adapt to new neighborhoods and driving conditions [2].
- Machine learning and deep learning excel at discovering insights within large collections of data. For example, the Food 101 data set contains images of 101 different kinds of food, which would be very challenging to classify using traditional programming techniques [3].
- If a problem can be solved with a simple set of rules, you should use traditional programming. For example, if you could write five steps to make your grandmother’s famous roast chicken, then it is better to do that than to use a machine learning algorithm [4, 5].
Traditional programming is when you write code to define a set of rules that map inputs to outputs. For example, you could write a program to make your grandmother’s roast chicken by defining a set of steps that map the ingredients to the finished dish [6, 7].

Machine learning, on the other hand, is when you give a computer a set of inputs and outputs, and it figures out the rules for itself. For example, you could give a machine learning algorithm a bunch of pictures of cats and dogs, and it would learn to distinguish between them [8, 9]. This is often described as supervised learning, because the algorithm is given both the inputs and the desired outputs, also known as features and labels. The algorithm’s job is to figure out the relationship between the features and the labels [8].

Deep learning is a subset of machine learning that uses neural networks with many layers. This allows deep learning models to learn more complex patterns than traditional machine learning algorithms. Deep learning is typically better for unstructured data, such as images, text, and audio [10].

Machine learning can be used for a wide variety of tasks, including:
- Image classification: Identifying the objects in an image. [11]
- Object detection: Locating objects in an image. [11]
- Natural language processing: Understanding and processing human language. [12]
- Speech recognition: Converting speech to text. [13]
- Machine translation: Translating text from one language to another. [13]
Overall, machine learning algorithms can be a powerful tool for solving complex problems that would be difficult or impossible to solve with traditional programming. However, it is important to remember that machine learning is not a silver bullet. There are many problems that are still best solved with traditional programming.

Here are the key advantages of using deep learning for problems with long lists of rules:
- Deep learning can excel at finding patterns in complex data, making it suitable for problems where it is difficult to explicitly code all of the rules. [1] For example, driving a car involves many rules, such as how to back out of a driveway, how to turn left, how to parallel park, and how to stop at an intersection. It would be extremely difficult to code all of these rules into a traditional program. [2]
- Deep learning is also well-suited for problems that involve continually changing environments. [3] This is because deep learning models can continue to learn and adapt to new data. [3] For example, a self-driving car might need to adapt to new neighborhoods and driving conditions. [3]
- Deep learning can be used to discover insights within large collections of data. [4] This is because deep learning models are able to learn complex patterns from large amounts of data. [4] For example, a deep learning model could be trained on a large dataset of food images to learn to classify different types of food. [4]
However, there are also some potential drawbacks to using deep learning for problems with long lists of rules:
- Deep learning models can be difficult to interpret. [5] This is because the patterns learned by a deep learning model are often represented as a large number of weights and biases, which can be difficult for humans to understand. [5]
- Deep learning models can be computationally expensive to train. [5] This is because deep learning models often have a large number of parameters, which require a lot of computational power to train. [5]
Overall, deep learning can be a powerful tool for solving problems with long lists of rules, but it is important to be aware of the potential drawbacks before using it.

Here is a comprehensive response to your query.

Deep Learning Models Learn by Adjusting Random Numbers

Deep learning models learn by starting with tensors full of random numbers and then adjusting those random numbers to represent data better. [1] This process is repeated over and over, with the model gradually improving its representation of the data. [2] This is a fundamental concept in deep learning. [1]

This process of adjusting random numbers is driven by two algorithms: gradient descent and backpropagation. [3, 4]
- Gradient descent minimizes the difference between the model’s predictions and the actual outputs by adjusting model parameters (weights and biases). [3, 4] The learning rate is a hyperparameter that determines how large the steps are that the model takes during gradient descent. [5, 6]
- Backpropagation calculates the gradients of the parameters with respect to the loss function. [4] In other words, backpropagation tells the model how much each parameter needs to be adjusted to reduce the error. [4] PyTorch implements backpropagation behind the scenes, making it easier to build deep learning models without needing to understand the complex math involved. [4, 7]
Deep learning models have many parameters, often thousands or even millions. [8, 9] These parameters represent the patterns that the model has learned from the data. [8, 10] By adjusting these parameters using gradient descent and backpropagation, the model can improve its performance on a given task. [1, 2]

This learning process is similar to how humans learn. For example, when a child learns to ride a bike, they start by making random movements. Through trial and error, they gradually learn to coordinate their movements and balance on the bike. Similarly, a deep learning model starts with random parameters and gradually adjusts them to better represent the data it is trying to learn.

In short, the main concept behind a deep learning model’s ability to learn is its ability to adjust a large number of random parameters to better represent the data, driven by gradient descent and backpropagation.

Supervised and Unsupervised Learning Paradigms

Supervised learning is a type of machine learning where you have data and labels. The labels are the desired outputs for each input. The goal of supervised learning is to train a model that can accurately predict the labels for new, unseen data. An example of supervised learning is training a model to discern between cat and dog photos using photos labeled as either “cat” or “dog”. [1, 2]

Unsupervised and self-supervised learning are types of machine learning where you only have data, and no labels. The goal of unsupervised learning is to find patterns in the data without any guidance from labels. The goal of self-supervised learning is similar, but the algorithm attempts to learn an inherent representation of the data without being told what to look for. [2, 3] For example, a self-supervised learning algorithm could be trained on a dataset of dog and cat photos without being told which photos are of cats and which are of dogs. The algorithm would then learn to identify the underlying patterns in the data that distinguish cats from dogs. This representation of the data could then be used to train a supervised learning model to classify cats and dogs. [3, 4]

Transfer learning is a type of machine learning where you take the patterns that one model has learned on one dataset and apply them to another dataset. This is a powerful technique that can be used to improve the performance of machine learning models on new tasks. For example, you could use a model that has been trained to classify images of dogs and cats to help train a model to classify images of birds. [4, 5]

Reinforcement learning is another machine learning paradigm that does not fall into the categories of supervised, unsupervised, or self-supervised learning. [6] In reinforcement learning, an agent learns to interact with an environment by performing actions and receiving rewards or observations in return. [6, 7] An example of reinforcement learning is teaching a dog to urinate outside by rewarding it for urinating outside. [7]

This response has comprehensively covered the key differences between supervised and unsupervised learning based on the information in the sources.

Underfitting in Machine Learning

Underfitting occurs when a machine learning model is not complex enough to capture the patterns in the training data. As a result, an underfit model will have high training error and high test error. This means it will make inaccurate predictions on both the data it was trained on and new, unseen data.

Here are some ways to identify underfitting:
- The model’s loss on the training and test data sets could be lower [1].
- The loss curve does not decrease significantly over time, remaining relatively flat [1].
- The accuracy of the model is lower than desired on both the training and test sets [2].
Here’s an analogy to better understand underfitting: Imagine you are trying to learn to play a complex piano piece but are only allowed to use one finger. You can learn to play a simplified version of the song, but it will not sound very good. You are underfitting the data because your one-finger technique is not complex enough to capture the nuances of the original piece.

Underfitting is often caused by using a model that is too simple for the data. For example, using a linear model to fit data with a non-linear relationship will result in underfitting [3]. It can also be caused by not training the model for long enough. If you stop training too early, the model may not have had enough time to learn the patterns in the data.

Here are some ways to address underfitting:
- Add more layers or units to your model: This will increase the complexity of the model and allow it to learn more complex patterns [4].
- Train for longer: This will give the model more time to learn the patterns in the data [5].
- Tweak the learning rate: If the learning rate is too high, the model may not be able to converge on a good solution. Reducing the learning rate can help the model learn more effectively [4].
- Use transfer learning: Transfer learning can help to improve the performance of a model by using knowledge learned from a previous task [6].
- Use less regularization: Regularization is a technique that can help to prevent overfitting, but if you use too much regularization, it can lead to underfitting. Reducing the amount of regularization can help the model learn more effectively [7].
The goal in machine learning is to find the sweet spot between underfitting and overfitting, where the model is complex enough to capture the patterns in the data, but not so complex that it overfits. This is an ongoing challenge, and there is no one-size-fits-all solution. However, by understanding the concepts of underfitting and overfitting, you can take steps to improve the performance of your machine learning models.

Impact of the Learning Rate on Gradient Descent

The learning rate, often abbreviated as “LR”, is a hyperparameter that determines the size of the steps taken during the gradient descent algorithm [1-3]. Gradient descent, as previously discussed, is an iterative optimization algorithm that aims to find the optimal set of model parameters (weights and biases) that minimize the loss function [4-6].

A smaller learning rate means the model parameters are adjusted in smaller increments during each iteration of gradient descent [7-10]. This leads to slower convergence, requiring more epochs to reach the optimal solution. However, a smaller learning rate can also be beneficial as it allows the model to explore the loss landscape more carefully, potentially avoiding getting stuck in local minima [11].

Conversely, a larger learning rate results in larger steps taken during gradient descent [7-10]. This can lead to faster convergence, potentially reaching the optimal solution in fewer epochs. However, a large learning rate can also be detrimental as it can cause the model to overshoot the optimal solution, leading to oscillations or even divergence, where the loss increases instead of decreasing [7, 10, 12].

Visualizing the learning rate’s effect can be helpful. Imagine trying to find the lowest point in a valley. A small learning rate is like taking small, careful steps down the slope, ensuring you don’t miss the bottom. A large learning rate is like taking large, confident strides, potentially reaching the bottom faster but risking stepping over it entirely.

The choice of learning rate is crucial and often involves experimentation to find the optimal balance between convergence speed and stability [12-15]. Setting the learning rate too low can result in unnecessarily long training times, while setting it too high can prevent the model from converging at all [7, 10, 12].

In the sources, a learning rate of 0.01 is described as “relatively large” [16]. A learning rate of 0.1 is mentioned as a common starting point but potentially too large for some scenarios [10, 17]. Smaller learning rates like 0.001 are also used, particularly with the Adam optimizer, which has a default learning rate of 0.001 [18-20].

The sources demonstrate the impact of the learning rate on the training process. In one instance, lowering the learning rate from 0.1 to 0.01 resulted in a lower loss on the test dataset [21]. Additionally, using a TensorFlow Playground visualization, varying the learning rate illustrated how it affects the speed of convergence and the ability to find the optimal solution [11, 22, 23].

Therefore, selecting an appropriate learning rate is an important consideration when training machine learning models. Experimentation and monitoring the training process are often necessary to find the learning rate that yields the best results for a specific problem and model architecture.

Here is a response to your query about the benefits and drawbacks of using deep learning models compared to traditional machine learning algorithms:

Benefits and Drawbacks of Deep Learning Compared to Traditional Machine Learning

Deep learning, a subfield of machine learning, has gained immense popularity due to its ability to achieve state-of-the-art performance on various complex tasks. However, it’s essential to consider both its advantages and disadvantages compared to traditional machine learning algorithms before deciding which approach is suitable for a particular problem.

Benefits of Deep Learning
- Handles Unstructured Data: Deep learning excels in handling unstructured data such as images, text, and audio. Traditional machine learning algorithms often struggle with such data types, requiring extensive feature engineering to convert them into structured formats [1, 2]. Deep learning models, particularly convolutional neural networks (CNNs) for images and recurrent neural networks (RNNs) or transformers for text, automatically learn relevant features from raw data, simplifying the process [1-3].
- Superior Performance on Complex Tasks: Deep learning models, due to their depth and complexity, can capture intricate patterns and relationships in data that traditional algorithms may miss [3-5]. This enables them to outperform traditional methods on tasks like image recognition, natural language processing, and speech synthesis [6-8].
- Adaptability and Continuous Learning: Deep learning models can adapt to changing environments and learn from new data continuously [9]. This is crucial for applications like self-driving cars, where the model needs to adjust to new scenarios and learn from ongoing experiences [4, 9, 10].
Drawbacks of Deep Learning
- Black Box Nature and Explainability: Deep learning models often lack explainability. Their complex architectures and vast number of parameters make it challenging to interpret how they arrive at their predictions [11, 12]. This can be a concern in applications where understanding the reasoning behind a decision is critical, such as medical diagnosis or financial modeling [11, 13].
- Data Hungry and Computationally Intensive: Deep learning models typically require massive amounts of data to train effectively [14, 15]. Gathering and preparing such large datasets can be time-consuming and expensive. Additionally, training deep learning models can be computationally demanding, requiring powerful hardware like GPUs or TPUs [16, 17].
- Potential for Overfitting: Due to their complexity, deep learning models are susceptible to overfitting, where they learn the training data too well but generalize poorly to unseen data [18-20]. This requires careful tuning and techniques like regularization or dropout to mitigate [21, 22].
Traditional Machine Learning Algorithms: Strengths and Limitations

Traditional machine learning algorithms, like linear regression, support vector machines (SVMs), and decision trees, have been widely used and remain relevant for many applications. They often excel in scenarios where data is structured, explainability is crucial, or computational resources are limited.
- Strengths: They generally require less data, are easier to interpret, and are computationally less demanding compared to deep learning models [23, 24].
- Limitations: They may struggle with complex tasks involving unstructured data, require extensive feature engineering, and may not perform as well as deep learning on specific tasks.
The sources primarily focus on deep learning, and there is limited information on specific advantages and disadvantages of traditional machine learning algorithms. You may want to independently verify this information.

The choice between deep learning and traditional machine learning depends on the specific problem, available data, computational resources, and the importance of explainability. Deep learning often outperforms traditional methods on complex tasks with unstructured data. However, traditional algorithms may be a better choice when data is structured, interpretability is crucial, or computational resources are limited.

Google’s Number One Rule of Machine Learning: Benefits of Simplicity

Google’s number one rule of machine learning emphasizes choosing simplicity over complexity when possible. It states: “If you can build a simple rule-based system that doesn’t require machine learning, do that.” [1-3] This principle, although seemingly counterintuitive at the start of a machine learning course, highlights the importance of considering the trade-offs before deploying complex models. [4]

Here’s why it’s crucial to consider Google’s number one rule:
- Efficiency and Maintainability: Simple rule-based systems are often easier to implement, understand, maintain, and debug compared to machine learning models. They require less code, are more transparent in their decision-making process, and don’t need extensive training data or computational resources. [1-3]
- Reliability and Predictability: Rule-based systems provide predictable outcomes based on predefined rules. In contrast, machine learning models, especially deep learning models, involve probabilistic predictions, meaning there’s always a chance of error. [5] For tasks requiring absolute certainty or where errors are unacceptable, a rule-based system might be a more suitable choice. [5]
- Reduced Development Time and Costs: Building and deploying a machine learning model involves several steps, including data collection, preprocessing, model selection, training, and evaluation. This process can be time-consuming and resource-intensive. If a simple rule-based system can achieve the desired outcome, it can significantly reduce development time and costs. [1, 2]
- Avoiding Unnecessary Complexity: Machine learning models, especially deep learning models, can become highly complex, making them challenging to interpret and debug. Using a machine learning model when a simpler solution exists introduces unnecessary complexity, potentially leading to difficulties in maintenance and troubleshooting. [4]
The sources provide an analogy to illustrate this principle. If a simple set of five rules can accurately map ingredients to a Sicilian grandmother’s roast chicken recipe, there’s no need to employ a complex machine learning model. The rule-based system, in this case, would be more efficient and reliable. [1, 2]

However, it’s important to acknowledge that rule-based systems have limitations. They may not be suitable for complex problems with a vast number of rules, constantly changing environments, or situations requiring insights from large datasets. [6, 7]

Therefore, Google’s number one rule encourages a thoughtful approach to problem-solving, urging consideration of simpler alternatives before resorting to the complexity of machine learning. It emphasizes that machine learning, although powerful, is not a universal solution and should be applied judiciously when the problem demands it. [4, 7]

Here is a breakdown of the difference between parameters and hyperparameters in machine learning, based on the provided sources:

Parameters: Learned by the Model
- Parameters are the internal values of a machine learning model that are learned automatically during the training process. [1]
- They are responsible for capturing patterns and relationships within the data. [1]
- Examples of parameters include weights and biases in a neural network. [1, 2]
- Parameters are updated iteratively through optimization algorithms like gradient descent, guided by the loss function. [3, 4]
- The number of parameters can vary significantly depending on the complexity of the model and the dataset. Models can have from a few parameters to millions or even billions. [2]
- In the context of PyTorch, accessing model parameters can be done using model.parameters(). [5]
Hyperparameters: Set by the Machine Learning Engineer
- Hyperparameters are external configurations that are set by the machine learning engineer or data scientist before training the model. [4]
- They control the learning process and influence the behavior of the model, such as its complexity, learning speed, and ability to generalize. [6]
- Examples of hyperparameters:
- Learning rate (LR) [7]
- Number of hidden layers [8]
- Number of hidden units per layer [8]
- Number of epochs [9]
- Activation functions [8]
- Loss function [8]
- Optimizer [8]
- Batch size [10]
- Choosing appropriate hyperparameters is crucial for optimal model performance. [6]
- Finding the best hyperparameter settings often involves experimentation and techniques like grid search or random search. [This information about grid search or random search is not from your provided sources and you may want to independently verify it.]
Key Differences
- Learned vs. Set: The key difference is that parameters are learned by the model during training, while hyperparameters are set manually before training.
- Internal vs. External: Parameters are internal to the model, representing its learned knowledge, whereas hyperparameters are external configurations that guide the learning process.
- Optimization Target vs. Optimization Control: The model’s optimization algorithms aim to find the optimal parameter values, while hyperparameters control how this optimization process occurs.
The sources provide a clear distinction between parameters and hyperparameters. Parameters are like the model’s internal settings that it adjusts to capture patterns in the data. Hyperparameters are the external knobs that the machine learning engineer tweaks to guide the model’s learning process. Understanding this distinction is essential for building and training effective machine learning models.

Here is an explanation of back propagation and its role in gradient descent, based on the provided sources and our conversation history.

Back Propagation and Gradient Descent: A Collaborative Learning Process

Back propagation and gradient descent are two essential algorithms that work together to enable a machine learning model to learn from data and improve its performance. These concepts are particularly relevant to deep learning models, which involve complex architectures with numerous parameters that need to be optimized.

Back Propagation: Calculating the Gradients

Back propagation is an algorithm that calculates the gradients of the loss function with respect to each parameter in the model. The gradients represent the direction and magnitude of change needed in each parameter to minimize the loss function.
- Forward Pass: It begins with a forward pass, where data is fed through the model’s layers, and predictions are generated.
- Loss Calculation: The difference between these predictions and the actual target values is quantified using a loss function.
- Backward Pass: The back propagation algorithm then works backward through the network, starting from the output layer and moving towards the input layer.
- Chain Rule: It uses the chain rule of calculus to calculate the gradients of the loss function with respect to each parameter. This process involves calculating the partial derivatives of the loss function with respect to the outputs of each layer, and then using these derivatives to calculate the gradients for the parameters within that layer.
- Gradient Accumulation: The gradients are accumulated during this backward pass, providing information about how each parameter contributes to the overall error.
Gradient Descent: Updating the Parameters

Gradient descent is an optimization algorithm that uses the gradients calculated by back propagation to update the model’s parameters iteratively. The goal is to find the parameter values that minimize the loss function, leading to improved model performance.
- Learning Rate: The learning rate is a hyperparameter that determines the step size taken in the direction of the negative gradient. It controls how much the parameters are adjusted during each update.
- Iterative Updates: Gradient descent starts with an initial set of parameter values (often randomly initialized) and repeatedly updates these values based on the calculated gradients.
- Minimizing the Loss: The update rule involves moving the parameters in the opposite direction of the gradient, scaled by the learning rate. This process continues iteratively until the loss function reaches a minimum or a satisfactory level of convergence is achieved.
The Interplay

Back propagation provides the essential information needed for gradient descent to operate. By calculating the gradients of the loss function with respect to each parameter, back propagation tells gradient descent which direction to move each parameter to reduce the error.

The sources emphasize that PyTorch handles the intricate mathematics of back propagation and gradient descent behind the scenes. When you define your model, loss function, and optimizer in PyTorch, and execute the training loop, these algorithms are automatically triggered to update the model’s parameters. The loss.backward() function triggers back propagation, and the optimizer.step() function performs the parameter update using gradient descent.

Example: Imagine a neural network trained to classify images of cats and dogs. During training, back propagation calculates how much each weight and bias parameter contributes to misclassifications. Gradient descent then uses this information to adjust these parameters slightly, moving them in a direction that reduces the error. This process continues iteratively, gradually improving the model’s ability to distinguish between cats and dogs.

In summary, back propagation and gradient descent form a powerful duo in machine learning. Back propagation determines the direction and magnitude of change needed in the model’s parameters, and gradient descent uses this information to iteratively update the parameters, driving the model towards better performance. While the mathematical details can be complex, PyTorch simplifies this process by abstracting away the complexity, allowing you to focus on building and training models without manually implementing these algorithms.

The Role of Random Seeds in Neural Networks

The sources provide a detailed explanation of how randomness plays a crucial role in the initialization and training of neural networks. Here’s how random seeds influence this randomness:

Random Initialization of Parameters:
- Neural networks start with random values for their parameters, such as weights and biases. [1, 2] This random initialization is essential for breaking symmetry and allowing the model to explore different regions of the parameter space during training.
- Without random initialization, all neurons in a layer would learn the same features, hindering the network’s ability to learn complex patterns. [This point is not explicitly mentioned in your sources, so you may want to independently verify it.]
Sources of Randomness in PyTorch:
- PyTorch uses pseudo-random number generators to create these random values. [3] Pseudo-randomness means that while the generated numbers appear random, they are actually determined by a deterministic algorithm.
- Random Tensor Creation: When you create a random tensor in PyTorch using functions like torch.rand(), the underlying random number generator determines the values within that tensor. [1, 4] Each time you run the code, you get a different set of random values.
The Impact of Random Seeds:
- Reproducibility: The problem with this inherent randomness is that it makes it difficult to reproduce experiments. If you share your code with someone else, they will likely get different results due to the different random initializations.
- Controlling the Randomness: A random seed allows you to “flavor” the randomness. [5] Setting a seed using torch.manual_seed() ensures that the random number generator starts from a specific point, producing the same sequence of random numbers every time you run the code. [6]
- Flavors of Randomness: Think of each seed value as a different “flavor” of randomness. [6] While the numbers will still be random, they will be the same random numbers every time you use that specific seed.
Benefits of Using Random Seeds:
- Consistent Results: Using a random seed enables you to reproduce experiments and ensure consistency in your results. This is essential for debugging, sharing code, and comparing different model architectures or hyperparameter settings.
- Controlled Experiments: Random seeds allow you to control the randomness in your experiments, isolating the effects of other factors like model architecture or hyperparameters.
Limitations of Random Seeds:
- Not True Randomness: It’s important to remember that even with a random seed, you’re not getting true randomness. The numbers are still generated by a deterministic algorithm.
- Generalization: Over-reliance on a specific random seed may lead to a model that overfits to the specific random initialization, potentially harming its ability to generalize to unseen data. [This is a point worth noting but is not from your sources.]
Practical Usage:
- The sources recommend setting the random seed at the beginning of your code using torch.manual_seed(). [6] This ensures that all subsequent random operations in your code, such as creating random tensors or initializing model parameters, use the same “flavored” randomness.
- CUDA Random Seed: If you’re using CUDA for GPU computations, you also need to set the CUDA random seed using torch.cuda.manual_seed() to control randomness on the GPU. [7]
In summary, random seeds are a valuable tool for controlling the randomness inherent in neural networks, enabling reproducibility and facilitating controlled experiments. By setting a random seed, you ensure that the same “flavored” randomness is used every time you run your code, leading to consistent and reproducible results.

Here is an in-depth explanation of how batch size influences the efficiency of training a neural network, drawing on the information provided in the sources and our previous conversation.

Batch Size: A Balancing Act for Efficiency and Performance

Batch size is a crucial hyperparameter in machine learning that determines how many samples are processed by the model before updating its parameters. Instead of processing the entire training dataset in one go, the data is divided into smaller groups called batches. The model iterates through these batches, updating its parameters after processing each batch.

Impact of Batch Size on Training:
- Computational Efficiency: The sources highlight that batch size significantly impacts computational efficiency. Processing a large batch of images requires significant memory and computational power. Using a smaller batch size can make training more manageable, especially when dealing with limited hardware resources or large datasets.
- Gradient Update Frequency: A smaller batch size leads to more frequent updates to the model’s parameters because the gradients are calculated and applied after each batch. This can lead to faster convergence, especially in the early stages of training.
- Generalization: Using smaller batch sizes can also improve the model’s ability to generalize to unseen data. This is because the model is exposed to a more diverse set of samples during each epoch, potentially leading to a more robust representation of the data.
Choosing the Right Batch Size:
- Hardware Constraints: The sources emphasize that hardware constraints play a significant role in determining the batch size. If you have a powerful GPU with ample memory, you can use larger batch sizes without running into memory issues. However, if you’re working with limited hardware, smaller batch sizes may be necessary.
- Dataset Size: The size of your dataset also influences the choice of batch size. For smaller datasets, you might be able to use larger batch sizes, but for massive datasets, smaller batch sizes are often preferred.
- Experimentation: Finding the optimal batch size often involves experimentation. The sources recommend starting with a common batch size like 32 and adjusting it based on the specific problem and hardware limitations.
Mini-Batch Gradient Descent:
- Efficiency and Performance Trade-off: The concept of using batches to train a neural network is called mini-batch gradient descent. Mini-batch gradient descent strikes a balance between the computational efficiency of batch gradient descent (processing the entire dataset in one go) and the faster convergence of stochastic gradient descent (processing one sample at a time).
- Advantages of Mini-Batches: The sources list two primary benefits of using mini-batches:
1. Computational Efficiency: Mini-batches make it feasible to train models on large datasets that might not fit entirely in memory.
2. More Frequent Gradient Updates: More frequent updates lead to potentially faster convergence and can help the model escape local minima during training.
Example from the Sources:
- In the context of image classification using the Fashion MNIST dataset, the sources demonstrate how a batch size of 32 is used to divide the 60,000 training images into smaller, manageable batches. This allows the model to process and learn from the data more efficiently.
Key Considerations When Choosing Batch Size:
- Larger Batch Sizes:Can be more computationally efficient because you’re processing more samples in one go.
- Can potentially lead to smoother gradient updates, but also may require more memory.
- Risk of overfitting if the batch size is too large and doesn’t allow the model to explore diverse samples.
- Smaller Batch Sizes:Lead to more frequent gradient updates, potentially leading to faster convergence, especially in the early stages of training.
- Can help the model generalize better to unseen data due to exposure to more diverse samples during training.
- May be less computationally efficient as you’re processing fewer samples at a time.
In conclusion, batch size is a critical hyperparameter that significantly influences the efficiency of training a neural network. Choosing the right batch size involves considering hardware constraints, dataset size, and experimental findings. Mini-batch gradient descent, by processing the data in batches, offers a balance between computational efficiency and performance, enabling the training of complex models on large datasets.

Please note that while I have tried my best to provide a comprehensive response, some insights may be based on general machine learning knowledge and best practices that extend beyond the specific details provided in the sources. You may want to independently verify this additional information.

By Amjad Izhar
Contact: amjad.izhar@gmail.com
https://amjadizhar.blog

Affiliate Disclosure: This blog may contain affiliate links, which means I may earn a small commission if you click on the link and make a purchase. This comes at no additional cost to you. I only recommend products or services that I believe will add value to my readers. Your support helps keep this blog running and allows me to continue providing you with quality content. Thank you for your support!
November 24, 2024